Friday, December 24, 2010

Christmas Tree

Christmas tree I made this year



Merry Christmas!

Tuesday, December 21, 2010

2010 LLVM Developers' Meeting

Videos from `2010 LLVM Developers' Meeting' now available here

Thursday, December 16, 2010

acer aspire

kernel: [37201.728875] CPU2: Core temperature above threshold, cpu clock throttled (total events = 1)
kernel: [37201.728880] CPU3: Core temperature above threshold, cpu clock throttled (total events = 1)
kernel: [37201.729881] CPU3: Core temperature/speed normal
kernel: [37201.729884] CPU2: Core temperature/speed normal
kernel: [37301.339112] [Hardware Error]: Machine check events logged
kernel: [37501.209513] CPU2: Core temperature above threshold, cpu clock throttled (total events = 4206)
kernel: [37501.209517] CPU3: Core temperature above threshold, cpu clock throttled (total events = 4206)
kernel: [37501.210542] CPU2: Core temperature/speed normal
kernel: [37501.210546] CPU3: Core temperature/speed normal


For the first time I have an emergency poweroff due to CPU overheat.
Acer Aspire is definetly a bullshit.

Thursday, December 9, 2010

yet another perl limitation

Another funny thing about perl.
Suppose, $i is 4.


while ($i > 0) {
    $i--;

    next if ($i == 2);

    print $i, "\n";
}


will produce
3 1 0


which is quite expected result. So far, so good.

Moving to do-while

do {
    $i--;

    next if ($i == 2);

    print $i, "\n";
} while ($i > 0);


Oops, bad luck:
3
Can't "next" outside a loop block at ./test.pl line 10.

And at the same time
do {
    $i--;

    next if ($i == 2);

    print $i, "\n";
} for (1 .. $i);


will happy to write
3 1 0

Guess why?

perldoc will shed light:
do BLOCK does not count as a loop, so the loop control statements next, last, or redo cannot be used to leave or restart the block. See perlsyn for alternative strategies.

Well... dunno... Hope, they really had reasons to do so.

Tuesday, December 7, 2010

sched: automated per tty task groups

For sure, you already heard about those magic 220 lines
to blow your head off (yes, "[RFC/RFT PATCH v3] sched: automated
per tty task groups").

Lennart Poettering, however, proposed similar solution, yet
without any need to patch your kernel.

I've changed it a bit, since original didn't work for me.
I added those lines to the end of .bashrc:
if [ "$PS1" ] ; then
    mkdir -m 0700 /cgroup/cpu/user/$$
    echo $$ > /cgroup/cpu/user/$$/tasks
fi


and to /etc/rc.local:
mount -t cgroup cgroup /cgroup/cpu -o cpu
mkdir -p -m 0777 /cgroup/cpu/user

Quick test:
ls /cgroup/cpu/user/
11564 1164 15342 24870 28728 3746 3812 ...


Tested while compiling kernel -j16, emacs -j8 and a few
things in the meantime...
Seems to work.

The truth behind the scene is that the kernel is much more
complicated than they told you in books.

P.S.
It may require libcgroup

Saturday, November 27, 2010

"#39861: Switch module doesn't like subroutine prototypes" part 2

Spent some time playing with perl Switch bug.

Well, let's start from the begging.
The sort of main parsing routine in Switch.pm is
sub filter_blocks, which tries to parse $source and create $text.

failter_blocks calls Text::Balanced::_match_* to distinguish code blocks
from quoted, to validate variables, etc... and that's the point where the
whole thing hits the fan. Text::Balanced::_match_variable call is too
early here, since _match_variable validates $#, $$, $^ and fails to
validate $). In other words, suppose, we are parsing

sub foo($) {
        switch($_[0]) {
                case /ACK/i {
                        return "ACK";
                }                   
                case /NACK/i {
                        return "NACK";
                }                    
        }       
}



Text::Balanced::_match_variable will return the whole block, since
it doesn't know what to do with $). Yet it works for $$, $,, etc. quite
well. My first solution was change

m{\G\$\s*(?!::)(\d+|[][&`'+*./|,";%=~:?!\@<>()-]|\^[a-z]?)}gci)

to

m{\G\$\)\?\s*(?!::)(\d+|[][&`'+*./|,";%=~:?!\@<>()-]|\^[a-z]?)}gci)


so, we now parse $) correctly. However, I didn't like it much.

What we (IMHO) really should do - is to teach filter_blocks what
subroutine is. I did very simple and general (which may fail for 
some sophisticated cases) thing:
       
diff --git a/Switch.pm b/Switch.pm
index 2189ae0..781bae8 100755
--- a/Switch.pm
+++ b/Switch.pm
@@ -111,6 +111,11 @@ sub filter_blocks
             }
             next component;
         }
+        if ($source =~ m/\G(\s*sub.+)\{/) {
+            $text .= $1;
+            pos $source += length($1);
+            next component;
+        }
         if ($source =~ m/(\G\s*$pod_or_DATA)/gc) {
             $text .= $1;
             next component;


which shifts position in currently parsed $source to avoid wrong
Text::Balanced::_match_variable call on subroutine declaration.

Of course, this is not tested at all, except for my simple script.
Just playing.

posix cpu timers: RCU read-side critical section

POSIX cpu timers were calling find_task_by_vpid in insecure way
(since 4221a9918e38b7494cee341dda7b7b4bb8c04bde which requires
RCU read-side critical section).

Thomas Gleixner wrote:
| We can remove the tasklist_lock while at it. rcu_read_lock is enough.

Patch also replaces thread_group_leader with has_group_leader_pid
in accordance to comment by Oleg Nesterov:

| ... thread_group_leader() check is not relaible without
| tasklist. If we race with de_thread() find_task_by_vpid() can find
| the new leader before it updates its ->group_leader.
|
| perhaps it makes sense to change posix_cpu_timer_create() to use
| has_group_leader_pid() instead, just to make this code not look racy
| and avoid adding new problems.



Thanks to:
    Reviewed-by: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Stanislaw Gruszka
    Signed-off-by: Thomas Gleixner


BTW, Oleg, in turn, fixed security issue
commit e0a70217107e6f9844628120412cb27bb4cea194
Author: Oleg Nesterov <>
posix-cpu-timers: workaround to suppress the problems with mt exec


---

diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c
index 6842eeb..05bb717 100644
--- a/kernel/posix-cpu-timers.c
+++ b/kernel/posix-cpu-timers.c
@@ -37,13 +37,13 @@ static int check_clock(const clockid_t which_clock)
     if (pid == 0)
         return 0;

-    read_lock(&tasklist_lock);
+    rcu_read_lock();
     p = find_task_by_vpid(pid);
     if (!p || !(CPUCLOCK_PERTHREAD(which_clock) ?
-           same_thread_group(p, current) : thread_group_leader(p))) {
+           same_thread_group(p, current) : has_group_leader_pid(p))) {
         error = -EINVAL;
     }
-    read_unlock(&tasklist_lock);
+    rcu_read_unlock();

     return error;
 }
@@ -390,7 +390,7 @@ int posix_cpu_timer_create(struct k_itimer *new_timer)

     INIT_LIST_HEAD(&new_timer->it.cpu.entry);

-    read_lock(&tasklist_lock);
+    rcu_read_lock();
     if (CPUCLOCK_PERTHREAD(new_timer->it_clock)) {
         if (pid == 0) {
             p = current;
@@ -404,7 +404,7 @@ int posix_cpu_timer_create(struct k_itimer *new_timer)
             p = current->group_leader;
         } else {
             p = find_task_by_vpid(pid);
-            if (p && !thread_group_leader(p))
+            if (p && !has_group_leader_pid(p))
                 p = NULL;
         }
     }
@@ -414,7 +414,7 @@ int posix_cpu_timer_create(struct k_itimer *new_timer)
     } else {
         ret = -EINVAL;
     }
-    read_unlock(&tasklist_lock);
+    rcu_read_unlock();

     return ret;
 }

Friday, November 26, 2010

"#39861: Switch module doesn't like subroutine prototypes"

Recently I faced an unexpected perl5's behaviour.
This simple and valid code

#!/usr/bin/perl
use POSIX;
use strict;

use Switch;

sub foo($) {
    switch($_[0]) {
        case /ACK/i {
            return "ACK";
        }
        case /NACK/i {
            return "NACK";
        }
    }
}

1;

failed to execute due to
syntax error at ./test.pl line 8, near ") {"
syntax error at ./test.pl line 15, near "}"
Bareword "case" not allowed while "strict subs" in use at ./test.pl line 12.
Bareword "NACK" not allowed while "strict subs" in use at ./test.pl line 12.
Execution of ./test.pl aborted due to compilation errors.



If we change foo's prototype requirement to expect more than one scalar
parameters, or drop any requirements - everything will cure and work just
as expected. IOW, case with one scalar parameter is kind of special (read broken).

This turned out to be known problem (#39861, according to Perl5's bugzilla )...
Since around 2004. Only one thing remains to write  - WTF?

Sunday, November 14, 2010

Eliminate instructions at compile time trick

Vasiliy Kulikov noted a bug in select code and proposed a fix:
 [..] struct timeval has padding bytes at the end.  This struct is copied to
 userspace with these padding bytes uninitialized.  This leads to leaking
 of contents of kernel stack memory.


 --- a/fs/select.c
 +++ b/fs/select.c
 @@ -306,6 +306,7 @@ static int poll_select_copy_remaining(struct timespec
 *end_time, void __user *p,
               rts.tv_sec = rts.tv_nsec = 0;

       if (timeval) {
 +             memset(&rtv, 0, sizeof(rtv));
               rtv.tv_sec = rts.tv_sec;
               rtv.tv_usec = rts.tv_nsec / NSEC_PER_USEC;




 Andrew Morton noted that
 | struct timeval has padding bytes at the end.
 On sparc and parisc.  On all other architectures this patch is a waste
 of cycles.

 And came up with this patch:

       if (timeval) {
 -             memset(&rtv, 0, sizeof(rtv));
 +             if (sizeof(rtv) > sizeof(rtv.tv_sec) + sizeof(rtv.tv_usec))
 +                     memset(&rtv, 0, sizeof(rtv));
               rtv.tv_sec = rts.tv_sec;
               rtv.tv_usec = rts.tv_nsec / NSEC_PER_USEC;


 The `if' gets eliminated at compile time.  With this approach we add
 four bytes of text to the sparc64 build and zero bytes of text to the
 x86_64 build.

ioprio: rcu protect find_task_by_vpid call

Since commit 4221a9918e38b7494cee341dda7b7b4bb8c04bde
find_task_by_pid_ns call needs to be protected with RCU lock.

Tetsuo Handa wrote:
| Usually tasklist gives enough protection, but if copy_process() fails
| it calls free_pid() lockless and does call_rcu(delayed_put_pid().
| This means, without rcu lock find_pid_ns() can't scan the hash table
| safely.

"Unsafe" find_task_by_pid_ns call may look like this:
Call Trace:
 [<ffffffff810656f2>] lockdep_rcu_dereference+0xaa/0xb2
 [<ffffffff81053c67>] find_task_by_pid_ns+0x4f/0x68
 [<ffffffff81053c9d>] find_task_by_vpid+0x1d/0x1f
 [<ffffffff811104e2>] sys_ioprio_get+0x50/0x2da
 [<ffffffff81002182>] system_call_fastpath+0x16/0x1b


V2: rcu critical section expanded according to comment
by Paul E. McKenney.

The patch below adds missing rcu in sys_ioprio_{set|get}.

--- a/fs/ioprio.c
+++ b/fs/ioprio.c
@@ -111,12 +111,14 @@ SYSCALL_DEFINE3(ioprio_set, int, which, int, who, int, ioprio)
        read_lock(&tasklist_lock);
        switch (which) {
                case IOPRIO_WHO_PROCESS:
+                       rcu_read_lock();
                        if (!who)
                                p = current;
                        else
                                p = find_task_by_vpid(who);
                        if (p)
                                ret = set_task_ioprio(p, ioprio);
+                       rcu_read_unlock();
                        break;
                case IOPRIO_WHO_PGRP:
                        if (!who)
@@ -205,12 +207,14 @@ SYSCALL_DEFINE2(ioprio_get, int, which, int, who)
        read_lock(&tasklist_lock);
        switch (which) {
                case IOPRIO_WHO_PROCESS:
+                       rcu_read_lock();
                        if (!who)
                                p = current;
                        else
                                p = find_task_by_vpid(who);
                        if (p)
                                ret = get_task_ioprio(p);
+                       rcu_read_unlock();
                        break;
                case IOPRIO_WHO_PGRP:
                        if (!who)



Monday, November 8, 2010

"We sometimes do this trick"

Recently on lkml we had a patch proposal by Don Zickus.

I had a minor nit, because I thought that it does make sense
to simplify this loop

       touch_all_nmi_watchdogs:
       ...
       for_each_present_cpu(cpu) {
               if (per_cpu(watchdog_nmi_touch, cpu) != true)
                       per_cpu(watchdog_nmi_touch, cpu) = true;
       }

to
       for_each_present_cpu(cpu) {
               per_cpu(watchdog_nmi_touch, cpu) = true;
       }



Andrew Morton wrote in responce:
We sometimes do this trick to avoid dirtying lots of cachelines
which already held the correct value.  It'll be extra-benefical
when dealing with other CPU's data, I expect.

This is really reasonable. Once again, try to think in opposite
each time you make a decision.

Friday, November 5, 2010

Happy birthday to Me.

Almost as yong as GNU


on photo Stephen Fry

Wednesday, November 3, 2010

"I really do want to do the merge"

Words of wisdom by Linus Torvalds
 
"..I do feel that actually seeing the merge conflicts really does help me get a feel 
for what I'm merging.."
 
On Sat, Oct 30, 2010 at 6:51 AM, Chris Mason wrote:
>
> There were some minor conflicts with Linus' current tree, so my branch
> is merged with Linus' tree as of this morning.

Gaah. Please don't do this. Unless it's a _really_ messy merge, I
really do want to do the merge. It's fine to have an alternate
pre-merged branch for me to compare against, but please do that
separately.

So what I did was to just instead merge the state before your merge,
and in the process I:

 (a) noticed that your merge was incorrect (you had left around a
unused "error:" label in btrfs_mount()), since I did use your merge as
something to compare against (see above). That label had been removed
in your branch by  commit 0e78340f3c1f, but your merge resurrected it.

 (b) saw just how horribly nasty your writeback_inodes_sb() end result
was, and decided to clean up the estimation of dirty pages in order to
not end up with the function call argument from hell.

Now, it's obviously totally possible that I screwed things up entirely
in the process, but as mentioned elsewhere, I do feel that actually
seeing the merge conflicts really does help me get a feel for what I'm
merging, and what the points of conflict are.

And yes, maybe it's just me showing my insecurities again. I have
various mental hangups, and liking to feel like I know roughly what is
going on is one of them. Doing the merges and looking at the code that
clashes makes me feel like I have some kind of awareness of how things
are interacting in the development process.

linux-btrfs

Saturday, October 30, 2010

GCC Summit 2010

GCC Summit 2010 slides are available at
http://gcc.gnu.org/wiki/summit2010

Friday, October 22, 2010

top failed ro read /proc/stat

Just noticed that top utiliy fails to read /proc/stat after
cpu offline.

The problem is that Cpu_tot is not updated before calling
cpus_refresh.

In cpus_refresh we're trying to read and sscanf Cpu_tot times /proc/stat

   for (i = 0; 1 < Cpu_tot && i < Cpu_tot; i++) {
      if (!fgets(buf, sizeof(buf), fp)) std_err("failed /proc/stat read");
      cpus[i].x = 0;  // FIXME: can't tell by kernel version number
      cpus[i].y = 0;  // FIXME: can't tell by kernel version number
      cpus[i].z = 0;  // FIXME: can't tell by kernel version number
      num = sscanf(buf, "cpu%u %Lu %Lu %Lu %Lu %Lu %Lu %Lu %Lu",
         &cpus[i].id,
         &cpus[i].u, &cpus[i].n, &cpus[i].s, &cpus[i].i, &cpus[i].w, &cpus[i].x, &cpus[i].y, &cpus[i].z
      );
      if (num < 4)
          std_err("failed /proc/stat read");
   }


Which is wrong, since:
cat /proc/stat
cpu  5700 0 1474 271836 3554 0 41 0 0 0
cpu0 2167 0 565 66945 974 0 13 0 0 0
cpu1 2600 0 507 66512 1030 0 11 0 0 0
cpu2 518 0 214 69082 810 0 5 0 0 0
cpu3 413 0 186 69296 738 0 9 0 0 0
intr ....


echo 0 > /sys/devices/system/cpu/cpu3/online
cat /proc/stat
cpu  5831 0 1531 292475 3592 0 43 0 0 0
cpu0 2236 0 591 72477 979 0 14 0 0 0
cpu1 2647 0 531 72052 1051 0 12 0 0 0
cpu2 527 0 217 74713 821 0 6 0 0 0
intr ...


(note absent cpu3 line).


The solution may look similar to this one:

      smp_num_cpus = sysconf(_SC_NPROCESSORS_ONLN);
      if(smp_num_cpus<1) smp_num_cpus=1;
      Cpu_tot = smp_num_cpus;


Before cpus_refresh call.

Thursday, October 21, 2010

Mutt 1.5.21 mail_check_recent option

Updated to mutt 1.5.21 recently and one change really was annoying me so far.
Mutt doesn't mark recently visited folder with 'N' (new message) until it
receives new message... even if there are unread messages.
Quick look at mutt 1.5.21 source code gave the following result:

/* returns 1 if maildir has new mail */
static int buffy_maildir_hasnew (BUFFY* mailbox)
{
  char path[_POSIX_PATH_MAX];
  DIR *dirp;
  struct dirent *de;
  char *p;
  int rc = 0;
  struct stat sb;

  snprintf (path, sizeof (path), "%s/new", mailbox->path);

  /* when $mail_check_recent is set, if the new/ directory hasn't been modified since
   * the user last exited the mailbox, then we know there is no recent mail.
   */
  if (option(OPTMAILCHECKRECENT))
  {
    if (stat(path, &sb) == 0 && sb.st_mtime < mailbox->last_visited)
      return 0;
  }

  if ((dirp = opendir (path)) == NULL)
  {
    mailbox->magic = 0;
    return 0;
  }

  while ((de = readdir (dirp)) != NULL)
  {
    if (*de->d_name == '.')
      continue;

    if (!(p = strstr (de->d_name, ":2,")) || !strchr (p + 3, 'T'))
    {
      if (option(OPTMAILCHECKRECENT))
      {
    char msgpath[_POSIX_PATH_MAX];

    snprintf(msgpath, sizeof(msgpath), "%s/%s", path, de->d_name);
    /* ensure this message was received since leaving this mailbox */
    if (stat(msgpath, &sb) == 0 && (sb.st_ctime <= mailbox->last_visited))
      continue;
      }
      /* one new and undeleted message is enough */
      mailbox->new = 1;
      rc = 1;
      break;
    }
  }

  closedir (dirp);

  return rc;
}

Gotcha! Quick grep:
{"mail_check_recent",DT_BOOL, R_NONE, OPTMAILCHECKRECENT, 1 },
When set, Mutt will only notify you about new mail that has been received
since the last time you opened the mailbox.  When unset, Mutt will notify
you if any new mail exists in the mailbox, regardless of whether you have
visited it recently.

Corresponding lines in change log (diff 1.5.20 - 1.5.21)
2010-09-13 17:25 -0700  Michael Elkins (20b2d496349f)
* init.h: make $mail_check_recent set by default

The solution is to add
unset mail_check_recent

to .muttrc


Why...

Glibc: moving forward to gcc-4.6

Glibc: moving forward to gcc-4.6

Author: Ulrich Drepper
Date:   Tue Oct 19 12:56:42 2010 -0400

    Provide FP_FAST_FMA{,F,L} definitions for x86/x86-64.

diff --git a/sysdeps/x86_64/bits/mathdef.h b/sysdeps/x86_64/bits/mathdef.h
index 7b16189..9146392 100644
--- a/sysdeps/x86_64/bits/mathdef.h
+++ b/sysdeps/x86_64/bits/mathdef.h
[..]

+/* The GCC 4.6 compiler will define __FP_FAST_FMA{,F,L} if the fma{,f,l}
+   builtins are supported.  */
+# if __FP_FAST_FMA
+#  define FP_FAST_FMA 1
+# endif
+
+# if __FP_FAST_FMAF
+#  define FP_FAST_FMAF 1
+# endif
+
+# if __FP_FAST_FMAL
+#  define FP_FAST_FMAL 1
+# endif
+
 #endif /* ISO C99 */


http://www.linuxselfhelp.com/gnu/glibc/html_chapter/libc_20.html
On processors which do not implement multiply-add in hardware, fma can be very slow since it must avoid intermediate rounding. `math.h' defines the symbols FP_FAST_FMA, FP_FAST_FMAF, and FP_FAST_FMAL when the corresponding version of fma is no slower than the expression `x*y + z'. In the GNU C library, this always means the operation is implemented in hardware.

Tuesday, October 19, 2010

wired: "Oct. 14, 1985: C++ Adds to Programming"

Interview with Bjarne Stroustrup (C++ is 25... a bit outdated).
Wired.com: Most programmers are particular about the music they listen to while coding or writing. What do you listen to?

Stroustrup: Tchaikovsky’s Fifth, Wagner’s The Ring Without Words, Grieg’s Peer Gynt Suite, Sibelius, Nielsen’s The Inextinguishable, various Mozart concertos, The Dixie Chicks, Beatles’ Abbey Road, Handel’s Messiah and Water Music, Eric Clapton, Beethoven’s Fifth and Seventh. I looked to see what my laptop had been playing lately.
...
Wired.com: Any advice for young programmers?

Stroustrup: I guess giving advice is easy compared to taking it. Know your fundamentals (algorithms, data structures, machine architecture, systems) and know several programming languages to the point where you can use them idiomatically.
Know some non-computer field of study well — math, biology, history, optics, whatever. Learn to communicate effectively in speech and in writing. Spend an unreasonable amount of time on some difficult topic to really master it. Try to do something that might make a difference in the world.

Full story via wired.com

Tuesday, October 12, 2010

www.☭.net

Just for note: Unicode 6.0 has been released.

www.☭.net

Sunday, October 10, 2010

10/10/10

101010 == 42 - The Answer to Life, The Universe, and Everything.

Wednesday, September 29, 2010

delete null pointer

Hi,

iso c++ 03
5.3
/2 ... if the value of the operand of delete is the null pointer the
operation has no effect.

/7   The delete-expression will call a deallocation function (3.7.3.2).

c++0x
5.3
/2 ... the value of the operand of delete may be a null pointer value.

/7   If the value of the operand of the delete-expression is not a
null pointer value, the delete-expression will call a deallocation
function (3.7.4.2). Otherwise, it is unspecified whether the
deallocation function will be called. [ Note: The deallocation
function is called regardless of whether the destructor for the
object or some element of the array throws an exception. -- end note ]

Wednesday, September 22, 2010

Benford's law

Interesting article in wikipedia.

Benford's law

Benford's law, also called the first-digit law, states that in lists of
numbers from many (but not all) real-life sources of data, the leading
digit is distributed in a specific, non-uniform way. According to this
law, the first digit is 1 almost one third  of the time, and larger
digits occur as the leading digit with lower and lower frequency, to the
point where 9 as a first digit occurs less than one time in twenty.

The distribution is as follows:
1     30.1%
2     17.6%
3     12.5%
4     9.7%
5     7.9%
6     6.7%
7     5.8%
8     5.1%
9     4.6%

Is it useful? Yep.
Following this idea, Mark Nigrini showed that Benford's law could be used
as an indicator of accounting and expenses fraud.

In the United States, evidence based on Benford's law is legally admissible
in criminal cases at the federal, state, and local levels.

So, whenever you're about to fake some data - use digits between 5 and 9
carefully. You've been warned.

Miklos Szeredi: memory barrier question

Miklos Szeredi posted a question about memory barriers (lkml).
Which lead to an interesting discussion on memory barriers, compilers
and the Universe.

Please read lkml.org/lkml/2010/9/15/223

Thursday, September 16, 2010

ext4 regression

Hello,

Commit 66e61a9e9504f61b9a928c9055368c81da613a50 intorduced
ext4: Once a day, printk file system error information to dmesg

via kernel timer.

Error report may look like
[  313.485876] EXT4-fs (sda6): error count: 13
[  313.485887] EXT4-fs (sda6): initial error at 1283093815: ext4_lookup:1052: inode 4980737
[  313.485895] EXT4-fs (sda6): last error at 1283094174: ext4_lookup:1052: inode 4980737


and I find it quite useful.

Sad but true - calling print_daily_error_info (by timer event)
on umounted fs will cause NULL pointer derefernce on superblock inode
(EXT4_SB(sb) returns NULL) resulting in OOPS (fatal error during
soft IRQ).

Stack trace will look similar to this one:
(lots of helpfull info cut)

IRQ
        run_timer_softirq
        ?run_timer_softirq
        ?print_daily_error_info
        ?__do_softirq
        __do_softirq
        call_softirq
        do_softirq
        irq_exit
        smp_apic_timer_interrupt
        apic_timer_interrupt
EOI
        intel_idle
        intel_idel
        cpuidle_idle_call
        cpu_idle
        start_secondary



And the solution is:

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 2614774..751997d 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -719,6 +719,7 @@ static void ext4_put_super(struct super_block *sb)
             ext4_abort(sb, "Couldn't clean up the journal");
     }

+    del_timer(&sbi->s_err_report);
     ext4_release_system_zone(sb);
     ext4_mb_release(sb);
     ext4_ext_release(sb);



Ted Ts'o wrote:
Good catch!  Thanks for the patch.  I will include this into ext4
tree, and I will probably push it separately to Linus so that it gets
into 2.6.36, since this is a regresssion.

I had some concerns about print_daily_error_info
> By the way, isn't print_daily_error_info racy? Is it safe to call
> print_daily_error_info
> (by timer event (softirq)) when we're remounting fs, etc.?


Ted Ts'o answered:
It should be fine.  Remounting doesn't actually change out the struct
superblock.  There is a chance that the information might not be fully
complete if an error is printed exactly as the same time as
print_daily_error_info() is run, but I'm not sure it's worth trying to
protect against that race, since the worst that this will mean is a
confusing report in the /var/log/messages file, and the ext4 error
message will be printed right next to it, which will have all of the
information the system administrator will need.

YAAYYY!

Wednesday, September 15, 2010

const iterators for elements removal

Q: Adam Badura

What is the point of const iterators being unusable for elements
removal?
        I mean that if I have a const collection object then no matter what
iterators I have I will not be able to erase any element. But if I
have a non-const collection object then why am I not allowed to erase
its elements with a const iterator? After all I could as well take new
non-const begin and advance it until it is equal to the const iterator
and use such iterator for erasure.


A: Bo Persson
You are right, this will be fixed in the next standard, C++0x. 

 A: Daniel Krügler
Perfectly right - therefore the Standard Library has fixed this
awful state by accepting LWG issue: lwg-defects
It is the constness of the container which should control whether it can be modified through a
member function such as erase(), not the constness of the iterators.

via comp.lang.c++.moderated.

Tuesday, September 14, 2010

"We no longer support gcc 3.x, so remove the workaround for it."

 H. Peter Anvin wrote:
We no longer support gcc 3.x, so remove the workaround for it.

[PATCH 1/5] Disallow building with gcc < 3.4
[PATCH 2/5] x86, gcc: Disallow building Linux/x86 with gcc 3.x/4.0 

[PATCH 3/5] x86, cpu: Remove gcc 3.x workarounds in
[PATCH 4/5] x86, mem: Remove gcc < 4.1 support code for memcpy()
[PATCH 5/5] x86, bitops: Remove gcc < 4.1 workaround


This patchset bumps the minimum supported gcc version to 3.4 for the
general kernel, and to 4.1 for x86.
Please read
lkml.org/lkml/2010/9/13/464 and http://lkml.org/lkml/2010/9/13/73 

Thursday, September 9, 2010

it's possible that there won't be any 2.4.37.11 at all

Willy Tarreau wrote:
Some of you have noticed that the last update was released 7 months ago.
This is long, but these days, very few of the issues reported on 2.6 also
affect 2.4, so basically the number of bug reports on 2.4 fades out quite
fast.
So I'm releasing 2.4.37.10 here. If nothing happens before September 2011,
it's possible that there won't be any 2.4.37.11 at all. By that time, the
2.6 kernel will have been available for almost 8 years, this should have
been enough for anyone to have a look at it. Users now have one year to
migrate or to report critical bugs.
At one point, I envisaged to start a 2.4.38 with a bunch of updated drivers.
Now I'd prefer that the users migrate to 2.6. 

 Please read lkml.org message.

Wednesday, September 1, 2010

The cost of static

Hi,

gcc-4.5.1

suppose we have very simple and dumb code:
void save_state(int i) {
    static int _foo_i = i + 0x09;
    static int _foo_j = i;
}



g++ -O2 will give us:

   4005c0 <+0>:    cmpb   $0x0,0x200491(%rip)        # 0x600a58 <_ZGVZ10save_stateiE6_foo_i>
   4005c7 <+7>:    push   %rbx
   4005c8 <+8>:    mov    %edi,%ebx
   4005ca <+10>:    je     0x400600 <_Z10save_statei+64>
   4005cc <+12>:    cmpb   $0x0,0x20048d(%rip)        # 0x600a60 <_ZGVZ10save_stateiE6_foo_j>
   4005d3 <+19>:    je     0x4005e0 <_Z10save_statei+32>
   4005d5 <+21>:    pop    %rbx
   4005d6 <+22>:    retq  
   4005d7 <+23>:    nopw   0x0(%rax,%rax,1)
   4005e0 <+32>:    mov    $0x600a60,%edi
   4005e5 <+37>:    callq  0x4004a0 <__cxa_guard_acquire@plt>
   4005ea <+42>:    test   %eax,%eax
   4005ec <+44>:    je     0x4005d5 <_Z10save_statei+21>
   4005ee <+46>:    mov    %ebx,0x200474(%rip)        # 0x600a68 <_ZZ10save_stateiE6_foo_j>
   4005f4 <+52>:    mov    $0x600a60,%edi
   4005f9 <+57>:    pop    %rbx
   4005fa <+58>:    jmpq   0x4004c0 <__cxa_guard_release@plt>
   4005ff <+63>:    nop
   400600 <+64>:    mov    $0x600a58,%edi
   400605 <+69>:    callq  0x4004a0 <__cxa_guard_acquire@plt>
   40060a <+74>:    test   %eax,%eax
   40060c <+76>:    je     0x4005cc <_Z10save_statei+12>
   40060e <+78>:    lea    0x9(%rbx),%eax
   400611 <+81>:    mov    $0x600a58,%edi
   400616 <+86>:    mov    %eax,0x200450(%rip)        # 0x600a6c <_ZZ10save_stateiE6_foo_i>
   40061c <+92>:    callq  0x4004c0 <__cxa_guard_release@plt>
   400621 <+97>:    jmp    0x4005cc <_Z10save_statei+12>


First of all we're checking global _ZGVZ10save_stateiE6_foo_i to see whether local
_ZZ10save_stateiE6_foo_i has been initialized with default value (and by the way to
protect it).

After all those crazy do_lookup_x, _dl_name_match_p, check_match.10800, _dl_lookup_symbol_x, etc.
we have
0x00007ffff7b913a3 <+179>:    movb   $0x1,0x1(%rdi)
in __cxa_guard_acquire, which sets our global _ZGVZ10save_stateiE6_foo_i to:
0x600a58 <_ZGVZ10save_stateiE6_foo_i>:    0x00000100

and
   0x00007ffff7b91459 <+57>:    movb   $0x0,0x1(%rdi)
   0x00007ffff7b9145d <+61>:    movb   $0x1,(%rdi)


in __cxa_guard_release which sets _ZGVZ10save_stateiE6_foo_i to:
0x600a58 <_ZGVZ10save_stateiE6_foo_i>:    0x00000001


g++ -Os will give us:
   4005b4 <+0>:    cmpb   $0x0,0x20046d(%rip)        # 0x600a28 <_ZGVZ10save_stateiE6_foo_i>
   4005bb <+7>:    push   %rbx
   4005bc <+8>:    mov    %edi,%ebx
   4005be <+10>:    jne    0x4005e1 <_Z10save_statei+45>
   4005c0 <+12>:    mov    $0x600a28,%edi
   4005c5 <+17>:    callq  0x4004a0 <__cxa_guard_acquire@plt>
   4005ca <+22>:    test   %eax,%eax
   4005cc <+24>:    je     0x4005e1 <_Z10save_statei+45>
   4005ce <+26>:    lea    0x9(%rbx),%eax
   4005d1 <+29>:    mov    $0x600a28,%edi
   4005d6 <+34>:    mov    %eax,0x200460(%rip)        # 0x600a3c <_ZZ10save_stateiE6_foo_i>
   4005dc <+40>:    callq  0x4004c0 <__cxa_guard_release@plt>
   4005e1 <+45>:    cmpb   $0x0,0x200448(%rip)        # 0x600a30 <_ZGVZ10save_stateiE6_foo_j>
   4005e8 <+52>:    jne    0x400609 <_Z10save_statei+85>
   4005ea <+54>:    mov    $0x600a30,%edi
   4005ef <+59>:    callq  0x4004a0 <__cxa_guard_acquire@plt>
   4005f4 <+64>:    test   %eax,%eax
   4005f6 <+66>:    je     0x400609 <_Z10save_statei+85>
   4005f8 <+68>:    mov    %ebx,0x20043a(%rip)        # 0x600a38 <_ZZ10save_stateiE6_foo_j>
   4005fe <+74>:    mov    $0x600a30,%edi
   400603 <+79>:    pop    %rbx
   400604 <+80>:    jmpq   0x4004c0 <__cxa_guard_release@plt>
   400609 <+85>:    pop    %rbx
   40060a <+86>:    retq



And by the way, g++ tries to help us with
__cxa_guard_acquire/__cxa_guard_release

which is thread-safe static variable initialization.

If you don't need it - don't pay for it.

g++ -O2 -fno-threadsafe-statics
g++ -Os -fno-threadsafe-statics
will generate equal code:
   4004d4 <+0>:    cmpb   $0x0,0x20042d(%rip)        # 0x600908 <_ZGVZ10save_stateiE6_foo_i>
   4004db <+7>:    jne    0x4004ed <_Z10save_statei+25>
   4004dd <+9>:    lea    0x9(%rdi),%eax
   4004e0 <+12>:    movb   $0x1,0x200421(%rip)        # 0x600908 <_ZGVZ10save_stateiE6_foo_i>
   4004e7 <+19>:    mov    %eax,0x20042f(%rip)        # 0x60091c <_ZZ10save_stateiE6_foo_i>
   4004ed <+25>:    cmpb   $0x0,0x20041c(%rip)        # 0x600910 <_ZGVZ10save_stateiE6_foo_j>
   4004f4 <+32>:    jne    0x400503 <_Z10save_statei+47>
   4004f6 <+34>:    mov    %edi,0x20041c(%rip)        # 0x600918 <_ZZ10save_stateiE6_foo_j>
   4004fc <+40>:    movb   $0x1,0x20040d(%rip)        # 0x600910 <_ZGVZ10save_stateiE6_foo_j>
   400503 <+47>:    retq



The interesting part here is
movb $0x1, _ZGVZ10save_stateiE6_foo_i

Keep it simple.

Friday, August 27, 2010

ANSI C vs ISO C++

Q:
int x = 0;
int y = 0;
(1 ? x : y) = 4;



A:
This is not legal in ANSI C:
6.5.15/4
If an attempt is made to modify the result of a conditional operator
or to access it after the next sequence point, the behavior is undefined.


However, in C++ we have:
5.16/4
If the second and third operands are lvalues and have the same type,
the result is of that type and is an lvalue.
5.16/5
Otherwise, the result is an rvalue.


via linkedin.com, cpptrivia.blogspot.com

Thursday, August 26, 2010

Google reader

Hello,
Just like you do, I have lots of subscriptions in google reader. Not a big deal. The thing
that freaks me out is that google have that brain damaged option:
- sort subscriptions list.

I mean, google offers you reading trends, subscription trends, how many times you
clicked and so on. The question arising is - why not sort subscriptions by demand
according to those stats? E.g. sort by "Items/Day", or by "% Read", "# Read".

And no, "sort alphabetically" or "sort by drug-and-drop" ain't a solution. Really.

Am I missing something?

Wednesday, August 25, 2010

acer aspire bios update

europe.pool.ntp.org (ntpdate executed nearly every second)

BIOS 1.13
adjust time server 80.96.120.252 offset 0.104322 sec
adjust time server 80.96.120.252 offset 0.099397 sec
adjust time server 80.96.120.252 offset 0.096243 sec
adjust time server 80.96.120.252 offset 0.091021 sec
adjust time server 80.96.120.252 offset 0.087349 sec
adjust time server 80.96.120.252 offset 0.087027 sec
adjust time server 80.96.120.252 offset 0.085541 sec
adjust time server 80.96.120.252 offset 0.082374 sec


BIOS 1.06
adjust time server 88.191.117.61 offset 0.002307 sec
adjust time server 88.191.117.61 offset 0.003218 sec
adjust time server 88.191.117.61 offset 0.003611 sec
adjust time server 88.191.117.61 offset 0.004132 sec
adjust time server 88.191.117.61 offset 0.004492 sec
adjust time server 88.191.117.61 offset 0.004967 sec
adjust time server 88.191.117.61 offset 0.005108 sec
adjust time server 88.191.117.61 offset 0.005881 sec
adjust time server 88.191.117.61 offset 0.006293 sec
adjust time server 88.191.117.61 offset 0.006553 sec
adjust time server 88.191.117.61 offset 0.007018 sec
adjust time server 88.191.117.61 offset 0.007411 sec
adjust time server 88.191.117.61 offset 0.007894 sec
adjust time server 88.191.117.61 offset 0.008193 sec
adjust time server 88.191.117.61 offset 0.008534 sec
adjust time server 88.191.117.61 offset 0.008972 sec
adjust time server 88.191.117.61 offset 0.008819 sec
adjust time server 88.191.117.61 offset 0.009721 sec
adjust time server 88.191.117.61 offset 0.010161 sec
adjust time server 88.191.117.61 offset 0.012594 sec

...To Infinity and Beyond!... CANCELLED

Monday, August 23, 2010

touch_(nmi|softlockup)_watchdog

Hello,

It is a mistake to think you can solve any major problems just with potatoes.
Douglas Adams

[   67.703556] BUG: using smp_processor_id() in preemptible [00000000] code
[   67.703563] caller is touch_nmi_watchdog+0x15/0x2c
[   67.703568] Call Trace:
[   67.703575]  [<ffffffff811f6bf1>] debug_smp_processor_id+0xc9/0xe4
[   67.703578]  [<ffffffff81092766>] touch_nmi_watchdog+0x15/0x2c
[   67.703584]  [<ffffffff81222950>] acpi_os_stall+0x34/0x40
[   67.703589]  [<ffffffff812398d2>] acpi_ex_system_do_stall+0x34/0x38
[   67.703591]  [<ffffffff81238396>] acpi_ex_opcode_1A_0T_0R+0x6d/0xa1
[   67.703595]  [<ffffffff8122e280>] acpi_ds_exec_end_op+0xf8/0x578
[   67.703598]  [<ffffffff812457f9>] acpi_ps_parse_loop+0x88a/0xa55
[   67.703604]  [<ffffffff81244a00>] acpi_ps_parse_aml+0x104/0x3c4
[   67.703607]  [<ffffffff81246198>] acpi_ps_execute_method+0x20f/0x2f3
[   67.703610]  [<ffffffff8124021f>] acpi_ns_evaluate+0x18b/0x2d2
[   67.703614]  [<ffffffff8123fad0>] acpi_evaluate_object+0x1b8/0x2fc
[   67.703617]  [<ffffffff8123e020>] ? acpi_get_sleep_type_data+0x21c/0x236
[   67.703620]  [<ffffffff8123d9fb>] acpi_enter_sleep_state_prep+0x61/0xd9
[   67.703623]  [<ffffffff81224205>] acpi_sleep_prepare+0x4f/0x56
[   67.703626]  [<ffffffff81224268>] __acpi_pm_prepare+0x13/0x2e
[   67.703629]  [<ffffffff81224448>] acpi_pm_prepare+0xe/0x1f
[   67.703632]  [<ffffffff81224466>] acpi_hibernation_pre_snapshot+0xd/0x1e
[   67.703637]  [<ffffffff81071b80>] hibernation_snapshot+0xaf/0x258
[   67.703641]  [<ffffffff81074dca>] snapshot_ioctl+0x25c/0x547
[   67.703645]  [<ffffffff81056efc>] ? __srcu_read_unlock+0x3b/0x57
[   67.703649]  [<ffffffff810e7f7d>] vfs_ioctl+0x31/0xa2
[   67.703652]  [<ffffffff810e88dc>] do_vfs_ioctl+0x47c/0x4af
[   67.703655]  [<ffffffff8125ee3c>] ? n_tty_write+0x0/0x35e
[   67.703659]  [<ffffffff8100203a>] ? sysret_check+0x2e/0x69
[   67.703662]  [<ffffffff810e8960>] sys_ioctl+0x51/0x75
[   67.703665]  [<ffffffff81002002>] system_call_fastpath+0x16/0x1b
[...]
[   67.703668] BUG: using smp_processor_id() in preemptible [00000000] code
[   67.703670] caller is touch_softlockup_watchdog+0x15/0x2b
[   67.703674] Call Trace:
[   67.703677]  [<ffffffff811f6bf1>] debug_smp_processor_id+0xc9/0xe4
[   67.703680]  [<ffffffff8109273b>] touch_softlockup_watchdog+0x15/0x2b
[   67.703682]  [<ffffffff81092779>] touch_nmi_watchdog+0x28/0x2c
[...]



Sometimes things are way much complicated than you may think at first.
The solution is pretty obvious... and pretty wrong at the same time.

Frederic Weisbe wrote:
[..]
It is buggy by nature.
[..]
The problem is on the caller. Considering such udelays loop:

* if it's in a irq disabled section, call touch_nmi_watchdog(), because this
  could prevent the nmi watchdog irq from firing
* if it's in a non-preemptable section, call touch_softlockup_watchdog(), because
  this could prevent the softlockup watchdog task from beeing scheduled
* if it's from a preemptable task context, this should call cond_resched() to
  avoid huge latencies on !CONFIG_PREEMPT

But acpi_os_stall() seem to be called from 4 different places, and these places
may run in different context like the above described.

It means that get_cpu()/put_cpu() are just masking the problem, despite the fact that what we actually need is to fix the problem. And it wasn't obvious to me (little, silly me).

So, we have git reset to previous code in touch_(nmi|softlockup)_watchdog.

Saturday, August 21, 2010

thermal_throttle_add_dev

Hello,

[13874.228704] BUG: using smp_processor_id() in preemptible [00000000] code: bash/10661
[13874.228715] caller is thermal_throttle_add_dev+0x20/0xa4
[13874.228721] Pid: 10661, comm: bash Not tainted 2.6.36-rc0-git11-07631-gfa34556-dirty #109
[13874.228725] Call Trace:
[13874.228738]  [<ffffffff811f6285>] debug_smp_processor_id+0xc9/0xe4
[13874.228744]  [<ffffffff8136ccac>] thermal_throttle_add_dev+0x20/0xa4
[13874.228750]  [<ffffffff8136cd82>] thermal_throttle_cpu_callback+0x52/0xb5
[13874.228759]  [<ffffffff81057198>] notifier_call_chain+0x32/0x5e
[13874.228767]  [<ffffffff8103c818>] ? cpu_maps_update_begin+0x12/0x14
[13874.228774]  [<ffffffff810571e3>] __raw_notifier_call_chain+0x9/0xb
[13874.228780]  [<ffffffff8103c6db>] __cpu_notify+0x1b/0x2d
[13874.228786]  [<ffffffff8136eeb2>] _cpu_up+0x6b/0xe9
[13874.228792]  [<ffffffff8136ef7a>] cpu_up+0x4a/0x57
[13874.228799]  [<ffffffff813638b7>] store_online+0x41/0x6e
[13874.228807]  [<ffffffff8129186b>] sysdev_store+0x1b/0x1d
[13874.228816]  [<ffffffff8113188e>] sysfs_write_file+0x103/0x13f
[13874.228824]  [<ffffffff810da5ff>] vfs_write+0xb1/0x14e
[13874.228830]  [<ffffffff810da898>] sys_write+0x45/0x6c
[13874.228840]  [<ffffffff81002002>] system_call_fastpath+0x16/0x1b


My first solution was quite simple - just don't use preemptible smp_processor_id().
In other words it was something like that:

--- a/arch/x86/kernel/cpu/mcheck/therm_throt.c
+++ b/arch/x86/kernel/cpu/mcheck/therm_throt.c
@@ -205,7 +205,9 @@ static int therm_throt_process(bool new_event, int event, int level)
 static __cpuinit int thermal_throttle_add_dev(struct sys_device *sys_dev)
 {
     int err;
-    struct cpuinfo_x86 *c = &cpu_data(smp_processor_id());
+    int cpu = get_cpu();
+    struct cpuinfo_x86 *c = &cpu_data(cpu);
+    put_cpu();

     err = sysfs_create_group(&sys_dev->kobj, &thermal_attr_group);
     if (err)

However, we know the exact cpu when we are about to call thermal_throttle_add_dev.
So smp_processor_id()/get_cpu() - put_cpu() is sort of redundant here. Here we are
with the second solution:

--- a/arch/x86/kernel/cpu/mcheck/therm_throt.c
+++ b/arch/x86/kernel/cpu/mcheck/therm_throt.c
@@ -202,11 +202,12 @@ static int therm_throt_process(bool new_event, int event, int level)

 #ifdef CONFIG_SYSFS
 /* Add/Remove thermal_throttle interface for CPU device: */
-static __cpuinit int thermal_throttle_add_dev(struct sys_device *sys_dev)
+static __cpuinit int thermal_throttle_add_dev(struct sys_device *sys_dev,
+                unsigned int cpu)
 {
     int err;
-    struct cpuinfo_x86 *c = &cpu_data(smp_processor_id());
-
+    struct cpuinfo_x86 *c = &cpu_data(cpu);
+   
     err = sysfs_create_group(&sys_dev->kobj, &thermal_attr_group);
     if (err)
         return err;
@@ -251,7 +252,7 @@ thermal_throttle_cpu_callback(struct notifier_block *nfb,
     case CPU_UP_PREPARE:
     case CPU_UP_PREPARE_FROZEN:
         mutex_lock(&therm_cpu_lock);
-        err = thermal_throttle_add_dev(sys_dev);
+        err = thermal_throttle_add_dev(sys_dev, cpu);
         mutex_unlock(&therm_cpu_lock);
         WARN_ON(err);
         break;
@@ -287,7 +288,7 @@ static __init int thermal_throttle_init_device(void)
 #endif
     /* connect live CPUs to sysfs */
     for_each_online_cpu(cpu) {
-        err = thermal_throttle_add_dev(get_cpu_sysdev(cpu));
+        err = thermal_throttle_add_dev(get_cpu_sysdev(cpu), cpu);
         WARN_ON(err);
     }
 #ifdef CONFIG_HOTPLUG_CPU



Good news everyone:

Subject: [tip:x86/urgent] x86, hwmon ...
Commit-ID:  51e3c1b558b31b11bf5fc66d3c6f5adacf3573f7


Thanks.


P.S. by the way:
LKML-Reference: <20100820073634.GB5209@swordfish.minsk.epam.com>

I'll ask epam for bonus... $10,000... cash. If you know what I mean. Yeah!

Wednesday, August 11, 2010

reiserfs evict inode

Hi,

2.6.36-rc0-git11-07128-g4104046-dirty

[ 2213.717957] ------------[ cut here ]------------
[ 2213.719401] kernel BUG at fs/inode.c:298!
[ 2213.720821] invalid opcode: 0000 [#7] PREEMPT SMP
[ 2213.722248] last sysfs file: /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A08:00/device:13/PNP0C0A:00/power_supply/BAT0/status
[ 2213.723692] CPU 0
[ 2213.729816]
[..]
[ 2213.753445] Stack:
[ 2213.754980]  ffff880135abfde8 ffff880117474610 ffff880135abfe58 ffffffff8113a1f8
[ 2213.755021] <0> ffff880155543800 0000000000000000 0000000000000000 0000000000000000
[ 2213.756590] <0> 0000000000000000 0000000000000000 0000000000000000 0000000000000001
[ 2213.759677] Call Trace:
[ 2213.761205]  [<ffffffff8113a1f8>] reiserfs_evict_inode+0x13c/0x151
[ 2213.762736]  [<ffffffff810ebcac>] evict+0x22/0x92
[ 2213.764248]  [<ffffffff810ec884>] iput+0x1c8/0x228
[ 2213.765746]  [<ffffffff810e3ca1>] do_unlinkat+0x107/0x15a
[ 2213.767393]  [<ffffffff810e1654>] ? path_put+0x2c/0x30
[ 2213.768909]  [<ffffffff8136d760>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 2213.770368]  [<ffffffff810e5004>] sys_unlink+0x11/0x13
[ 2213.771806]  [<ffffffff81002002>] system_call_fastpath+0x16/0x1b
[ 2213.773246] Code: 02 00 00 00 74 02 0f 0b 48 8d 87 e0 02 00 00 48 39 87 e0 02 00 00 74 02 0f 0b 48 8b 87 30 03 00 00 a8 20 75 02 0f 0b a8 40 74 02 <0f> 0b a8 80 74 1d 48 8d bf 30 03 00 00 b9 02 00 00 00 48 c7 c2
[ 2213.776684] RIP  [<ffffffff810ebc58>] end_writeback+0x3b/0x6d
[ 2213.778296]  RSP <ffff880135abfdd8>
[ 2213.798230] ---[ end trace 4b833f744d46ce1f ]---



The problem is that:
reiserfs_evict_inode calls end_writeback two times hitting
kernel BUG at fs/inode.c:298 because inode->i_state is I_CLEAR already.

---

diff --git a/fs/reiserfs/inode.c b/fs/reiserfs/inode.c
index ae35413..87e11d2 100644
--- a/fs/reiserfs/inode.c
+++ b/fs/reiserfs/inode.c
@@ -83,7 +83,8 @@ void reiserfs_evict_inode(struct inode *inode)
        dquot_drop(inode);
        inode->i_blocks = 0;
        reiserfs_write_unlock_once(inode->i_sb, depth);
-
+       return;
+
 no_delete:
        end_writeback(inode);
        dquot_drop(inode);



Al Viro wrote:
Applied, thanks.

Tuesday, August 10, 2010

scheduling while atomic

At first I was like
uname -a
Linux 2.6.36-rc0-git9-06241-g2ba110e-dirty

but then I was like
[    0.010045] BUG: scheduling while atomic: swapper/0/0x10000002
[    0.010140] no locks held by swapper/0.
[    0.010219] Modules linked in:
[    0.010356] Pid: 0, comm: swapper Not tainted 2.6.36-rc0-git9-06241-g2ba110e-dirty #99
[    0.010455] Call Trace:
[    0.010535]  [] __schedule_bug+0x72/0x77
[    0.010622]  [] schedule+0xdc/0x8f2
[    0.010707]  [] __cond_resched+0x13/0x1f
[    0.010792]  [] _cond_resched+0x29/0x30
[    0.010878]  [] acpi_ps_complete_op+0x2c6/0x2db
[    0.010965]  [] ? acpi_ds_load1_end_op+0x51/0x249
[    0.011052]  [] acpi_ps_parse_loop+0x8b8/0xa55
[    0.011139]  [] ? trace_hardirqs_on+0xd/0xf
[    0.011224]  [] acpi_ps_parse_aml+0x104/0x3c4
[    0.011310]  [] acpi_ns_one_complete_parse+0x125/0x142
[    0.011399]  [] ? acpi_os_signal_semaphore+0x5f/0x6f
[    0.011482]  [] acpi_ns_parse_table+0x49/0x8e
[    0.011567]  [] acpi_ns_load_table+0x78/0x114
[    0.011652]  [] acpi_load_tables+0xa1/0x18e
[    0.011736]  [] acpi_early_init+0x6c/0xf7
[    0.011821]  [] start_kernel+0x3fd/0x40d
[    0.011905]  [] x86_64_start_reservations+0xb1/0xb5
[    0.011990]  [] x86_64_start_kernel+0xf8/0x107



UPD:

 #define ACPI_PREEMPTION_POINT() \
  do { \
-  if (!in_atomic_preempt_off() && !irqs_disabled()) \
+  if (!irqs_disabled()) \
    cond_resched(); \
  } while (0)
+#endif

Saturday, July 31, 2010

Back to the future!

Hi,

dmesg:

[    0.000000] ... MAX_LOCKDEP_CHAINS:      32768
[    0.000000] ... CHAINHASH_SIZE:          16384
[    0.000000]  memory used by lock dependency info: 5855 kB
[    0.000000]  per task-struct memory footprint: 1920 bytes
[    0.000000] hpet clockevent registered
[    0.000000] Fast TSC calibration using PIT
[    0.003333] Detected 2261.025 MHz processor.
[    0.000012] Calibrating delay loop (skipped), value calculated using timer frequency.. 4523.46 BogoMIPS (lpj=7536750)
[    0.000176] pid_max: default: 32768 minimum: 301
[    0.000308] Security Framework initialized
[    0.000413] Mount-cache hash table entries: 256
[    0.001246] Initializing cgroup subsys ns
[    0.001328] Initializing cgroup subsys cpuacct
[    0.001413] Initializing cgroup subsys devices
[    0.001493] Initializing cgroup subsys blkio



Actually, the one and only cool thing about Acer Aspire 5741G. Yeah...

Thursday, July 8, 2010

I/O load

Hi,

Finding the source of I/O load:

echo 1 > /proc/sys/vm/block_dump
dmesg /*add grep | sort | head | whatever you like*/

[23303.008133] jbd2/sda7-8(3588): WRITE block 73544 on sda7
[23303.009090] jbd2/sda5-8(1315): WRITE block 2506872 on sda5
[23303.009169] jbd2/sda5-8(1315): WRITE block 8740776 on sda5
[23303.009189] jbd2/sda5-8(1315): WRITE block 8740784 on sda5
[23303.009204] jbd2/sda5-8(1315): WRITE block 8740792 on sda5
[23303.012135] syslog-ng(3725): READ block 1572968 on sda5
[23303.012168] syslog-ng(3725): READ block 1573000 on sda5
[23303.012201] syslog-ng(3725): READ block 1573064 on sda5
[23303.012221] syslog-ng(3725): READ block 1573080 on sda5
[23303.012243] syslog-ng(3725): READ block 1573104 on sda5
[23303.012261] syslog-ng(3725): READ block 1573176 on sda5
[23303.068168] syslog-ng(3725): dirtied inode 14390 (everything.log) on sda5
[23303.068194] syslog-ng(3725): dirtied inode 14390 (everything.log) on sda5
[23303.068205] syslog-ng(3725): dirtied inode 14390 (everything.log) on sda5
[23303.068525] syslog-ng(3725): dirtied inode 14386 (kernel.log) on sda5
[23303.068543] syslog-ng(3725): dirtied inode 14386 (kernel.log) on sda5
[23303.068552] syslog-ng(3725): dirtied inode 14386 (kernel.log) on sda5
[23303.072523] jbd2/sda5-8(1315): WRITE block 8740800 on sda5
[23307.374341] bash(30701): READ block 417840 on sda5
[23307.408615] bash(30701): dirtied inode 7275 (dmesg) on sda5
[23309.008101] jbd2/sda5-8(1315): WRITE block 2711320 on sda5
[23309.008156] jbd2/sda5-8(1315): WRITE block 2806184 on sda5
[23309.008229] jbd2/sda5-8(1315): WRITE block 8740808 on sda5
[23309.008248] jbd2/sda5-8(1315): WRITE block 8740816 on sda5
[23309.008262] jbd2/sda5-8(1315): WRITE block 8740824 on sda5
[23309.020531] jbd2/sda5-8(1315): WRITE block 8740832 on sda5
[23312.836095] flush-8:0(1348): WRITE block 19660808 on sda7
[23312.836137] flush-8:0(1348): WRITE block 19660832 on sda7
[23312.836159] flush-8:0(1348): WRITE block 19661496 on sda7
[23312.836178] flush-8:0(1348): WRITE block 19968464 on sda7
[23312.836288] flush-8:0(1348): WRITE block 0 on sda7
[23312.836307] flush-8:0(1348): WRITE block 8 on sda7
[23315.008123] jbd2/sda5-8(1315): WRITE block 8740840 on sda5
[23315.008246] jbd2/sda5-8(1315): WRITE block 8740848 on sda5
[23315.018987] jbd2/sda5-8(1315): WRITE block 8740856 on sda5
[..]
[23384.161854] firefox(30716): WRITE block 4107056 on sda7
[23384.161860] firefox(30716): WRITE block 4107064 on sda7
[23384.161867] firefox(30716): WRITE block 4107072 on sda7
[23384.161873] firefox(30716): WRITE block 4107080 on sda7
[23384.161880] firefox(30716): WRITE block 4107088 on sda7
[23384.161886] firefox(30716): WRITE block 4107096 on sda7
[23384.161952] firefox(30716): WRITE block 4107104 on sda7
[23384.161960] firefox(30716): WRITE block 4107112 on sda7
[23384.161967] firefox(30716): WRITE block 4107120 on sda7
[23384.161974] firefox(30716): WRITE block 4107128 on sda7
[23384.161980] firefox(30716): WRITE block 4107136 on sda7
[23384.161987] firefox(30716): WRITE block 4107144 on sda7
[23384.161993] firefox(30716): WRITE block 4107152 on sda7
[23384.161999] firefox(30716): WRITE block 4107160 on sda7
[23384.162006] firefox(30716): WRITE block 4107168 on sda7
[23384.162013] firefox(30716): WRITE block 4107176 on sda7
[23384.162019] firefox(30716): WRITE block 4107184 on sda7
[23384.162026] firefox(30716): WRITE block 4107192 on sda7
[23384.162033] firefox(30716): WRITE block 4107200 on sda7
[23384.162039] firefox(30716): WRITE block 4107208 on sda7
[23384.162047] firefox(30716): WRITE block 4107216 on sda7

Surely You're Joking, Mr. Feynman!

Hi,

Currently I'm reading "Surely You're Joking, Mr. Feynman! (Adventures of a Curious Character)".
And this book rocks. Really. Unique mix of enormous amount of knowledge, curiosity, humor... And
everything else you wouldn't expect from a Nobel-winner physicist.

And yeah, this is recommended reading.


The next one I'd like to read is
"Reminiscences of Los Alamos 1943-1945 (Studies in the History of Modern Science)".

Monday, July 5, 2010

Ever seen a grown man naked?

Greetings everyone,

I've published a simple code which, however, targets some 'not so easy'
problems in vfs/reiserfs.

Like deadlock:

[  573.405720]
[  573.405722] =======================================================
[  573.405728] [ INFO: possible circular locking dependency detected ]
[  573.405732] 2.6.35-rc3-dbg-git6-00502-g94feaba-dirty #65
[  573.405735] -------------------------------------------------------
[  573.405739] a.out/7287 is trying to acquire lock:
[  573.405742]  (&sb->s_type->i_mutex_key#10){+.+.+.}, at: [] reiserfs_file_release+0x11d/0x344
[  573.405758]
[  573.405759] but task is already holding lock:
[  573.405762]  (&mm->mmap_sem){++++++}, at: [] sys_mmap_pgoff+0xa4/0xe7
[  573.405772]
[  573.405773] which lock already depends on the new lock.
[  573.405774]
[  573.405777]
[  573.405778] the existing dependency chain (in reverse order) is:
[  573.405781]
[  573.405782] -> #1 (&mm->mmap_sem){++++++}:
[  573.405789]        [] lock_acquire+0x59/0x70
[  573.405797]        [] might_fault+0x53/0x70
[  573.405803]        [] copy_to_user+0x30/0x48
[  573.405809]        [] filldir64+0x95/0xc9
[  573.405815]        [] reiserfs_readdir_dentry+0x35d/0x4d9
[  573.405821]        [] reiserfs_readdir+0x12/0x17
[  573.405827]        [] vfs_readdir+0x6d/0x92
[  573.405831]        [] sys_getdents64+0x63/0xa2
[  573.405836]        [] sysenter_do_call+0x12/0x32
[  573.405843]
[  573.405843] -> #0 (&sb->s_type->i_mutex_key#10){+.+.+.}:
[  573.405851]        [] __lock_acquire+0x96d/0xbe1
[  573.405857]        [] lock_acquire+0x59/0x70
[  573.405862]        [] __mutex_lock_common+0x39/0x36b
[  573.405869]        [] mutex_lock_nested+0x12/0x15
[  573.405874]        [] reiserfs_file_release+0x11d/0x344
[  573.405880]        [] fput+0xe0/0x16a
[  573.405886]        [] remove_vma+0x28/0x47
[  573.405892]        [] do_munmap+0x1e8/0x200
[  573.405897]        [] mmap_region+0x6b/0x372
[  573.405902]        [] do_mmap_pgoff+0x23c/0x282
[  573.405908]        [] sys_mmap_pgoff+0xbd/0xe7
[  573.405913]        [] sysenter_do_call+0x12/0x32
[  573.405919]
[  573.405920] other info that might help us debug this:
[  573.405921]
[  573.405925] 1 lock held by a.out/7287:
[  573.405928]  #0:  (&mm->mmap_sem){++++++}, at: [] sys_mmap_pgoff+0xa4/0xe7
[  573.405937]
[  573.405938] stack backtrace:
[  573.405942] Pid: 7287, comm: a.out Not tainted 2.6.35-rc3-dbg-git6-00502-g94feaba-dirty #65
[  573.405946] Call Trace:
[  573.405951]  [] ? printk+0xf/0x11
[  573.405957]  [] print_circular_bug+0x8a/0x96
[  573.405962]  [] __lock_acquire+0x96d/0xbe1
[  573.405969]  [] ? mark_lock+0x26/0x1b3
[  573.405975]  [] lock_acquire+0x59/0x70
[  573.405980]  [] ? reiserfs_file_release+0x11d/0x344
[  573.405986]  [] __mutex_lock_common+0x39/0x36b
[  573.405991]  [] ? reiserfs_file_release+0x11d/0x344
[  573.405997]  [] mutex_lock_nested+0x12/0x15
[  573.406003]  [] ? reiserfs_file_release+0x11d/0x344
[  573.406008]  [] reiserfs_file_release+0x11d/0x344
[  573.406014]  [] ? fput+0x90/0x16a
[  573.406019]  [] fput+0xe0/0x16a
[  573.406024]  [] remove_vma+0x28/0x47
[  573.406030]  [] ? arch_unmap_area_topdown+0x0/0x18
[  573.406035]  [] do_munmap+0x1e8/0x200
[  573.406040]  [] mmap_region+0x6b/0x372
[  573.406046]  [] do_mmap_pgoff+0x23c/0x282
[  573.406052]  [] sys_mmap_pgoff+0xbd/0xe7
[  573.406058]  [] sysenter_do_call+0x12/0x32



Error causing RO remount:
[  202.300464] REISERFS error (device sda9): vs-2100 add_save_link:
search_by_key ([-1 7812832 0x1 IND]) returned 1
[  202.300473] REISERFS (device sda9): Remounting filesystem read-only
[  202.301603] ------------[ cut here ]------------
[  202.301615] WARNING: at fs/reiserfs/journal.c:3436
journal_end+0x5b/0xaf()
[  202.301689] Pid: 5055, comm: a.out Not tainted
2.6.35-rc3-dbg-git6-00502-g94feaba-dirty #65
[  202.301693] Call Trace:
[  202.301701]  [] warn_slowpath_common+0x65/0x7a
[  202.301707]  [] ? journal_end+0x5b/0xaf
[  202.301712]  [] warn_slowpath_null+0xf/0x13
[  202.301718]  [] journal_end+0x5b/0xaf
[  202.301725]  [] reiserfs_truncate_file+0x19f/0x233
[  202.301733]  [] reiserfs_vfs_truncate_file+0xd/0xf
[  202.301738]  [] vmtruncate+0x23/0x29
[  202.301745]  [] inode_setattr+0x47/0x68
[  202.301751]  [] reiserfs_setattr+0x242/0x297
[  202.301758]  [] ? down_write+0x22/0x2a
[  202.301764]  [] notify_change+0x15c/0x26b
[  202.301770]  [] do_truncate+0x64/0x7d
[  202.301776]  [] ? _raw_spin_unlock+0x33/0x3f
[  202.301783]  [] do_last+0x450/0x459
[  202.301789]  [] do_filp_open+0x1c0/0x41a
[  202.301798]  [] ? get_parent_ip+0xb/0x31
[  202.301804]  [] ? sub_preempt_count+0x7c/0x89
[  202.301810]  [] ? alloc_fd+0xb4/0xbf
[  202.301816]  [] do_sys_open+0x48/0xdf
[  202.301821]  [] sys_open+0x1e/0x26
[  202.301827]  [] sysenter_do_call+0x12/0x32
[  202.301833] ---[ end trace c4e3312bdadd2dc5 ]---


And even OOps...

Al Viro wrote:
> OK...  See 22093b8f3d387f77 in vfs-2.6.git for-next (should
> propagate to git.kernel.org shortly).  That ought to deal with
> this crap, assuming I hadn't fucked up somewhere...


YAY!

/*
 * 2010, Sergey Senozhatsky. GPLv2
 *
*/

[..]


int main()
{
    char buf[4096];
    int i = 0;
    /* we don't really care */
    for (; i < 4096; i++)
        buf[i] = (i + 65) % 255;

    for (i = 0; i < 10; i++) {

        int pid = fork();
        if (pid > 0 ) {
            printf("parent...");
        } else if (pid == 0) {
           
            printf("child...\n");
            int fd = open("conftest.mmap", O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE, 0600);
            if (fd > 0) {
                printf("OPEN ok %d\n", fd);
                if (write(fd, buf, 4096) < 0)
                    printf("WRITE error\n");
                else
                    printf("WRITE ok\n");
               
                close(fd);
            } else {
                printf("OPEN error\n");
            }
           
            fd = open("conftest.mmap", O_RDWR|O_LARGEFILE);
            if (fd > 0) {
                printf("OPEN conftest.mmap %d\n", fd);
               
                void *map = mmap((void*)0xb78a8000, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, fd, 0);
                if (map == MAP_FAILED) {
                    printf("MMAP failed\n");
                    close(fd);
                    goto out;
                } else {
                    printf("MMAP ok\n");
                }
               
                if (read(fd, buf, 4096) < 0)
                    printf("READ failed\n");
                else
                    printf("READ ok\n");

                close(fd);
                munmap(map, 4096);
            } else {
                printf("Error: can't open conftest.mmap\n");
            }
           
        out:
            fd = open(".", O_RDONLY|O_LARGEFILE);
            if (fd > 0) {
                printf("OPEN . ok %d... closing\n", fd);
                close(fd);
            } else {
                printf("OPEN error\n");
            }
           
            struct stat _stat;
            if (fstatat(AT_FDCWD, "conftest.mmap", &_stat, AT_SYMLINK_NOFOLLOW) < 0)
                printf("FSTATAT error\n");
            else
                printf("FSTATAT ok\n");
           
            if (unlinkat(AT_FDCWD, "conftest.mmap", 0) < 0)
                printf("UNLINKAT error\n");
            else
                printf("UNLINKAT ok\n");

            /*
             * Yep...
             * return 0;
             */
        } else {
            printf("FORK error\n");
        }
    }
   
    return 0;
}