How I fixed a kernel regression using git bisect
As a happy user of the ArchLinux testing-repository, I’m sometimes “forced” to deal with bugs and regressions which do not break the whole system but definitely need some time finding and fixing them.
Beside the breaking of some userland applications like libvirt or wicd, a kernel regression, which might only affect your own system, could be really difficult to spot.
My Thinkpad T43 notebook suffered a “won’t-wakeup-from-suspend” bug since the 3.7 kernel tree and apparently this bug is still present in the latest stable upstream releases. Searching for the cause of this regression on Google isn’t really helpfull and opening tasks on your distributions bug tracker gets you in this case to the only but unpopular soloution: kernel bisection. While reading about the pros and cons of this unanalytical method, I just wanted to give it a try:
git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git linux-git
cd linux-git
git bisect start
git bisect good v3.6
git bisect bad v3.7-rc1
As you can see, I define the range in which I assume the culprit commit. Now the process of compiling and testing several kernel versions starts while narrowing down the bug. Unfortunately, I always had to compress the current revision and compile it using the offical ArchLinux PKGBUILD file on my remote server, because I couldn’t figure out how to afterwards test the kernel on my local system in a “clean” manner. Installing the bisected kernel with pacman was the more convenient way :)
So here’s the quite long bisection log. Note that git always stated with every step, how many commits are left to check.
git bisect start
# good: [49b8c695e331c9685e6ffdbf34872509d77c8459] Merge branch 'x86/fpu' into x86/smap
git bisect good 49b8c695e331c9685e6ffdbf34872509d77c8459
# good: [49b8c695e331c9685e6ffdbf34872509d77c8459] Merge branch 'x86/fpu' into x86/smap
git bisect good 49b8c695e331c9685e6ffdbf34872509d77c8459
# bad: [a20acf99f75e49271381d65db097c9763060a1e8] Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc-next
git bisect bad a20acf99f75e49271381d65db097c9763060a1e8
# good: [06d2fe153b9b35e57221e35831a26918f462db68] Merge tag 'driver-core-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
git bisect good 06d2fe153b9b35e57221e35831a26918f462db68
# good: [3498d13b8090c0b0ef911409fbc503a7c4cca6ef] Merge tag 'tty-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
git bisect good 3498d13b8090c0b0ef911409fbc503a7c4cca6ef
# bad: [61464c8357c8f6b780e4c44f5c79471799c51ca7] Merge tag 'cleanup' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
git bisect bad 61464c8357c8f6b780e4c44f5c79471799c51ca7
# good: [cc150a2861e744d8f574d571762cc7e9f928abb3] Merge tag 'hwmon-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging
git bisect good cc150a2861e744d8f574d571762cc7e9f928abb3
# good: [60e59920152c7bafc8a2eb3031a62f22c2bc9e95] Merge branch 'board' of git://github.com/hzhuang1/linux into next/cleanup
git bisect good 60e59920152c7bafc8a2eb3031a62f22c2bc9e95
# bad: [797b9e5ae93270ec27a1f1ed48cd697d01b2269f] Merge branch 'for-linus' of git://git.samba.org/sfrench/cifs-2.6
git bisect bad 797b9e5ae93270ec27a1f1ed48cd697d01b2269f
# good: [71953fc6e4ce5ac05b594d8e5866accf531aa969] cifs: remove kmap lock and rsize limit
git bisect good 71953fc6e4ce5ac05b594d8e5866accf531aa969
# good: [c052e2b423f3eabe9f3f32e60744afa5cf26f6b9] cifs: obtain file access during backup intent lookup (resend)
git bisect good c052e2b423f3eabe9f3f32e60744afa5cf26f6b9
# good: [cdeb9b014331af4282be522824e36f3aa33f0671] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k
git bisect good cdeb9b014331af4282be522824e36f3aa33f0671
# good: [a57d985e378ca69f430b85852e4187db3698a89e] Merge tag 'please-pull-ia64-for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux
git bisect good a57d985e378ca69f430b85852e4187db3698a89e
# bad: [b2cc2a074de75671bbed5e2dda67a9252ef353ea] x86, smep, smap: Make the switching functions one-way
git bisect bad b2cc2a074de75671bbed5e2dda67a9252ef353ea
# good: [5a5a51db78ef24aa61a4cb2ae36f07f6fa37356d] x86-32: Start out eflags and cr4 clean
git bisect good 5a5a51db78ef24aa61a4cb2ae36f07f6fa37356d
# bad: [73201dbec64aebf6b0dca855b523f437972dc7bb] x86, suspend: On wakeup always initialize cr4 and EFER
git bisect bad 73201dbec64aebf6b0dca855b523f437972dc7bb
# bad: [73201dbec64aebf6b0dca855b523f437972dc7bb] x86, suspend: On wakeup always initialize cr4 and EFER
git bisect bad 73201dbec64aebf6b0dca855b523f437972dc7bb
# bad: [73201dbec64aebf6b0dca855b523f437972dc7bb] x86, suspend: On wakeup always initialize cr4 and EFER
git bisect bad 73201dbec64aebf6b0dca855b523f437972dc7bb
So, the last part of the session finally showed me a result:
73201dbec64aebf6b0dca855b523f437972dc7bb is the first bad commit
commit 73201dbec64aebf6b0dca855b523f437972dc7bb
Author: H. Peter Anvin <hpa@linux.intel.com>
Date: Wed Sep 26 15:02:34 2012 -0700
x86, suspend: On wakeup always initialize cr4 and EFER
We already have a flag word to indicate the existence of MISC_ENABLES,
so use the same flag word to indicate existence of cr4 and EFER, and
always restore them if they exist. That way if something passes a
nonzero value when the value *should* be zero, we will still
initialize it.
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Link: http://lkml.kernel.org/r/1348529239-17943-1-git-send-email-hpa@linux.intel.com
:040000 040000 bb093059ee142f1dd5bd7fe44368ba657701e451 a13e5f81b5a83f783ebeb6317599a7cd6cd4056b M arch
I contacted the author of this commit but he was unsure about the cause of this bug and so he wanted to reproduce it on the same Thinkpad model. Meanwhile I tried to revert a part of this commit and actually fixed the bug with this patch (without knowing what it does):
--- a/arch/x86/realmode/rm/wakeup_asm.S 2013-02-23 13:53:04.280001331 +0000
+++ b/arch/x86/realmode/rm/wakeup_asm.S 2013-02-23 13:54:14.363333655 +0000
@@ -93,8 +93,8 @@
/* Restore MISC_ENABLE before entering protected mode, in case
BIOS decided to clear XD_DISABLE during S3. */
- movl pmode_behavior, %edi
- btl $WAKEUP_BEHAVIOR_RESTORE_MISC_ENABLE, %edi
+ movl pmode_behavior, %eax
+ btl $WAKEUP_BEHAVIOR_RESTORE_MISC_ENABLE, %eax
jnc 1f
movl pmode_misc_en, %eax
@@ -110,15 +110,15 @@
movl pmode_cr3, %eax
movl %eax, %cr3
- btl $WAKEUP_BEHAVIOR_RESTORE_CR4, %edi
- jnc 1f
- movl pmode_cr4, %eax
- movl %eax, %cr4
+ movl pmode_cr4, %ecx
+ jecxz 1f
+ movl %ecx, %cr4
1:
- btl $WAKEUP_BEHAVIOR_RESTORE_EFER, %edi
- jnc 1f
movl pmode_efer, %eax
movl pmode_efer + 4, %edx
+ movl %eax, %ecx
+ orl %edx, %ecx
+ jz 1f
movl $MSR_EFER, %ecx
wrmsr
1:
Now my laptop is running again with the patch applied to the latest stable kernel and suspend2ram is working again :) Let’s hope that this gets fixed upstream very soon.
Update Juli 2013: Der Kernel-Entwickler H. Peter Anvin konnte den Fehler mit dem Commit “x86, suspend: Handle CPUs which fail to #GP on RDMSR” fixen. Die Änderungen sind seit dem Kernel 3.11-rc2 enthalten.