Exquisitely sensitive stability testing - the linux kernel!

Discussion in 'Overclocking, Cooling & Modding' started by graysky, Jul 29, 2012.

  1. graysky

    graysky ARP Reviewer

    TL; DR Summary
    The linux kernel is a powerful tool to detect instabilities in your overclock settings with both greater accuracy and sensitivity than either Prime95 or IBT/LinX.

    More Details
    The linux kernel supplies users with a dead simple method for measuring hardware instabilities -- like those caused by an 'unstable' overclock. There is nothing special to install as this functionality seems to be naively included in the kernel itself. To use it, simply run a standard stress test such as Prime95 or Linpack and watch the output from dmesg. If the system is unstable due to insufficient voltage settings, excessive heat, it will report:

    Code:
    [Hardware Error]: Machine check events logged
    I have seen the kernel throw these errors during a prime95 run before prime95 gave an error in the math. Further, I have seen these errors appear when and linpack did not detect the settings are unstable as evident by the residual number not chaining during the run when the error occurred.

    How to Stress Test Under Linux
    Probably the most newb-friendly flavor of Linux is Ubuntu. Users can run it live off a CD or a USB without installing it to their systems. Further, it is pre-configured to boot into a GUI with network and hardware autodetected. Download an image from Home | Ubuntu - I recommend the 64-bit version as the 32-bit Linux suffers from the same <4 GB of memory limitation that the 32-bit Windows does,

    Note: don't feel like Ubuntu is your only option. There are many other Linux distributions out there from which to choose.

    Download the iso, burn it to media or to a USB and boot. Ubuntu prompts users to either "try ubuntu" or "install ubuntu." Just hit the "try ubuntu" button and you will be dumped into the live linux environment.

    Here are a few suggestions for stress testing:
    1) mprime ---> linux version of prime95. Help to download and run mprime.
    2) linpack ---> back end to both LinX and IBT. Help to download and run linpack.
    3) x264 video encoding.
    4) Compiling something large like the linux kernel.

    I have seen on my own machine the ability to pass tests #1 and #2 but an inability to get more than 10 min into a x264 encode or to compile something 4-5 times without errors. It is important to test using several orthogonal stresses. While stressing, print the output of the kernel ring buffer. You can do this in one of two ways:

    1) Open a terminal and type dmesg to see a snapshot.
    2) Perhaps more useful is to be informed when something happens rather than typing dmesg over and over again! You can do this with the following command:
    Code:
    sudo cat /proc/kmsg
    It looks like nothing is happening, but actually, the command more or less opened a connection to the ring buffer; it will update when something happens. To test it, plug in a USB thumb drive.

    Example on my box:
    Code:
    <5>[13393.025582] scsi 10:0:0:0: Direct-Access     Kingston DataTraveler 112 1.00 PQ: 0 ANSI: 2
    <5>[13393.026103] sd 10:0:0:0: [sdc] 7831552 512-byte logical blocks: (4.00 GB/3.73 GiB)
    <5>[13393.026449] sd 10:0:0:0: [sdc] Write Protect is of<>133065]s 0000 sc oeSne 30 00
    Anyway, you will want to watch for that message I posted above:
    Code:
    [Hardware Error]: Machine check events logged
    EDIT: I edited this thread to add a few more recommendations for stability. Why? Both mprime and linpack are what I term 'higher voltage' stress programs. They will demand the most vcore when run. Options 3 and 4 are 'lower voltage' stress programs. They do not always call for the highest vcore when running and as such can be more prone to throwing errors.

    Example on my overclocked i7-3770K (4.50 GHz); vcore is +0.020 V in offset mode with all powersaving features enabled.

    Idle: 0.7440 V - 0.8320 V (varies).
    Mprime small FFTs: 1.2880 V (steady).
    Mprime large FFTs: 1.3040 V (steady).
    Mprime blend: 1.2960 V (steady).
    Linpack: 1.2320 V - 1.2720 V (varies).
    x264 encoding: 1.2320 V - 1.2720 V (varies).
    gcc compiling: 1.2720 V (steady).

    I can actually run with a vcore of +0.005 and remain stable in both mprime and linpack but get errors under both x264 and gcc. This is why I recommend selecting stresses from both the 'higher voltage' category and the 'lower voltage' category.
     
    Last edited: May 21, 2013
  2. Adrian Wong

    Adrian Wong Da Boss Staff Member

    Hmm.. It actually logged a direct access error to your Kingston flash memory drive?
     
  3. graysky

    graysky ARP Reviewer

    No, the ring buffer logs most events, I just inserted the usb stick to get it to output something.
     
  4. Adrian Wong

    Adrian Wong Da Boss Staff Member

    Oh, then how would an actual error look like? Do you have a sample?
     
  5. graysky

    graysky ARP Reviewer

     
  6. Adrian Wong

    Adrian Wong Da Boss Staff Member

    Oh, I saw the syntax. But have you recorded any error messages yet?

    Also, can the same error be replicated if you run another test?

    Thanks!
     
  7. graysky

    graysky ARP Reviewer

    That is the error message. You can augment it with a piece of software called mcelog. To answer your question, yes, it is reproducible when voltage is too low.

    Here is an example mcelog on this machine when undervolted below the stability threshold:

    Code:
    HARDWARE ERROR. This is *NOT* a software problem!
    Please contact your hardware vendor
    MCE 0
    CPU 2 BANK 0 
    TIME 1343004751 Sun Jul 22 20:52:31 2012
    MCG status:
    MCi status:
    Error enabled
    MCA: Unknown Error 5
    STATUS 9000004000010005 MCGSTATUS 0
    MCGCAP c09 APICID 4 SOCKETID 0 
    CPUID Vendor Intel Family 6 Model 58
    mcelog: Unsupported new Family 6 Model 3a CPU: only decoding architectural errors
    HARDWARE ERROR. This is *NOT* a software problem!
    Please contact your hardware vendor
    MCE 0
    CPU 2 BANK 0 
    TIME 1343004753 Sun Jul 22 20:52:33 2012
    MCG status:
    MCi status:
    Error enabled
    MCA: Unknown Error 5
    STATUS 9000004000010005 MCGSTATUS 0
    MCGCAP c09 APICID 4 SOCKETID 0 
    CPUID Vendor Intel Family 6 Model 58
    mcelog: Unsupported new Family 6 Model 3a CPU: only decoding architectural errors
    HARDWARE ERROR. This is *NOT* a software problem!
    Please contact your hardware vendor
    MCE 0
    CPU 2 BANK 0 
    TIME 1343004754 Sun Jul 22 20:52:34 2012
    MCG status:
    MCi status:
    Error enabled
    MCA: Unknown Error 5
    STATUS 9000004000010005 MCGSTATUS 0
    MCGCAP c09 APICID 4 SOCKETID 0 
    CPUID Vendor Intel Family 6 Model 58
    mcelog: Unsupported new Family 6 Model 3a CPU: only decoding architectural errors
    HARDWARE ERROR. This is *NOT* a software problem!
    Please contact your hardware vendor
    MCE 0
    CPU 2 BANK 0 
    TIME 1343004755 Sun Jul 22 20:52:35 2012
    MCG status:
    MCi status:
    Error enabled
    MCA: Unknown Error 5
    STATUS 9000004000010005 MCGSTATUS 0
    MCGCAP c09 APICID 4 SOCKETID 0 
    CPUID Vendor Intel Family 6 Model 58
    mcelog: Unsupported new Family 6 Model 3a CPU: only decoding architectural errors
    HARDWARE ERROR. This is *NOT* a software problem!
    Please contact your hardware vendor
    MCE 0
    CPU 2 BANK 0 
    TIME 1343004756 Sun Jul 22 20:52:36 2012
    MCG status:
    MCi status:
    Error enabled
    MCA: Unknown Error 5
    STATUS 9000004000010005 MCGSTATUS 0
    MCGCAP c09 APICID 4 SOCKETID 0 
    CPUID Vendor Intel Family 6 Model 58
    mcelog: Unsupported new Family 6 Model 3a CPU: only decoding architectural errors
    HARDWARE ERROR. This is *NOT* a software problem!
    Please contact your hardware vendor
    MCE 0
    CPU 2 BANK 0 
    TIME 1343004757 Sun Jul 22 20:52:37 2012
    MCG status:
    MCi status:
    Error enabled
    MCA: Unknown Error 5
    STATUS 9000004000010005 MCGSTATUS 0
    MCGCAP c09 APICID 4 SOCKETID 0 
    CPUID Vendor Intel Family 6 Model 58
    mcelog: Unsupported new Family 6 Model 3a CPU: only decoding architectural errors
    HARDWARE ERROR. This is *NOT* a software problem!
    Please contact your hardware vendor
    MCE 0
    CPU 2 BANK 0 
    TIME 1343004759 Sun Jul 22 20:52:39 2012
    MCG status:
    MCi status:
    Error enabled
    MCA: Unknown Error 5
    STATUS 9000004000010005 MCGSTATUS 0
    MCGCAP c09 APICID 4 SOCKETID 0 
    CPUID Vendor Intel Family 6 Model 58
    mcelog: Unsupported new Family 6 Model 3a CPU: only decoding architectural errors
    HARDWARE ERROR. This is *NOT* a software problem!
    Please contact your hardware vendor
    MCE 0
    CPU 2 BANK 0 
    TIME 1343004759 Sun Jul 22 20:52:39 2012
    MCG status:
    MCi status:
    Error enabled
    MCA: Unknown Error 5
    STATUS 9000004000010005 MCGSTATUS 0
    MCGCAP c09 APICID 4 SOCKETID 0 
    CPUID Vendor Intel Family 6 Model 58
    mcelog: Unsupported new Family 6 Model 3a CPU: only decoding architectural errors
    HARDWARE ERROR. This is *NOT* a software problem!
    Please contact your hardware vendor
    MCE 0
    CPU 2 BANK 0 
    TIME 1343004759 Sun Jul 22 20:52:39 2012
    MCG status:
    MCi status:
    Error enabled
    MCA: Unknown Error 5
    STATUS 9000004000010005 MCGSTATUS 0
    MCGCAP c09 APICID 4 SOCKETID 0 
    CPUID Vendor Intel Family 6 Model 58
    mcelog: Unsupported new Family 6 Model 3a CPU: only decoding architectural errors
    HARDWARE ERROR. This is *NOT* a software problem!
    Please contact your hardware vendor
    MCE 0
    CPU 2 BANK 0 
    TIME 1343004761 Sun Jul 22 20:52:41 2012
    MCG status:
    MCi status:
    Error enabled
    MCA: Unknown Error 5
    STATUS 9000004000010005 MCGSTATUS 0
    MCGCAP c09 APICID 4 SOCKETID 0 
    CPUID Vendor Intel Family 6 Model 58
    mcelog: Unsupported new Family 6 Model 3a CPU: only decoding architectural errors
    HARDWARE ERROR. This is *NOT* a software problem!
    Please contact your hardware vendor
    MCE 0
    CPU 2 BANK 0 
    TIME 1343004762 Sun Jul 22 20:52:42 2012
    MCG status:
    MCi status:
    Error enabled
    MCA: Unknown Error 5
    STATUS 9000004000010005 MCGSTATUS 0
    MCGCAP c09 APICID 4 SOCKETID 0 
    CPUID Vendor Intel Family 6 Model 58
    mcelog: Unsupported new Family 6 Model 3a CPU: only decoding architectural errors
    HARDWARE ERROR. This is *NOT* a software problem!
    Please contact your hardware vendor
    MCE 0
    CPU 2 BANK 0 
    TIME 1343004762 Sun Jul 22 20:52:42 2012
    MCG status:
    MCi status:
    Error enabled
    MCA: Unknown Error 5
    STATUS 9000004000010005 MCGSTATUS 0
    MCGCAP c09 APICID 4 SOCKETID 0 
    CPUID Vendor Intel Family 6 Model 58
    mcelog: Unsupported new Family 6 Model 3a CPU: only decoding architectural errors
    HARDWARE ERROR. This is *NOT* a software problem!
    Please contact your hardware vendor
    MCE 0
    CPU 2 BANK 0 
    TIME 1343004763 Sun Jul 22 20:52:43 2012
    MCG status:
    MCi status:
    Error enabled
    MCA: Unknown Error 5
    STATUS 9000004000010005 MCGSTATUS 0
    MCGCAP c09 APICID 4 SOCKETID 0 
    CPUID Vendor Intel Family 6 Model 58
    mcelog: Unsupported new Family 6 Model 3a CPU: only decoding architectural errors
    HARDWARE ERROR. This is *NOT* a software problem!
    Please contact your hardware vendor
    MCE 0
    CPU 2 BANK 0 
    TIME 1343004763 Sun Jul 22 20:52:43 2012
    MCG status:
    MCi status:
    Error enabled
    MCA: Unknown Error 5
    STATUS 9000004000010005 MCGSTATUS 0
    MCGCAP c09 APICID 4 SOCKETID 0 
    CPUID Vendor Intel Family 6 Model 58
    mcelog: Unsupported new Family 6 Model 3a CPU: only decoding architectural errors
    HARDWARE ERROR. This is *NOT* a software problem!
    Please contact your hardware vendor
    MCE 0
    CPU 2 BANK 0 
    TIME 1343004763 Sun Jul 22 20:52:43 2012
    MCG status:
    MCi status:
    Error enabled
    MCA: Unknown Error 5
    STATUS 9000004000010005 MCGSTATUS 0
    MCGCAP c09 APICID 4 SOCKETID 0 
    CPUID Vendor Intel Family 6 Model 58
    mcelog: Unsupported new Family 6 Model 3a CPU: only decoding architectural errors
    HARDWARE ERROR. This is *NOT* a software problem!
    Please contact your hardware vendor
    MCE 0
    CPU 2 BANK 0 
    TIME 1343004764 Sun Jul 22 20:52:44 2012
    MCG status:
    MCi status:
    Error enabled
    MCA: Unknown Error 5
    STATUS 9000004000010005 MCGSTATUS 0
    MCGCAP c09 APICID 4 SOCKETID 0 
    CPUID Vendor Intel Family 6 Model 58
    mcelog: Unsupported new Family 6 Model 3a CPU: only decoding architectural errors
    HARDWARE ERROR. This is *NOT* a software problem!
    Please contact your hardware vendor
    MCE 0
    CPU 2 BANK 0 
    TIME 1343004766 Sun Jul 22 20:52:46 2012
    MCG status:
    MCi status:
    Error enabled
    MCA: Unknown Error 5
    STATUS 9000004000010005 MCGSTATUS 0
    MCGCAP c09 APICID 4 SOCKETID 0 
    CPUID Vendor Intel Family 6 Model 58
    mcelog: Unsupported new Family 6 Model 3a CPU: only decoding architectural errors
    HARDWARE ERROR. This is *NOT* a software problem!
    Please contact your hardware vendor
     
  8. graysky

    graysky ARP Reviewer

    Edited main post with new info.
     
  9. Adrian Wong

    Adrian Wong Da Boss Staff Member

    Mind if I post it up on Tech ARP's main site, graysky?
     
    Last edited: May 22, 2013
  10. graysky

    graysky ARP Reviewer

    Go for it. I just wish that the software I recommend is actually in the Ubuntu repos so users can get at them easily from the live CD. Both linpack and mprime need to be downloaded (binaries). Handbrake (really good x264 encoder) needs to be downloaded from a PPA. Only gcc/cc is available from the repos. For this to be useful, people really need to put Linux on a small partition rather than running it from a live CD... although easy to do, it is a bit more involved for non-linux-users.

    EDIT: I have written the same content on the Arch wiki for Arch Linux users (and others) as well: https://wiki.archlinux.org/index.php/Stress_Test
     
  11. Adrian Wong

    Adrian Wong Da Boss Staff Member

    Thanks! :thumb:
     

Share This Page