[mythtv-users] Random system freezes -- clueless

Sun Dec 20 21:04:56 UTC 2015

On Sun, 20 Dec 2015 13:22:47 -0600
Craig Huff <huffcslists at gmail.com> wrote:
> Okay, I have a clue. Now what to do with it?
> 
> I found "general protection fault" and "Oops" messages while trolling
> through the /var/log/kern.log* and /var/log/syslog* files.  Here are
> the ten I found in the kern.log* files over the last ~30 days and the
> following RIP: lines that indicate where the problems occurred:
> 
> Nov 25 21:40:32 penguin kernel: [101907.758881] general protection
> fault: 0000 [#1] SMP
> Nov 25 21:40:32 penguin kernel: [101907.770634] RIP:
> 0010:[<ffffffff81087592>]  [<ffffffff81087592>] find_pid_ns+0x92/0xc0
> 
> Dec  5 18:34:41 penguin kernel: [ 1140.280943] general protection
> fault: 0000 [#1] SMP
> Dec  5 18:34:41 penguin kernel: [ 1140.296076] RIP:
> 0010:[<ffffffff81087592>]  [<ffffffff81087592>] find_pid_ns+0x92/0xc0
> 
> Dec  5 18:34:45 penguin kernel: [ 1144.400283] general protection
> fault: 0000 [#2] SMP
> Dec  5 18:34:45 penguin kernel: [ 1144.417924] RIP:
> 0010:[<ffffffff81087592>]  [<ffffffff81087592>] find_pid_ns+0x92/0xc0
> 
> Dec 11 01:01:01 penguin kernel: [ 7524.783834] general protection
> fault: 0000 [#1] SMP
> Dec 11 01:01:01 penguin kernel: [ 7524.784155] RIP:
> 0010:[<ffffffff81162037>]  [<ffffffff81162037>]
> page_evictable+0x17/0x40
> 
> Dec 11 06:14:33 penguin kernel: [ 9565.507236] general protection
> fault: 0000 [#1] SMP
> Dec 11 06:14:33 penguin kernel: [ 9565.507890] RIP:
> 0010:[<ffffffff81162037>]  [<ffffffff81162037>]
> page_evictable+0x17/0x40
> 
> Dec 11 19:43:01 penguin kernel: [11447.492095] general protection
> fault: 0000 [#1] SMP
> Dec 11 19:43:01 penguin kernel: [11447.492586] RIP:
> 0010:[<ffffffff81162037>]  [<ffffffff81162037>]
> page_evictable+0x17/0x40
> 
> Dec 12 10:28:17 penguin kernel: [   53.135696] Oops: 0000 [#1] SMP
> Dec 12 10:28:17 penguin kernel: [   53.136108] RIP:
> 0010:[<ffffffff811866c4>]  [<ffffffff811866c4>]
> anon_vma_fork+0xa4/0x140
> 
> Dec 20 05:05:32 penguin kernel: [  592.633393] general protection
> fault: 0000 [#1] SMP
> Dec 20 05:05:32 penguin kernel: [  592.682342] RIP:
> 0010:[<ffffffff810875f2>]  [<ffffffff810875f2>] find_pid_ns+0x92/0xc0
> 
> Dec 20 11:22:22 penguin kernel: [  212.770621] Oops: 0002 [#1] SMP
> Dec 20 11:22:22 penguin kernel: [  212.794726] RIP:
> 0010:[<ffffffff81186310>]  [<ffffffff81186310>]
> unlink_anon_vmas+0xf0/0x200
> 
> Dec 20 12:06:59 penguin kernel: [ 2890.593387] Oops: 0002 [#2] SMP
> Dec 20 12:06:59 penguin kernel: [ 2890.612896] RIP:
> 0010:[<ffffffff81186310>]  [<ffffffff81186310>]
> unlink_anon_vmas+0xf0/0x200
> 
> 
> I have a sinking feeling that my CPU is bad, but how do I test it to
> prove that is the case?  I've seen mentions of running programs (xxx
> Sieve xxx?) but I don't know what they were, where to get them
> (apt-get xxx?), or if there's parameter settings to address for best
> testing.
> 
> OTOH, it might be the RAM, but I'd have to take Myth off-line for
> more than 12 hours (max time I've tried so far to run memtest86+)
> which didn't turn up any errors.
> 
> Any suggestions?  Need a bigger piece of the log for one/several/all
> of these?

Dear Craig,

The portions you show indicate that memory contains something
unexpected (corrupt). It is unlikely to be a kernel bug because these
code paths have been in use for decades. I don't think it is your CPU;
the errors would likely be much more extensive.

If it turns out to be a memory problem it would explain why the
symptoms you experience seem to change. You may find that running
different daemons (and mythtv especially) will change the symptoms of
the problem. This is why I recommend testing memory systems by
parallel-compiling a massive software package; the more memory churns
over the sooner a duff bit will hit something important.

I advise you to run the mcelog daemon. Even if you are not using
error-correcting memory you may still get a log message if a memory
bank is on the blink.

Before going too far it is important to establish a proper test case.
If it stinks like memory then you should do a memory test now, before
opening the case up. What you need is a repeatable and automatic test
that demonstrates that a problem definitely exists.

GCC comes from <http://gcc.gnu.org/>. You will need to install the
development packages for GCC's dependencies. There are quite a few I am
afraid. I made a mistake when I gave the commands to build GCC. This is
what you want:

extract tar file
mkdir obj ; cd obj
...srcdir.../configure --enable-languages=all --enable-threads
make -j 8_or_more all-gcc

You don't have to use GCC, any big software package would do. Open
Office, QT, you name it. GCC has the advantage that it bootstraps
itself. A single-bit error anywhere in compiled software won't just
lurk there until you manually exercise every part of the compiled code.
The GCC build runs the compiler it has just built. This should cause the
build to fall over in some (usually inexplicable) manner if it
encounters corruption anywhere. This is what you are looking for.

Assuming this yields a result it is time to run down my check list.
I might add that whilst you have the chassis open you should consider
re-seating the memory. Before you do that (and before you open the case
to do anything) ensure that you have clean (but not dry) hands, are
wearing a wrist-strap and have an anti-static mat upon which to place
the motherboard and other computer parts. The last thing you want to do
is damage the hardware further by giving it a zap. Your CPU/RAM runs on
1 volt. Bone-dry hands can support a charge of several hundred volts
without you noticing. An anti-static bench mat from somewhere like
Maplin (<http://www.maplin.co.uk>) costs a lot less than a new CPU.

That may just do the trick, so re-run the memory check like so:

rm -rf obj
mkdir obj ; cd obj
and so on...

If it works then you should have reasonable confidence in the hardware.
If not it is probably a dodgy memory stick. Remove all but one memory
stick and re-run the above test. Rinse and repeat until you find the
culprit. Before you do remove any sticks please read your motherboard's
manual to see which sockets to install your memory stick in. I advise
you to write a number on the memory sticks and keep a paper record of
your testing.

If you follow these suggestions I am confident that you will get to the
nub of the problem fairly quickly. I do this sort of thing (with
something much, much, much bigger than GCC) whenever I have suspect
hardware problems on my computers and it has always proved reliable.

Mike.