[mythtv-users] OT: Random system freezes -- clueless

Fri Dec 18 18:56:24 UTC 2015

On Thu, 17 Dec 2015 18:31:02 -0600
Craig Huff <huffcslists at gmail.com> wrote:
> My Myth system has a bad habit of freezing, except when I'm doing
> something on it.  It will automatically wake up to record something
> and shutdown once it is finished and everything works.  But
> sometimes... The system just stops in its tracks, like it hit a halt
> instruction.  There are no errors reported in any of the log files
> in /var/log, /var/log/lightdm, /var/log/upstart, /var/log/mysql,
> or /var/log/mythtv.  They just stop dead in their tracks and I have
> to hit the reset button or cycle the power to get it going again.
> Not even installing the watchdog package and configuring it rescues
> the system.  I get lots of watchdog check messages in the logs right
> up until the system freezes in its tracks at some random time hours
> or days after the last incident.  I've run memtest for hours with no
> problems reported -- like overnight plus half the next day.  I've run
> disk diagnostics and swapped disks (within limits).  I've tried most
> things I found on my copy of the Ultimate Boot CD.  All to say I've
> found no smoking gun.  So, like I said, I'm clueless.  I've ordered a
> different motherboard, but I'm only shooting in the dark with that.
> 
> Anyone have something better than a wild guess about what's going on?

Dear Craig,

Hardware or software? That is the question.

I doubt it is a hardware problem because these are actually very rare. I
have a lot of machines and I flog them hard. Also many of them are very
old indeed (10, 15 years even). Of course I have experienced hardware
problems but they are usually quite easy to diagnose. Try this general
check list:

1. Are your hard discs shagged? I have one go every year on average but
then I have a lot of hard discs (and I flog them). Some drives last for
2 years, some for 5, some for 20 years and still going strong. It
usually obvious when your disc has gone; you will get IO errors all
over the place.

2. Is the chassis clean? It is a good idea to clean out all the filth
every year or so. You should remove the fans and clean them with a
cloth. It's fiddly work but it will keep even cheapo fans going for 10
years. You can blow out the crud from the fins on the heat sinks. Also
make sure you open up the PSU and clean its fans. Make sure you have
let the PSU discharge for several minutes so you don't get a shock.

Unless you can see burn marks on a component in the PSU it is very
unlikely to be at fault. I've had more PSUs than I care to remember but
they are virtually indestructible. Furthermore modern mobos generate 1V
from an unregulated 12V supply. There could be a problem with the power
supply circuit on the mobo but unless you can see burn marks I doubt
it would be the culprit. It wouldn't have left the factory.

Similarly I don't believe that overheating on the motherboard is a
problem. Most of the important chips have built-in temperature sensors
and emergency shutdown circuits which are not under O/S control. An
overheat problem (which I have experienced before now) will just cause
an O/S shutdown with a console message. Mike Perkins did raise the
possibility of the heat sink being improperly bonded to the CPU. This
would result in a temperature rise and that would result in a thermal
shutdown.

3. Whilst you have everything in pieces just check that there aren't any
metal things sticking up underneath the motherboard. I had a
motherboard which behaved oddly once. It turned out that one part on
the reverse was touching an unused boss and shorting out. Surprisingly
the fault was intermittent.

4. Is the battery on the motherboard worn out? I had a couple of
motherboards produce intermittent network errors whilst Linux was
running. It turned out the BIOS wasn't configuring the network cards
properly because the BIOS settings were borked. I replaced the mobo
battery and reset it to factory defaults.

5. If your battery is OK try resetting your BIOS to factory settings.
This removes any over clocking "tuning" you did earlier. This is a good
thing. Over clocking means running the chips faster than their
manufacturers are prepared to guarantee they will work without fault. If
you are still having faults by now they are probably self-inflicted. I
had a lot of USB problems on a couple of boards until I realised their
memory timing was bad. All the faults were intermittent. A BIOS reset
solved it.

6. Turn off all the "green" facilities your motherboard and peripherals
offer. "Green" hard disc drives spin down and will produce system
stalls a plenty. They shag their motors quickly into the bargain.
There's no such thing as a free lunch.

7. Unplug as many cards as you can and boot without them. Duff video
drivers have produced hangs for me. Although not a hardware fault, it
looks like a hardware fault because the problem goes away when the
video driver can't load.

To do this successfully I strongly recommend you connect a serial cable
to COM1 and configure your OS to boot over a serial console with some
element of debug output. You must boot with the following items in
your command line "console=tty0 console=ttyS0,115200n8" (or whatever
your baud rate is). You can use cu, minicom etc. to connect from another
computer. Ensure you have a large scroll-back on your xterm so you can
look back at the "crash" message.

Next I recommend firstly that you install a kernel with MAGIC_SYSRQ
set. If it all goes to shit you can use Alt+SysRq+S (or ~#s if you use
cu) to sync everything (to lessen the effect of a disc crash) and
Alt+SysRq+T or ~#t to dump the task states (to identify a hung task).
Alt+SysRq+h or ~#h provides a help message. If you can't find a
pre-compiled kernel compile it and install it yourself. You might learn
something.

If you are running X then put this in your start-up script:

setxkbmap -option terminate:ctrl_alt_bksp

This will allow you to give the X server a three-fingered salute when
things go wrong.

Finally, the best way to check if your RAM is any good is to forget
using memtest; it doesn't produce realistic signal timings. Many memory
problems seem to be related to inter-symbol interference (from what I
can tell), so you need to flog your RAM and busses with realistic
accesses. What you want to do is run a really big compile job, and do
this in a loop for a whole day. The compile job must be much bigger
than the available RAM.

One of the best and most accessible programmes to try is GCC. GCC is
big if you enable all the languages and do a 3-stage bootstrap and it
builds well in parallel. The bootstrap build compares the binaries
produced in stage 2 with those in stage 3 which can add to your
confidence. Make sure you have lots of swap available and run make
with a large parallelism. You want it to drag your system to
its knees to exercise all your RAM and signal timings.

Try something like:

extract tar file
mkdir obj ; cd obj
...srcdir.../configure --enable-languages=all --enable-threads
make -j 8_or_more

If this goes swimmingly then it is unlikely to be hardware. I suggest
you re-install your O/S.

Hope this helps,

Mike.