[mythtv-users] OT: Random system freezes -- clueless

Sat Dec 19 02:23:01 UTC 2015

On 12/18/2015 01:56 PM, Mike Thomas wrote:
> On Thu, 17 Dec 2015 18:31:02 -0600
> Craig Huff <huffcslists at gmail.com> wrote:
>> My Myth system has a bad habit of freezing, except when I'm doing
>> something on it.  It will automatically wake up to record something
>> and shutdown once it is finished and everything works.  But
>> sometimes... The system just stops in its tracks, like it hit a halt
>> instruction.  There are no errors reported in any of the log files
>> in /var/log, /var/log/lightdm, /var/log/upstart, /var/log/mysql,
>> or /var/log/mythtv.  They just stop dead in their tracks and I have
>> to hit the reset button or cycle the power to get it going again.
>> Not even installing the watchdog package and configuring it rescues
>> the system.  I get lots of watchdog check messages in the logs right
>> up until the system freezes in its tracks at some random time hours
>> or days after the last incident.  I've run memtest for hours with no
>> problems reported -- like overnight plus half the next day.  I've run
>> disk diagnostics and swapped disks (within limits).  I've tried most
>> things I found on my copy of the Ultimate Boot CD.  All to say I've
>> found no smoking gun.  So, like I said, I'm clueless.  I've ordered a
>> different motherboard, but I'm only shooting in the dark with that.
>>
>> Anyone have something better than a wild guess about what's going on?
>
> Dear Craig,
>
> Hardware or software? That is the question.
>
> I doubt it is a hardware problem because these are actually very rare. I
> have a lot of machines and I flog them hard. Also many of them are very
> old indeed (10, 15 years even). Of course I have experienced hardware
> problems but they are usually quite easy to diagnose. Try this general
> check list:
>
> 1. Are your hard discs shagged? I have one go every year on average but
> then I have a lot of hard discs (and I flog them). Some drives last for
> 2 years, some for 5, some for 20 years and still going strong. It
> usually obvious when your disc has gone; you will get IO errors all
> over the place.
>
> 2. Is the chassis clean? It is a good idea to clean out all the filth
> every year or so. You should remove the fans and clean them with a
> cloth. It's fiddly work but it will keep even cheapo fans going for 10
> years. You can blow out the crud from the fins on the heat sinks. Also
> make sure you open up the PSU and clean its fans. Make sure you have
> let the PSU discharge for several minutes so you don't get a shock.
>
> Unless you can see burn marks on a component in the PSU it is very
> unlikely to be at fault. I've had more PSUs than I care to remember but
> they are virtually indestructible. Furthermore modern mobos generate 1V
> from an unregulated 12V supply. There could be a problem with the power
> supply circuit on the mobo but unless you can see burn marks I doubt
> it would be the culprit. It wouldn't have left the factory.
>
> Similarly I don't believe that overheating on the motherboard is a
> problem. Most of the important chips have built-in temperature sensors
> and emergency shutdown circuits which are not under O/S control. An
> overheat problem (which I have experienced before now) will just cause
> an O/S shutdown with a console message. Mike Perkins did raise the
> possibility of the heat sink being improperly bonded to the CPU. This
> would result in a temperature rise and that would result in a thermal
> shutdown.
>
> 3. Whilst you have everything in pieces just check that there aren't any
> metal things sticking up underneath the motherboard. I had a
> motherboard which behaved oddly once. It turned out that one part on
> the reverse was touching an unused boss and shorting out. Surprisingly
> the fault was intermittent.
>
> 4. Is the battery on the motherboard worn out? I had a couple of
> motherboards produce intermittent network errors whilst Linux was
> running. It turned out the BIOS wasn't configuring the network cards
> properly because the BIOS settings were borked. I replaced the mobo
> battery and reset it to factory defaults.
>
> 5. If your battery is OK try resetting your BIOS to factory settings.
> This removes any over clocking "tuning" you did earlier. This is a good
> thing. Over clocking means running the chips faster than their
> manufacturers are prepared to guarantee they will work without fault. If
> you are still having faults by now they are probably self-inflicted. I
> had a lot of USB problems on a couple of boards until I realised their
> memory timing was bad. All the faults were intermittent. A BIOS reset
> solved it.
>
> 6. Turn off all the "green" facilities your motherboard and peripherals
> offer. "Green" hard disc drives spin down and will produce system
> stalls a plenty. They shag their motors quickly into the bargain.
> There's no such thing as a free lunch.
>
> 7. Unplug as many cards as you can and boot without them. Duff video
> drivers have produced hangs for me. Although not a hardware fault, it
> looks like a hardware fault because the problem goes away when the
> video driver can't load.
>
> To do this successfully I strongly recommend you connect a serial cable
> to COM1 and configure your OS to boot over a serial console with some
> element of debug output. You must boot with the following items in
> your command line "console=tty0 console=ttyS0,115200n8" (or whatever
> your baud rate is). You can use cu, minicom etc. to connect from another
> computer. Ensure you have a large scroll-back on your xterm so you can
> look back at the "crash" message.
>
> Next I recommend firstly that you install a kernel with MAGIC_SYSRQ
> set. If it all goes to shit you can use Alt+SysRq+S (or ~#s if you use
> cu) to sync everything (to lessen the effect of a disc crash) and
> Alt+SysRq+T or ~#t to dump the task states (to identify a hung task).
> Alt+SysRq+h or ~#h provides a help message. If you can't find a
> pre-compiled kernel compile it and install it yourself. You might learn
> something.
>
> If you are running X then put this in your start-up script:
>
> setxkbmap -option terminate:ctrl_alt_bksp
>
> This will allow you to give the X server a three-fingered salute when
> things go wrong.
>
> Finally, the best way to check if your RAM is any good is to forget
> using memtest; it doesn't produce realistic signal timings. Many memory
> problems seem to be related to inter-symbol interference (from what I
> can tell), so you need to flog your RAM and busses with realistic
> accesses. What you want to do is run a really big compile job, and do
> this in a loop for a whole day. The compile job must be much bigger
> than the available RAM.
>
> One of the best and most accessible programmes to try is GCC. GCC is
> big if you enable all the languages and do a 3-stage bootstrap and it
> builds well in parallel. The bootstrap build compares the binaries
> produced in stage 2 with those in stage 3 which can add to your
> confidence. Make sure you have lots of swap available and run make
> with a large parallelism. You want it to drag your system to
> its knees to exercise all your RAM and signal timings.
>
> Try something like:
>
> extract tar file
> mkdir obj ; cd obj
> ...srcdir.../configure --enable-languages=all --enable-threads
> make -j 8_or_more
>
> If this goes swimmingly then it is unlikely to be hardware. I suggest
> you re-install your O/S.
>
> Hope this helps,
>
> Mike

Craig, you do not give your version of myth, but in the past I have had 
exactly the same thing. After having done the first half of Mike's list, 
I strongly suspect a race condition, as nothing but the big red switch 
would recover. An active ssh session into the box would become locked up 
too. Recent master branch code has not exhibited this problem (last 4 
months or so).

The only thing which Mike missed, is checking for a full partition 
*including /tmp*. Having this happen on a separate /var partition just 
slows the box waaayyy down: your box on Prozac!  But you can even ssh in 
and just kill some log files to recover. Same thing with a chock full 
/video partition.

Having this happen on '/' (because /var got hit by a runaway) just kills 
the box. (Ok *maybe* it will respond sometime, but my patience never 
lasted that long. If it feels like a big-red-switch hang, it IS a 
big-red-switch hang!)
An overfull /tmp will thrash your swap, but the box should still run 
reasonably. It is /var/log which is your problem spot. (And why /, 
/home. and /var are separate partitions on my boxen.

Geoff

.