[mythtv-users] Backend machine seems to crash every few days

Thu Sep 29 15:35:39 UTC 2016

On 09/29/16 07:00, Steve Goodey wrote:
> I don't know if this is a Myth backend issue, or something bigger, but
> my backend/sever seems to crash/hang every few days.
>
> The computer seems to still be running (the light is on), but I can't
> access the machine via vnc or ssh or anything. I can't even ping it.
> It's a headless server, so I just have to hold the power button in and
> then start it up again.
>
> Can anyone help me to get to the bottom of this? I have no idea what I'm
> looking at if I view a log file.
>
> Any help would be much appreciated.
>
> Thanks,
> Damian
> _______________________________________________
> Damian,
>
> Unfortunately there's not a lot to go on with the info you've given.
>
> I'm not expert on this but I think first thing I'd do is see if you can get a 
> monitor, keyboard and mouse hooked up so that you can see what state 
> the machine is in when it goes wrong.
>
It's probably not a mythbackend issue, per se, since a non-privileged
application should not be able to crash a system.  It is possible that
Myth or another program is doing something that the kernel doesn't
handle properly, but that's still a kernel bug IMO.  A hardware issue or
kernel bug would be my leading candidates.

Steve's suggestion is a good one - put a monitor and keyboard on the
system if possible.  You might see a message displayed on the console. 
If your kernel has CONFIG_MAGIC_SYSRQ, you might be able to use the
magic SysRq keys to determine if the kernel trapped a problem (SysRq
keys work) or there was a kernel panic (SysRq keys don't work), sync
file systems, do a cleaner reboot, etc.

You should definitely look at your system log files, typically in
/var/log.  The specific files will vary by Linux distribution and your
settings for syslog (kern.log, syslog, and messages are common names). 
Look specifically for any messages that were logged just prior to the
hang time and that seem unusual compared to log entries during a time of
normal operation, e.g., kernel panic, out of memory killer messages,
device not responding, etc.

Consider *any* changes you made to the system shortly before this began
happening.  Have you added any devices, updated the kernel, installed
new programs, etc.?

Do you have any system monitoring, particularly for temperature?  If
not, see if you have environmental monitoring configured and if the
lm-sensors command "sensors" works.  If it does, start looking at the
output periodically and see if the core temperatures climb.  You could
put the "sensors" command into a crontab entry and save the output every
5 minutes or so to create a crude, temporary monitor.  Inspect your CPU
fan and make sure it is running.  Make sure the fan and the CPU cooler
fins are not obstructed by dust, fur, etc.

Keith