[mythtv-users] mythbackend partial lockups making system all but unuseable

Mon Jul 17 17:21:30 UTC 2006

Hi Folks,

I think I may have run my system into the dirt and in doing so uncovered 
a bit of a problem. The OSs of the 3 systems are fine, they keep running 
no issues, but after between 20 minutes and 6 hours the master server 
and one or more of the slaves will have locked up and become 
unresponsive to incoming requests.

If the SVN already has fixes for this, then I guess my next question is 
when they might get out into the wider world of atrpms et al.

The system is as follows:

Main server: no cards, RHEL4 i386 with atrpms repo
Sky server: 1 x PVR350, FC5 i386 with atrpms
Freeview server: 3 x Nova-T, FC5 i386 with atrpms

All machines updated to this date.

I am guessing from having pored over the issue for several weeks, but I 
am thinking that the scheduler and/or protocol driver threads freeze in 
the master and that in turn locks up the slave servers. Restarting the 
slave will not work as the master is still locked up. Restarting the 
master also doesn't help as the slaves seem to lock the master up when 
it tries to find out the status of the cards on the slaves and the 
request locks up.

The thing that makes me think it is not all of the threads, is that if 
the backend is recording, the recoding carries on going in, even if for 
all other functions the backend is unresponsive. It is really galling to 
have to kill a backend that is recording in order to get the rest of the 
system working.

In order to unlock the system, the following seems to work best.

- restart mythbackend on the master and tail -f the mythbackend
   log file. If the log "Reschedule requested for id ??" is not
   followed in a second or 5 by the schedule result, then recurse.

- When the master is freerunning (after 2 - 5 re-starts) guide
   the frontend to the "play recordings" screen. This will bring
   out a bunch of master log error messages of the kind:

2006-07-17 16:44:40.958 MainServer::HandleQueryRecordings()
                         Couldn't find backend for:
                         Silent Witness : "Cargo. Part One"

- From this the locked up slave server can be identified. Go there and
   restart its mythbackend.

This works about 2/3 times.

If it doesn't, the master backend will be locked up again, so you will 
need to start from the top.

For a while I had the network and schedule log items switched on to see 
if anything obvious turned up. Sadly the copious additional output to 
the log files slowed something down and made the problem a whole lot 
worse. While they were running though, nothing really obvious jumped out.

Without spending several more days on this poring over the code, can 
anyone tell me if there is a timeout on the client/server protocol 
driver such that if the client or server fails to come back with a 
response, the end that sent the command can timeout and re-establish. At 
the moment a failure to reply seems to lock up the sending end.

Oh and I do wish <CR> and <LF> characters could be mapped out of the 
protocol somehow as it makes it very difficult to hand drive the 
protocol using telnet. Telnet and many other low level tcp tools insist 
on having a CR at the end of the line, however, this is counted as the 
first characters of the next command. The last character of that command 
then wraps over into the next command on and it all goes pear shaped. 
It would be SO useful for debugging.

Cheers

Andy M

-- 
____________________________________________________________

          Andrew Meredith BEng CEng CITP MBCS MIEE
                 The Anvil Organisation Ltd
                          Director
   http://www.anvil.org/     --    sip:andrew at sip.anvil.org
   andrew at anvil.org       --------      +44 (0) 1249 460560
____________________________________________________________