[mythtv-commits] Ticket #10822: Non-responsive backend locks up all networked systems
MythTV
noreply at mythtv.org
Sat Jun 9 16:34:44 UTC 2012
#10822: Non-responsive backend locks up all networked systems
----------------------+--------------------------------------------
Reporter: ewilde@… | Type: Bug Report - Hang/Deadlock
Status: new | Priority: minor
Milestone: unknown | Component: MythTV - General
Version: 0.25 | Severity: medium
Keywords: | Ticket locked: 0
----------------------+--------------------------------------------
I have three systems running MythTV 0.25. One master backend (MB) and two
secondaries (S1 and S2).
For some reason (there is nothing in the log), the backend process on S1
decided to take a vacation (sort of). The frontend task could not do
anything on that system that required the services of the backend process
(e.g. play a recorded program). However, the backend process was able to
start and complete a recording at its appointed time.
Meanwhile, on the MB and S2 machines, the recorded program showed up in
the list and was briefly available for concurrent (while recording)
viewing until it abruptly stopped playing. At this point, the frontend
process on the MB system (or the S2 system, take your choice) appeared to
be hung. Several attempts to kill and restart it always resulted in the
appearance of it being hung, when the program was viewed. Eventually,
patience prevailed and an attempt to view the program did not hang but
returned to the recorded programs list (about 5 minutes).
Attempts to view any other programs on the MB or S2 systems met with
similar hangs, although the recording files were local to those machines.
This behavior continued until the recording ended on the S1 system, at
which point the MB and S2 started working normally (i.e. recordings could
be played).
However, any attempts to play the recording from either the MB or S2
system resulted in the "File doesn't exist" message. Incidentally, this
message is often a lie. The file is right where it should be, on the
system that is not responding to the helm. Changing the message to reflect
the true state of the problem would be swell. How about, "The file is
located on a server which isn't responding to the helm," instead of
misleading the user into thinking the file didn't get created? They might
be foolishly tempted to delete the missing recording from the database.
Anyway, other recordings could be played but not the one in question.
Meanwhile, the S1 system continued to be essentially broken. Several
restarts of the frontend process had no effect.
Not until the backend process on the S1 system was restarted did anything
start to work reasonably well. Incidentally, the new upstart method of
controlling the backend does not work. It gives some lame message about
"stop/waiting" but the actual task continues to soldier on (in its semi-
hung state). A "ps x -A" followed by a "kill -9" of the appropriate task
number did the job. Then, "start mythtv-backend" brought up a working
backend.
Note that the recording on the S1 system worked fine. Once the backend
process on that system was restarted, the recording file appeared on all
systems (MB, S1, S2) and could be played. The frontend process on S1 was
restarted and it worked too. All was right with the world.
So, it would appear that the portion of the backend process on S1, that
responds to file transfer requests from other processes, went south.
Perhaps there was more to it than that, since the frontend process on S1
apparently could not talk to the backend process either. The recording
portion worked fine, however, since the scheduled recording started
(perhaps this was before the problem began) and stopped at the correct
time and the recording file itself is 100% fine.
This lack of response on one system (S1) had the effect of hanging all of
the other systems (MB, S1). In the backend log file on S1, there is
absolutely nothing to indicate any problem whatsoever. All I see is the
recording starting and stopping normally.
Meanwhile, in the backend log file on MB (or S2, take your choice), I see
a boat load of "E FreeSpaceUpdater playbacksock.cpp:139
(SendReceiveStringList) PlaybackSock::SendReceiveStringList(): Response
too short" messages (and I do mean a boat load) which go on for at least
30 minutes from around the time the event appears to have happened.
I'll be happy to send you whatever logs you need.
Here are some general observations:
1) The network timeouts are far too long. Waiting for two minutes
and then retrying a couple of times (for a total of 5 or 6 minutes)
is way too long. If the answer on a locally-connected device does
not come back in 500ms, something is probably wrong. Ten to twenty
seconds is more than enough. Remember, the user is sitting there
waiting for something to happen. After two minutes, they are
probably thinking about buying a gun. From my years of experience
in the online systems business, the goal should be two second
response time (not always achievable but still a good goal).
2) Perhaps a separate task that just looks up the recorded programs
list and streams files would be good (e.g. I have a server that
serves videos for Mythvideo, using Samba. One can select a video
from the menu, hit enter, and it starts playing in 2 seconds. Once
it begins, nothing ever interrupts it). Asking the backend to do
everything may be not such a good idea.
3) Error recovery from network problems seems to be not good. Often,
when something is wrong (e.g. can't view a recording), the fix is
simply to restart the frontend task. This would imply that error
recovery could be had simply by closing the socket and reopening
it.
4) It would be great if recording tasks could be spawned separately
from the backend task. Then, one could restart the backend without
losing all of one's recordings. Believe it or not, this state
happens quite often (i.e. when one would like to recycle the
backend process but must wait for valuable recordings to end).
5) A kill everything and start it all back up fresh (except for
recordings) command that actually works would be swell. It would
be really swell if it could be bound to a key on the
keyboard/remote that wasn't routed through the hung frontend
process. It would be even sweller if pressing this key would
package up all of the log entries (in the vicinity of the
problem), plus any other pertinent information, and send it from a
background task (heavy emphasis on the word background) to MythTV
Command Central for debugging purposes. There are users who do not
even know how a computer works who, none-the-less, would then be
capable of restarting the broken system and simultaneously sending
in a bug report. The Evil Empire has, for example, had great
success with such a system towards improving the reliability of
their software.
Fundamental problems that are probably network-related have been around
since MythTV 0.21. I realize that debugging problems that are spread
across multiple, networked devices is very difficult, if not impossible.
So, whatever assistance I can render in the form of capturing packets,
creating log files, running debugging code, etc., I'd be happy to help
out.
--
Ticket URL: <http://code.mythtv.org/trac/ticket/10822>
MythTV <http://code.mythtv.org/trac>
MythTV Media Center
More information about the mythtv-commits
mailing list