[mythtv-commits] Ticket #10822: Non-responsive backend locks up all networked systems

Sat Jun 9 16:34:44 UTC 2012

#10822: Non-responsive backend locks up all networked systems
----------------------+--------------------------------------------
 Reporter:  ewilde@…  |           Type:  Bug Report - Hang/Deadlock
   Status:  new       |       Priority:  minor
Milestone:  unknown   |      Component:  MythTV - General
  Version:  0.25      |       Severity:  medium
 Keywords:            |  Ticket locked:  0
----------------------+--------------------------------------------
 I have three systems running MythTV 0.25. One master backend (MB) and two
 secondaries (S1 and S2).

 For some reason (there is nothing in the log), the backend process on S1
 decided to take a vacation (sort of). The frontend task could not do
 anything on that system that required the services of the backend process
 (e.g. play a recorded program). However, the backend process was able to
 start and complete a recording at its appointed time.

 Meanwhile, on the MB and S2 machines, the recorded program showed up in
 the list and was briefly available for concurrent (while recording)
 viewing until it abruptly stopped playing. At this point, the frontend
 process on the MB system (or the S2 system, take your choice) appeared to
 be hung. Several attempts to kill and restart it always resulted in the
 appearance of it being hung, when the program was viewed. Eventually,
 patience prevailed and an attempt to view the program did not hang but
 returned to the recorded programs list (about 5 minutes).

 Attempts to view any other programs on the MB or S2 systems met with
 similar hangs, although the recording files were local to those machines.
 This behavior continued until the recording ended on the S1 system, at
 which point the MB and S2 started working normally (i.e. recordings could
 be played).

 However, any attempts to play the recording from either the MB or S2
 system resulted in the "File doesn't exist" message. Incidentally, this
 message is often a lie. The file is right where it should be, on the
 system that is not responding to the helm. Changing the message to reflect
 the true state of the problem would be swell. How about, "The file is
 located on a server which isn't responding to the helm," instead of
 misleading the user into thinking the file didn't get created? They might
 be foolishly tempted to delete the missing recording from the database.
 Anyway, other recordings could be played but not the one in question.

 Meanwhile, the S1 system continued to be essentially broken. Several
 restarts of the frontend process had no effect.

 Not until the backend process on the S1 system was restarted did anything
 start to work reasonably well. Incidentally, the new upstart method of
 controlling the backend does not work. It gives some lame message about
 "stop/waiting" but the actual task continues to soldier on (in its semi-
 hung state). A "ps x -A" followed by a "kill -9" of the appropriate task
 number did the job. Then, "start mythtv-backend" brought up a working
 backend.

 Note that the recording on the S1 system worked fine. Once the backend
 process on that system was restarted, the recording file appeared on all
 systems (MB, S1, S2) and could be played. The frontend process on S1 was
 restarted and it worked too. All was right with the world.

 So, it would appear that the portion of the backend process on S1, that
 responds to file transfer requests from other processes, went south.
 Perhaps there was more to it than that, since the frontend process on S1
 apparently could not talk to the backend process either. The recording
 portion worked fine, however, since the scheduled recording started
 (perhaps this was before the problem began) and stopped at the correct
 time and the recording file itself is 100% fine.

 This lack of response on one system (S1) had the effect of hanging all of
 the other systems (MB, S1). In the backend log file on S1, there is
 absolutely nothing to indicate any problem whatsoever. All I see is the
 recording starting and stopping normally.

 Meanwhile, in the backend log file on MB (or S2, take your choice), I see
 a boat load of "E FreeSpaceUpdater playbacksock.cpp:139
 (SendReceiveStringList) PlaybackSock::SendReceiveStringList(): Response
 too short" messages (and I do mean a boat load) which go on for at least
 30 minutes from around the time the event appears to have happened.

 I'll be happy to send you whatever logs you need.

 Here are some general observations:

   1) The network timeouts are far too long. Waiting for two minutes
      and then retrying a couple of times (for a total of 5 or 6 minutes)
      is way too long. If the answer on a locally-connected device does
      not come back in 500ms, something is probably wrong. Ten to twenty
      seconds is more than enough. Remember, the user is sitting there
      waiting for something to happen. After two minutes, they are
      probably thinking about buying a gun. From my years of experience
      in the online systems business, the goal should be two second
      response time (not always achievable but still a good goal).

   2) Perhaps a separate task that just looks up the recorded programs
      list and streams files would be good (e.g. I have a server that
      serves videos for Mythvideo, using Samba. One can select a video
      from the menu, hit enter, and it starts playing in 2 seconds. Once
      it begins, nothing ever interrupts it). Asking the backend to do
      everything may be not such a good idea.

   3) Error recovery from network problems seems to be not good. Often,
      when something is wrong (e.g. can't view a recording), the fix is
      simply to restart the frontend task. This would imply that error
      recovery could be had simply by closing the socket and reopening
      it.

   4) It would be great if recording tasks could be spawned separately
      from the backend task. Then, one could restart the backend without
      losing all of one's recordings. Believe it or not, this state
      happens quite often (i.e. when one would like to recycle the
      backend process but must wait for valuable recordings to end).

   5) A kill everything and start it all back up fresh (except for
      recordings) command that actually works would be swell. It would
      be really swell if it could be bound to a key on the
      keyboard/remote that wasn't routed through the hung frontend
      process. It would be even sweller if pressing this key would
      package up all of the log entries (in the vicinity of the
      problem), plus any other pertinent information, and send it from a
      background task (heavy emphasis on the word background) to MythTV
      Command Central for debugging purposes. There are users who do not
      even know how a computer works who, none-the-less, would then be
      capable of restarting the broken system and simultaneously sending
      in a bug report. The Evil Empire has, for example, had great
      success with such a system towards improving the reliability of
      their software.

 Fundamental problems that are probably network-related have been around
 since MythTV 0.21. I realize that debugging problems that are spread
 across multiple, networked devices is very difficult, if not impossible.
 So, whatever assistance I can render in the form of capturing packets,
 creating log files, running debugging code, etc., I'd be happy to help
 out.

-- 
Ticket URL: <http://code.mythtv.org/trac/ticket/10822>
MythTV <http://code.mythtv.org/trac>
MythTV Media Center