[mythtv-commits] Ticket #1882: Fix broken signal handling in housekeeper.c

MythTV mythtv at cvs.mythtv.org
Sat May 27 22:25:58 UTC 2006


#1882: Fix broken signal handling in housekeeper.c
-----------------------------------+----------------------------------------
 Reporter:  mrsam at courier-mta.com  |       Owner:  ijr 
     Type:  patch                  |      Status:  new 
 Priority:  minor                  |   Milestone:      
Component:  mythtv                 |     Version:  0.19
 Severity:  medium                 |  
-----------------------------------+----------------------------------------
 A series of events led me to scrutinize the code in housekeeper.c that
 runs mythfilldatabase, and the mutual interaction between this code and
 the myth_system() code in libmyth/util.cpp.

 The series of events in question, was a fairly reproducible instance of my
 job queue getting wedged.  After about a day of running myth 0.19's
 mythbackend, some mythcommflag or transcode job would inevitably stuck in
 the "Running" stage.  This happened every time.  A closer examination
 showed that the job apparently completed succesfully, but jobqueue never
 got updated, so that this job got wedged in the "Running" state, and any
 new jobs just kept piling on in the job queue, in the "Queued" state.

 An even closer examination revealed that the thread responsible for
 invoking myth_system("/usr/bin/mythcommflag [args]"), etcetera, was in no-
 man's land. Attaching a debugger to that thread revealed the following
 interesting call trace.  Now, I did not, unfortunately, save the actual
 gdb's backtrace, but the tail end of the stack frame was, in essence:

 1. util.cpp:859 # the call to waitpid() in myth_system()
 2. A few stack frames in glibc's waitpid() function
 3. [signal handler]
 4. housekeeper.cpp:26   # The call to wait(0) in reapChild() in
 housekeeper.cpp
 5. The wait() syscall
 6. <---- You are here

 And here, I was stuck.

 So, now, how did I get from myth_system() to housekeeper.cpp?

 Well, the answer can be found in the linuxthreads FAQ.

 http://pauillac.inria.fr/~xleroy/linuxthreads/faq.html#J

 ============================================================================

 J.3: I've installed a handler on a signal. Which thread executes the
 handler when the signal is received?

 [ ... ]

 If the signal is sent to a particular thread using pthread_kill(), then
 that thread executes the handler.

 [ ... ]

 If the signal is sent via kill() or the tty interface (e.g. by pressing
 ctrl-C), then the POSIX specs say that the handler is executed by any
 thread in the process that does not currently block the signal. In other
 terms, POSIX considers that the signal is sent to the process (the
 collection of all threads) as a whole, and any thread that is not blocking
 this signal can then handle it.

 ============================================================================

 So, the here's what I believe is happening here.

 A.  The housekeeper thread calls "signal(SIGCHLD, &reapChild)", then forks
 off and runs mythfilldatabase.  The intent here is to set
 "HouseKeeper_filldb_running = false;".  Housekeeper's main loop checks if
 mythfilldatabase is already running in the background, using this flag,
 and, if so, that factors into some decisions.  housekeeper does not wait()
 for mythfilldatabase to end, it just sets the signal handler, forks it
 off, and lets the SIGCHLD handler reset the "HouseKeeper_filldb_running"
 flag when it's done.

 But, the signal handler remains set forever.  So:

 B.  At some point later, myth_system() forks off and runs mythcommflag, or
 transcode

 C.  The child process does its thing, and exits.

 D.  The signal handler, reapChild(), that meant to be used when
 mythfilldatabase exits, now gets invoked to handle the SIGCHLD that gets
 generated when mythcommflag/transcode exits.

 It looks like what's happening is that when a child process gets forked,
 the new process is considered to be a child of the entire process, not
 just the thread that forked it, so when the child process finishes, the
 resulting SIGCHLD is *NOT* sent via the pthread_kill() mechanism.  It gets
 posted to the entire process, and, as per linuxthreads FAQ, any thread can
 pick it up, and
 run with it.

 Now, which thread actually gets it, is not really germane.  Whichever
 thread gets chosen, it's going to invoke reapChild(), which the real
 problem.

 The fix is NOT to clear the signal handler when mythfilldatabase is done.
 Consider this sequence of events:

 1. myth_system() forks off and runs mythcommflag

 2. housekeeper.cpp decides it's time to run mythfilldatabase.

 3. housekeeper.cpp sets "HouseKeeper_filldb_running = true", sets up a
 SIGCHLD handler, and forks off the mythfilldatabase

 4. mythcommflag terminates, and generates a SIGCHLD.

 5. Since there's a signal handler set for SIGCHLD, it gets invoked.

 5. reapChild() sets "HouseKeeper_filldb_running = false", while
 mythfilldatabase continues to chug along, doing its business.

 6.  Boom.

 So, I think there's a problem here, that needs fixin'.  Maybe it's already
 fixed in SVN -- I'm looking at the 0.19 release -- so this message can be
 ignored.

 I believe that 99.9% of the time this does not actually end up causing
 anything bad -- but this is definitely an issue that needs fixing
 nevertheless -- and there are other reasons why, in my case, I can
 reproduce this fairly consistently.  Why it's happening to me it's another
 story, but, to make a long story short, the positive side of me being able
 to reproduce this badness also means that I can fix it.  Which I did.
 I've patched this, and I've been testing the fix for almost a week now,
 and it completely eliminates this problem, the problem being: you should
 either use a signal handler for all child processes, or not.  You can't
 reliably use a signal handler for some processes, but then rely on a
 wait/waitpid for other child processes.  You just
 can't do it.

 My fix is, essentially: ditch the signal handler, create a new detached
 thread that forks() mythfilldatabase, waitpid()s for it, clears the
 HouseKeeper_filldb_running flag, and goes away.  Nice and tidy.

 As I mentioned, for various reasons I can reliably reproduce the problems
 caused by improper signal handling. After applying the following patch, no
 problems have been observed.

 This patch is against mythtv-0.19 release.

-- 
Ticket URL: <http://svn.mythtv.org/trac/ticket/1882>
MythTV <http://www.mythtv.org/>
MythTV


More information about the mythtv-commits mailing list