[mythtv-users] SOLVED Random lockups on Mythbackend

Larry Finger Larry.Finger at lwfinger.net
Sun Apr 10 21:37:31 UTC 2011


On 04/10/2011 04:17 PM, f-myth-users at media.mit.edu wrote:
>      >  Date: Sun, 10 Apr 2011 15:47:11 -0500
>      >  From: dwoody1<dwoody1 at charter.net>
>
>      >  After a lot of testing I found the problem. The memory was bad. The
>      >  memory test ran for 4 days and over 100 passes without even one failure
>      >  (go figure).
>
> That's perfectly reasonable, if unfortunate.
>
> Lots of people assume, "If it passes memtest86+, it -must- be good,"
> but that assumes that memtest86+ can test everything, which it can't.
> (If it -fails-, you know you have a problem, but if it passes, you
> actually don't know much.)
>
> For example, I've got an old EPoX Ultra 9NPA3 that would corrupt a few
> bits out of a few GB (in a pattern-dependent and mostly-reproduceable
> way), -only- if CPU throttling was enabled.  Since memtest86+ always
> runs at full CPU, it couldn't detect the problem.  But I spent a while
> tracking it down and implicated only the memory or its datapaths,
> because I could take one of the problematic files (of a few hundred
> meg or a gig), read it once (to get it in the FS cache in RAM), and
> then md5sum of the in-memory cached copy would return incorrect
> results in a loop like "sleep 10; md5sum thing" but would return
> correct results without the sleep---or with the sleep but if I
> ran something CPU-intensive in another shell.  [I had earlier
> exonerated all disk datapaths---tried IDE, SATA, and USB---and
> since I was using a crypto FS, I -knew- those paths couldn't be
> flipping random bits or they'd be flipping random -cleartext-
> bits and trashing entire sectors of the FS, which wasn't happening.
> I'm very glad I was paranoid and tested the machine before putting
> it into production---I'd originally found the problem simply copying
> a terabyte to it and checksumming the results, and when they didn't
> match, started backtracing, beginning with the network hardware.]
>
> My solution for that motherboard was to just disable CPU throttling.
>
> Now, it's -possible- that some different brand of memory might have
> been just enough different that these marginal throttling-dependent
> changes in datapath speeds wouldn't have led to corruption, but I
> didn't feel like screwing with it; turning off throttling didn't
> matter and instantly fixed the problem for good.  It's now the first
> thing I try when I see nondeterministic behavior, along with disabling
> any spread-spectrum the motherboard might have available.  (I have a
> pair of other, even older, motherboards where having SS on often leads
> to boots where the clock runs fast many minutes/hour; turning off SS
> decreases the incidence of that by about 10x or more.)

I was certainly aware that memtest86+ could never prove that memory had no 
faults, but your stories are scary. That is nearly enough to suggest a new line 
of work.


More information about the mythtv-users mailing list