[mythtv-users] vdpau and the star-studded screen of death

Tue Dec 22 19:27:34 UTC 2009

    > Date: Tue, 22 Dec 2009 00:17:02 -0600
    > From: Kenneth Emerson <kenneth.emerson at gmail.com>

    > On Tue, Dec 1, 2009 at 4:26 PM, John Drescher <drescherjm at gmail.com> wrote:

    > > I wanted to follow up on a possible fix (at least for me) and thought it
    > might be useful for others.  Searching dmesg I found that I had XID errors
    > coming from the nvida module when this problem occurred.  Searching for XID
    > errors, I found this link on one of the Ubuntu forums (
    > http://ubuntuforums.org/showthread.php?t=1163786).  This led me to try
    > adding this option to the nvidia module (in /etc/modprobe.d/nvidia.conf):

    > options nvidia NVreg_RegistryDwords="PerfLevelSrc=0x2222"

Fascinating.  That page seems to indicate issues with the GPU going
in and out of a power-saving slower-clock mode.  If that's the issue,
it's very similar to one motherboard I have that is also unreliable
if CPU throttling is in effect---it corrupts a bit or two of RAM every
few tens of gig of transfers.  (*)

My solution was to disable throttling on that motherboard, and the
problem went away.  Sounds like nvidia may have a similar issue.  If
so, it may be very sensitively dependent upon board rev, vendor and/or
batch of RAM they put on their board, temperature, etc.  Fun.

[My problem of course never caused an error in memtest86+ because that
runs the CPU at 100%---and memtest86+ hasn't found the last few cases
of memory settings/motherboard incompatibility I've had anyway (though
the kernel has found them).  I wonder if nvidia has similar diagnostics,
but, again, they probably wouldn't help with a throttling issue
because most such diagnostics are intended to be stress tests and try
to run eveything flat out.]

(*) Was annoying to diagnose 'cause I first found it when transferring
a terabyte to an adjacent machine, so I assumed it was cabling or NICs
or my hub; tracked it back & back & back to discover it wasn't net,
wasn't RAM (per se), but running repeated md5sum's on the same file
gave me same answers if they were run continuously [CPU therefore
stuck at max freq] but different answers with 10-second sleeps [CPU
throttled up & down] vs 10-second sleeps but with something else
running 100% in another process [again, CPU stuck at max freq].  The
problem was partially pattern-dependent, but I used a file that tended
to fail to be transferred correctly as my test case.  Debugging was
vastly aided by observation that IDE, SATA, and USB devices were all
affected [ruling out a wide variety of hardware faults] and I knew
from the very start that the disk & datapaths were good, because the
original filesystem was an encrypted one anyway [instantly ruling out
a few bad disk bits, which would have randomized entire files, not
just flipped a few bits], which very rapidly got me looking at RAM
because the common element was the RAM cache of filesystem contents.
Most timeconsuming issue was that the machine was a fileserver, so
verifying that nothing had been corrupted on its way to being written
to disk for the several weeks that throttling had been enabled took
some care; at least GPU corruption shouldn't lead to bad files... :)