[mythtv-users] Is this a failing drive?

Tue Nov 22 15:22:26 UTC 2011

On 11/22/2011 9:57 AM, Warren Sturm wrote:
> On Tue, 2011-11-22 at 09:04 +0000, Tim Draper wrote:
>> On 22 November 2011 03:32, Don Brett<dlbrett at zoominternet.net>  wrote:
>>> On 11/21/2011 10:29 PM, Don Brett wrote:
>>>
>>> On 11/21/2011 11:12 AM, Keith Pyle wrote:
>>>
>>> On 11/21/11 06:00, Manuel McLure wrote:
>>>
>>> On Sun, Nov 20, 2011 at 7:40 PM, Don Brett<dlbrett at zoominternet.net>  wrote:
>>>
>>> This is a little off-topic, but it's a new drive on a new Mythbuntu
>>> installation. ?Symptoms are:
>>>
>>> -partition table got corrupted (after about 50 hours on the drive); an
>>> 8-hour low level format got it back
>>> -the box occasionally freezes-up for a few seconds
>>> -I see multiple instances of these errors in /var/log/syslog:
>>>
>>> Nov 20 09:54:02 zedo kernel: [ ? ?6.292384] EXT4-fs (sda2): re-mounted.
>>> Opts: errors=remount-ro
>>> Nov 20 09:55:12 zedo kernel: [ ? 81.573425] EXT4-fs (sda2): re-mounted.
>>> Opts: errors=remount-ro,commit=0
>>> Nov 20 09:58:51 zedo kernel: [ ?300.004041] [Hardware Error]: Machine
>>> check events logged
>>>
>>>
>>> ?From /var/log/mcelog, I see multiple entries of this:
>>>
>>> mcelog: failed to prefill DIMM database from DMI data
>>> Kernel does not support page offline interface
>>> mcelog: mcelog read: No such device
>>> Hardware event. This is not a software error.
>>> MCE 0
>>> CPU 0 4 northbridge
>>> MISC c008000001000000 ADDR 1844184
>>> TIME 1321843678 Sun Nov 20 21:47:58 2011
>>> ? Northbridge NB Array Error
>>> ? ? ? ?bit42 = L3 subcache in error bit 0
>>> ? ? ? ?bit43 = L3 subcache in error bit 1
>>> ? ? ? ?bit46 = corrected ecc error
>>> ? ? ? ?bit59 = misc error valid
>>> ? ? ? ?bit62 = error overflow (multiple errors)
>>> ? memory/cache error 'evict mem transaction, generic transaction, level
>>> generic'
>>> STATUS dc074c60001c017b MCGSTATUS 0
>>> MCGCAP 106 APICID 0 SOCKETID 0
>>> CPUID Vendor AMD Family 16 Model 5
>>> Hardware event. This is not a software error.
>>>
>>>
>>> I replaced the sata drive cables, disconnected the dvd drive, tried a
>>> different power supply. ?I also tried another drive on the same box (but
>>> it was an ide) and had no errors. ?With a different motherboard...still
>>> got the "re-mount" errors but none of the "[Hardware Error]" entries.
>>> Anyone have a suggestion?
>>>
>>>
>>> PS - the hardware is: (everything but the power supply and case is new)
>>> -ASUS M4A78LT-M AM3 AMD 760G HDMI Micro ATX AMD Motherboard
>>> -SAMSUNG EcoGreen F4 HD204UI 2TB SATA 3.0Gb/s 3.5" Internal Hard Drive
>>> -ADATA Gaming Series 2GB 240-Pin DDR3 SDRAM DDR3 1600 (PC3 12800)
>>> Desktop Memory
>>> -ZOTAC ZT-20203-10L GeForce GT 220 1GB 128-bit DDR2 PCI Express 2.0 x16
>>> HDCP Ready Video Card
>>> -ThermalTake 430 power supply
>>>
>>> It's not a disk problem, that's a CPU or motherboard problem. The disk
>>> corruption is caused by your memory contents getting corrupted and
>>> being written to disk.
>>>
>>> See http://halobates.de/mce.pdf for details on exactly what a "machine
>>> check exception" is.
>>>
>>> With the caveat that I'm not an expert on this...
>>>
>>> I suspect your cache memory or perhaps the Northbridge (handles
>>> communication among cores, possibly RAM, video) has a problem.
>>> Depending on your specific CPU, the Northbridge may be on the CPU, i.e.,
>>> CPU cores and Northbridge are all in one package.  This is the case for
>>> many recent, mainstream processors from both Intel and AMD.
>>>
>>> The errors you included suggest a multi-bit error in the L3 cache.  L3
>>> is special memory where (a limited amount of) recently used instructions
>>> and data are stored for faster access by the CPU than going to main
>>> RAM.  As Manuel wrote, if the cache is corrupted, it could lead to all
>>> manner of intermittent and seemingly random problems, including those
>>> you mentioned.
>>>
>>> If this is a Northbridge/cache problem and the Northbridge is on the CPU
>>> die, then your only fix will be to replace the CPU.  If the Northbridge
>>> is a separate chip on the motherboard, then you'll have to replace the
>>> motherboard but could keep the CPU.
>>>
>>> It may be worthwhile trying to see if Asus will help you if this a new
>>> motherboard.  (I have no personal experience with Asus support and don't
>>> know if/how they will help.)
>>>
>>> Keith
>>> _______________________________________________
>>> mythtv-users mailing list
>>> mythtv-users at mythtv.org
>>> http://www.mythtv.org/mailman/listinfo/mythtv-users
>>>
>>>
>>>
>>> I just notice that I hadn't included the cpu on the list of hardware; it's
>>> an AMD Athlon II X3 445 Rana (3.1GHz Socket AM3 95W Triple-Core Desktop
>>> Processor ADX445WFGMBOX).  Apparently this processor doesn't have an L3
>>> cache (from Toms Hardware - Rana, triple-core, no L3 cache (2.7+ GHz)), does
>>> that matter, or is the cache on the motherboard?
>>>
>> shouldnt really matter.
>>
>>> I looked up the features on the motherboard chipset, it has a North Bridge
>>> (AMD 760G); I assume that means the cpu does not have an integrated
>>> northbridge...right?  So it looks like my problem might be with the
>>> motherboard.
>>>
>>> Sidenote - Some other threads implied it might be a memory problem, so I
>>> played with it a little.  The box started with (2) 2 gig ddr3's.  I removed
>>> one of the sticks...similar behavior.  Replaced that with the other stick
>>> (still running with a single stick)...errors increased a lot, 2-3 errors a
>>> minute.   Does that mean anything?
>>>
>> could well be a faulty motherboard. although i dont have any real-life
>> tails, i'd of thought a faulty DIMM slot would of also showed in the
>> memtest you've ran.
>>
>>
>>> I forgot to mention, I ran memtest86.  I showed no errors after 5 passes (it
>>> ran for about 4 hours).
>>>
>> it's not the ram then.
>> the only thing left now is CPU and/or motherboard, so based of other
>> peoples' and your own opinions, i'd also say motherboard.
>> _______________________________________________
>> mythtv-users mailing list
>> mythtv-users at mythtv.org
>> http://www.mythtv.org/mailman/listinfo/mythtv-users
> Perhaps one thing to check is if cpuspeed is running.  I had similar
> freezes that went away after I removed cpuspeed.
>
>
>
> _______________________________________________
> mythtv-users mailing list
> mythtv-users at mythtv.org
> http://www.mythtv.org/mailman/listinfo/mythtv-users
>
>
I sent an email to Asus; we'll see what happens from there.  In the 
meantime, I had missed the suggestion to check the smartctl ouput (I'll 
add this command to my toolbox...thank-you-very-much).  Not sure what 
all this means, but here it is:

Install with:
don at zedo:~$ sudo apt-get install smartmontools

See output with:
don at zedo:~$ sudo smartctl -a /dev/sda

Output for reallocated sector count is:

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG       VALUE     WORST THRESH     
TYPE      UPDATED  WHEN_FAILED RAW_VALUE

   5   Reallocated_Sector_Ct   0x0033    252          252        
010           Pre-fail   Always                      -                0

Output for logs at end of report is:

Error 42 occurred at disk power-on lifetime: 150 hours (6 days + 6 hours)
   When the command that caused the error occurred, the device was 
active or idle.

   After command completion occurred, registers were:
   ER ST SC SN CL CH DH
   -- -- -- -- -- -- --
   84 51 00 00 00 00 a0

   Commands leading to the command that caused the error were:
   CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
   -- -- -- -- -- -- -- --  ----------------  --------------------
   ec 00 00 00 00 00 a0 08      00:00:03.662  IDENTIFY DEVICE
   ec 00 00 00 00 00 a0 08      00:00:03.662  IDENTIFY DEVICE
   ef 03 46 00 00 00 a0 08      00:00:03.662  SET FEATURES [Set transfer 
mode]
   ef 10 02 00 00 00 a0 08      00:00:03.662  SET FEATURES [Reserved for 
Serial ATA]
   27 00 00 00 00 00 e0 08      00:00:03.662  READ NATIVE MAX ADDRESS EXT

Error 41 occurred at disk power-on lifetime: 150 hours (6 days + 6 hours)
   When the command that caused the error occurred, the device was 
active or idle.

   After command completion occurred, registers were:
   ER ST SC SN CL CH DH
   -- -- -- -- -- -- --
   84 51 00 00 00 00 a0

   Commands leading to the command that caused the error were:
   CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
   -- -- -- -- -- -- -- --  ----------------  --------------------
   ec 00 00 00 00 00 a0 08      00:00:03.656  IDENTIFY DEVICE
   00 00 01 01 00 00 00 08      00:00:03.656  NOP [Abort queued commands]
   00 00 01 01 00 00 00 00      00:00:03.656  NOP [Abort queued commands]
   00 00 01 01 00 00 00 00      00:00:03.656  NOP [Abort queued commands]
   ec 00 00 00 00 00 a0 08      00:00:03.651  IDENTIFY DEVICE

Error 40 occurred at disk power-on lifetime: 150 hours (6 days + 6 hours)
   When the command that caused the error occurred, the device was 
active or idle.

   After command completion occurred, registers were:
   ER ST SC SN CL CH DH
   -- -- -- -- -- -- --
   84 51 00 00 00 00 a0

   Commands leading to the command that caused the error were:
   CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
   -- -- -- -- -- -- -- --  ----------------  --------------------
   ec 00 00 00 00 00 a0 08      00:00:03.651  IDENTIFY DEVICE
   00 00 01 01 00 00 00 08      00:00:03.651  NOP [Abort queued commands]
   00 00 01 01 00 00 00 00      00:00:03.651  NOP [Abort queued commands]
   ec 00 01 00 00 00 00 08      00:00:03.650  IDENTIFY DEVICE
   ea 00 00 00 00 00 e0 08      00:00:03.641  FLUSH CACHE EXT

Error 39 occurred at disk power-on lifetime: 150 hours (6 days + 6 hours)
   When the command that caused the error occurred, the device was 
active or idle.

   After command completion occurred, registers were:
   ER ST SC SN CL CH DH
   -- -- -- -- -- -- --
   84 51 00 00 00 00 00

   Commands leading to the command that caused the error were:
   CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
   -- -- -- -- -- -- -- --  ----------------  --------------------
   ec 00 01 00 00 00 00 08      00:00:03.650  IDENTIFY DEVICE
   ea 00 00 00 00 00 e0 08      00:00:03.641  FLUSH CACHE EXT
   61 00 08 94 8d 04 40 08      00:00:03.641  WRITE FPDMA QUEUED
   ea 00 00 00 00 00 e0 08      00:00:03.641  FLUSH CACHE EXT
   61 00 48 4c 8d 04 40 08      00:00:03.641  WRITE FPDMA QUEUED

Error 38 occurred at disk power-on lifetime: 149 hours (6 days + 5 hours)
   When the command that caused the error occurred, the device was 
active or idle.

   After command completion occurred, registers were:
   ER ST SC SN CL CH DH
   -- -- -- -- -- -- --
   84 51 00 00 00 00 a0

   Commands leading to the command that caused the error were:
   CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
   -- -- -- -- -- -- -- --  ----------------  --------------------
   ec 00 00 00 00 00 a0 08      00:00:03.071  IDENTIFY DEVICE
   00 00 00 00 00 00 00 08      00:00:03.070  NOP [Abort queued commands]
   00 00 00 00 00 00 00 08      00:00:03.070  NOP [Abort queued commands]
   00 00 00 00 00 00 00 08      00:00:03.070  NOP [Abort queued commands]
   00 00 08 64 87 04 40 08      00:00:03.060  NOP [Abort queued commands]