Fight for the Internet 1!

Friday, April 4, 2014

Harddrive Problems Solved

Overview of Problem
Had a bit of a saga trying to trying to get a harddrive to work properly on my computer. Here are the steps I took to fix the issues. Hopefully they may help someone else someday.

Hardware
New Harddrive: Seagate 4TB, 5900 RPM, SATA3 6.0Gps
Motherboard: Brand: ASRock, Model: Z77 Extreme4.

Problem
I tried copying several terabytes of data to the new drive. It doesn't seem to have a problem with lots of small files, but it eventually has an error when copying files that are several gigabyes in size. However the size threshold which triggered errors were inconsistent. Sometimes it would copy 4.5GB files, and other times it would error if I tried copying 1.8GB files. (I later figured out why.)

Upon error, it would remount the drive into read-only mode and no attempts to remount it writeable seemed to work. I always had to reboot. Sometimes konqueror would report "Errno: 30" upon a failed copy.

Occasionally I think it even managed to hard-lock my system. I was impatient so i didn't give it long to wait, (maybe a minute or something), before hard-resetting the power on my computer. (What can I say? I trust EXT3/4's journaling system to recover my work and all my other programs recover as well.)

Diagnosis
I ran some tests using the SMART diagnosis, but it said the drive was healthy. I haven't run a full test yet, but I will just in case. The quick tests reported no problems.

I checked the cabling and I found the drive was on the same power-cable as 3TB Seagate. I don't think they would have been drawing too much electricity over the same line to cause power fluctuations (since they are the only two Serial ATA/SATA drives in the whole system and nothing else was running.) Still I gave the new 4TB Seagate it's own dedicated power line.

Next I did an important step, which I didn't realize until later. I deleted the partition table of the problem drive (it only had one massive 4TB partition), and created a new one. I also reformated the drive. (For most of this article I used EXT4, but I did experiments with XFS. XFS never triggered the error but I didn't wait hours to force a trigger so my results can't be judged either way with that format.)

I have read several accounts stating that once you have triggered the errors, you must correct them or they will continue to happen. These symptoms are consistent with what I encountered, particularly trigger errors with the different large file sizes, none of which were consistent.

I simply wiped/formated the drive, but others claim they have been able to recover using fsck.ext4 to correct the problems. I cannot comment on this.

More Diagnosis
I finally clued into checking the dmesg utility, and found some very useful information.

[ 891.079292] EXT4-fs (sdb1): mounted filesystem with ordered data mode. Opts: (null)
[11668.781082] ata4.00: exception Emask 0x10 SAct 0x7fffffff SErr 0x400100 action 0x6 frozen
[11668.781086] ata4.00: irq_stat 0x08000000, interface fatal error
[11668.781087] ata4: SError: { UnrecovData Handshk }
[11668.781089] ata4.00: failed command: WRITE FPDMA QUEUED
[11668.781091] ata4.00: cmd 61/00:00:00:18:b9/04:00:9b:00:00/40 tag 0 ncq 524288 out
res 40/00:d0:00:80:b9/00:00:9b:00:00/40 Emask 0x10 (ATA bus error)
[11668.781093] ata4.00: status: { DRDY }

......
(Snip)
.....
[11846.581765] ata4: hard resetting link
[11846.886462] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[11846.887197] ata4.00: ACPI cmd ef/10:06:00:00:00:00 (SET FEATURES) succeeded
[11846.887199] ata4.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out
[11846.887200] ata4.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out
[11846.888785] ata4.00: ACPI cmd ef/10:06:00:00:00:00 (SET FEATURES) succeeded
[11846.888788] ata4.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out
[11846.888789] ata4.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out
[11846.889558] ata4.00: configured for UDMA/133
[11846.889594] ata4: EH complete
[11892.733319] ata4.00: exception Emask 0x10 SAct 0x7fffffff SErr 0x400100 action 0x6 frozen
[11892.733322] ata4.00: irq_stat 0x08000000, interface fatal error
[11892.733324] ata4: SError: { UnrecovData Handshk }
[11892.733325] ata4.00: failed command: WRITE FPDMA QUEUED
[11892.733328] ata4.00: cmd 61/00:00:00:3c:76/04:00:9e:00:00/40 tag 0 ncq 524288 out



Some users online claimed the SATA cable itself could be going bad. This is a genuine possibility, so I switched to a different one.

Also, there are long threads reporting problems like this for SATA microcontrollers, particularly the Marvell 9123 controller. This is a software issue I believe, and not a hardware failure. But I'm no kernel dev. Others have reported issues for JMicron controllers also. I checked my motherboard and found a post by a different user using a very similar model to my own with the same problem. My motherboard has the following (checked by running the 'lspci' command.)

The chipset for the controller is "Intel Corporation 7 Series/C210 Series Chipset Family 6-port SATA Controller [AHCI mode] (rev 04)".

Finally I checked the cabling again in my computer. Mistakenly I had hooked up the harddrive to the SATA-3.0 ports. (There are 8-ports on my system, and it was a simple mistake, but ugh...) In fact, looking at the dmesg output you can even see the drive is reporting running at SATA2-3.0Gps speeds, when it is capable of running at SATA3 speed.

I switched its connection to a SATA3 port.


Solution

I applied several solutions to this problem so far. I'm not sure which one solved it.

1) Gave the device a dedicated power line. It's a big harddrive and sharing power with another big harddrive could have caused minor fluctuations.
2) Recreated the partition table of the problem drive and formated the drive to  EXT4. I made sure to do this after every error / failure.
3) Connected the drive to a different SATA micro-controller on my motherboard.