Linux Laptop (Toshiba 1800)
Linux Laptop (Dell D820N)
PySlideShow
Files and Scripts
Car PC
So I have this nice new ABMX box running FC5 for my main server. It is a great setup, and it is small (1U). The unfortunate part is that the raid array keeps degrading. This box has a 3Ware 7006-2 running two WDC WD3200JB harddrives. At first everything was fine. After two or three days, I got a notice that I had a drive error (0x0F:0x000A), and that the raid 1 array was now a single drive. I went into the nice 3DM2 and started a rebuild. It worked fine and ran for about 2 more days as a raid system. It then failed again. This time it had a controller error (0x0F:0x0004).
At this point in time, I contacted 3Wares technical support. They were fast to respond to email, and they recommended I try different cables. I gave that a try (using 3ware certified cables even), but it was the same problem. I emailed them again and they recommended I try swapping drives and seeing if the error moved, or if the same drive had issues. At this point I got a little frustrated and ordered a new WD3200JB drive. I figure if the drive is bad that is in the box, I can replace it with this new one and RMA the old one. If the drive isn't bad, I now have a new drive for backup on my main computer.
The harddrive arrived today, so I forwarded it on to my colocation (hi Greg!). We shall see how it works. If the controller turns out to be bad, I hope 3Ware is good about replacements.
I also sent an email to ABMX tech support. The next day they gave me a phonecall at work. I was impressed and pleasantly surprised. The person I talked to said ABMX would be happy to help if I was unsuccessful in dealing with 3Ware and WD directly.
I will let you know what happens. For the record, here is my error transcript:
### 3DM Version: 2.04.00.014
### Time Stamp: 20:38.15 30-May-2006
### Host Name: sphinx.grussling.com
### OS Version: Linux 2.6.16-1.2096_FC5
### Driver Version: 1.26.02.001
### Controller ID: 0
### Model: 7006-2
### Firmware: FE7X 1.05.00.068
### BIOS: BE7X 1.08.00.048
### Serial #: L14804A6050317
### Memory: 512 kB
### BEGIN Firmware Print Log
3ware DiskSwitch 2/4/8/12
FE7X 1.05.00.068 19-May-04
Model No. : 7006-2
Bios BE7X 1.08.00.048
(c) 1997 - 2003 3ware
Achip version # 03.20
Achip version #
Achip version #
Checking Pchip Version
(Will Hang if Incorrect)...
Pchip version # 01.30-66
.5MB Sbuf
Segments :06
Sbuf memory test...
.5MB Sbuf
OK
Alloc rnd :
bkgrnd tasks stopped
waiting for disks ready...
Spinup check:
Aport 00
Aport 01
disks ready.
Drive 00: UDMA100 WDC WD3200JB-22KFA0
Drive 01: UDMA100 WDC WD3200JB-22KFA0
READY
Unit 00: Degraded TwinStor[0:1x] of a CBOD[0] and a CBOD[1]
AEN sent to host: 0002
<< SOFT reset: count = 0001 >>
Time: 0000C453 msec
3ware DiskSwitch 2/4/8/12
FE7X 1.05.00.068 19-May-04
Model No. : 7006-2
Bios BE7X 1.08.00.048
(c) 1997 - 2003 3ware
Achip version # 03.20
Achip version #
Achip version #
Checking Pchip Version
(Will Hang if Incorrect)...
Pchip version # 01.30-66
Alloc rnd :
bkgrnd tasks stopped
waiting for disks ready...
Spinup check:
Aport 00
Aport 01
disks ready.
Drive 00: UDMA100 WDC WD3200JB-22KFA0
Drive 01: UDMA100 WDC WD3200JB-22KFA0
READY
Unit 00: Degraded TwinStor[0:1x] of a CBOD[0] and a CBOD[1]
AEN sent to host: 0002
AEN sent to host: 0001
bkgrnd tasks stopped
Drive 00: UDMA100 WDC WD3200JB-22KFA0
Unit 00: Incomplete Degraded TwinStor[0:Fx] of a CBOD[0] and a CBOD[1]
Drive 00: UDMA100 WDC WD3200JB-22KFA0
Drive 01: UDMA100 WDC WD3200JB-22KFA0
---------Error---------
Status: 00C4
Code: 0031
Time: 0002D74F msec
C
Unit 00: Incomplete Degraded TwinStor[0:Fx] of a CBOD[0] and a CBOD[1]
Unit 01: CBOD[1]
AEN sent to host: 000B
Rebuilding Unit 00
00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F
20 21
TFR Out 00 40 40 AD 62 E0 35
Aport timeout 01 003E2150 D278
TFR In 04 83 45 00 00 A0 51
Reset drive ...
TFR In 01 01 01 00 00 A0 50
TFR In 01 01 01 00 00 A0 50
AEN sent to host: 010A
AEN sent to host: 0004
bkgrnd tasks stopped
Unit 00: Degraded TwinStor[0:1x] of a CBOD[0] and a CBOD[1]
Drive 00: UDMA100 WDC WD3200JB-22KFA0
Unit 00: Incomplete Degraded TwinStor[0:Fx] of a CBOD[0] and a CBOD[1]
Drive 00: UDMA100 WDC WD3200JB-22KFA0
Drive 01: UDMA100 WDC WD3200JB-22KFA0
---------Error---------
Status: 00C4
Code: 0031
Time: 021DE8F4 msec
C
Unit 00: Incomplete Degraded TwinStor[0:Fx] of a CBOD[0] and a CBOD[1]
Unit 01: CBOD[1]
AEN sent to host: 000B
Rebuilding Unit 00
00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F
20 21 22 23 24 25
TFR Out 00 40 C0 E3 5C E0 35
Aport timeout 01 027B47E0 F8F8
TFR In 04 83 45 00 00 A0 51
Reset drive ...
TFR In 01 01 01 00 00 A0 50
TFR In 01 01 01 00 00 A0 50
AEN sent to host: 010A
AEN sent to host: 0004
bkgrnd tasks stopped
Unit 00: Degraded TwinStor[0:1x] of a CBOD[0] and a CBOD[1]
--------- SMART Info for last 24 hrs, Day 0001 ---------
0001 soft resets were received.
Aport 01 had 0002 timeouts reading.
---------
--------- SMART Info for last 24 hrs, Day 0002 ---------
0000 soft resets were received.
No timeouts occured on any Aport.
---------
--------- SMART Info for last 24 hrs, Day 0003 ---------
0000 soft resets were received.
No timeouts occured on any Aport.
---------
--------- SMART Info for last 24 hrs, Day 0004 ---------
0000 soft resets were received.
No timeouts occured on any Aport.
---------
--------- SMART Info for last 24 hrs, Day 0005 ---------
0000 soft resets were received.
No timeouts occured on any Aport.
---------
--------- SMART Info for last 24 hrs, Day 0006 ---------
0000 soft resets were received.
No timeouts occured on any Aport.
---------
Drive 00: UDMA100 WDC WD3200JB-22KFA0
Unit 00: Incomplete Degraded TwinStor[0:Fx] of a CBOD[0] and a CBOD[1]
Drive 00: UDMA100 WDC WD3200JB-22KFA0
Drive 01: UDMA100 WDC WD3200JB-22KFA0
---------Error---------
Status: 00C4
Code: 0031
Time: 20613E3E msec
C
Unit 00: Incomplete Degraded TwinStor[0:Fx] of a CBOD[0] and a CBOD[1]
Unit 01: CBOD[1]
AEN sent to host: 000B
Rebuilding Unit 00
00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F
20 21 22 23 24 25 26 27 28 29 2A 2B 2C 2D 2E 2F
30 31 32 33 34 35 36 37 38 39 3A 3B
TFR Out 00 40 C0 A1 D2 E0 35
Aport timeout 01 20DFD818 8852
TFR In 04 83 45 00 00 A0 51
Reset drive ...
TFR In 01 01 01 00 00 A0 50
TFR In 01 01 01 00 00 A0 50
AEN sent to host: 010A
AEN sent to host: 0004
bkgrnd tasks stopped
Unit 00: Degraded TwinStor[0:1x] of a CBOD[0] and a CBOD[1]
### END Firmware Print Log
Comments
New Drive Installed
I installed the new harddrive. I started to rebuild the raid array
and received the following notifications:
20060602113142 - Controller 0
ERROR - (0x0F:0x0006): Incomplete unit detected: Unit #0
20060602113144 - Controller 0
ERROR - (0x0F:0x0002): Unit degraded: Unit #0
20060602202544 - Controller 0
ERROR - (0x0F:0x0005): Rebuild failed: Unit #0
I was depressed because it appears the new drive was no help, but
when I logged into 3dm2, it said the array was OK and that all drives
were functioning fine. I don't know who to believe!
Still working
Its been over a month, and the raid array is still "OK". I don't understand.
To really complicate the waters, I took the "bad" drive that was causing all the problems and installed it in my machine at home. I formated it and ran all the WD diagnostic tools on it. There were no problems found. I then formated with mke2fs and did the "-c -c" thing. No problems were found. I have been running it in my box for the last month, and it has worked great. I don't understand :-(
Another Update
It has been over 3 months and the "bad" drive and the raid array are still working fine. I don't know what was/is up. I no longer get raid errors (3dm2 still shows OK) and the "bad" drive is still running in my main computer with no problems. Both of the hard drives have the same model number. Maybe there was a drive firmware update?