Hi,
I'm getting massive filesystem corruption on an md RAID comprising 4
SATA disks. I tried ext3, xfs and reiserfs on RAID level 5 as well as
ext3 on RAID level 1 (using only 2 disks); all can be crashed reliably
by running bonnie++ for just a few minutes. In the case of ext3, I
usually get dmesg output like this:
[...]
md0: rw=1, want=1482184800, limit=490223232
attempt to access beyond end of device
md0: rw=1, want=1482184800, limit=490223232
attempt to access beyond end of device
md0: rw=1, want=1482184800, limit=490223232
Buffer I/O error on device md0, logical block 185273099
lost page write due to I/O error on md0
Aborting journal on device md0.
EXT3-fs error (device md0) in ext3_reserve_inode_write: Journal has aborted
EXT3-fs error (device md0) in ext3_dirty_inode: Journal has aborted
EXT3-fs error (device md0) in ext3_new_blocks: Journal has aborted
ext3_abort called.
EXT3-fs error (device md0): ext3_journal_start_sb: Detected aborted journal
Remounting filesystem read-only
The filesystems are impossible to repair afterwards, e2fsck in
particular will run for ages, and eventually segfault.
By contrast, ext3 directly on the physical disk partition works fine and
withstood days of continouus bonnieing.
This is with Etch, kernel 2.6.18-4-686-bigmem. FWIW, the machine used to
run Sarge with a 2.4 kernel, where the RAID worked fine.
Now, it seems quite unlikely that RAID is completely broken in 2.6, so I
suppose it might be related to the hardware: it's a Pentium 4 @ 2.8 GHz,
1.5 GiB RAM, the SATA Controller is a Promise S150 SX4 using the
sata_sx4 kernel module.
Any ideas on this?
--
Bookmark/Search this post with:
Filesystem corruption on md (Software) RAID
Quoting Sebastian Flothow :
> Hi,
>
> I'm getting massive filesystem corruption on an md RAID comprising 4
> SATA disks. I tried ext3, xfs and reiserfs on RAID level 5 as well as
> ext3 on RAID level 1 (using only 2 disks); all can be crashed reliably
> by running bonnie++ for just a few minutes. In the case of ext3, I
> usually get dmesg output like this:
>
> [...]
> md0: rw=1, want=1482184800, limit=490223232
> attempt to access beyond end of device
> md0: rw=1, want=1482184800, limit=490223232
> attempt to access beyond end of device
> md0: rw=1, want=1482184800, limit=490223232
> Buffer I/O error on device md0, logical block 185273099
> lost page write due to I/O error on md0
> Aborting journal on device md0.
> EXT3-fs error (device md0) in ext3_reserve_inode_write: Journal has aborted
> EXT3-fs error (device md0) in ext3_dirty_inode: Journal has aborted
> EXT3-fs error (device md0) in ext3_new_blocks: Journal has aborted
> ext3_abort called.
> EXT3-fs error (device md0): ext3_journal_start_sb: Detected aborted journal
> Remounting filesystem read-only
>
> The filesystems are impossible to repair afterwards, e2fsck in
> particular will run for ages, and eventually segfault.
>
> By contrast, ext3 directly on the physical disk partition works fine and
> withstood days of continouus bonnieing.
>
> This is with Etch, kernel 2.6.18-4-686-bigmem. FWIW, the machine used to
> run Sarge with a 2.4 kernel, where the RAID worked fine.
>
> Now, it seems quite unlikely that RAID is completely broken in 2.6, so I
> suppose it might be related to the hardware: it's a Pentium 4 @ 2.8 GHz,
> 1.5 GiB RAM, the SATA Controller is a Promise S150 SX4 using the
> sata_sx4 kernel module.
>
>
Defintely sounds like hardware is failing.
You could try installing smartmontools onto your system and use it
to scan your drives. It might tell you if you have some bad sectors, or some
other failing component.
Also, try not using the bigmem kernel. AFAIK, its designed for 32 bit
systems with RAM exceeding 4 Gigs. ?? (Although I would guess that
shouldn't make a difference)
Filesystem corruption on md (Software) RAID
michael@estone.ca wrote:
> Defintely sounds like hardware is failing.
> You could try installing smartmontools onto your system and use it
> to scan your drives. It might tell you if you have some bad sectors, or
> some other failing component.
The hardware is fine - I checked the SMART status, did a full read/write
test with badblocks on md0, and in fact the very same hardware and RAID
setup worked fine for the past year, using a 2.4 kernel, with high
filesystem load every day.
It's just when I put a filesystem on top of an md device that things
break - my assumption is that there is a race in the kernel involving
sata_sx4 and the md modules. Given that the Promise SX4 is not really a
shining piece of hardware, and not that popular, I wouldn't be surprised
if the driver is a bit flaky too.
Anyway, we've decided to replace it with a real RAID controller, that
should sort things out.
--
Re: Filesystem corruption on md (Software) RAID
Hello guys, i have similar problem after reinstaling the ftp/samba server from fedora to debian etch with default 2.6.18 kernel
/var/log/syslog and dmesg says:
Mar 14 22:13:41 fiber-ftp2 kernel: sda1: rw=0, want=12478208576, limit=3750840072
Mar 14 22:13:41 fiber-ftp2 kernel: attempt to access beyond end of device
Mar 14 22:13:41 fiber-ftp2 kernel: sda1: rw=0, want=21000039256, limit=3750840072
Mar 14 22:13:41 fiber-ftp2 kernel: attempt to access beyond end of device
Mar 14 22:13:41 fiber-ftp2 kernel: sda1: rw=0, want=9143591032, limit=3750840072
Mar 14 22:13:41 fiber-ftp2 kernel: attempt to access beyond end of device
Mar 14 22:13:41 fiber-ftp2 kernel: sda1: rw=0, want=6060777040, limit=3750840072
Mar 14 22:13:41 fiber-ftp2 kernel: attempt to access beyond end of device
Mar 14 22:13:41 fiber-ftp2 kernel: sda1: rw=0, want=9143591032, limit=3750840072
Mar 14 22:13:41 fiber-ftp2 kernel: attempt to access beyond end of device
Mar 14 22:13:41 fiber-ftp2 kernel: sda1: rw=0, want=6060777040, limit=3750840072
Mar 14 22:13:41 fiber-ftp2 kernel: attempt to access beyond end of device
Mar 14 22:13:41 fiber-ftp2 kernel: sda1: rw=0, want=9143591032, limit=3750840072
Mar 14 22:13:41 fiber-ftp2 kernel: attempt to access beyond end of device
Mar 14 22:13:41 fiber-ftp2 kernel: sda1: rw=0, want=6060777040, limit=3750840072
Mar 14 22:13:41 fiber-ftp2 kernel: attempt to access beyond end of device
and so on ... the whole log is full
I don't see any system crashes for now,the server is working fine, but maybe will come up something serious :/
Hardware: Pentium 4 @ 3.0Ghz, 1GB ram, hardware 3ware raid controller, raid0 (sda1) 1,7TB
any idea what can cause this?
In meantime i'll run fsck to check if there's any problems with the hdd's ..
thanks!
The Answer Is Not In The Box, It's In The Band