操作系统:SUSE Linux 11
文件系统:ext3
错误现象
X日,接到告警,检查文件系统/dev/sda1发现写入报只读,检查IP存储有告警,随即umount /img,但卸载后无法正常挂载
fdisk -l显示IO错误,重启应用服务器后依然无法正常挂载,显示IO错误,
检查IP存储有告警信息,待存储厂商解决存储问题后,重启应用服务器仍然无法正常挂载文件系统,
由于mount命令执行后长时间无响应,但观察/var/log/messages仍然显示系统在进行block的扫描:
Nov 2 06:04:53 linux11 kernel: [128293.578670] Buffer I/O error on device sda1, logical block 483584660
Nov 2 06:04:53 linux11 kernel: [128293.578672] lost page write due to I/O error on sda1
Nov 2 06:05:01 linux11 /usr/sbin/cron[15283]: (root) CMD ( /opt/hp/hp-health/bin/check-for-restart-requests)
Nov 2 06:05:53 linux11 kernel: [128353.584893] sd 9:0:0:0: [sda] Unhandled sense code
Nov 2 06:05:53 linux11 kernel: [128353.584898] sd 9:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Nov 2 06:05:53 linux11 kernel: [128353.584901] sd 9:0:0:0: [sda] Sense Key : Medium Error [current]
Nov 2 06:05:53 linux11 kernel: [128353.584905] sd 9:0:0:0: [sda] Add. Sense: Medium not present
Nov 2 06:05:53 linux11 kernel: [128353.584910] sd 9:0:0:0: [sda] CDB: Write(10): 2a 00 e6 97 59 5f 00 00 08 00
Nov 2 06:05:53 linux11 kernel: [128353.584916] end_request: I/O error, dev sda, sector 3868678495
Nov 2 06:05:53 linux11 kernel: [128353.584920] Buffer I/O error on device sda1, logical block 483584804
Nov 2 06:05:53 linux11 kernel: [128353.584922] lost page write due to I/O error on sda1
Nov 2 06:05:53 linux11 kernel: [128353.599875] sd 9:0:0:0: [sda] Unhandled sense code
Nov 2 06:05:53 linux11 kernel: [128353.599878] sd 9:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Nov 2 06:05:53 linux11 kernel: [128353.599880] sd 9:0:0:0: [sda] Sense Key : Medium Error [current]
Nov 2 06:05:53 linux11 kernel: [128353.599883] sd 9:0:0:0: [sda] Add. Sense: Medium not present
Nov 2 06:05:53 linux11 kernel: [128353.599886] sd 9:0:0:0: [sda] CDB: Write(10): 2a 00 e6 97 5f bf 00 00 08 00
Nov 2 06:05:53 linux11 kernel: [128353.599890] end_request: I/O error, dev sda, sector 3868680127
Nov 2 06:05:53 linux11 kernel: [128353.599893] Buffer I/O error on device sda1, logical block 483585008
Nov 2 06:05:53 linux11 kernel: [128353.599895] lost page write due to I/O error on sda1
Nov 2 06:05:53 linux11 kernel: [128353.600872] sd 9:0:0:0: [sda] Unhandled sense code
Nov 2 06:05:53 linux11 kernel: [128353.600875] sd 9:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Nov 2 06:05:53 linux11 kernel: [128353.600877] sd 9:0:0:0: [sda] Sense Key : Medium Error [current]
Nov 2 06:05:53 linux11 kernel: [128353.600879] sd 9:0:0:0: [sda] Add. Sense: Medium not present
Nov 2 06:05:53 linux11 kernel: [128353.600882] sd 9:0:0:0: [sda] CDB: Write(10): 2a 00 e6 97 62 47 00 00 08 00
Nov 2 06:05:53 linux11 kernel: [128353.600887] end_request: I/O error, dev sda, sector 3868680775
红色部分显示系统仍在工作中,等待20小时候,工程师建议继续等待,20小时后,mount命令运行结束
linux11:~ #mount /dev/sda1 /mnt/
mount: wrong fs type, bad option, bad superblock on /dev/sda1,
missing codepage or helper program, or other error
In some cases useful info is found in syslog – try
dmesg | tail or so
linux11:~ #dmesg|tail -50
[138764.297170] sd 9:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[138764.297172] sd 9:0:0:0: [sda] Sense Key : Medium Error [current]
[138764.297175] sd 9:0:0:0: [sda] Add. Sense: Medium not present
[138764.297178] sd 9:0:0:0: [sda] CDB: Write(10): 2a 00 f2 1f f5 b7 00 00 10 00
[138764.297182] end_request: I/O error, dev sda, sector 4062180791
[138764.312193] sd 9:0:0:0: [sda] Unhandled sense code
[138764.312197] sd 9:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[138764.312199] sd 9:0:0:0: [sda] Sense Key : Medium Error [current]
[138764.312202] sd 9:0:0:0: [sda] Add. Sense: Medium not present
[138764.312204] sd 9:0:0:0: [sda] CDB: Write(10): 2a 00 f2 20 37 9f 00 00 08 00
[138764.312209] end_request: I/O error, dev sda, sector 4062197663
[138764.312224] sd 9:0:0:0: [sda] Unhandled sense code
[138764.312226] sd 9:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[138764.312228] sd 9:0:0:0: [sda] Sense Key : Medium Error [current]
[138764.312230] sd 9:0:0:0: [sda] Add. Sense: Medium not present
[138764.312233] sd 9:0:0:0: [sda] CDB: Write(10): 2a 00 f2 20 38 b7 00 00 08 00
[138764.312237] end_request: I/O error, dev sda, sector 4062197943
[138764.312242] sd 9:0:0:0: [sda] Unhandled sense code
[138764.312243] sd 9:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[138764.312245] sd 9:0:0:0: [sda] Sense Key : Medium Error [current]
[138764.312247] sd 9:0:0:0: [sda] Add. Sense: Medium not present
[138764.312250] sd 9:0:0:0: [sda] CDB: Write(10): 2a 00 f2 20 7f 87 00 00 08 00
[138764.312254] end_request: I/O error, dev sda, sector 4062216071
[138824.286688] sd 9:0:0:0: [sda] Unhandled sense code
[138824.286692] sd 9:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[138824.286696] sd 9:0:0:0: [sda] Sense Key : Medium Error [current]
[138824.286699] sd 9:0:0:0: [sda] Add. Sense: Medium not present
[138824.286704] sd 9:0:0:0: [sda] CDB: Write(10): 2a 00 f2 20 f2 bf 00 00 08 00
[138824.286710] end_request: I/O error, dev sda, sector 4062245567
[138824.286714] __ratelimit: 8 callbacks suppressed
[138824.286718] Buffer I/O error on device sda1, logical block 507780688
[138824.286719] lost page write due to I/O error on sda1
[138824.324706] sd 9:0:0:0: [sda] Unhandled sense code
[138824.324709] sd 9:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[138824.324711] sd 9:0:0:0: [sda] Sense Key : Medium Error [current]
[138824.324714] sd 9:0:0:0: [sda] Add. Sense: Medium not present
[138824.324717] sd 9:0:0:0: [sda] CDB: Write(10): 2a 00 f2 20 fa 1f 00 00 08 00
[138824.324722] end_request: I/O error, dev sda, sector 4062247455
[138824.324726] Buffer I/O error on device sda1, logical block 507780924
[138824.324727] lost page write due to I/O error on sda1
[138824.324741] sd 9:0:0:0: [sda] Unhandled sense code
[138824.324742] sd 9:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[138824.324744] sd 9:0:0:0: [sda] Sense Key : Medium Error [current]
[138824.324747] sd 9:0:0:0: [sda] Add. Sense: Medium not present
[138824.324749] sd 9:0:0:0: [sda] CDB: Write(10): 2a 00 f2 2e a1 17 00 00 08 00
[138824.324754] end_request: I/O error, dev sda, sector 4063142167
[138824.324756] Buffer I/O error on device sda1, logical block 507892763
[138824.324758] lost page write due to I/O error on sda1
[138824.324773] JBD: recovery failed
[138824.324774] EXT3-fs: error loading journal.
修复方案
工程师初步判定为Superblock损坏,开始进行制定修复方案:
1.通过dd将原/dev/sda1分区的文件备份到其他文件分区,原分区大小2T,IP存储重新划分了略大于2T的空间,挂到应用服务器上,进行数据备份
2.数据备份后通过fsck.ext3进行修复
一、数据备份
创建新的分区/dev/sdb1
linux11:/var/log #fdisk /dev/sdb
The number of cylinders for this disk is set to 267075.
There is nothing wrong with that, but this is larger than 1024,
and could in certain setups cause problems with:
1) software that runs at boot time (e.g., old versions of LILO)
2) booting and partitioning software from other OSs
(e.g., DOS FDISK, OS/2 FDISK)
Command (m for help): n
Command action
e extended
p primary partition (1-4)
p
Partition number (1-4): 1
First cylinder (1-267075, default 1):
Using default value 1
Last cylinder, +cylinders or +size{K,M,G} (1-267075, default 267075):
Using default value 267075
Command (m for help): w
The partition table has been altered!
Calling ioctl() to re-read partition table.
Syncing disks.
linux11:/var/log #
linux11:/var/log #
linux11:/var/log #
linux11:/var/log # fdisk -l
Disk /dev/cciss/c0d0: 300.0 GB, 299966445568 bytes
255 heads, 63 sectors/track, 36468 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x000bf615
Device Boot Start End Blocks Id System
/dev/cciss/c0d0p1 * 1 38 305203+ 83 Linux
/dev/cciss/c0d0p2 39 4215 33551752+ 82 Linux swap / Solaris
/dev/cciss/c0d0p3 4216 36468 259072222+ 83 Linux
Disk /dev/sda: 2097.2 GB, 2097152000000 bytes
255 heads, 63 sectors/track, 254964 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x000a13a0
Device Boot Start End Blocks Id System
/dev/sda1 1 254964 2047998298+ 83 Linux
Disk /dev/sdb: 2196.8 GB, 2196766720000 bytes
255 heads, 63 sectors/track, 267075 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x24828d3f
Device Boot Start End Blocks Id System
/dev/sdb1 1 267075 2145279906 83 Linux
注意,这里尝试使用了mkfs格式化文件分区,由于文件系统2T,格式化时间相当长,最终取消了这一操作,注意kill操作也不能很快的结束,只有等待,随即重新划分了存储空间,进行分区,但不进行格式化
开始数据备份
dd if=/dev/sda1 of=/dev/sdb1 bs=8M
最开始的时候未指定bs的大小,默认只有512字节,经过约30小时的等待后,测速发现只有1M/s,后中断该过程,改为bs=8M
应用服务器未安装stat包,补充测速的方法:
>strace.log
time strace -o strace.log -p 11929
运行一段时间后ctrl+c终止
统计write出现的次数
grep -c write starace.log
echo “次数*8/time得到的时间” |bc
即为估算的每秒复制的速度。
30个小时后备份结束
249999+1 records in
249999+1 records out
2097150257664 bytes (2.1 TB) copied, 130468 s, 16.1 MB/s
修复方案
工程师初步判定为Superblock损坏,开始进行制定修复方案:
1.通过dd将原/dev/sda1分区的文件备份到其他文件分区,原分区大小2T,IP存储重新划分了略大于2T的空间,挂到应用服务器上,进行数据备份
2.数据备份后通过fsck.ext3进行修复
一、数据备份
创建新的分区/dev/sdb1
linux11:/var/log #fdisk /dev/sdb
The number of cylinders for this disk is set to 267075.
There is nothing wrong with that, but this is larger than 1024,
and could in certain setups cause problems with:
1) software that runs at boot time (e.g., old versions of LILO)
2) booting and partitioning software from other OSs
(e.g., DOS FDISK, OS/2 FDISK)
Command (m for help): n
Command action
e extended
p primary partition (1-4)
p
Partition number (1-4): 1
First cylinder (1-267075, default 1):
Using default value 1
Last cylinder, +cylinders or +size{K,M,G} (1-267075, default 267075):
Using default value 267075
Command (m for help): w
The partition table has been altered!
Calling ioctl() to re-read partition table.
Syncing disks.
linux11:/var/log #
linux11:/var/log #
linux11:/var/log #
linux11:/var/log # fdisk -l
Disk /dev/cciss/c0d0: 300.0 GB, 299966445568 bytes
255 heads, 63 sectors/track, 36468 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x000bf615
Device Boot Start End Blocks Id System
/dev/cciss/c0d0p1 * 1 38 305203+ 83 Linux
/dev/cciss/c0d0p2 39 4215 33551752+ 82 Linux swap / Solaris
/dev/cciss/c0d0p3 4216 36468 259072222+ 83 Linux
Disk /dev/sda: 2097.2 GB, 2097152000000 bytes
255 heads, 63 sectors/track, 254964 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x000a13a0
Device Boot Start End Blocks Id System
/dev/sda1 1 254964 2047998298+ 83 Linux
Disk /dev/sdb: 2196.8 GB, 2196766720000 bytes
255 heads, 63 sectors/track, 267075 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x24828d3f
Device Boot Start End Blocks Id System
/dev/sdb1 1 267075 2145279906 83 Linux
注意,这里尝试使用了mkfs格式化文件分区,由于文件系统2T,格式化时间相当长,最终取消了这一操作,注意kill操作也不能很快的结束,只有等待,随即重新划分了存储空间,进行分区,但不进行格式化
开始数据备份
dd if=/dev/sda1 of=/dev/sdb1 bs=8M
最开始的时候未指定bs的大小,默认只有512字节,经过约30小时的等待后,测速发现只有1M/s,后中断该过程,改为bs=8M
应用服务器未安装stat包,补充测速的方法:
>strace.log
time strace -o strace.log -p 11929
运行一段时间后ctrl+c终止
统计write出现的次数
grep -c write starace.log
echo “次数*8/time得到的时间” |bc
即为估算的每秒复制的速度。
30个小时后备份结束
249999+1 records in
249999+1 records out
2097150257664 bytes (2.1 TB) copied, 130468 s, 16.1 MB/s