我们的文章会在微信公众号IT民工的龙马人生和博客网站( www.htz.pw )同步更新 ,欢迎关注收藏,也欢迎大家转载,但是请在文章开始地方标注文章出处,谢谢!
由于博客中有大量代码,通过页面浏览效果更佳。
最近半个月遇到有两个客户的Oracle Exadata一体机出现物理磁盘的损坏,一个客户是机械磁盘、一个客户是FLASH磁盘。很巧的是这两个客户他们的日常运维过程中都是只看物理服务器的故障信号灯。但是在一体机环境中其实这远远不够的,就如今天我们分享的这个案例一样,一台存储节点故障灯并没有亮,但是磁盘已经出现了大量的坏块,导致最后重平衡失败。
针对Oracle一体机环境在硬件巡检时,有几个维度必须检查到:- 1,所有节点的messages日志
- 2,存储节点cell的alert日志
- 3,cellcli中的alerthistory日志
- 4,asm的alert日志
- 5,ilom日志
- 6,最后才服务器信号灯
复制代码 故障概述
2025年6月19日,值班人员在机房巡检时发现Exadata一体机存储节点亮黄灯,随即通知数据库工程师。经排查,确认存储6节点(CD_05_HTZHTZHADM06)出现故障,影响了部分业务磁盘的正常使用。
故障现象
- 存储节点 HTZHTZHADM06 的物理磁盘出现异常,指示灯报警。
- 通过 CellCLI 工具查询,发现相关物理磁盘、celldisk、griddisk 状态异常,部分磁盘状态为 proactive failure。
- ASM 日志显示,业务 I/O 被转移到其他联机伙伴磁盘,rebalance 过程中,其他节点磁盘也出现坏块计数增加,rebalance 长时间无法完成。
- 数据库层面,部分 ASM 磁盘未能自动 drop,rebalance 状态长时间处于 WAIT 或 RUN,无法顺利结束。
故障原因分析
- 硬盘物理故障
存储节点6的物理磁盘(/dev/sdf,252:5)出现硬件故障,CellCLI 查询状态为 proactive failure,errorCount 累计至 46 次。
- 磁盘组冗余受损
故障磁盘droping过程中触发磁盘组重平衡过程,遇到节点磁盘出现坏块,使得重平衡失败。
- 自动修复受阻
由于部分磁盘健康状况不佳,导致相关 griddisk 的 asmdeactivationoutcome 状态为“Cannot deactivate because partner disk ... has poor health”,影响了自动修复和冗余恢复流程。
处理过程
确定损坏磁盘的信息
1,通过list physicaldisk 查看状态标志包含failure的所有磁盘
- CellCLI> list physicaldisk attributes all
- 252:0 22 /dev/sda HardDisk 252 0 0_0 "HGST H1231A823SUN010T" A680 2018-06-27T12:43:40+08:00 sas R6WSUN 8.91015625T 0 normal
- 252:1 23 /dev/sdb HardDisk 252 0 0_1 "HGST H1231A823SUN010T" A680 2018-06-27T12:43:40+08:00 sas R74S6N 8.91015625T 1 normal
- 252:2 28 /dev/sdc HardDisk 252 0 0_2 "HGST H1231A823SUN010T" A680 2024-09-12T10:24:58+08:00 sas R52JHK 8.91015625T 2 normal
- 252:3 20 /dev/sdd HardDisk 252 0 0_3 "HGST H1231A823SUN010T" A680 2018-06-27T12:43:40+08:00 sas R75DLN 8.91015625T 3 normal
- 252:4 26 /dev/sde HardDisk 252 0 0_4 "HGST H1231A823SUN010T" A680 2018-06-27T12:43:40+08:00 sas R63KJN 8.91015625T 4 normal
- 252:5 27 /dev/sdf HardDisk 252 0 0_5 "HGST H1231A823SUN010T" A680 2018-06-27T12:43:40+08:00 sas R714UN 8.91015625T 5 normal
- 252:6 25 /dev/sdg HardDisk 252 0 0_6 "HGST H1231A823SUN010T" A680 2018-06-27T12:43:40+08:00 sas R6SLBN 8.91015625T 6 normal
- 252:7 24 /dev/sdh HardDisk 252 1106 0_7 "HGST H1231A823SUN010T" A680 2018-06-27T12:43:40+08:00 sas R7D8GN 8.91015625T 7 normal
- 252:8 19 /dev/sdi HardDisk 252 0 0_8 "HGST H1231A823SUN010T" A680 2018-06-27T12:43:40+08:00 sas R6W8AN 8.91015625T 8 normal
- 252:9 18 /dev/sdj HardDisk 252 0 0_9 "HGST H1231A823SUN010T" A680 2018-06-27T12:43:40+08:00 sas R6Y44N 8.91015625T 9 normal
- 252:10 17 /dev/sdk HardDisk 252 0 0_10 "HGST H1231A823SUN010T" A680 2018-06-27T12:43:40+08:00 sas R7D4RN 8.91015625T 10 normal
- 252:11 16 /dev/sdl HardDisk 252 0 0_11 "HGST H1231A823SUN010T" A680 2018-06-27T12:43:40+08:00 sas R7G62N 8.91015625T 11 normal
- FLASH_10_1 /dev/nvme2n1 FlashDisk 10_0 "Oracle Flash Accelerator F640 PCIe Card" QDV1RF35 2018-06-27T12:44:13+08:00 PHLE805600PD6P4BGN-1 2.910957656800746917724609375T "PCI Slot: 10; FDOM: 1" normal
- FLASH_10_2 /dev/nvme3n1 FlashDisk 10_0 "Oracle Flash Accelerator F640 PCIe Card" QDV1RF35 2018-06-27T12:44:13+08:00 PHLE805600PD6P4BGN-2 2.910957656800746917724609375T "PCI Slot: 10; FDOM: 2" normal
- FLASH_4_1 /dev/nvme4n1 FlashDisk 4_0 "Oracle Flash Accelerator F640 PCIe Card" QDV1RF35 2018-06-27T12:44:13+08:00 PHLE8055008T6P4BGN-1 2.910957656800746917724609375T "PCI Slot: 4; FDOM: 1" normal
- FLASH_4_2 /dev/nvme5n1 FlashDisk 4_0 "Oracle Flash Accelerator F640 PCIe Card" QDV1RF35 2018-06-27T12:44:13+08:00 PHLE8055008T6P4BGN-2 2.910957656800746917724609375T "PCI Slot: 4; FDOM: 2" normal
- FLASH_5_1 /dev/nvme6n1 FlashDisk 5_0 "Oracle Flash Accelerator F640 PCIe Card" QDV1RF35 2018-06-27T12:44:13+08:00 PHLE805600P76P4BGN-1 2.910957656800746917724609375T "PCI Slot: 5; FDOM: 1" normal
- FLASH_5_2 /dev/nvme7n1 FlashDisk 5_0 "Oracle Flash Accelerator F640 PCIe Card" QDV1RF35 2019-04-21T10:13:44+08:00 PHLE805600P76P4BGN-2 2.910957656800746917724609375T "PCI Slot: 5; FDOM: 2" normal
- FLASH_6_1 /dev/nvme0n1 FlashDisk 6_0 "Oracle Flash Accelerator F640 PCIe Card" QDV1RF35 2018-06-27T12:44:13+08:00 PHLE8056009A6P4BGN-1 2.910957656800746917724609375T "PCI Slot: 6; FDOM: 1" normal
- FLASH_6_2 /dev/nvme1n1 FlashDisk 6_0 "Oracle Flash Accelerator F640 PCIe Card" QDV1RF35 2019-04-22T09:36:12+08:00 PHLE8056009A6P4BGN-2 2.910957656800746917724609375T "PCI Slot: 6; FDOM: 2" normal
- M2_SYS_0 /dev/sdm M2Disk "INTEL SSDSCKJB150G7" N2010121 2018-06-27T12:44:19+08:00 PHDW802004L8150A 139.73558807373046875G "M.2 Slot: 0" normal
- M2_SYS_1 /dev/sdn M2Disk "INTEL SSDSCKJB150G7" N2010121 2018-06-27T12:44:19+08:00 PHDW802004ZT150A 139.73558807373046875G "M.2 Slot: 1" normal
复制代码 2,通过list celldisk查看状态
- CD_00_HTZHTZHADM01 2018-06-27T17:33:49+08:00 /dev/sda /dev/sda HardDisk 0 0 e432c76d-f0d4-46b5-8ef4-c6aa2f7efee0 R6WSUN 8.9094085693359375T normal
- CD_01_HTZHTZHADM01 2018-06-27T17:33:49+08:00 /dev/sdb /dev/sdb HardDisk 0 0 25f32ef1-dd5a-4cf7-8440-1e74abe8e858 R74S6N 8.9094085693359375T normal
- CD_02_HTZHTZHADM01 2024-09-12T10:25:05+08:00 /dev/sdc /dev/sdc HardDisk 0 0 76bcd3bd-9e46-4f7e-bece-046d4e83276a R52JHK 8.9094085693359375T normal
- CD_03_HTZHTZHADM01 2018-06-27T17:33:49+08:00 /dev/sdd /dev/sdd HardDisk 0 0 0b1c0522-e657-4d70-b6ea-85811ddf5913 R75DLN 8.9094085693359375T normal
- CD_04_HTZHTZHADM01 2018-06-27T17:33:49+08:00 /dev/sde /dev/sde HardDisk 0 0 c169081a-1ba8-435f-ba7c-a605b3049c78 R63KJN 8.9094085693359375T normal
- CD_05_HTZHTZHADM01 2018-06-27T17:33:49+08:00 /dev/sdf /dev/sdf HardDisk 0 0 4b56c0f2-a6a0-43a0-bbeb-314a70626b09 R714UN 8.9094085693359375T normal
- CD_06_HTZHTZHADM01 2018-06-27T17:33:50+08:00 /dev/sdg /dev/sdg HardDisk 0 0 0161eb49-0dca-4ae5-a247-a0b810491270 R6SLBN 8.9094085693359375T normal
- CD_07_HTZHTZHADM01 2018-06-27T17:33:50+08:00 /dev/sdh /dev/sdh HardDisk 11413920 0 d7ef18a7-0aae-4136-8a98-2ec3978c498b R7D8GN 8.9094085693359375T normal
- CD_08_HTZHTZHADM01 2018-06-27T17:33:50+08:00 /dev/sdi /dev/sdi HardDisk 0 0 312cc384-6cb4-4724-a88d-e43a06003e80 R6W8AN 8.9094085693359375T normal
- CD_09_HTZHTZHADM01 2018-06-27T17:33:50+08:00 /dev/sdj /dev/sdj HardDisk 0 0 700e6561-de2f-4577-8cfe-d8d09435d291 R6Y44N 8.9094085693359375T normal
- CD_10_HTZHTZHADM01 2018-06-27T17:33:50+08:00 /dev/sdk /dev/sdk HardDisk 0 0 1ba9303f-16a7-4c75-bd06-9e772307b63a R7D4RN 8.9094085693359375T normal
- CD_11_HTZHTZHADM01 2018-06-27T17:33:50+08:00 /dev/sdl /dev/sdl HardDisk 0 0 0a429ee8-790b-4087-bfa5-74a94a79238a R7G62N 8.9094085693359375T normal
- FD_00_HTZHTZHADM01 2018-06-27T17:33:51+08:00 /dev/md310 /dev/md310 FlashDisk 0 0 4d5272bf-4aba-48c4-8e98-b5c651ff19ed PHLE805600PD6P4BGN-2,PHLE805600PD6P4BGN-1 5.8218994140625T normal
- FD_01_HTZHTZHADM01 2018-06-27T17:33:51+08:00 /dev/md304 /dev/md304 FlashDisk 0 0 6eea3fb4-deaf-436f-aa61-46ecf4bc5f2b PHLE8055008T6P4BGN-2,PHLE8055008T6P4BGN-1 5.8218994140625T normal
- FD_02_HTZHTZHADM01 2018-06-27T17:33:52+08:00 /dev/md305 /dev/md305 FlashDisk 0 0 e10fe5b7-2d5c-48d4-a1a6-37612ddee585 PHLE805600P76P4BGN-2,PHLE805600P76P4BGN-1 5.8218994140625T normal
- FD_03_HTZHTZHADM01 2018-06-27T17:33:53+08:00 /dev/md306 /dev/md306 FlashDisk 0 0 38181a27-b808-4238-a98c-04325146db87 PHLE8056009A6P4BGN-2,PHLE8056009A6P4BGN-1 5.8218994140625T normal
复制代码 3,通过list griddisk查看状态
- CellCLI> list griddisk attributes all
- DATAC1_CD_00_HTZHTZHADM01 DATAC1 DATAC1_CD_00_HTZHTZHADM01 HTZHTZHADM01 FD_03_HTZHTZHADM01 default CD_00_HTZHTZHADM01 "Cluster cluster-clu1 diskgroup DATAC1" 2018-06-27T17:38:56+08:00 HardDisk 0 8666a91f-f41b-4ad9-91af-72e5e1dd3840 7.1279296875T active
- DATAC1_CD_01_HTZHTZHADM01 DATAC1 DATAC1_CD_01_HTZHTZHADM01 HTZHTZHADM01 FD_01_HTZHTZHADM01 default CD_01_HTZHTZHADM01 "Cluster cluster-clu1 diskgroup DATAC1" 2018-06-27T17:38:56+08:00 HardDisk 0 98e70d44-755a-416e-b221-1339bfe6accf 7.1279296875T active
- DATAC1_CD_02_HTZHTZHADM01 DATAC1 DATAC1_CD_02_HTZHTZHADM01 HTZHTZHADM01 FD_03_HTZHTZHADM01 default CD_02_HTZHTZHADM01 "Cluster cluster-clu1 diskgroup DATAC1" 2024-09-12T10:25:05+08:00 HardDisk 0 20be2fbd-de47-455f-aceb-6f33451227e9 7.1279296875T active
- DATAC1_CD_03_HTZHTZHADM01 DATAC1 DATAC1_CD_03_HTZHTZHADM01 HTZHTZHADM01 FD_01_HTZHTZHADM01 default CD_03_HTZHTZHADM01 "Cluster cluster-clu1 diskgroup DATAC1" 2018-06-27T17:38:56+08:00 HardDisk 0 c5dd1a7b-3ab4-4afd-ba95-6b52d7e0c990 7.1279296875T active
- DATAC1_CD_04_HTZHTZHADM01 DATAC1 DATAC1_CD_04_HTZHTZHADM01 HTZHTZHADM01 FD_00_HTZHTZHADM01 default CD_04_HTZHTZHADM01 "Cluster cluster-clu1 diskgroup DATAC1" 2018-06-27T17:38:55+08:00 HardDisk 0 66daaae1-f9bd-409f-a587-056a49bb0737 7.1279296875T active
- DATAC1_CD_05_HTZHTZHADM01 DATAC1 DATAC1_CD_05_HTZHTZHADM01 HTZHTZHADM01 FD_02_HTZHTZHADM01 default CD_05_HTZHTZHADM01 "Cluster cluster-clu1 diskgroup DATAC1" 2018-06-27T17:38:56+08:00 HardDisk 0 475cab0a-eef4-4c5b-b3be-1ff23b155f51 7.1279296875T active
- DATAC1_CD_06_HTZHTZHADM01 DATAC1 DATAC1_CD_06_HTZHTZHADM01 HTZHTZHADM01 FD_03_HTZHTZHADM01 default CD_06_HTZHTZHADM01 "Cluster cluster-clu1 diskgroup DATAC1" 2018-06-27T17:38:57+08:00 HardDisk 0 f75ea3a4-7ee2-4f85-9fee-dada8e11c6cf 7.1279296875T active
- DATAC1_CD_07_HTZHTZHADM01 DATAC1 DATAC1_CD_07_HTZHTZHADM01 HTZHTZHADM01 FD_01_HTZHTZHADM01 default CD_07_HTZHTZHADM01 "Cluster cluster-clu1 diskgroup DATAC1" 2018-06-27T17:38:55+08:00 HardDisk 4 d30809fc-57df-442d-bea9-b22d12c72281 7.1279296875T active
- DATAC1_CD_08_HTZHTZHADM01 DATAC1 DATAC1_CD_08_HTZHTZHADM01 HTZHTZHADM01 FD_00_HTZHTZHADM01 default CD_08_HTZHTZHADM01 "Cluster cluster-clu1 diskgroup DATAC1" 2018-06-27T17:38:56+08:00 HardDisk 0 9cc88c99-31dc-4731-be5f-1f4e467efb01 7.1279296875T active
- DATAC1_CD_09_HTZHTZHADM01 DATAC1 DATAC1_CD_09_HTZHTZHADM01 HTZHTZHADM01 FD_02_HTZHTZHADM01 default CD_09_HTZHTZHADM01 "Cluster cluster-clu1 diskgroup DATAC1" 2018-06-27T17:38:55+08:00 HardDisk 0 1656d696-640e-459b-a15f-c6c3c2ec504a 7.1279296875T active
- DATAC1_CD_10_HTZHTZHADM01 DATAC1 DATAC1_CD_10_HTZHTZHADM01 HTZHTZHADM01 FD_00_HTZHTZHADM01 default CD_10_HTZHTZHADM01 "Cluster cluster-clu1 diskgroup DATAC1" 2018-06-27T17:38:55+08:00 HardDisk 0 881d0d07-1a4a-4492-8320-407bc8b2e5f0 7.1279296875T active
- DATAC1_CD_11_HTZHTZHADM01 DATAC1 DATAC1_CD_11_HTZHTZHADM01 HTZHTZHADM01 FD_02_HTZHTZHADM01 default CD_11_HTZHTZHADM01 "Cluster cluster-clu1 diskgroup DATAC1" 2018-06-27T17:38:57+08:00 HardDisk 0 13fddb80-857b-40c5-9b41-68efced3dc40 7.1279296875T active
- RECOC1_CD_00_HTZHTZHADM01 RECOC1 RECOC1_CD_00_HTZHTZHADM01 HTZHTZHADM01 FD_03_HTZHTZHADM01 default CD_00_HTZHTZHADM01 "Cluster cluster-clu1 diskgroup RECOC1" 2018-06-27T17:39:19+08:00 HardDisk 0 a4ad0311-da18-4419-80dc-8b75227329d2 1.78143310546875T active
- RECOC1_CD_01_HTZHTZHADM01 RECOC1 RECOC1_CD_01_HTZHTZHADM01 HTZHTZHADM01 FD_01_HTZHTZHADM01 default CD_01_HTZHTZHADM01 "Cluster cluster-clu1 diskgroup RECOC1" 2018-06-27T17:39:19+08:00 HardDisk 0 ce81a19a-3651-4dc1-9dc7-e84b41754a76 1.78143310546875T active
- RECOC1_CD_02_HTZHTZHADM01 RECOC1 RECOC1_CD_02_HTZHTZHADM01 HTZHTZHADM01 FD_03_HTZHTZHADM01 default CD_02_HTZHTZHADM01 "Cluster cluster-clu1 diskgroup RECOC1" 2024-09-12T10:25:05+08:00 HardDisk 0 789705ed-e76f-43ab-ac1f-2544bebb1ea3 1.78143310546875T active
- RECOC1_CD_03_HTZHTZHADM01 RECOC1 RECOC1_CD_03_HTZHTZHADM01 HTZHTZHADM01 FD_01_HTZHTZHADM01 default CD_03_HTZHTZHADM01 "Cluster cluster-clu1 diskgroup RECOC1" 2018-06-27T17:39:21+08:00 HardDisk 0 37098fd1-2708-4a4b-86a7-d4a8c14ddf88 1.78143310546875T active
- RECOC1_CD_04_HTZHTZHADM01 RECOC1 RECOC1_CD_04_HTZHTZHADM01 HTZHTZHADM01 FD_00_HTZHTZHADM01 default CD_04_HTZHTZHADM01 "Cluster cluster-clu1 diskgroup RECOC1" 2018-06-27T17:39:20+08:00 HardDisk 0 e18f3551-0af6-44b3-b577-7f08bde773c3 1.78143310546875T active
- RECOC1_CD_05_HTZHTZHADM01 RECOC1 RECOC1_CD_05_HTZHTZHADM01 HTZHTZHADM01 FD_02_HTZHTZHADM01 default CD_05_HTZHTZHADM01 "Cluster cluster-clu1 diskgroup RECOC1" 2018-06-27T17:39:20+08:00 HardDisk 0 cbb48299-dbac-4230-b784-508c954a461b 1.78143310546875T active
- RECOC1_CD_06_HTZHTZHADM01 RECOC1 RECOC1_CD_06_HTZHTZHADM01 HTZHTZHADM01 FD_03_HTZHTZHADM01 default CD_06_HTZHTZHADM01 "Cluster cluster-clu1 diskgroup RECOC1" 2018-06-27T17:39:21+08:00 HardDisk 0 0010e607-713f-4055-8924-2fc23a31e9a1 1.78143310546875T active
- RECOC1_CD_07_HTZHTZHADM01 RECOC1 RECOC1_CD_07_HTZHTZHADM01 HTZHTZHADM01 FD_01_HTZHTZHADM01 default CD_07_HTZHTZHADM01 "Cluster cluster-clu1 diskgroup RECOC1" 2018-06-27T17:39:20+08:00 HardDisk 23825674 a2560429-fb29-4022-a85a-98e0baeacf28 1.78143310546875T active
- RECOC1_CD_08_HTZHTZHADM01 RECOC1 RECOC1_CD_08_HTZHTZHADM01 HTZHTZHADM01 FD_00_HTZHTZHADM01 default CD_08_HTZHTZHADM01 "Cluster cluster-clu1 diskgroup RECOC1" 2018-06-27T17:39:20+08:00 HardDisk 0 70997c8e-4bf0-4820-97a9-32c48be234ff 1.78143310546875T active
- RECOC1_CD_09_HTZHTZHADM01 RECOC1 RECOC1_CD_09_HTZHTZHADM01 HTZHTZHADM01 FD_02_HTZHTZHADM01 default CD_09_HTZHTZHADM01 "Cluster cluster-clu1 diskgroup RECOC1" 2018-06-27T17:39:19+08:00 HardDisk 0 309f073f-262f-4042-b9c7-b935c6f1a143 1.78143310546875T active
- RECOC1_CD_10_HTZHTZHADM01 RECOC1 RECOC1_CD_10_HTZHTZHADM01 HTZHTZHADM01 FD_00_HTZHTZHADM01 default CD_10_HTZHTZHADM01 "Cluster cluster-clu1 diskgroup RECOC1" 2018-06-27T17:39:19+08:00 HardDisk 0 fbc56c09-c95f-4230-a971-31b2a18fc10c 1.78143310546875T active
- RECOC1_CD_11_HTZHTZHADM01 RECOC1 RECOC1_CD_11_HTZHTZHADM01 HTZHTZHADM01 FD_02_HTZHTZHADM01 default CD_11_HTZHTZHADM01 "Cluster cluster-clu1 diskgroup RECOC1" 2018-06-27T17:39:19+08:00 HardDisk 0 96061cc1-ad24-4a8e-afe7-f974d0930a8f 1.78143310546875T active
- CellCLI>
复制代码 4,确认故障LUN的位置
- name: CD_05_HTZHTZHADM06
- comment:
- creationTime: 2018-06-27T17:33:50+08:00
- deviceName: /dev/sdf
- devicePartition: /dev/sdf
- diskType: HardDisk
- errorCount: 46
- freeSpace: 0
- id: 86b90b17-fbe4-4622-821c-f3e96881ff09
- physicalDisk: R74YWN
- size: 8.9094085693359375T
- status: proactive failure
- CellCLI>
- 查看griddisk的信息
- list griddisk where celldisk=CD_05_HTZHTZHADM06
- CellCLI> list griddisk where celldisk=CD_05_HTZHTZHADM06
- DATAC1_CD_05_HTZHTZHADM06 proactive failure
- RECOC1_CD_05_HTZHTZHADM06 proactive failure
复制代码 5,确认失败的磁盘所关联的ASM disk是否已经自动drop。
使用grid用户登录到数据库节点,使用sqlplus / as sysasm连接到ASM实例。- SQL> set lines 180 pages 999
- col path format a50
- select group_number,path,header_status,mount_status,mode_status,name from v$ASM_DISK where path like '%CD_05_HTZHTZHADM06%';
- SQL> SQL>
- GROUP_NUMBER PATH HEADER_STATU MOUNT_S MODE_ST NAME
- ------------ -------------------------------------------------- ------------ ------- ------- ------------------------------
- 0 o/192.168.10.19;192.168.10.20/DATAC1_CD_05_sjxzcel FORMER CLOSED ONLINE
- adm06
- 3 o/192.168.10.19;192.168.10.20/RECOC1_CD_05_sjxzcel MEMBER CACHED ONLINE RECOC1_CD_05_HTZHTZHADM06
- adm06
复制代码 发现未完全drop。
6,查看数据库reblance状态
- SQL>
- INST_ID GROUP_NUMBER OPERA PASS STAT POWER ACTUAL SOFAR EST_WORK EST_RATE EST_MINUTES ERROR_CODE CON_ID
- ------- ------------ ----- ---- ----- ----- ------ -------- -------- -------- ----------- ---------- ------
- 1 3 REBAL COMPACT WAIT 12 12 0 0 0 0 0
- 3 3 REBAL REBALANCE RUN 12 121789 1240900 0 0 0 0
- 3 3 REBAL REBUILD DONE 12 12 0 0 0 0 0
- 3 3 REBAL RESYNC DONE 12 12 0 0 0 0 0
- 1 3 REBAL COMPACT WAIT 12 12 0 0 0 0 0
- 2 3 REBAL REBALANCE WAIT 12 12 0 0 0 0 0
- 2 3 REBAL REBUILD WAIT 12 12 0 0 0 0 0
- 2 3 REBAL RESYNC WAIT 12 12 0 0 0 0 0
- 3 3 REBAL COMPACT WAIT 12 12 0 0 0 0 0
- 3 3 REBAL REBALANCE WAIT 12 12 0 0 0 0 0
- 3 3 REBAL REBUILD WAIT 12 12 0 0 0 0 0
- 3 3 REBAL RESYNC WAIT 12 12 0 0 0 0 0
- 1 3 REBAL COMPACT WAIT 12 12 0 0 0 0 0
- 1 3 REBAL REBALANCE WAIT 12 12 0 0 0 0 0
- 1 3 REBAL REBUILD WAIT 12 12 0 0 0 0 0
- 1 3 REBAL RESYNC WAIT 12 12 0 0 0 0 0
- 2 3 REBAL COMPACT WAIT 12 12 0 0 0 0 0
- 2 3 REBAL REBALANCE WAIT 12 12 0 0 0 0 0
- 2 3 REBAL REBUILD WAIT 12 12 0 0 0 0 0
- 2 3 REBAL RESYNC WAIT 12 12 0 0 0 0 0
- 20 rows selected.
复制代码 发现一直是这个状态。
确认ALERT日志信息
某节点的日志
- ----------------Alert----------------
- 2025-06-16T16:51:26.645923+08:00
- NOTE: updating disk modes to 0x5 from 0x7 for disk 19 (DATAC1_CD_05_HTZHTZHADM06) in group 1 (DATAC1): lflags 0x0
- NOTE: disk 19 (DATAC1_CD_05_HTZHTZHADM06) in group 1 (DATAC1) is offline for reads
- NOTE: updating disk modes to 0x1 from 0x5 for disk 19 (DATAC1_CD_05_HTZHTZHADM06) in group 1 (DATAC1): lflags 0x0
- NOTE: disk 19 (DATAC1_CD_05_HTZHTZHADM06) in group 1 (DATAC1) is offline for writes
- NOTE: disk 19 (DATAC1_CD_05_HTZHTZHADM06) in group 1 (DATAC1) is offline for writes
- SUCCESS: disk DATAC1_CD_05_HTZHTZHADM06 (19.1897451371) dropped from diskgroup DATAC1
- ----------------ASM磁盘故障-------------------------------
- WARNING: I/O on unhealthy ASM disk (DATAC1_CD_05_HTZHTZHADM06) in group DATAC1 /0x3a82859c will be diverted to its online partner disks
- 2025-06-19T18:08:35.323341+08:00
- SQL> /* Exadata Auto Mgmt: Proactive DROP ASM Disk */
- alter diskgroup RECOC1 drop
- disk RECOC1_CD_05_HTZHTZHADM06
- NOTE: GroupBlock outside rolling migration privileged region
- NOTE: initiating offline for alter one membership refresh for group=3
- 2025-06-19T18:08:37.501287+08:00
- 2025-06-19T18:08:25.583485+08:00
- SQL> /* Exadata Auto Mgmt: Proactive DROP ASM Disk */
- alter diskgroup RECOC1 drop
- disk RECOC1_CD_05_HTZHTZHADM06
- 2025-06-19T18:08:26.836233+08:00
- NOTE: Attempting voting file refresh on diskgroup RECOC1
- NOTE: Refresh completed on diskgroup RECOC1. No voting file found.
复制代码 某节点的信息
- 2025-06-23T21:54:19.843872+08:00
- WARNING: Read Failed. group:3 disk:80 AU:38398 offset:1048576 size:1048576
- path:o/192.168.10.9;192.168.10.10/RECOC1_CD_07_HTZHTZHADM01
- incarnation:0x7118d6cb asynchronous result:'I/O error'
- subsys:OSS krq:0x7f19ef8f3000 osderr1:0xc9 osderr2:0x0
- Exadata error:'Generic I/O error'
- IO elapsed time: 7057 usec Time waited on I/O: 6048 usec
- WARNING: Read Failed. group:3 disk:80 AU:38398 offset:0 size:1048576
- path:o/192.168.10.9;192.168.10.10/RECOC1_CD_07_HTZHTZHADM01
- incarnation:0x7118d6cb asynchronous result:'I/O error'
- subsys:OSS krq:0x7f19ef98288 bufp:0x7f19ea383000 osderr1:0xc9 osderr2:0x0
- Exadata error:'Generic I/O error'
- IO elapsed time: 15155 usec Time waited on I/O: 14146 usec
- NOTE: Suppressing further IO Read errors on group:3 disk:80
- WARNING: Read Failed. group:3 disk:80 AU:38374 offset:0 size:1048576
- path:o/192.168.10.9;192.168.10.10/RECOC1_CD_07_HTZHTZHADM01
- incarnation:0x7118d6cb asynchronous result:'I/O error'
- subsys:OSS krq:0x7f19ef969348 bufp:0x7f19ea073000 osderr1:0xc9 osderr2:0x0
- Exadata error:'Generic I/O error'
- IO elapsed time: 7068 usec Time waited on I/O: 2021 usec
- 2025-06-23T21:54:50.567089+08:00
- WARNING: Read Failed. group:3 disk:80 AU:38398 offset:1048576 size:1048576
- path:o/192.168.10.9;192.168.10.10/RECOC1_CD_07_HTZHTZHADM01
- incarnation:0x7118d6cb asynchronous result:'I/O error'
- subsys:OSS krq:0x7f19ef8f3000 osderr1:0xc9 osderr2:0x0
- Exadata error:'Generic I/O error'
- IO elapsed time: 40362 usec Time waited on I/O: 6050 usec
- WARNING: Read Failed. group:3 disk:80 AU:38398 offset:0 size:1048576
- path:o/192.168.10.9;192.168.10.10/RECOC1_CD_07_HTZHTZHADM01
- incarnation:0x7118d6cb asynchronous result:'I/O error'
- subsys:OSS krq:0x7f19ef969348 bufp:0x7f19ea073000 osderr1:0xc9 osderr2:0x0
- Exadata error:'Generic I/O error'
- IO elapsed time: 10082 usec Time waited on I/O: 1009 usec
- NOTE: Suppressing further IO Read errors on group:3 disk:80
- WARNING: Read Failed. group:3 disk:80 AU:57491 offset:0 size:1048576
复制代码 Exadata error:'Generic I/O error'和Read Failed这里给得很明显,提示在从平衡的时候读取AU时出现了IO错误。
确定IO故障节点的CELL的日志
- Read Error on Cell Disk CD_07_sjxzceladm01 (/dev/sdh) at device offset 8078426636288 bytes with size 1048576 bytes membuf 0x1c89a00000, bioreq 0x616d522c (errno: No data available [61])
- Read Error on Grid Disk RECOC1_CD_07_sjxzceladm01 at grid disk offset 241134731264 bytes with size 1048576 bytes from database +ASM
- 2025-06-23T17:23:05.781390+08:00
- Read Error on Cell Disk CD_07_sjxzceladm01 (/dev/sdh) at device offset 7844585799680 bytes with size 1048576 bytes membuf 0x1e70f00000, bioreq 0x6120b2b4 (errno: No data available [61])
- Read Error on Grid Disk RECOC1_CD_07_sjxzceladm01 at grid disk offset 7293894656 bytes with size 1048576 bytes from database +ASM
- 2025-06-23T17:23:05.790166+08:00
复制代码 errno: No data available [61]通过这行日志,也可以很明确知道IO有异常
操作系统日志
- Jun 24 00:17:43 sjxzceladm01 kernel: [29925171.789307] sd 0:2:7:0: [sdh] tag#7 BRCM Debug mfi stat 0x2d, data len requested/completed 0x100000/0x0
- Jun 24 00:17:43 sjxzceladm01 kernel: [29925171.790573] sd 0:2:7:0: [sdh] tag#7 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
- Jun 24 00:17:43 sjxzceladm01 kernel: [29925171.790576] sd 0:2:7:0: [sdh] tag#7 Sense Key : Medium Error [current]
- Jun 24 00:17:43 sjxzceladm01 kernel: [29925171.790578] sd 0:2:7:0: [sdh] tag#7 Add. Sense: Unrecovered read error
复制代码 操作系统层面同样在报读取错误
故障处理
强制删除磁盘
使用踢盘命令alter physicaldisk 252:5 drop for replacement剔除存储节点6的故障硬盘
更换物理磁盘
更换磁盘,每个磁盘都有一个挂扣,拨动挂扣将旧的磁盘移除,然后在将新的磁盘推入槽位,锁起挂扣。更换完成后,磁盘上的LED指示灯消失,绿灯亮起。
确认更换磁盘的状态
在物理磁盘更换完成以后,系统会自动创建LUN,celldisk,griddisk,当其是系统盘时,如果磁盘包含系统分区,RAID同时也会自动进行重组。
在存储服务器这一端的cellcli命令提示符下执行如下命令可查看lun,physicaldisk,celldisk,griddisk的状态,创建时间及名称,确认更换后的信息正确无误。- CellCLI> list lun 0_5 detail
- name: 0_5
- cellDisk: CD_05_HTZHTZHADM06
- deviceName: /dev/sdf
- diskType: HardDisk
- id: 0_5
- isSystemLun: FALSE
- lunSize: 8.90940952301025390625T
- lunUID: 0_5
- physicalDrives: 252:5
- raidLevel: 0
- lunWriteCacheMode: "WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU"
- status: normal
复制代码- CellCLI> list physicaldisk 252:5 detail
- name: 252:5
- deviceId: 29
- deviceName: /dev/sdf
- diskType: HardDisk
- enclosureDeviceId: 252
- errOtherCount: 0
- luns: 0_5
- makeModel: "HGST H1231A823SUN010T"
- physicalFirmware: A680
- physicalInsertTime: 2025-06-24T02:30:45+08:00
- physicalInterface: sas
- physicalSerial: R94HEN
- physicalSize: 8.91015625T
- slotNumber: 5
- status: normal
复制代码- CellCLI> list celldisk where lun=0_5 detail
- name: CD_05_HTZHTZHADM06
- comment:
- creationTime: 2025-06-24T02:30:53+08:00
- deviceName: /dev/sdf
- devicePartition: /dev/sdf
- diskType: HardDisk
- errorCount: 0
- freeSpace: 0
- id: 1a41238e-cbb6-4d66-8427-5e7dc2e2729f
- physicalDisk: R94HEN
- size: 8.9094085693359375T
- status: normal
复制代码- CellCLI> list griddisk where celldisk=CD_05_HTZHTZHADM06 detail
- name: DATAC1_CD_05_HTZHTZHADM06
- asmDiskGroupName: DATAC1
- asmDiskName: DATAC1_CD_05_HTZHTZHADM06
- asmFailGroupName: HTZHTZHADM06
- availableTo:
- cachedBy: FD_01_HTZHTZHADM06
- cachingPolicy: default
- cellDisk: CD_05_HTZHTZHADM06
- comment: "Cluster cluster-clu1 diskgroup DATAC1"
- creationTime: 2025-06-24T02:30:53+08:00
- diskType: HardDisk
- errorCount: 0
- id: ac2ecc45-619e-464c-bf66-5c0fd7e8c608
- size: 7.1279296875T
- status: active
- name: RECOC1_CD_05_HTZHTZHADM06
- asmDiskGroupName: RECOC1
- asmDiskName: RECOC1_CD_05_HTZHTZHADM06
- asmFailGroupName: HTZHTZHADM06
- availableTo:
- cachedBy: FD_01_HTZHTZHADM06
- cachingPolicy: default
- cellDisk: CD_05_HTZHTZHADM06
- comment: "Cluster cluster-clu1 diskgroup RECOC1"
- creationTime: 2025-06-24T02:30:53+08:00
- diskType: HardDisk
- errorCount: 0
- id: 5ad626b7-9167-4f64-b07d-cfb767ca8d3d
- size: 1.78143310546875T
- status: active
复制代码 然后在数据库服务器的ASM实例这一段查看griddisk是否已经正确添加到ASM磁盘组:- SQL> set linesize 180 pages 999
- col path format a50
- select group_number,path,header_status,mount_status,name from v$ASM_DISK where path like '%CD_05_HTZHTZHADM06%';
- SQL> SQL>
- GROUP_NUMBER PATH HEADER_STATU MOUNT_S NAME
- ------------ -------------------------------------------------- ------------ ------- ------------------------------
- 1 o/192.168.10.19;192.168.10.20/DATAC1_CD_05_sjxzcel MEMBER CACHED DATAC1_CD_05_HTZHTZHADM06
- adm06
- 3 o/192.168.10.19;192.168.10.20/RECOC1_CD_05_sjxzcel MEMBER CACHED RECOC1_CD_05_HTZHTZHADM06
- adm06
复制代码 ------------------作者介绍-----------------------
姓名:黄廷忠
现就职:Oracle中国高级服务团队
曾就职:OceanBase、云和恩墨、东方龙马等
电话、微信、QQ:18081072613
个人博客: (http://www.htz.pw)
CSDN地址: (https://blog.csdn.net/wwwhtzpw)
博客园地址: (https://www.cnblogs.com/www-htz-pw)
来源:程序园用户自行投稿发布,如果侵权,请联系站长删除
免责声明:如果侵犯了您的权益,请联系站长,我们会及时删除侵权内容,谢谢合作! |