Oracle 11gR2光钎链路切换crs服务发生crash
背景:
我们将Oracle 11gR2(11.2.0.4)在RedHat EnterPrise 5.8上通过RDAC完成的多路径链路冗余,在部署完成后,我们需要做多路径链路冗余测试,我们的光钎链路连接方式如下。我们做多路径测试完成了如下几个组合:
拔线测试组合一:
1、 先拔下光钎链路 ②和④ 一切正常没有问题;插上五分钟后执行第2步。
2、 再拔下光钎链路 ①和③ 数据库服务正常,crs进程crash无法访问,手工重启crs进程即可。
拔线测试组合二:
1、 先拔下光钎链路 ①和③ 一切正常没有问题;插上五分钟后执行第2步。
2、 再拔下光钎链路 ②和④ 数据库服务正常,crs进程crash无法访问,手工重启crs进程即可。
拔线测试组合三:
1、 先拔下光钎链路 ①和④ 一切正常没有问题;插上五分钟后执行第2步。
2、 再拔下光钎链路 ②和③ 一切正常没有问题;
拔线测试组合四:
1、 先拔下光钎链路 ②和③ 一切正常没有问题;插上五分钟后执行第2步。
2、 再拔下光钎链路 ①和④ 一切正常没有问题;
控制器切换测试组合:
1、 进入存储管理控制台,查看当前磁盘所在控制器为A控,手动全部切换到B,一切正常没有问题。
2、 五分钟之后,再次进入存储管理控制台,将所有磁盘从B控制器切换到A控制器,一切正常没有问题。
问题现象:
问题发生在第一组和第二组的的测试2上面,问题现象如下:
-
[grid@db01 ~] $ crs_stat -t -v
-
CRS-0184: Cannot communicate with the CRS daemon.
-
-
[root@db01 ~]
-
oracle 2687 1 0 00:12 ? 00:00:00 ora_pmon_woo
-
oracle 2689 1 0 00:12 ? 00:00:00 ora_psp0_woo
-
oracle 2691 1 0 00:12 ? 00:00:00 ora_vktm_woo
-
oracle 2695 1 0 00:12 ? 00:00:00 ora_gen0_woo
-
oracle 2697 1 0 00:12 ? 00:00:00 ora_diag_woo
-
oracle 2699 1 0 00:12 ? 00:00:00 ora_dbrm_woo
-
oracle 2701 1 0 00:12 ? 00:00:00 ora_dia0_woo
-
oracle 2703 1 0 00:12 ? 00:00:00 ora_mman_woo
-
oracle 2705 1 0 00:12 ? 00:00:00 ora_dbw0_woo
-
oracle 2707 1 0 00:12 ? 00:00:00 ora_lgwr_woo
-
oracle 2709 1 0 00:12 ? 00:00:01 ora_ckpt_woo
-
oracle 2711 1 0 00:12 ? 00:00:00 ora_smon_woo
-
oracle 2713 1 0 00:12 ? 00:00:00 ora_reco_woo
-
oracle 2715 1 0 00:12 ? 00:00:00 ora_mmon_woo
-
oracle 2717 1 0 00:12 ? 00:00:00 ora_mmnl_woo
-
oracle 2719 1 0 00:12 ? 00:00:00 ora_d000_woo
-
oracle 2721 1 0 00:12 ? 00:00:00 ora_s000_woo
-
oracle 2728 1 0 00:12 ? 00:00:00 ora_rvwr_woo
-
-
SQL> select host_name,instance_name,status from gv$instance;
-
-
HOST_NAME INSTANCE_NAME STATUS
-
---------- ---------------- ------------
-
db01 woo OPEN
-
db02 woo OPEN
日志排查:
OSmessage:
ASMalert日志信息:
CRS日志:
-
2014-10-30 13:48:26.715: [ CRSPE][1174640960]{2:1454:184} RI [ora.OCR_VOT001.dg db02 1] new target state: [OFFLINE] old value: [ONLINE]
-
2014-10-30 13:48:26.716: [ CRSOCR][1166235968]{2:1454:184} Multi Write Batch processing...
-
2014-10-30 13:48:26.716: [ CRSPE][1174640960]{2:1454:184} RI [ora.OCR_VOT001.dg db02 1] new internal state: [STOPPING] old value: [STABLE]
-
2014-10-30 13:48:26.716: [ CRSPE][1174640960]{2:1454:184} Sending message to agfw: id = 3284
-
2014-10-30 13:48:26.716: [ CRSPE][1174640960]{2:1454:184} CRS-2673: Attempting to stop 'ora.OCR_VOT001.dg' on 'db02'
-
-
2014-10-30 13:48:26.720: [ CRSPE][1174640960]{2:1454:184} Received reply to action [Stop] message ID: 3284
-
2014-10-30 13:48:26.725: [ OCRRAW][1166235968]proprior: Header check from OCR device 0 offset 6651904 failed (26).
-
2014-10-30 13:48:26.725: [ OCRRAW][1166235968]proprior: Retrying buffer read from another mirror for disk group [+OCR_VOT001] for block at offset [6651904]
-
2014-10-30 13:48:26.725: [ OCRASM][1166235968]proprasmres: Total 0 mirrors detected
-
2014-10-30 13:48:26.725: [ OCRASM][1166235968]proprasmres: Only 1 mirror found in this disk group.
-
2014-10-30 13:48:26.725: [ OCRASM][1166235968]proprasmres: Need to invoke checkdg. Mirror #0 has an invalid buffer.
-
2014-10-30 13:48:26.740: [ CRSPE][1174640960]{2:1454:184} Received reply to action [Stop] message ID: 3284
-
2014-10-30 13:48:26.740: [ CRSPE][1174640960]{2:1454:184} RI [ora.OCR_VOT001.dg db02 1] new internal state: [STABLE] old value: [STOPPING]
-
2014-10-30 13:48:26.740: [ CRSPE][1174640960]{2:1454:184} RI [ora.OCR_VOT001.dg db02 1] new external state [OFFLINE] old value: [ONLINE] label = []
-
2014-10-30 13:48:26.740: [ CRSPE][1174640960]{2:1454:184} CRS-2677: Stop of 'ora.OCR_VOT001.dg' on 'db02' succeeded
-
-
2014-10-30 13:48:26.740: [ CRSRPT][1176742208]{2:1454:184} Published to EVM CRS_RESOURCE_STATE_CHANGE for ora.OCR_VOT001.dg
-
2014-10-30 13:48:40.891: [ OCRASM][1166235968]proprasmres: kgfoControl returned error [8]
-
[ OCRASM][1166235968]SLOS : SLOS: cat=8, opn=kgfoCkDG01, dep=15032, loc=kgfokge
-
-
2014-10-30 13:48:40.891: [ OCRASM][1166235968]ASM Error Stack : ORA-27091: unable to queue I/O
-
ORA-15079: ASM file is closed
-
ORA-06512: at line 4
-
-
2014-10-30 13:48:55.542: [ CRSMAIN][199140176] Initializing OCR
-
[ CLWAL][199140176]clsw_Initialize: OLR initlevel [70000]
-
2014-10-30 13:48:55.805: [ OCRASM][199140176]proprasmo: Error in open/create file in dg [OCR_VOT001]
-
[ OCRASM][199140176]SLOS : SLOS: cat=8, opn=kgfoOpen01, dep=15056, loc=kgfokge
-
-
2014-10-30 13:48:55.805: [ OCRASM][199140176]ASM Error Stack :
-
2014-10-30 13:48:55.825: [ OCRASM][199140176]proprasmo: kgfoCheckMount returned [6]
-
2014-10-30 13:48:55.825: [ OCRASM][199140176]proprasmo: The ASM disk group OCR_VOT001 is not found or not mounted
-
2014-10-30 13:48:55.825: [ OCRRAW][199140176]proprioo: Failed to open [+OCR_VOT001]. Returned proprasmo() with [26]. Marking location as UNAVAILABLE.
-
2014-10-30 13:48:55.825: [ OCRRAW][199140176]proprioo: No OCR/OLR devices are usable
-
2014-10-30 13:48:55.825: [ OCRASM][199140176]proprasmcl: asmhandle is NULL
-
2014-10-30 13:48:55.826: [ GIPC][199140176] gipcCheckInitialization: possible incompatible non-threaded init from [prom.c : 690], original from [clsss.c : 5343]
-
2014-10-30 13:48:55.826: [ default][199140176]clsvactversion:4: Retrieving Active Version from local storage.
-
2014-10-30 13:48:55.827: [ CSSCLNT][199140176]clssgsgrppubdata: group (ocr_db-cluster) not found
-
2014-10-30 13:48:55.827: [ OCRRAW][199140176]proprio_repairconf: Failed to retrieve the group public data. CSS ret code [20]
-
2014-10-30 13:48:55.830: [ OCRRAW][199140176]proprioo: Failed to auto repair the OCR configuration.
-
2014-10-30 13:48:55.830: [ OCRRAW][199140176]proprinit: Could not open raw device
-
2014-10-30 13:48:55.830: [ OCRASM][199140176]proprasmcl: asmhandle is NULL
-
2014-10-30 13:48:55.831: [ OCRAPI][199140176]a_init:16!: Backend init unsuccessful : [26]
-
2014-10-30 13:48:55.832: [ CRSOCR][199140176] OCR context init failure. Error: PROC-26: Error while accessing the physical storage
-
2014-10-30 13:48:55.832: [ CRSD][199140176] Created alert : (:CRSD00111:) : Could not init OCR, error: PROC-26: Error while accessing the physical storage
-
2014-10-30 13:48:55.832: [ CRSD][199140176][PANIC] CRSD exiting: Could not init OCR, code: 26
-
2014-10-30 13:48:55.832: [ CRSD][199140176] Done.
故障处理有两种方法:
多路径切换层面,参考如下:
FailOverQuiescenceTime:
Quiescence Timeout before Failover (Mode Select Page 2C) command. Thetime,in seconds,the array will wait for a quiescence condition to clear for an explicitfailover operation. A typical setting is 20 seconds.
FailedPathCheckingInterval:
This parameter defines how long (in seconds) the MPP drivershould wait before initiating a path-validation action.Default value is 60 seconds.
Egg:
-
[root@db01 ~]# cat /etc/mpp.conf
-
VirtualDiskProductId=VirtualDisk
-
DebugLevel=0x0
-
NotReadyWaitTime=270
-
BusyWaitTime=270
-
QuiescenceWaitTime=270
-
InquiryWaitTime=60
-
MaxLunsPerArray=256
-
MaxPathsPerController=4
-
ScanInterval=60
-
InquiryInterval=1
-
MaxArrayModules=30
-
ErrorLevel=3
-
SelectionTimeoutRetryCount=0
-
UaRetryCount=10
-
RetryCount=10
-
SynchTimeout=170
-
FailOverQuiescenceTime=20
-
FailoverTimeout=120
-
FailBackToCurrentAllowed=1
-
ControllerIoWaitTime=300
-
ArrayIoWaitTime=600
-
DisableLUNRebalance=0
-
SelectiveTransferMaxTransferAttempts=5
-
SelectiveTransferMinIOWaitTime=3
-
IdlePathCheckingInterval=60
-
RecheckFailedPathWaitTime=30
-
FailedPathCheckingInterval=60
-
ArrayFailoverWaitTime=300
-
PrintSenseBuffer=0
-
ClassicModeFailover=0
-
AVTModeFailover=0
-
LunFailoverDelay=3
-
LoadBalancePolicy=1
-
ImmediateVirtLunCreate=0
-
BusResetTimeout=150
-
LunScanDelay=2
-
AllowHBAsgDevs=0
-
S2ToS3Key=471f51f35ec5426e
ASM检测时间方面:
只需要调整ASM隐含参数 _asm_hbeatiowait的值将其调大些,我这直接调到120了,重新执行五组测试,问题没有再现,故障解决。
(参看隐含参数值得方法参考:archive-1980)
Egg:
-
[root@db01 ~] # su – gird
-
[grid@db01 ~] $ sqlplus sysasm/oracle
-
SQL*Plus: Release 11.2.0.4.0 Production on Wed Nov 12 22:15:18 2014
-
Copyright (c) 1982, 2013, Oracle. All rights reserved.
-
-
Connected to:
-
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
-
With the Partitioning, OLAP, Data Mining and Real Application Testing options
-
-
SQL> alter system set "_asm_hbeatiowait"=120 scope=spfile sid='*';
-
System altered.
-
SQL> <span style="color:#ff0000;">
-
</span>
(责任编辑:IT) |