环境:OEL6.8,ORACLE 11.2.0.4 双节点RAC
节点2 alert日志报错信息如下:
Fri Nov 24 09:11:42 2017 skgxpvfynet: mtype: 61 process 11799 failed because of a resource problem in the OS. The OS has most likely run out of buffers (rval: 4) Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl2/trace/orcl2_ora_11799.trc (incident=123381): ORA-00603: ORACLE server session terminated by fatal error ORA-27504: IPC error creating OSD context ORA-27300: OS system dependent operation:sendmsg failed with status: 105 ORA-27301: OS failure message: No buffer space available ORA-27302: failure occurred at: sskgxpsnd2 Incident details in: /u01/app/oracle/diag/rdbms/orcl/orcl2/incident/incdir_123381/orcl2_ora_11799_i123381.trc Fri Nov 24 09:11:42 2017 skgxpvfynet: mtype: 61 process 11801 failed because of a resource problem in the OS. The OS has most likely run out of buffers (rval: 4) opiodr aborting process unknown ospid (11743) as a result of ORA-603 Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl2/trace/orcl2_ora_11801.trc (incident=123382): ORA-00603: ORACLE server session terminated by fatal error ORA-27504: IPC error creating OSD context ORA-27300: OS system dependent operation:sendmsg failed with status: 105 ORA-27301: OS failure message: No buffer space available ORA-27302: failure occurred at: sskgxpsnd2 Incident details in: /u01/app/oracle/diag/rdbms/orcl/orcl2/incident/incdir_123382/orcl2_ora_11801_i123382.trc Dumping diagnostic data in directory=[cdmp_20171124091142], requested by (instance=2, osid=11743), summary=[incident=123380]. |
trace文件
#/u01/app/oracle/diag/rdbms/orcl/orcl2/trace/orcl2_ora_11799.trc *** 2017-11-24 09:11:42.123 *** CLIENT ID:() 2017-11-24 09:11:42.123 *** SERVICE NAME:() 2017-11-24 09:11:42.123 *** MODULE NAME:() 2017-11-24 09:11:42.123 *** ACTION NAME:() 2017-11-24 09:11:42.123 SKGXP:[7fbcc2bfda88.0]{0}: SKGXPVFYNET: Socket self-test could not verify successful transmission of 32768 bytes (mtype 61). SKGXP:[7fbcc2bfda88.1]{0}: The network is required to support UDP protocol sends of this size. Socket is bound to 169.254.188.234. SKGXP:[7fbcc2bfda88.2]{0}: phase 'send', 0 tries, 100 loops, 4905 ms (last) struct ksxpp * ksxppg_ [0xc122540, 0x7fbcc2995310) = 0x7fbcc2995308 Dump of memory from 0x00007FBCC2995308 to 0x00007FBCC2996838 # /u01/app/oracle/diag/rdbms/orcl/orcl2/incident/incdir_123381/orcl2_ora_11799_i123381.trc Dump continued from file: /u01/app/oracle/diag/rdbms/orcl/orcl2/trace/orcl2_ora_11799.trc ORA-00603: ORACLE server session terminated by fatal error ORA-27504: IPC error creating OSD context ORA-27300: OS system dependent operation:sendmsg failed with status: 105 ORA-27301: OS failure message: No buffer space available ORA-27302: failure occurr #========= Dump for incident 123381 (ORA 603) ======== |
经查MOS(Oracle Linux: ORA-27301:OS Failure Message: No Buffer Space Available (文档 ID 2041723.1),发现可能是因为网卡的MUT参数设置过高导致网卡的缓存不足导致的。
CAUSE This happens due to less space available for network buffer reservation. SOLUTION 1. On servers with High Physical Memory, the parameter vm.min_free_kbytes should be set in the order of 0.4% of total Physical Memory. This helps in keeping a larger range of defragmented memory pages available for network buffers reducing the probability of a low-buffer-space conditions. *** For example, on a server which is having 256GB RAM, the parameter vm.min_free_kbytes should be set to 1048576 *** On NUMA Enabled Systems, the value of vm.min_free_kbytes should be multiplied by the number of NUMA nodes since the value is to be split across all the nodes. On NUMA Enabled Systems, the value of vm.min_free_kbytes = n * 0.4% of total Physical Memory. Here 'n' is the number of NUMA nodes. 2. Additionally, the MTU value should be modified as below #ifconfig lo mtu 16436 To make the change persistent over reboot add the following line in the file /etc/sysconfig/network-scripts/ifcfg-lo : MTU=16436 Save the file and restart the network service to load the changes #service network restart |
这应该是OEL操作系统专属的错误,对比OEL、RHEL、CentOS系统,发现只有OEL系统网卡本地回环的MTU是65536,其他系统均是16436,而MOS上的解决方案是将网卡本地回环的MTU改为16436。这台服务器网卡本地回环的MTU当前设置如下:
lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:65536 Metric:1 RX packets:1193604960 errors:0 dropped:0 overruns:0 frame:0 TX packets:1193604960 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:498857656538 (464.5 GiB) TX bytes:498857656538 (464.5 GiB) |
当前服务器网卡本地回环的MTU默认设置是65536,按照MOS文档的方法修改这个设置。
ifconfig lo mtu 16436
|
这个命令将修改内存中网卡的参数,直接生效。
lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:1193606824 errors:0 dropped:0 overruns:0 frame:0 TX packets:1193606824 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:498858566038 (464.5 GiB) TX bytes:498858566038 (464.5 GiB) |
但是重启后将恢复默认值,如果保证重启也生效,就需要修改网卡的配置文件。
vi /etc/sysconfig/network-scripts/ifcfg-lo DEVICE=lo IPADDR=127.0.0.1 NETMASK=255.0.0.0 NETWORK=127.0.0.0 # If you're having problems with gated making 127.0.0.0/8 a martian, # you can change this to something else (255.255.255.255, for example) BROADCAST=127.255.255.255 ONBOOT=yes NAME=loopback MTU=16436 |