2017-11-24

Oracle Linux网卡参数默认设置导致ORA-603

环境：OEL6.8，ORACLE 11.2.0.4 双节点RAC
节点2 alert日志报错信息如下：

Fri Nov 24 09:11:42 2017
skgxpvfynet: mtype: 61 process 11799 failed because of a resource problem in the OS. The OS has most likely run out of buffers (rval: 4)
Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl2/trace/orcl2_ora_11799.trc  (incident=123381):
ORA-00603: ORACLE server session terminated by fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:sendmsg failed with status: 105
ORA-27301: OS failure message: No buffer space available
ORA-27302: failure occurred at: sskgxpsnd2
Incident details in: /u01/app/oracle/diag/rdbms/orcl/orcl2/incident/incdir_123381/orcl2_ora_11799_i123381.trc
Fri Nov 24 09:11:42 2017
skgxpvfynet: mtype: 61 process 11801 failed because of a resource problem in the OS. The OS has most likely run out of buffers (rval: 4)
opiodr aborting process unknown ospid (11743) as a result of ORA-603
Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl2/trace/orcl2_ora_11801.trc  (incident=123382):
ORA-00603: ORACLE server session terminated by fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:sendmsg failed with status: 105
ORA-27301: OS failure message: No buffer space available
ORA-27302: failure occurred at: sskgxpsnd2
Incident details in: /u01/app/oracle/diag/rdbms/orcl/orcl2/incident/incdir_123382/orcl2_ora_11801_i123382.trc
Dumping diagnostic data in directory=[cdmp_20171124091142], requested by (instance=2, osid=11743), summary=[incident=123380].

trace文件

#/u01/app/oracle/diag/rdbms/orcl/orcl2/trace/orcl2_ora_11799.trc
*** 2017-11-24 09:11:42.123
*** CLIENT ID:() 2017-11-24 09:11:42.123
*** SERVICE NAME:() 2017-11-24 09:11:42.123
*** MODULE NAME:() 2017-11-24 09:11:42.123
*** ACTION NAME:() 2017-11-24 09:11:42.123

SKGXP:[7fbcc2bfda88.0]{0}: SKGXPVFYNET: Socket self-test could not verify successful transmission of 32768 bytes (mtype 61).
SKGXP:[7fbcc2bfda88.1]{0}: The network is required to support UDP protocol sends of this size.  Socket is bound to 169.254.188.234.
SKGXP:[7fbcc2bfda88.2]{0}: phase 'send', 0 tries, 100 loops, 4905 ms (last)
struct ksxpp * ksxppg_ [0xc122540, 0x7fbcc2995310) = 0x7fbcc2995308
Dump of memory from 0x00007FBCC2995308 to 0x00007FBCC2996838

# /u01/app/oracle/diag/rdbms/orcl/orcl2/incident/incdir_123381/orcl2_ora_11799_i123381.trc
Dump continued from file: /u01/app/oracle/diag/rdbms/orcl/orcl2/trace/orcl2_ora_11799.trc
ORA-00603: ORACLE server session terminated by fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:sendmsg failed with status: 105
ORA-27301: OS failure message: No buffer space available
ORA-27302: failure occurr
#========= Dump for incident 123381 (ORA 603) ========

经查MOS(Oracle Linux: ORA-27301:OS Failure Message: No Buffer Space Available (文档 ID 2041723.1)，发现可能是因为网卡的MUT参数设置过高导致网卡的缓存不足导致的。

CAUSE

 This happens due to less space available for network buffer reservation.

SOLUTION

1. On servers with High Physical Memory, the parameter vm.min_free_kbytes should be set in the order of 0.4% of total 
Physical Memory. This helps in keeping a larger range of defragmented memory pages available for network buffers 
reducing the probability of a low-buffer-space conditions.

*** For example, on a server which is having 256GB RAM, the parameter vm.min_free_kbytes should be set to 1048576 ***
 

On NUMA Enabled Systems, the value of vm.min_free_kbytes should be multiplied by the number of NUMA nodes since the value 
is to be split across all the nodes.


On NUMA Enabled Systems, the value of vm.min_free_kbytes = n * 0.4% of total Physical Memory. Here 'n' is the 
number of NUMA nodes.
 

2. Additionally, the MTU value should be modified as below

#ifconfig lo mtu 16436

To make the change persistent over reboot add the following line in the file /etc/sysconfig/network-scripts/ifcfg-lo :

MTU=16436
Save the file and restart the network service to load the changes

#service network restart

这应该是OEL操作系统专属的错误，对比OEL、RHEL、CentOS系统，发现只有OEL系统网卡本地回环的MTU是65536，其他系统均是16436，而MOS上的解决方案是将网卡本地回环的MTU改为16436。这台服务器网卡本地回环的MTU当前设置如下：

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:1193604960 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1193604960 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:498857656538 (464.5 GiB)  TX bytes:498857656538 (464.5 GiB)

当前服务器网卡本地回环的MTU默认设置是65536，按照MOS文档的方法修改这个设置。

ifconfig lo mtu 16436

这个命令将修改内存中网卡的参数，直接生效。

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:1193606824 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1193606824 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:498858566038 (464.5 GiB)  TX bytes:498858566038 (464.5 GiB)

但是重启后将恢复默认值，如果保证重启也生效，就需要修改网卡的配置文件。

vi /etc/sysconfig/network-scripts/ifcfg-lo

DEVICE=lo
IPADDR=127.0.0.1
NETMASK=255.0.0.0
NETWORK=127.0.0.0
# If you're having problems with gated making 127.0.0.0/8 a martian,
# you can change this to something else (255.255.255.255, for example)
BROADCAST=127.255.255.255
ONBOOT=yes
NAME=loopback
MTU=16436