Wednesday, February 25, 2015

RS-700 Celloflsrv hang detected on Exadata 12.1.1.1

Links to this post
RS-700 [Celloflsrv hang detected. It will be terminated] [SYS_121111_140712] [] [] [] [] [] [] [] [] [] []
Server Model Oracle Corporation SUN SERVER X4-2L High Capacity
Release Version 12.1.1.1.1
Release Label OSS_12.1.1.1.1_LINUX.X64_140712

This is Bug 19132065 - Oracle Linux semtimedop() wakeups by timeout are lagging causing offload operations to fail (which may degrade performance) and errors similar to one or more of the following:
? ORA-700 [Offload issue job timed out]
? ORA-700 [Offload group not open]
? RS-700 [Celloflsrv hang detected. It will be terminated]

This bug affects related to 12.1.1.1. storage Version.
It is due to DB Node RCU delayed and cause Offload job to fail on Cellservices .
it affects database performance not availability.
Error ocure mostly when cellserv tried to do Read optimization.
reducing Delay in RCU is work around accross whole stack.

Step 1: Set rcu_delay for runtime

# echo 1 > /proc/sys/kernel/rcu_delay
Verify the setting
# cat /proc/sys/kernel/rcu_delay
1

Step 2: Set rcu_delay in /etc/sysctl.conf for proper setting upon reboot

Add "kernel.rcu_delay=1" to /etc/sysctl.conf

Step 3: Restart cellsrv on storage servers

CellCLI> alter cell restart services cellsrv;


This workaround is automatically applied in the following cases:
When a new system is deployed with Exadata 11.2.3.3.1 or 12.1.1.1.1 using OEDA Sep 2014 or later.
When storage servers are upgraded to 11.2.3.3.1 or 12.1.1.1.1 and the patchmgr plugins patch is properly staged before running patchmgr, as documented.
When database servers are upgraded to 11.2.3.3.1 or 12.1.1.1.1 using dbnodeupdate.sh v3.58 or later.

References https://www.kernel.org/doc/Documentation/RCU/whatisRCU.txt