Friday, March 18, 2016

(DB35) Cellsrv crash Leads to Exadata Hang and database corruption.

Links to this post
I didnt realize until I see outcome of this bug,seems like a simple fix but it had broader impact on Exadata cluster.

(DB35) Cellsrv process on storage servers may crash when performing a cell-to-cell offload operation (Doc ID 2074393.1)

As per MOS note "(DB35) Cellsrv process on storage servers may crash when performing a cell-to-cell offload operation (Doc ID 2074393.1)",
combination of below version generate a Bug to crash cell-to-cell offload operation.

-Exadata Storage Server software version is <= 12.1.2.1.3.
-Grid Infrastructure version is 12.1.0.2, lower than 12.1.0.2.160119.
-/u01/app/12.1.0.2/grid/OPatch/opatch lsinventory -bugs_fixed -oh /u01/app/12.1.0.2/grid | egrep '21218243|22304421'

Once it occures , we should expect one of ORA-00600 from Cell alert log.

ORA-00600: [FLASHCACHE::ISSUEIO INV PINPAGE(R) RESULT]
ORA-00600: [CopyFromRemote::processWaitForSendCompl:invFCFlags]

-Every time the ORA-600 [CopyFromRemote::processWaitForSendCompl:invFCFlags] happened the cell sever processes crashed and got restarted, so in this period when the cell server processes were down both the copies of some files were not available.
 The diskgroup will get dismounted if both copies are not available for any file, the databases failed as the DATA diskgroup where all datafles are located got dismounted due to the above problem.


- Bug 22304421 : 
------------ 
ASM data relocation processes when they startup will send cell to cell offload IOCTLs with flash cache hints; but cellsrv lower than 12.1.2.2.0 crashes when it receives these hints. 
ORA-00600: internal error code, arguments: [CopyFromRemote::processWaitForSendCompl:invFCFlags], [4], [4], [0x60015A622618], [], [], [], [], [], [], [], []

Result
------
-Exadata cluster Hangs/freeze.
-Voting / OCR disk corruption.
-database corruption occures (ORA-01578: ORACLE data block corrupted)