前几天一个朋友说他们的数据库存储坏了,但是存储修好之后,数据库无法启动了,让我帮忙诊断一下。想想自己已经很久很久没处理过这么复杂的Case了,简单记录分析一下。就当怀旧吧,毕竟这么老的环境也不多见了。
首先看alert log,发现最早的记录是2010年3月26号。 想想2010年的自己,还是一个年轻羞涩的帅小伙!时光如梭!已经10多年过去了。。。
Fri Mar 26 14:31:36 2010
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Interface type 1 lan901 10.1.1.0 configured from OCR for use as a cluster interconnect
Interface type 1 lan900 192.120.1.0 configured from OCR for use as a public interface
Shared memory segment for instance monitoring created
Picked latch-free SCN scheme 3
Using LOG_ARCHIVE_DEST_1 parameter default value as /xxx/10.2/db_1/dbs/arch
Autotune of undo retention is turned on.
IMODE=BR
ILAT =36
LICENSE_MAX_USERS = 0
SYS auditing is disabled
ksdpec: called for event 13740 prior to event group initialization
Starting up ORACLE RDBMS Version: 10.2.0.4.0.
System parameters with non-default values:
processes = 300
sga_target = 1610612736
control_files = /dev/vgxxxx/rrac_control01, /dev/vgxxxx/rrac_control02, /dev/vgxxxx/rrac_control03
db_block_size = 8192
compatible = 10.2.0.3.0
?我们继续看数据库启动时刻的报错:
Fri Dec 22 18:03:55 2023
ALTER DATABASE MOUNT
Fri Dec 22 18:03:55 2023
This instance was first to mount
Fri Dec 22 18:03:59 2023
Errors in file /users/oracle/admin/xxxx/udump/xxxx1_ora_19376.trc:
ORA-00600: internal error code, arguments: [kccpb_sanity_check_2], [17033858], [17033853], [0x000000000], [], [], [], []
ORA-600 signalled during: ALTER DATABASE MOUNT...
Fri Dec 22 18:04:00 2023
作为技术经验比较老道的老司机,一眼就知道这是控制文件损坏了,大家网上应能搜到不少的案例,如果我们去查官方解释,那么是这样的:
ORA-600 [kccpb_sanity_check_2] indicates that the seq# of the last read block is
higher than the seq# of the control file header block. This is indication of
the lost write of the header block during commit of the previous cf?transaction.
问用户说他们没有数据库备份(实际上数据库是非归档),那么既然是这样,Oracle官方推荐的利用备份恢复控制文件的方法肯定是行不通了。
那我们先来人工重建数据库控制文件吧。对于重建控制文件的脚本,大家都知道,其有几个关键的点:dbname、数据库是否归档、字符集。前面2个信息很容易从数据库alert log去获取。那么数据库字符集呢?数据库都无法打开,自然是无法去查数据库视图的。
以前如果学过我备份恢复课程的同学一定知道,我之前讲过,Oracle的字符集等信息存储在props$这个数据库字典表中,而在Oracle?10g版本中,该字典表存储在file 1 block 722 block中。因此我们可以这样来获取字符集信息:
dd if=/xxx/xxx/system01.dbf of=/tmp/dd.722 bs=8192 skip=722 count=1
strings /tmp/dd.722| more
可能是很久很久没玩过裸设备的环境了,发现上面的命令居然不行,然后我搜索alert log发现,居然能看到10多年的数据库创建信息:
Fri Mar 26 14:31:38 2010
CREATE DATABASE "xxx"
CONTROLFILE REUSE
MAXINSTANCES 32
MAXLOGHISTORY 1
MAXLOGFILES 192
MAXLOGMEMBERS 3
MAXDATAFILES 1024
DATAFILE '/dev/xxxx/rrac_system' SIZE 4000M
EXTENT MANAGEMENT LOCAL
SYSAUX DATAFILE '/dev/xxxx/rrac_sysaux' SIZE 4000M
SMALLFILE DEFAULT TEMPORARY TABLESPACE TEMP TEMPFILE '/dev/xxxx/rrac_temp' SIZE 4000M
SMALLFILE UNDO TABLESPACE "UNDOTBS1" DATAFILE '/dev/xxxx/rrac_undotbs1' SIZE 4000M
CHARACTER SET ZHS16CGB231280
NATIONAL CHARACTER SET AL16UTF16
LOGFILE GROUP 1 ('/dev/xxxx/rrac_redo01') SIZE 500M,
GROUP 2 ('/dev/xxxx/rrac_redo02') SIZE 500M,
GROUP 5 ('/dev/xxxx/rrac_redo03') SIZE 500M,
GROUP 6 ('/dev/xxxx/rrac_redo13') SIZE 500M
USER SYS IDENTIFIED BY *USER SYSTEM IDENTIFIED BY
Fri Mar 26 14:31:38 2010
当我们知道了这些信息之后,就可以编辑重建controlfile的脚本了。从上面的信息,我们可以看到,使用的是裸设备;从2010年至今,这套Oracle rac环境运行时间超过了13年了(hp unix serviceGuard做的集群)。想想用户提供的这个alert log居然就10Mb不到,居然存了13年的告警信息;想到这里,我再回想一下OceanBase的log,简直头大。创建控制文件的脚本如下:
CREATE CONTROLFILE REUSE DATABASE "xxxx" NORESETLOGS NOARCHIVELOG
MAXLOGFILES 20
MAXLOGMEMBERS 3
MAXDATAFILES 1024
MAXINSTANCES 16
MAXLOGHISTORY 292
LOGFILE
GROUP 1 '/dev/xxxx/rrac_redo01' SIZE 500M,
GROUP 2 '/dev/xxxx/rrac_redo02' SIZE 500M,
GROUP 5 '/dev/xxxx/rrac_redo03' SIZE 500M,
GROUP 3 '/dev/xxxx/rrac_redo11' SIZE 500M,
GROUP 4 '/dev/xxxx/rrac_redo12' SIZE 500M,
GROUP 6 '/dev/xxxx/rrac_redo13' SIZE 500M
DATAFILE
'/dev/xxxx/rrac_system',
'/dev/xxxx/rrac_sysaux',
......
'/dev/xxxx/ropen_sample_x32',
'/dev/xxxx/ropen_sample_x31',
'/dev/xxxx/ropen_sample_x40',
'/dev/xxxx/ropen_sample_x33',
'/dev/xxxx/rrac_undotbs1',
'/dev/xxxx/rrac_undotbs2'
CHARACTER SET ZHS16CGB231280
;
创建完控制文件之后,此次此刻就可以Mount数据库实例了。然而我查下了下v$datafile /v$datafile_header发现checkpoint_change#差别非常大,高达数百万之多。
这种无备份,无归档的情况之下,scn差距如此之大,基本上只能去强制打开数据库了。我尝试使用recover database using backup controlfile until cancel?恢复之后;打开数据库时报了如下错误:
Fri Dec 22 21:27:39 2023
SMON: enabling cache recovery
Fri Dec 22 21:27:39 2023
Instance recovery: looking for dead threads
Instance recovery: lock domain invalid but no dead threads
Fri Dec 22 21:27:39 2023
Errors in file /users/oracle/admin/xxxx/udump/xxxx1_ora_2702.trc:
ORA-00600: internal error code, arguments: [kcbgtcr_13], [], [], [], [], [], [], []
Fri Dec 22 21:27:42 2023
Errors in file /users/oracle/admin/xxxx/udump/xxxx1_ora_2702.trc:
ORA-00704: bootstrap process failure
ORA-00704: bootstrap process failure
ORA-00600: internal error code, arguments: [kcbgtcr_13], [], [], [], [], [], [], []
Fri Dec 22 21:27:42 2023
Error 704 happened during db open, shutting down database
USER: terminating instance due to error 704
Instance terminated by USER, pid = 2702
ORA-1092 signalled during: alter database open resetlogs...
很明显Oracle在open的时刻,进行bootstrap初始化就失败了。从报错的ORA-00600?kcbgtrc来看,这是在构造一致性读。说明该过程中读到了一些异常的数据块。
要解决这个问题也不难,尝试禁止SMON 回滚(使用event 10510/10511/10512),并结合_allow_corruption_resetlogs、undo_management=manual参数即可打开数据库;然后重建undotbs1/undotbs2 即可,并add tempfile reuse。
本以为很容易就打开数据库了,应该没啥事儿了,可以收工了。习惯性尝试create table test,虽然能进行dml,但是无法commit也无法进行drop table操作。此时去看alert log会发现类似的一些错误:
Fri Dec 22 22:24:50 2023
Errors in file /users/oracle/admin/xxxx/udump/xxxx1_ora_8198.trc:
ORA-00600: internal error code, arguments: [kdsgrp1], [], [], [], [], [], [], []
Fri Dec 22 22:30:08 2023
Errors in file /users/oracle/admin/xxxx/udump/xxxx1_ora_9362.trc:
ORA-00600: internal error code, arguments: [qertbFetchByRowID], [], [], [], [], [], [], []
Fri Dec 22 22:38:40 2023
MMNL absent for 1220 secs; Foregrounds taking over
从上述的错误我们大致就知道,必然是有一些数据字典表和索引不一致了。对于这类错误,我们可以通过如下操作来分析,观察具体在哪里不一致。
SQL> oradebug setmypid
Statement processed.
SQL>
SQL> analyze table obj$ validate structure cascade;
analyze table obj$ validate structure cascade
*
ERROR at line 1:
ORA-01499: table/index cross reference failure - see trace file
这下麻烦了。Oracle最核心的obj$出问题了,另外看日志还发现dependency$?的索引也异常了。对于dependency$?而言,处理相对简单,很早 之前在博客(www.killdb.com)写过使用bbed来处理的方法。实际上数据库如果能open的情况之下,那么可以直接drop 并重建dependency$上面的2个索引即可(设置event 38003?并重启实例后可以操作)。
然而对于obj# <56?的这部分核心数据字典对象,是无法通过event 38003来处理的。我们来看看analayze分析的结果:
row mismatch in index tsn: 0 rdba: 0x004000e1
env: (scn: 0x0007.9a76fa55 xid: 0x0000.000.00000000 uba: 0x00000000.0000.00 statement num=0 parent xid: xid: 0x0000.000.00000000 scn: 0x0000.00000000 8s
ch: scn: 0x0000.00000000)col 0; len 1; (1): 80
col 1; len 18; (18): 57 52 48 24 5f 46 49 4c 45 53 54 41 54 58 53 5f 50 4b
col 2; len 2; (2): c1 05
col 3; NULL
col 4; NULL
col 5; len 29; (29):
57 52 48 24 5f 46 49 4c 45 53 54 5f 33 36 35 31 39 38 37 31 34 36 5f 31 31
37 35 36 31
rowid = 00476487.2f
Block header dump: 0x00476434
Object id on Block? Y
seg/obj: 0x12 csc: 0x07.9a116a07 itc: 1 flg: - typ: 1 - DATA
fsl: 0 fnx: 0x0 ver: 0x01
Itl Xid Uba Flag Lck Scn/Fsc
0x01 0x0005.022.009aee83 0x0083af02.0c73.05 --U- 2 fsc 0x0075.9a116a15
很明显,这里已经提示我们,row mismatch了,而且obj $的这个block 似乎也有点异常,还有2行lock记录,这里通过bbed直接处理掉(处理掉之后,解决了smon进程回滚遇到的一些错误,包括alert log中的一些ora-01555错误。如下所示:
SMON: following errors trapped and ignored:
ORA-00604: error occurred at recursive SQL level 1
ORA-01555: snapshot too old: rollback segment number 5 with name "_SYSSMU5$" too small
*** 2023-12-23 18:18:01.898
WAIT #0: nam='smon timer' ela= 9790111 sleep time=10 failed=67 p3=0 obj#=1 tim=157899982335
=====================
PARSING IN CURSOR #4 len=294 dep=1 uid=0 oct=3 lid=0 tim=157899982612 hv=2327026800 ad='1e597860'
select o.owner#,o.obj#,decode(o.linkname,null, decode(u.name,null,'SYS',u.name),o.remoteowner), o.name,o.linkname,o.namespace,o.subname from user$ u, obj$ o where u.user#(+)=o.owner# and o.type#=:1 and not exists (select p_obj# from dependency$ where p_obj# = o.obj#) order by o.obj# for update
END OF STMT
PARSE #4:c=0,e=46,p=0,cr=0,cu=0,mis=0,r=0,dep=1,og=4,tim=157899982608
BINDS #4:
kkscoacd
Bind#0
oacdty=02 mxl=22(22) mxlc=00 mal=00 scl=00 pre=00
oacflg=08 fl2=0001 frm=00 csi=00 siz=24 off=0
kxsbbbfp=9ffffffffd321588 bln=22 avl=02 flg=05
value=10
WAIT #4: nam='db file sequential read' ela= 89 file#=2 block#=241410 blocks=1 obj#=1 tim=157900007254
EXEC #4:c=30000,e=24700,p=1,cr=1655,cu=0,mis=0,r=0,dep=1,og=4,tim=157900007428
ERROR #4:err=1555 tim=16168960
为什么一定要去处理obj$,我发现上述的错误是非常奇怪的。实际上到这里,我已经把之前的undotbs1/undotbs2 都drop重建了。
Execution Plan
----------------------------------------------------------
Plan hash value: 1234705684
-----------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 110 | 349 (3)| 00:00:05 |
| 1 | SORT ORDER BY | | 1 | 110 | 349 (3)| 00:00:05 |
| 2 | NESTED LOOPS OUTER | | 1 | 110 | 348 (2)| 00:00:05 |
|* 3 | HASH JOIN ANTI | | 1 | 94 | 347 (2)| 00:00:05 |
|* 4 | TABLE ACCESS FULL | OBJ$ | 75 | 6675 | 316 (2)| 00:00:04 |
| 5 | INDEX FAST FULL SCAN| I_DEPENDENCY2 | 26382 | 128K| 29 (0)| 00:00:01 |
| 6 | TABLE ACCESS CLUSTER | USER$ | 1 | 16 | 1 (0)| 00:00:01 |
|* 7 | INDEX UNIQUE SCAN | I_USER# | 1 | | 0 (0)| 00:00:01 |
-----------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
3 - access("P_OBJ#"="O"."OBJ#")
4 - filter("O"."TYPE#"=10)
7 - access("U"."USER#"(+)="O"."OWNER#")
这里为什么Oracle在查询的时候还会去读file 2 block 241410呢,很明显这是应该是之前的undo信息。因为此时我尝试dump这个undo block,发现是是一个未格式化的block。按理说已经处理掉执行计划中的几个表和索引,且都analyze过没问题。所以一定是问题出在obj$上。
尽管如此,改了几个block,包括将上述报错的回滚段drop掉,我发现exp数据的时候,仍然会报错ora-01555,那么大概率是读写的部分block还是有一些异常,既然如此就推一下scn(因为看alert log还发现了零星的ora-00600 2662 错误)。
SQL> startup mount pfile='/tmp/1.ora';
ORACLE instance started.
Total System Global Area 1610612736 bytes
Fixed Size 2056504 bytes
Variable Size 385879752 bytes
Database Buffers 1207959552 bytes
Redo Buffers 14716928 bytes
Database mounted.
SQL> oradebug setmypid
Statement processed.
SQL> oradebug DUMPvar SGA kcsgscn_
kcslf kcsgscn_ [C0000000C0E197E8, C0000000C0E19818) = 00000000 00000009 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 C0000000 ...
SQL> ORADEBUG POKE 0xC0000000C0E197E8 4 0x0000001E
BEFORE: [C0000000C0E197E8, C0000000C0E197EC) = 00000000
AFTER: [C0000000C0E197E8, C0000000C0E197EC) = 0000001E
SQL> oradebug DUMPvar SGA kcsgscn_
kcslf kcsgscn_ [C0000000C0E197E8, C0000000C0E19818) = 0000001E 00000009 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 C0000000 ...
SQL> alter database open;
Database altered.
最开始本想使用event 10015 来推scn,发现并不起作用,既然如此就使用oradebug poke操作吧,需要注意的是,hp unix环境的scn?高位在前面,因此poke的时候注意level(16进制)。
在恢复过程中,发现alert log陆续还有不少的ora-00600错误,比如:
Non-fatal internal error happenned while SMON was doing non-existent object cleanup.
SMON encountered 7 out of maximum 100 non-fatal internal errors.
Errors in file /users/oracle/admin/xxxx/bdump/xxxx1_smon_5214.trc:
ORA-00600: internal error code, arguments: [25027], [1], [8630018], [], [], [], [], []
以及一些跟数据坏块相关的错误。
在exp 数据的过程中,发现有些表有问题会导致数据库crash,最终还是要通过bbed将obj$的索引删除掉,会减少一些错误。然而数据字典基表的索引被删除,也会有一些副作用,比如你的递归sql查询效率会大幅降低。
我想以后很难再遇到这样的Case了,简单记录一下!
备注:我在想,如果国产数据库遇到极端的灾难情况,我们是否有一些手段来抢救数据呢?比如OceanBase、达梦、openGauss?