文接上篇,Orchestrator源码解读2-故障失败发现-CSDN博客?,上篇 阶段了如何发现故障或失败,OC会对被管理的数据库进行状态信息数据收集之后,在OC的后台管理数据库(benkend)进行一个复杂查询,有个状态值已经在该复杂SQL中进行了判断。根据SQL查询的至会存储到结构体中,判断故障类型主要是根据结构体中的字段。这些故障类型有的需要进行处理,
分类 | 故障类型 | 处理函数 | 源码条件 | 源码翻译 | 故障描述 | isActionableRecovery | isInEmergencyOperationGracefulPeriod | |
NoProblem | 没有 | 集群健康 | 健康 | FALSE | ||||
DeadMasterWithoutReplicas | 没有 | a.IsMaster && !a.LastCheckValid && a.CountReplicas == 0 | 主库 ,最近一次检测实例失败,没有从副本 | 主库宕机,该主库没有从副本 | FALSE | |||
DeadMaster | checkAndRecoverDeadMaster | a.IsMaster && !a.LastCheckValid && a.CountValidReplicas == a.CountReplicas && a.CountValidReplicatingReplicas == 0 | 该实例为主库 且 最近一次主库探活失败 且 所有从副本都存活 且 主从复制正常的从副本个数为0 | 主库宕机,所有从副本复制中断,从副本存活 | TRUE | checkAndRecoverGenericProblem | FALSE | |
DeadMasterAndReplicas | checkAndRecoverGenericProblem | a.IsMaster && !a.LastCheckValid && a.CountReplicas > 0 && a.CountValidReplicas == 0 && a.CountValidReplicatingReplicas == 0 | 该实例为主库 且 最近一次主库探活失败 且 从副本个数大于0 且 存活的从副本为0 且 复制正常的从副本为0 | 主库和所有的从副本都宕机 | FALSE | |||
DeadMasterAndSomeReplicas | checkAndRecoverDeadMaster | a.IsMaster && !a.LastCheckValid && a.CountValidReplicas < a.CountReplicas && a.CountValidReplicas > 0 && a.CountValidReplicatingReplicas == 0 | 该实例为主库 且 最近一次主库探活失败 且 有效从副本个数小于从副本总数 且 复制正常的从副本为0 | 主库和部分从副本宕机 | TRUE | checkAndRecoverGenericProblem | FALSE | |
UnreachableMasterWithLaggingReplicas | checkAndRecoverGenericProblem | a.IsMaster && !a.LastCheckValid && a.CountLaggingReplicas == a.CountReplicas && a.CountDelayedReplicas < a.CountReplicas && a.CountValidReplicatingReplicas > 0 | 该实例为主库 且 最近一次主库探活失败 且 所有从副本都存在延迟 且 | 主库宕机 ,所有从副本都延迟 | FALSE | |||
UnreachableMaster | checkAndRecoverGenericProblem | a.IsMaster && !a.LastCheckValid && !a.LastCheckPartialSuccess && a.CountValidReplicas > 0 && a.CountValidReplicatingReplicas > 0 | 该实例为主库 且 最近一次主库探活失败 且 存活的从副本个数大于0 且 复制正常的从副本个数大于0 | 通过OC节点不能连接,但是有复制正常的从副本 | FALSE | |||
MasterSingleReplicaNotReplicating | a.IsMaster && a.LastCheckValid && a.CountReplicas == 1 && a.CountValidReplicas == a.CountReplicas && a.CountValidReplicatingReplicas == 0 | 该实例为主库 且 最近一次主库探活正常 且 只有一个从副本 且 复制正常的从副本个数为0 | 主库正常且只有一个从副本,但该从副本复制不正常 | |||||
MasterSingleReplicaDead | a.IsMaster && a.LastCheckValid && a.CountReplicas == 1 && a.CountValidReplicas == 0 | 该实例为主库 且 最近一次主库探活正常 且 只有一个从副本 且 存活的从副本个数为0 | 主库正常且只有一个从副本,但该从副本宕机 | |||||
AllMasterReplicasNotReplicating | checkAndRecoverGenericProblem | a.IsMaster && a.LastCheckValid && a.CountReplicas > 1 && a.CountValidReplicas == a.CountReplicas && a.CountValidReplicatingReplicas == 0 | 该实例为主库 且 最近一次主库探活正常 且 从副本个数大于1个 且 从副本都存活 且 复制正常的从副本个数为0 | 主库正常 但是所有的从副本主从复制不正常 | FALSE | |||
AllMasterReplicasNotReplicatingOrDead | checkAndRecoverGenericProblem | a.IsMaster && a.LastCheckValid && a.CountReplicas > 1 && a.CountValidReplicas < a.CountReplicas && a.CountValidReplicas > 0 && a.CountValidReplicatingReplicas == 0 | 主库正常 但是所有的从副本主从复制不正常 或宕机 | FALSE | ||||
半同步复制 | LockedSemiSyncMasterHypothesis | |||||||
LockedSemiSyncMaster | checkAndRecoverLockedSemiSyncMaster | a.IsMaster && a.SemiSyncMasterEnabled && a.SemiSyncMasterStatus && a.SemiSyncMasterWaitForReplicaCount > 0 && a.SemiSyncMasterClients < a.SemiSyncMasterWaitForReplicaCount | 半同步复制因为没有得到从副本的确认被锁住 | TRUE | checkAndRecoverGenericProblem | FALSE | ||
MasterWithTooManySemiSyncReplicas | checkAndRecoverMasterWithTooManySemiSyncReplicas | config.Config.EnforceExactSemiSyncReplicas && a.IsMaster && a.SemiSyncMasterEnabled && a.SemiSyncMasterStatus && a.SemiSyncMasterWaitForReplicaCount > 0 && a.SemiSyncMasterClients > a.SemiSyncMasterWaitForReplicaCount | 半同步复制的从副本比配置的多 | TRUE | ||||
MasterWithoutReplicas | ||||||||
Co-Master | DeadCoMaster | checkAndRecoverDeadCoMaster | OC不能访问中间主库 且 所有从副本主从复制都不正常 | TRUE | ||||
DeadCoMasterAndSomeReplicas | checkAndRecoverDeadCoMaster | TRUE | ||||||
UnreachableCoMaster | ||||||||
AllCoMasterReplicasNotReplicating | ||||||||
Intermediate Master 级联复制的中间主库 | DeadIntermediateMaster | checkAndRecoverDeadIntermediateMaster | TRUE | |||||
DeadIntermediateMasterWithSingleReplica | checkAndRecoverDeadIntermediateMaster | TRUE | ||||||
DeadIntermediateMasterWithSingleReplicaFailingToConnect | checkAndRecoverDeadIntermediateMaster | TRUE | ||||||
DeadIntermediateMasterAndSomeReplicas | checkAndRecoverDeadIntermediateMaster | TRUE | ||||||
DeadIntermediateMasterAndReplicas | checkAndRecoverGenericProblem | FALSE | ||||||
UnreachableIntermediateMasterWithLaggingReplicas | checkAndRecoverGenericProblem | FALSE | ||||||
UnreachableIntermediateMaster | ||||||||
AllIntermediateMasterReplicasFailingToConnectOrDead | checkAndRecoverDeadIntermediateMaster | |||||||
AllIntermediateMasterReplicasNotReplicating | ||||||||
FirstTierReplicaFailingToConnectToMaster | ||||||||
BinlogServerFailingToConnectToMaster | ||||||||
// Group replication problems | ||||||||
组复制 | DeadReplicationGroupMemberWithReplicas | checkAndRecoverDeadGroupMemberWithReplicas | TRUE |