1、 2025,Amazon Web Services,Inc.or its affiliates.All rights reserved.2025,Amazon Web Services,Inc.or its affiliates.All rights reserved.A R C 3 1 6-ROperational excellence:Building resilient systemsIslam GhanimPrincipal Technical Account ManagerWill LawsSenior Solutions Architect 2025,Amazon Web Serv
2、ices,Inc.or its affiliates.All rights reserved.2025,Amazon Web Services,Inc.or its affiliates.All rights reserved.How operations eat resilience for breakfast 2025,Amazon Web Services,Inc.or its affiliates.All rights reserved.Why systems failHARDWAREDisk faults,clock skewSOFTWAREKernel,OSOPERATIONSDe
3、ployment,logsENVIRONMENTPower,networking 2025,Amazon Web Services,Inc.or its affiliates.All rights reserved.The perceptionHardware&softwareCompute,data,codeOperationsLogs,metrics,deploymentsAfterthoughtStick with defaultsLarge time investmentMinimal time investmentResilienceExtensive functional and
4、non-functional testing 2025,Amazon Web Services,Inc.or its affiliates.All rights reserved.Why systems“actually”failOperations42%Software25%Hardware18%Environment15%Failure causes“Why Do Computers Stop and What Can Be Done About It?”Jim Gray,Tandem Computers,1985Failure ModeProbabilityMTBFOperations4
5、2%31 YearsSoftware25%50 YearsHardware18%73 YearsEnvironment15%87 Years 2025,Amazon Web Services,Inc.or its affiliates.All rights reserved.“The techniques for fault-tolerant hardware are well documented and quite successful.Dealing with system configuration,operations,and maintenance remains an unsol
6、ved problem.”Jim Gray,1985“Why Do Computers Stop and What Can Be Done About It?”2025,Amazon Web Services,Inc.or its affiliates.All rights reserved.MTBF,MTTD,and MTTRX!XIssueDetectionRepairIssue 2025,Amazon Web Services,Inc.or its affiliates.All rights reserved.MTBF,MTTD,and MTTRX!XMTTDLower MTTDDETE