当前位置:首页 > 报告详情

超大规模环境下的告警疲劳:基于指标的开放网络信号调优方法.pdf

上传人: 明**** 编号:1011993 2025-12-21 26页 1.81MB

1、Pooja GupteAlert Fatigue in Hyperscale Environments:A Metrics-Based Approach to Signal Tuning in Open NetworksAlert Fatigue in Hyperscale Environments:A Metrics-Based Approach to Signal Tuning in Open NetworksPooja GupteNetworkingIts 2.14 am andYou are on-callThe Problem:Alert FatigueScale of Hypers

2、cale cloudInternetWANRegionRegionDC NetworkRetail,Enterprise,Media&Entertainment customersBackbone Transport,Routing Domain,Resiliency&Redundancy50 geographic regions worldwideMillionsphysical network devicesWhy Hyperscale Makes It WorseFailuresWhy Hyperscale Makes It WorseTelemetry lag,correlation

3、failure,configuration driftPort failure,Misconfigured VLAN,Control plane crashNIC Failure,OS/Driver crash,ServerToTor link flapLink congestion,Spine switch outage,packet drops due to buffer exhaustionWhen One Small Failure Becomes a StormWhen One Small Failure Becomes a StormThe NoiseWhen One Small

4、Failure Becomes a StormIntroducing the solutionMetrics like TTD,TTM and TTN help focus on what mattersThe ShiftEvent start time to alert Creation timeTime it took to mitigate issue and stop customer impactTime it took to notifythe customer about theimpact of this eventon their workloadsTTDTime to De

5、tect TTMTime to mitigateTTNTime to notifyMetrics Driven SolutionTelemetry flowToR,Spine,Fabric switchesPort counters,error rates,link stateStatistical/ML DrivenAI/Agentic workflowsMetrics Driven Solution:TTDtimestampdcrackdevicelayeralert_nameseverity2025-08-21 10:00:00DC-01R04Server-213ServerPacket

6、 retransmission errorswarning2025-08-21 10:04:00DC-01R04Server-219ServerHigh latency observedwarning2025-08-21 10:14:00DC-01R04ToR-17ToRLink flap detectedcritical2025-08-21 10:15:00DC-01R04ToR-17ToRInterface errors risingmajor2025-08-21 10:18:00DC-01R04Spine-05SpineNorthbound congestionwarning2025-0

word格式文档无特别注明外均可编辑修改,预览文件经过压缩,下载原文更清晰!
三个皮匠报告文库所有资源均是客户上传分享,仅供网友学习交流,未经上传用户书面授权,请勿作商用。
根据《Alert Fatigue in Hyperscale Environments: A Metrics-Based Approach to Signal Tuning in Open Networks》一文,主要内容如下: 1. **告警疲劳问题**:在超大规模环境中,告警数量庞大,导致告警疲劳,难以快速定位和解决问题。 2. **超大规模环境挑战**:包括设备故障、配置错误、网络拥塞等,导致告警噪声大。 3. **解决方法**:采用基于指标的信号调整方法,重点关注关键指标,如TTD(Time to Detect)、TTM(Time to Mitigate)和TTN(Time to Notify)。 4. **TTD优化**:通过设备级和跨设备聚合,将多个告警合并为一个,减少告警数量,提高检测速度(降低60-85%)。 5. **TTM优化**:利用AI代理执行故障排除,减少人工干预时间,提高问题解决速度。 6. **TTN优化**:快速通知客户,恢复客户信任,提高客户满意度。
"如何减少大规模环境中的警报疲劳?" "AI如何助力网络故障快速解决?" "如何通过指标驱动优化网络监控?"
客服
商务合作
小程序
服务号
折叠