Flux de travail de réponse aux incidents

devops

Flux de travail de réponse aux incidents de production avec routage basé sur la sévérité, coordination en salle de crise, désignation d’un commandant d’incident, communication client, analyse de la cause racine et planification du post-mortem.

Télécharger le fichier .fz

Code FlowZap complet

Monitoring { # Monitoring System
n1: circle label:"Start"
n2: rectangle label:"Alert triggered"
n3: rectangle label:"Page on-call responder"
n4: rectangle label:"Log incident timeline"
n5: circle label:"End"
n1.handle(right) -> n2.handle(left)
n2.handle(bottom) -> OnCall.n6.handle(top) [label="SEV-1 Alert"]
n3.handle(bottom) -> OnCall.n6.handle(top) [label="Paged"]
n4.handle(right) -> n5.handle(left)
}
OnCall { # On-Call Responder
n6: rectangle label:"Acknowledge incident"
n7: rectangle label:"Assess severity and impact"
n8: diamond label:"Severity level?"
n9: rectangle label:"Declare major incident"
n10: rectangle label:"Begin troubleshooting"
n11: rectangle label:"Escalate to team lead"
n6.handle(right) -> n7.handle(left)
n7.handle(right) -> n8.handle(left)
n8.handle(right) -> n9.handle(left) [label="SEV-1/2"]
n8.handle(bottom) -> n10.handle(top) [label="SEV-3/4"]
n9.handle(bottom) -> IncidentCommand.n12.handle(top) [label="Open bridge"]
n10.handle(bottom) -> Resolution.n18.handle(top) [label="Investigate"]
n11.handle(bottom) -> IncidentCommand.n12.handle(top) [label="Needs help"]
}
IncidentCommand { # Incident Command
n12: rectangle label:"Open war room bridge"
n13: rectangle label:"Assign incident commander"
n14: rectangle label:"Coordinate response teams"
n15: rectangle label:"Communicate status updates"
n16: diamond label:"Customer impact?"
n17: rectangle label:"Notify customer success"
n12.handle(right) -> n13.handle(left)
n13.handle(right) -> n14.handle(left)
n14.handle(right) -> n15.handle(left)
n15.handle(right) -> n16.handle(left)
n16.handle(right) -> n17.handle(left) [label="Yes"]
n16.handle(bottom) -> Resolution.n18.handle(top) [label="No"]
n17.handle(bottom) -> Resolution.n18.handle(top) [label="Notified"]
}
Resolution { # Resolution
n18: rectangle label:"Identify root cause"
n19: rectangle label:"Implement fix"
n20: diamond label:"Issue resolved?"
n21: rectangle label:"Verify recovery"
n22: rectangle label:"Close incident"
n23: rectangle label:"Schedule postmortem"
n18.handle(right) -> n19.handle(left)
n19.handle(right) -> n20.handle(left)
n20.handle(right) -> n21.handle(left) [label="Yes"]
n20.handle(bottom) -> n18.handle(bottom) [label="No - Continue"]
n21.handle(right) -> n22.handle(left)
n22.handle(right) -> n23.handle(left)
n23.handle(top) -> Monitoring.n4.handle(bottom) [label="Complete"]
}

Pourquoi ce workflow ?

When production goes down, every minute costs money and customer trust. A well-defined incident response workflow reduces Mean Time to Recovery (MTTR) by ensuring the right people are notified, war rooms are coordinated, and postmortems capture learnings. This workflow implements industry-standard severity-based routing.

Comment ça fonctionne

Step 1: Monitoring alerts trigger the workflow with severity classification (SEV1-SEV4).
Step 2: SEV1/SEV2 incidents immediately page the on-call engineer via PagerDuty or Opsgenie.
Step 3: An incident commander is assigned and a war room (Slack channel or Zoom) is created.
Step 4: Customer communication is drafted and sent through status page updates.
Step 5: Root cause analysis begins in parallel with mitigation efforts.
Step 6: Once resolved, a postmortem is scheduled and action items are tracked in Jira.

Alternatives

Ad-hoc incident response leads to finger-pointing, missed notifications, and repeated incidents. Enterprise tools like ServiceNow or PagerDuty Incident Response cost $20-50 per user/month. This workflow provides a visual runbook that integrates with your existing alerting stack.

Key Facts

Template Name	Flux de travail de réponse aux incidents
Category	devops
Steps	6 workflow steps
Format	FlowZap Code (.fz file)

Modèles associés

Workflow d’alerte de surveillance

devops

Flux de travail de surveillance et d’alerte avec collecte de métriques, évaluation des seuils, routage des alertes, politiques d’escalade et création d’incidents.

Voir le modèle

Flux de travail de rotation d’astreinte

devops

Flux de travail de rotation d’astreinte avec création de planning, relais de poste, gestion des dérogations, politiques d’escalade et répartition équitable des rotations.

Voir le modèle

Flux de travail de revue des accès

devops

Workflow trimestriel de revue des accès utilisateurs avec certification par le manager, validation de la séparation des tâches, suivi des remédiations et reporting de conformité pour les audits.

Voir le modèle

Flux de travail de sauvegarde et restauration

devops

Flux de travail de sauvegarde et restauration avec sauvegardes planifiées, réplication hors site, application de la politique de rétention, tests de restauration et validation RTO/RPO.

Voir le modèle

Flux de travail de renouvellement de certificats

devops

Flux de travail de renouvellement de certificats SSL/TLS avec surveillance des dates d’expiration, demande de certificat par type (DV/OV/EV), validation de domaine, déploiement sur les répartiteurs de charge et vérification de l’état de santé avec possibilité de rollback.

Voir le modèle

Flux de travail de chaos engineering

devops

Flux de travail de chaos engineering avec définition de l’hypothèse, surveillance de l’état stable, injection contrôlée de pannes, limitation du périmètre d’impact et validation de la résilience.

Voir le modèle

Retour à tous les modèles