ATLAS ZERO VM.zip / AZ-042_Incident_Postmortem_Canon_and_Lessons_Registry_v1.md

AZ-042 — Incident Postmortem Canon and Lessons Registry v1

AZ-042 — Incident Postmortem Canon and Lessons Registry v1

Status

Acest document definește:

  • forma canonicală a postmortem-urilor de incident;
  • taxonomia lecțiilor învățate;
  • mapping-ul dintre incident, cauză, control failure, remediation și lessons registry;
  • și regulile prin care incidentele schimbă efectiv threat canon, runbooks, checklists, monitoring și conformance corpus.

După AZ-001 până la AZ-041, există deja:

  • specificația protocolului și a subsistemelor;
  • modelul de securitate, incident response și recovery;
  • launch discipline, monitoring, stabilization și archive;
  • upgrade, hard fork, key compromise și long-term preservation;
  • threat model canon, conformance claim framework și economic attack evaluation.

AZ-042 răspunde la întrebarea: cum transformăm un incident real într-un set canonic de adevăruri și lecții acționabile, astfel încât organizația și protocolul să nu repete aceeași clasă de eșec sub alt nume, iar memoria incidentului să nu se piardă în rezumate informale sau postmortem-uri inconsistente?

Scopul documentului este să fixeze:

  • structura canonicală a postmortem-ului;
  • taxonomia finding-urilor și lecțiilor;
  • relația dintre incident, cauză, control failure și remediation;
  • registrul central de lessons learned;
  • regulile de follow-up, closure și verification;
  • și legătura cu threat canon, residual risk canon, runbooks, checklist-uri, monitoring profiles, conformance corpus și audit export.

Acest document se bazează pe:

  • AZ-002 până la AZ-041, cu accent direct pe AZ-008, AZ-015, AZ-027, AZ-030, AZ-031, AZ-032, AZ-038, AZ-039, AZ-040 și AZ-041.

Termeni:

  • MUST = obligatoriu
  • MUST NOT = interzis
  • SHOULD = recomandat puternic
  • MAY = opțional

1. Obiectiv

AZ-042 răspunde la 10 întrebări critice:

  1. Ce este un incident postmortem canonical?
  2. Ce câmpuri și secțiuni trebuie să conțină?
  3. Cum separăm cronologia, cauza, impactul, control failures și remediation?
  4. Ce este un lesson și cum este stocat central?
  5. Cum legăm un incident de threat canon și residual risk canon?
  6. Cum legăm postmortem-ul de runbooks, checklist-uri, conformance și monitoring?
  7. Cum verificăm că remediation chiar s-a întâmplat?
  8. Când un postmortem este complet și când este insuficient?
  9. Cum exportăm lessons și postmortem-uri pentru audit extern?
  10. Cum evităm postmortem-uri narative, vagi sau neacționabile?

2. Principii

2.1 Postmortem is a truth artifact, not a ritual essay

Postmortem-ul MUST fi tratat ca artefact operațional și auditabil, nu ca text ceremonial de închidere.

2.2 Incident memory must be structured

Cronologia, cauzele, controalele eșuate, remediation și lecțiile MUST fi separate și tipizate.

2.3 Every material incident must change something or explain why not

Un incident material SHOULD produce:

  • remediation;
  • registry updates;
  • risk updates;
  • sau explicație explicită de ce nu e nevoie.

2.4 Root cause is rarely enough alone

Postmortem-ul SHOULD surprinde:

  • trigger;
  • contributing factors;
  • failed assumptions;
  • failed controls;
  • detection gaps;
  • response gaps;
  • și blast radius.

2.5 Lessons without ownership decay into folklore

Orice lesson material SHOULD avea:

  • owner;
  • due boundary;
  • verification path;
  • și closure state.

2.6 The canon must outlive the people involved

Memoria incidentului MUST rămâne utilă chiar dacă echipa originală dispare sau se schimbă.


3. Postmortem purpose

3.1 A canonical postmortem SHOULD answer:

  • ce s-a întâmplat?
  • când?
  • ce a fost afectat?
  • cum am detectat?
  • ce control a eșuat sau a lipsit?
  • ce a permis escalarea?
  • ce am făcut?
  • ce schimbăm?
  • cum știm că am învățat ceva real?

3.2 Rule

Dacă postmortem-ul nu poate răspunde clar la aceste întrebări, este insuficient.


4. Incident classes in postmortem scope

4.1 Canon SHOULD support postmortem-uri for:

  • consensus incidents
  • finality/liveness incidents
  • validation divergence
  • BVM execution incidents
  • witness/proof incidents
  • governance incidents
  • economic or spam incidents
  • release/genesis/provenance incidents
  • key compromise incidents
  • operator or rollout incidents
  • archive/preservation incidents
  • audit/export integrity incidents

4.2 Rule

Material incidents across all major trust boundaries SHOULD have canonical postmortem support.


5. Postmortem object model

5.1 Canonical structure

IncidentPostmortemRecord {
  version_major
  version_minor

  postmortem_id
  incident_id
  postmortem_scope_hash
  incident_class
  severity_class
  timeline_root
  impact_root
  cause_root
  control_failure_root
  remediation_root
  lesson_root
  risk_update_root?
  verification_plan_root?
  status
  created_at_unix_ms
  finalized_at_unix_ms?
  authoring_scope_hash
  metadata_hash?
}

5.2 status

  • PM_DRAFT
  • PM_IN_REVIEW
  • PM_FINAL
  • PM_SUPERSEDED
  • PM_REVOKED

5.3 Rule

Material incidents SHOULD reach PM_FINAL unless explicitly superseded.


6. Postmortem scope model

6.1 Canonical structure

PostmortemScope {
  scope_id
  target_network_class
  target_chain_id?
  target_genesis_hash?
  affected_release_package_id?
  affected_upgrade_proposal_id?
  affected_incident_window_hash
  affected_asset_root
}

6.2 Rule

Scope MUST bind the postmortem to exact incident context, not generalize implicitly.


7. Timeline model

7.1 Purpose

Incident memory needs exact chronology.

7.2 Canonical structure

IncidentTimelineEntry {
  timeline_entry_id
  event_class
  timestamp_unix_ms
  event_ref?
  summary_hash
}

7.3 event_class examples

  • first_signal_observed
  • incident_opened
  • escalation_triggered
  • operator_action_taken
  • decision_issued
  • mitigation_applied
  • recovery_started
  • recovery_completed
  • incident_closed
  • postmortem_opened
  • postmortem_finalized

7.4 Rule

Timelines SHOULD be precise enough to reconstruct sequencing and latency of response.


8. Impact model

8.1 Canonical structure

IncidentImpactRecord {
  impact_id
  affected_asset_root
  impact_class
  severity_class
  blast_radius_hash
  duration_hash?
  observed_metric_delta_root?
  notes_hash?
}

8.2 impact_class examples

  • finality_delay
  • liveness_loss
  • validation_inconsistency
  • execution_failure
  • governance_distortion
  • operator_unavailability
  • provenance_confidence_loss
  • archive_integrity_loss
  • auditability_gap

8.3 Rule

Impact SHOULD be recorded as actual effect, not mixed with cause.


9. Cause model

9.1 Need

Cause analysis must be typed.

9.2 Canonical structure

IncidentCauseRecord {
  cause_id
  cause_class
  primary
  statement_hash
  evidence_root?
}

9.3 cause_class examples

  • code_defect
  • parameter_misconfiguration
  • operator_misconfiguration
  • key_compromise
  • incompatible_upgrade_behavior
  • control_missing
  • monitoring_blind_spot
  • unexpected_adversary_strategy
  • archive_or_storage_failure
  • external_dependency_failure

9.4 Rule

Postmortem SHOULD support multiple causes with explicit primary/non-primary distinction.


10. Contributing factor model

10.1 Need

Many incidents have contributing factors distinct from primary cause.

10.2 Canonical structure

ContributingFactorRecord {
  factor_id
  factor_class
  statement_hash
  evidence_root?
}

10.3 factor_class examples

  • poor_threshold_tuning
  • runbook_gap
  • checklist_gap
  • alert_noise
  • delayed_detection
  • delayed_escalation
  • mixed_fleet_confusion
  • inadequate_test_coverage
  • residual_risk_underestimated
  • operator_training_gap

10.4 Rule

Contributing factors SHOULD NOT be buried inside prose.


11. Control failure model

11.1 Canonical structure

ControlFailureRecord {
  control_failure_id
  control_ref?
  failure_class
  statement_hash
  evidence_root?
  prevention_gap
  detection_gap
  recovery_gap
}

11.2 failure_class examples

  • control_missing
  • control_present_but_not_executed
  • control_present_but_ineffective
  • control_scope_too_narrow
  • control_silenced_by_noise
  • control_bypassed
  • control_not_tested

11.3 Rule

A material postmortem SHOULD identify control failures explicitly.


12. Detection analysis model

12.1 Purpose

Need to know how incident was noticed and where detection failed.

12.2 Canonical structure

DetectionAnalysisRecord {
  detection_id
  first_detection_source_class
  detection_latency_hash
  detection_quality_class
  missed_signal_root?
  notes_hash?
}

12.3 first_detection_source_class examples

  • monitoring_alert
  • operator_observation
  • user_report
  • audit_finding
  • simulation_or_test
  • external_party_report
  • archive_verification_run

12.4 Rule

Detection SHOULD be analyzed separately from mitigation quality.


13. Response analysis model

13.1 Canonical structure

ResponseAnalysisRecord {
  response_id
  response_quality_class
  escalation_latency_hash?
  mitigation_latency_hash?
  runbook_fit_class
  coordination_quality_class
  notes_hash?
}

13.2 response_quality_class examples

  • effective
  • delayed_but_effective
  • partially_effective
  • ineffective
  • harmful_side_effects

13.3 Rule

Response analysis SHOULD capture process quality, not only technical fix outcome.


14. Remediation model

14.1 Canonical structure

RemediationRecord {
  remediation_id
  remediation_class
  description_hash
  owner_role_class
  due_boundary_hash?
  verification_required
  remediation_status
}

14.2 remediation_class examples

  • code_fix
  • parameter_change
  • runbook_update
  • checklist_update
  • monitoring_threshold_update
  • alert_routing_update
  • training_or_process_update
  • conformance_case_addition
  • risk_canon_update
  • governance_policy_update
  • archive_repair_or_migration
  • no_change_explained

14.3 remediation_status

  • proposed
  • approved
  • in_progress
  • implemented
  • verified
  • deferred
  • rejected

14.4 Rule

Every material remediation SHOULD have owner and state.


15. Verification plan model

15.1 Need

Fixes and remediations need validation.

15.2 Canonical structure

RemediationVerificationPlan {
  verification_plan_id
  remediation_ref
  verification_method_root
  success_criteria_root
  due_boundary_hash?
}

15.3 verification_method examples

  • conformance_regression_case
  • replay_test
  • simulation_rerun
  • operator_drill
  • monitoring_validation
  • audit_export_check
  • archive_rebuild_check

15.4 Rule

Material remediation SHOULD not close without verification plan.


16. Lesson model

16.1 Definition

A lesson is a portable statement extracted from incident truth that should influence future behavior, design or review.

16.2 Canonical structure

IncidentLesson {
  lesson_id
  lesson_class
  statement_hash
  source_postmortem_id
  applicability_scope_hash
  owner_role_class
  lesson_status
}

16.3 lesson_class examples

  • design_lesson
  • control_lesson
  • operator_lesson
  • monitoring_lesson
  • governance_lesson
  • archive_lesson
  • audit_lesson
  • rollout_lesson
  • training_lesson

16.4 lesson_status

  • proposed
  • accepted
  • implemented
  • verified
  • retired
  • superseded

16.5 Rule

A lesson SHOULD be more general than raw incident facts, but still concrete enough to act on.


17. Lessons registry

17.1 Definition

Lessons Registry = authoritative collection of active and historical lessons learned.

17.2 Canonical structure

LessonsRegistry {
  registry_id
  registry_scope_hash
  active_lesson_root
  retired_lesson_root?
  superseded_lesson_root?
  timestamp_unix_ms
}

17.3 Rule

The registry SHOULD support lookup by incident class, subsystem, control family and lesson class.


18. Postmortem findings model

18.1 Canonical structure

PostmortemFinding {
  finding_id
  finding_class
  severity_class
  statement_hash
  evidence_root?
  action_required
}

18.2 finding_class examples

  • root_cause_confirmed
  • control_gap
  • monitoring_gap
  • operator_gap
  • risk_underestimated
  • rollout_weakness
  • archive_weakness
  • training_gap
  • false_assumption_exposed

18.3 Rule

Findings SHOULD drive lessons and remediation, not just summarize narrative.


19. Postmortem completeness criteria

19.1 A postmortem SHOULD be considered complete only if it includes:

  • incident scope
  • timeline
  • impact analysis
  • cause and contributing factors
  • control failures
  • response analysis
  • remediation set
  • lesson set
  • verification plan for material remediations
  • links to risk or canon updates where relevant

19.2 Rule

A postmortem missing these SHOULD be considered incomplete for material incidents.


20. Incomplete or limited postmortems

20.1 Sometimes evidence may be missing or incident may still be unfolding.

20.2 Rule

In such cases, postmortem MUST explicitly state:

  • incomplete sections
  • why incomplete
  • what evidence is still missing
  • review or refresh due boundary

20.3 Rule

A partial postmortem MUST NOT pretend finality.


21. Relationship to threat canon

21.1 Material postmortem findings SHOULD update:

  • threat records
  • control records
  • residual risk records
  • risk review records

21.2 Rule

Incident lessons that expose new adversary capabilities or new control failure modes MUST be reflected in AZ-039 canon.


22. Relationship to conformance corpus

22.1 If incident reveals reproducible protocol or process defect, postmortem SHOULD trigger:

  • new conformance cases
  • regression cases
  • upgrade boundary cases
  • operator procedure cases
  • monitoring validation cases

22.2 Rule

If no corpus update is needed, postmortem SHOULD explain why.


23. Relationship to runbooks and checklists

23.1 Postmortem remediation SHOULD explicitly state whether to update:

  • incident runbooks
  • operator launch manuals
  • operator checklists
  • restart/rejoin procedures
  • upgrade rollout procedures

23.2 Rule

If incident exposed operator action weakness, checklist/runbook updates SHOULD be mandatory or explicitly waived with rationale.


24. Relationship to monitoring profiles

24.1 Postmortem SHOULD assess whether to change:

  • alert thresholds
  • alert routes
  • anomaly classes
  • correlation rules
  • restricted posture criteria
  • blind spot detection

24.2 Rule

Monitoring gaps exposed by incidents SHOULD feed directly into AZ-031 profiles.


25. Relationship to launch and upgrade decisions

25.1 Serious incidents SHOULD influence:

  • launch blockers
  • upgrade blockers
  • restricted posture extension
  • rollout policy changes
  • mixed-fleet restrictions
  • risk acceptance reviews

25.2 Rule

Postmortem outcomes SHOULD be decision-relevant, not historical only.


26. Relationship to archive and audit export

26.1 Postmortem records SHOULD be archivable and exportable via audit interface.

26.2 Export MAY include:

  • postmortem record
  • findings
  • remediation set
  • lessons subset
  • risk updates
  • redacted evidence
  • closure verification state

26.3 Rule

External postmortem export SHOULD remain claim-centered and redaction-aware.


27. Closure model

27.1 Incident closure and postmortem closure are related but distinct.

27.2 Incident MAY close before all remediations are verified.

In that case, postmortem SHOULD remain open or partially finalized until follow-up is tracked.

27.3 Rule

“Incident closed” MUST NOT imply “lesson implemented”.


28. Remediation closure verification

28.1 Verification SHOULD require:

  • explicit evidence of implementation
  • explicit evidence of test/replay/drill if applicable
  • closure status in remediation record
  • update of related lesson status

28.2 Rule

Material remediation MUST NOT close solely on code merge or note in chat.


29. Postmortem review and sign-off

29.1 Recommended reviewers

  • incident commander
  • subsystem owner
  • security lead if security-relevant
  • ops lead if operationally material
  • audit scribe or review owner
  • governance liaison if governance scope touched

29.2 Rule

Major postmortems SHOULD be reviewed by more than one perspective.


30. Postmortem supersession

30.1 Need

Later evidence may produce improved postmortem or corrected analysis.

30.2 Rule

Supersession MUST be explicit:

  • prior_postmortem_id
  • new_postmortem_id
  • supersession_reason
  • what materially changed

30.3 Rule

Older postmortem remains visible as historical artifact.


31. Postmortem revocation

31.1 Need

A postmortem may be materially wrong or scope-confused.

31.2 Rule

Revocation MUST be explicit and SHOULD include:

  • target postmortem_id
  • reason hash
  • replacement ref if any
  • effect on lessons/remediation linkage

31.3 Rule

Revocation MUST NOT silently erase incident memory.


32. Lessons verification

32.1 A lesson SHOULD move from accepted to implemented/verified only when:

  • linked remediation completed
  • verification plan passed
  • related canon/runbook/checklist/corpus updates recorded
  • lesson no longer merely aspirational

32.2 Rule

“Lesson learned” without structural change SHOULD not be marked implemented.


33. Lessons query model

33.1 Registry SHOULD support queries such as:

  • all lessons from consensus incidents
  • all monitoring lessons not yet verified
  • all lessons affecting upgrade rollout
  • all lessons from key compromise incidents
  • all lessons that changed residual risk canon
  • all retired or superseded lessons

33.2 Rule

Lessons registry SHOULD be operationally searchable, not just archival.


34. Postmortem anti-patterns

34.1 Systems SHOULD avoid:

  1. postmortem as narrative with no structured fields
  2. root cause only, no contributing factors
  3. “human error” as endpoint with no control analysis
  4. no remediation owner
  5. no verification plan
  6. no lesson registry linkage
  7. no threat/risk canon update after new class of incident
  8. incident closed with remediation still aspirational
  9. lessons worded too vaguely to act on
  10. rewriting historical postmortem without supersession

35. Formal goals

AZ-042 urmărește aceste obiective:

35.1 Structured incident memory

Every material incident can be reconstructed as timeline, cause, impact, control failure and remediation.

35.2 Actionable learning

Lessons become owned, tracked and verified changes rather than slogans.

35.3 Canon integration

Incident knowledge updates threat canon, residual risk, conformance and operations artifacts.

35.4 Audit-grade post-incident truth

External and internal reviewers can inspect incident learning in a consistent, exportable form.


36. Formula documentului

Incident Postmortem Canon = structured incident truth + typed findings + lessons registry + remediation with verification + canon/runbook/checklist/corpus updates


37. Relația cu restul suitei

  • AZ-015 definește incident response.
  • AZ-039 definește threat and residual risk canon.
  • AZ-040 definește conformance claims.
  • AZ-042 definește cum incidentele devin memorie operațională și schimbare sistemică verificabilă.

Pe scurt: AZ-042 transformă incidentul din eveniment trecător în infrastructură de învățare și control.


38. Ce urmează

După AZ-042, documentul corect este:

AZ-043 — Constitutional Record and Network Identity Canon

Acolo trebuie fixate:

  • ce recorduri sunt constituționale pentru identitatea rețelei;
  • cum se leagă chain identity, genesis, hard fork lineage și governance critical records;
  • și cum definim memoria normativă ultimă a rețelei pe termen lung.

Închidere

Un incident este cu adevărat închis nu când alertele s-au oprit, ci când cauza este înțeleasă, controlul lipsă este numit, schimbarea este făcută, lecția este păstrată, și următoarea apariție a aceleiași clase de eșec devine mai puțin probabilă.

Acolo începe postmortem-ul cu valoare reală.