AZ-042 — Incident Postmortem Canon and Lessons Registry v1
Status
Acest document definește:
- forma canonicală a postmortem-urilor de incident;
- taxonomia lecțiilor învățate;
- mapping-ul dintre incident, cauză, control failure, remediation și lessons registry;
- și regulile prin care incidentele schimbă efectiv threat canon, runbooks, checklists, monitoring și conformance corpus.
După AZ-001 până la AZ-041, există deja:
- specificația protocolului și a subsistemelor;
- modelul de securitate, incident response și recovery;
- launch discipline, monitoring, stabilization și archive;
- upgrade, hard fork, key compromise și long-term preservation;
- threat model canon, conformance claim framework și economic attack evaluation.
AZ-042 răspunde la întrebarea: cum transformăm un incident real într-un set canonic de adevăruri și lecții acționabile, astfel încât organizația și protocolul să nu repete aceeași clasă de eșec sub alt nume, iar memoria incidentului să nu se piardă în rezumate informale sau postmortem-uri inconsistente?
Scopul documentului este să fixeze:
- structura canonicală a postmortem-ului;
- taxonomia finding-urilor și lecțiilor;
- relația dintre incident, cauză, control failure și remediation;
- registrul central de lessons learned;
- regulile de follow-up, closure și verification;
- și legătura cu threat canon, residual risk canon, runbooks, checklist-uri, monitoring profiles, conformance corpus și audit export.
Acest document se bazează pe:
- AZ-002 până la AZ-041, cu accent direct pe AZ-008, AZ-015, AZ-027, AZ-030, AZ-031, AZ-032, AZ-038, AZ-039, AZ-040 și AZ-041.
Termeni:
- MUST = obligatoriu
- MUST NOT = interzis
- SHOULD = recomandat puternic
- MAY = opțional
1. Obiectiv
AZ-042 răspunde la 10 întrebări critice:
- Ce este un incident postmortem canonical?
- Ce câmpuri și secțiuni trebuie să conțină?
- Cum separăm cronologia, cauza, impactul, control failures și remediation?
- Ce este un lesson și cum este stocat central?
- Cum legăm un incident de threat canon și residual risk canon?
- Cum legăm postmortem-ul de runbooks, checklist-uri, conformance și monitoring?
- Cum verificăm că remediation chiar s-a întâmplat?
- Când un postmortem este complet și când este insuficient?
- Cum exportăm lessons și postmortem-uri pentru audit extern?
- Cum evităm postmortem-uri narative, vagi sau neacționabile?
2. Principii
2.1 Postmortem is a truth artifact, not a ritual essay
Postmortem-ul MUST fi tratat ca artefact operațional și auditabil, nu ca text ceremonial de închidere.
2.2 Incident memory must be structured
Cronologia, cauzele, controalele eșuate, remediation și lecțiile MUST fi separate și tipizate.
2.3 Every material incident must change something or explain why not
Un incident material SHOULD produce:
- remediation;
- registry updates;
- risk updates;
- sau explicație explicită de ce nu e nevoie.
2.4 Root cause is rarely enough alone
Postmortem-ul SHOULD surprinde:
- trigger;
- contributing factors;
- failed assumptions;
- failed controls;
- detection gaps;
- response gaps;
- și blast radius.
2.5 Lessons without ownership decay into folklore
Orice lesson material SHOULD avea:
- owner;
- due boundary;
- verification path;
- și closure state.
2.6 The canon must outlive the people involved
Memoria incidentului MUST rămâne utilă chiar dacă echipa originală dispare sau se schimbă.
3. Postmortem purpose
3.1 A canonical postmortem SHOULD answer:
- ce s-a întâmplat?
- când?
- ce a fost afectat?
- cum am detectat?
- ce control a eșuat sau a lipsit?
- ce a permis escalarea?
- ce am făcut?
- ce schimbăm?
- cum știm că am învățat ceva real?
3.2 Rule
Dacă postmortem-ul nu poate răspunde clar la aceste întrebări, este insuficient.
4. Incident classes in postmortem scope
4.1 Canon SHOULD support postmortem-uri for:
- consensus incidents
- finality/liveness incidents
- validation divergence
- BVM execution incidents
- witness/proof incidents
- governance incidents
- economic or spam incidents
- release/genesis/provenance incidents
- key compromise incidents
- operator or rollout incidents
- archive/preservation incidents
- audit/export integrity incidents
4.2 Rule
Material incidents across all major trust boundaries SHOULD have canonical postmortem support.
5. Postmortem object model
5.1 Canonical structure
IncidentPostmortemRecord {
version_major
version_minor
postmortem_id
incident_id
postmortem_scope_hash
incident_class
severity_class
timeline_root
impact_root
cause_root
control_failure_root
remediation_root
lesson_root
risk_update_root?
verification_plan_root?
status
created_at_unix_ms
finalized_at_unix_ms?
authoring_scope_hash
metadata_hash?
}
5.2 status
PM_DRAFTPM_IN_REVIEWPM_FINALPM_SUPERSEDEDPM_REVOKED
5.3 Rule
Material incidents SHOULD reach PM_FINAL unless explicitly superseded.
6. Postmortem scope model
6.1 Canonical structure
PostmortemScope {
scope_id
target_network_class
target_chain_id?
target_genesis_hash?
affected_release_package_id?
affected_upgrade_proposal_id?
affected_incident_window_hash
affected_asset_root
}
6.2 Rule
Scope MUST bind the postmortem to exact incident context, not generalize implicitly.
7. Timeline model
7.1 Purpose
Incident memory needs exact chronology.
7.2 Canonical structure
IncidentTimelineEntry {
timeline_entry_id
event_class
timestamp_unix_ms
event_ref?
summary_hash
}
7.3 event_class examples
- first_signal_observed
- incident_opened
- escalation_triggered
- operator_action_taken
- decision_issued
- mitigation_applied
- recovery_started
- recovery_completed
- incident_closed
- postmortem_opened
- postmortem_finalized
7.4 Rule
Timelines SHOULD be precise enough to reconstruct sequencing and latency of response.
8. Impact model
8.1 Canonical structure
IncidentImpactRecord {
impact_id
affected_asset_root
impact_class
severity_class
blast_radius_hash
duration_hash?
observed_metric_delta_root?
notes_hash?
}
8.2 impact_class examples
- finality_delay
- liveness_loss
- validation_inconsistency
- execution_failure
- governance_distortion
- operator_unavailability
- provenance_confidence_loss
- archive_integrity_loss
- auditability_gap
8.3 Rule
Impact SHOULD be recorded as actual effect, not mixed with cause.
9. Cause model
9.1 Need
Cause analysis must be typed.
9.2 Canonical structure
IncidentCauseRecord {
cause_id
cause_class
primary
statement_hash
evidence_root?
}
9.3 cause_class examples
- code_defect
- parameter_misconfiguration
- operator_misconfiguration
- key_compromise
- incompatible_upgrade_behavior
- control_missing
- monitoring_blind_spot
- unexpected_adversary_strategy
- archive_or_storage_failure
- external_dependency_failure
9.4 Rule
Postmortem SHOULD support multiple causes with explicit primary/non-primary distinction.
10. Contributing factor model
10.1 Need
Many incidents have contributing factors distinct from primary cause.
10.2 Canonical structure
ContributingFactorRecord {
factor_id
factor_class
statement_hash
evidence_root?
}
10.3 factor_class examples
- poor_threshold_tuning
- runbook_gap
- checklist_gap
- alert_noise
- delayed_detection
- delayed_escalation
- mixed_fleet_confusion
- inadequate_test_coverage
- residual_risk_underestimated
- operator_training_gap
10.4 Rule
Contributing factors SHOULD NOT be buried inside prose.
11. Control failure model
11.1 Canonical structure
ControlFailureRecord {
control_failure_id
control_ref?
failure_class
statement_hash
evidence_root?
prevention_gap
detection_gap
recovery_gap
}
11.2 failure_class examples
- control_missing
- control_present_but_not_executed
- control_present_but_ineffective
- control_scope_too_narrow
- control_silenced_by_noise
- control_bypassed
- control_not_tested
11.3 Rule
A material postmortem SHOULD identify control failures explicitly.
12. Detection analysis model
12.1 Purpose
Need to know how incident was noticed and where detection failed.
12.2 Canonical structure
DetectionAnalysisRecord {
detection_id
first_detection_source_class
detection_latency_hash
detection_quality_class
missed_signal_root?
notes_hash?
}
12.3 first_detection_source_class examples
- monitoring_alert
- operator_observation
- user_report
- audit_finding
- simulation_or_test
- external_party_report
- archive_verification_run
12.4 Rule
Detection SHOULD be analyzed separately from mitigation quality.
13. Response analysis model
13.1 Canonical structure
ResponseAnalysisRecord {
response_id
response_quality_class
escalation_latency_hash?
mitigation_latency_hash?
runbook_fit_class
coordination_quality_class
notes_hash?
}
13.2 response_quality_class examples
- effective
- delayed_but_effective
- partially_effective
- ineffective
- harmful_side_effects
13.3 Rule
Response analysis SHOULD capture process quality, not only technical fix outcome.
14. Remediation model
14.1 Canonical structure
RemediationRecord {
remediation_id
remediation_class
description_hash
owner_role_class
due_boundary_hash?
verification_required
remediation_status
}
14.2 remediation_class examples
- code_fix
- parameter_change
- runbook_update
- checklist_update
- monitoring_threshold_update
- alert_routing_update
- training_or_process_update
- conformance_case_addition
- risk_canon_update
- governance_policy_update
- archive_repair_or_migration
- no_change_explained
14.3 remediation_status
- proposed
- approved
- in_progress
- implemented
- verified
- deferred
- rejected
14.4 Rule
Every material remediation SHOULD have owner and state.
15. Verification plan model
15.1 Need
Fixes and remediations need validation.
15.2 Canonical structure
RemediationVerificationPlan {
verification_plan_id
remediation_ref
verification_method_root
success_criteria_root
due_boundary_hash?
}
15.3 verification_method examples
- conformance_regression_case
- replay_test
- simulation_rerun
- operator_drill
- monitoring_validation
- audit_export_check
- archive_rebuild_check
15.4 Rule
Material remediation SHOULD not close without verification plan.
16. Lesson model
16.1 Definition
A lesson is a portable statement extracted from incident truth that should influence future behavior, design or review.
16.2 Canonical structure
IncidentLesson {
lesson_id
lesson_class
statement_hash
source_postmortem_id
applicability_scope_hash
owner_role_class
lesson_status
}
16.3 lesson_class examples
- design_lesson
- control_lesson
- operator_lesson
- monitoring_lesson
- governance_lesson
- archive_lesson
- audit_lesson
- rollout_lesson
- training_lesson
16.4 lesson_status
- proposed
- accepted
- implemented
- verified
- retired
- superseded
16.5 Rule
A lesson SHOULD be more general than raw incident facts, but still concrete enough to act on.
17. Lessons registry
17.1 Definition
Lessons Registry = authoritative collection of active and historical lessons learned.
17.2 Canonical structure
LessonsRegistry {
registry_id
registry_scope_hash
active_lesson_root
retired_lesson_root?
superseded_lesson_root?
timestamp_unix_ms
}
17.3 Rule
The registry SHOULD support lookup by incident class, subsystem, control family and lesson class.
18. Postmortem findings model
18.1 Canonical structure
PostmortemFinding {
finding_id
finding_class
severity_class
statement_hash
evidence_root?
action_required
}
18.2 finding_class examples
- root_cause_confirmed
- control_gap
- monitoring_gap
- operator_gap
- risk_underestimated
- rollout_weakness
- archive_weakness
- training_gap
- false_assumption_exposed
18.3 Rule
Findings SHOULD drive lessons and remediation, not just summarize narrative.
19. Postmortem completeness criteria
19.1 A postmortem SHOULD be considered complete only if it includes:
- incident scope
- timeline
- impact analysis
- cause and contributing factors
- control failures
- response analysis
- remediation set
- lesson set
- verification plan for material remediations
- links to risk or canon updates where relevant
19.2 Rule
A postmortem missing these SHOULD be considered incomplete for material incidents.
20. Incomplete or limited postmortems
20.1 Sometimes evidence may be missing or incident may still be unfolding.
20.2 Rule
In such cases, postmortem MUST explicitly state:
- incomplete sections
- why incomplete
- what evidence is still missing
- review or refresh due boundary
20.3 Rule
A partial postmortem MUST NOT pretend finality.
21. Relationship to threat canon
21.1 Material postmortem findings SHOULD update:
- threat records
- control records
- residual risk records
- risk review records
21.2 Rule
Incident lessons that expose new adversary capabilities or new control failure modes MUST be reflected in AZ-039 canon.
22. Relationship to conformance corpus
22.1 If incident reveals reproducible protocol or process defect, postmortem SHOULD trigger:
- new conformance cases
- regression cases
- upgrade boundary cases
- operator procedure cases
- monitoring validation cases
22.2 Rule
If no corpus update is needed, postmortem SHOULD explain why.
23. Relationship to runbooks and checklists
23.1 Postmortem remediation SHOULD explicitly state whether to update:
- incident runbooks
- operator launch manuals
- operator checklists
- restart/rejoin procedures
- upgrade rollout procedures
23.2 Rule
If incident exposed operator action weakness, checklist/runbook updates SHOULD be mandatory or explicitly waived with rationale.
24. Relationship to monitoring profiles
24.1 Postmortem SHOULD assess whether to change:
- alert thresholds
- alert routes
- anomaly classes
- correlation rules
- restricted posture criteria
- blind spot detection
24.2 Rule
Monitoring gaps exposed by incidents SHOULD feed directly into AZ-031 profiles.
25. Relationship to launch and upgrade decisions
25.1 Serious incidents SHOULD influence:
- launch blockers
- upgrade blockers
- restricted posture extension
- rollout policy changes
- mixed-fleet restrictions
- risk acceptance reviews
25.2 Rule
Postmortem outcomes SHOULD be decision-relevant, not historical only.
26. Relationship to archive and audit export
26.1 Postmortem records SHOULD be archivable and exportable via audit interface.
26.2 Export MAY include:
- postmortem record
- findings
- remediation set
- lessons subset
- risk updates
- redacted evidence
- closure verification state
26.3 Rule
External postmortem export SHOULD remain claim-centered and redaction-aware.
27. Closure model
27.1 Incident closure and postmortem closure are related but distinct.
27.2 Incident MAY close before all remediations are verified.
In that case, postmortem SHOULD remain open or partially finalized until follow-up is tracked.
27.3 Rule
“Incident closed” MUST NOT imply “lesson implemented”.
28. Remediation closure verification
28.1 Verification SHOULD require:
- explicit evidence of implementation
- explicit evidence of test/replay/drill if applicable
- closure status in remediation record
- update of related lesson status
28.2 Rule
Material remediation MUST NOT close solely on code merge or note in chat.
29. Postmortem review and sign-off
29.1 Recommended reviewers
- incident commander
- subsystem owner
- security lead if security-relevant
- ops lead if operationally material
- audit scribe or review owner
- governance liaison if governance scope touched
29.2 Rule
Major postmortems SHOULD be reviewed by more than one perspective.
30. Postmortem supersession
30.1 Need
Later evidence may produce improved postmortem or corrected analysis.
30.2 Rule
Supersession MUST be explicit:
- prior_postmortem_id
- new_postmortem_id
- supersession_reason
- what materially changed
30.3 Rule
Older postmortem remains visible as historical artifact.
31. Postmortem revocation
31.1 Need
A postmortem may be materially wrong or scope-confused.
31.2 Rule
Revocation MUST be explicit and SHOULD include:
- target postmortem_id
- reason hash
- replacement ref if any
- effect on lessons/remediation linkage
31.3 Rule
Revocation MUST NOT silently erase incident memory.
32. Lessons verification
32.1 A lesson SHOULD move from accepted to implemented/verified only when:
- linked remediation completed
- verification plan passed
- related canon/runbook/checklist/corpus updates recorded
- lesson no longer merely aspirational
32.2 Rule
“Lesson learned” without structural change SHOULD not be marked implemented.
33. Lessons query model
33.1 Registry SHOULD support queries such as:
- all lessons from consensus incidents
- all monitoring lessons not yet verified
- all lessons affecting upgrade rollout
- all lessons from key compromise incidents
- all lessons that changed residual risk canon
- all retired or superseded lessons
33.2 Rule
Lessons registry SHOULD be operationally searchable, not just archival.
34. Postmortem anti-patterns
34.1 Systems SHOULD avoid:
- postmortem as narrative with no structured fields
- root cause only, no contributing factors
- “human error” as endpoint with no control analysis
- no remediation owner
- no verification plan
- no lesson registry linkage
- no threat/risk canon update after new class of incident
- incident closed with remediation still aspirational
- lessons worded too vaguely to act on
- rewriting historical postmortem without supersession
35. Formal goals
AZ-042 urmărește aceste obiective:
35.1 Structured incident memory
Every material incident can be reconstructed as timeline, cause, impact, control failure and remediation.
35.2 Actionable learning
Lessons become owned, tracked and verified changes rather than slogans.
35.3 Canon integration
Incident knowledge updates threat canon, residual risk, conformance and operations artifacts.
35.4 Audit-grade post-incident truth
External and internal reviewers can inspect incident learning in a consistent, exportable form.
36. Formula documentului
Incident Postmortem Canon = structured incident truth + typed findings + lessons registry + remediation with verification + canon/runbook/checklist/corpus updates
37. Relația cu restul suitei
- AZ-015 definește incident response.
- AZ-039 definește threat and residual risk canon.
- AZ-040 definește conformance claims.
- AZ-042 definește cum incidentele devin memorie operațională și schimbare sistemică verificabilă.
Pe scurt: AZ-042 transformă incidentul din eveniment trecător în infrastructură de învățare și control.
38. Ce urmează
După AZ-042, documentul corect este:
AZ-043 — Constitutional Record and Network Identity Canon
Acolo trebuie fixate:
- ce recorduri sunt constituționale pentru identitatea rețelei;
- cum se leagă chain identity, genesis, hard fork lineage și governance critical records;
- și cum definim memoria normativă ultimă a rețelei pe termen lung.
Închidere
Un incident este cu adevărat închis nu când alertele s-au oprit, ci când cauza este înțeleasă, controlul lipsă este numit, schimbarea este făcută, lecția este păstrată, și următoarea apariție a aceleiași clase de eșec devine mai puțin probabilă.
Acolo începe postmortem-ul cu valoare reală.