AZ-042 — Incident Postmortem Canon and Lessons Registry v1

Status

Acest document definește:

forma canonicală a postmortem-urilor de incident;
taxonomia lecțiilor învățate;
mapping-ul dintre incident, cauză, control failure, remediation și lessons registry;
și regulile prin care incidentele schimbă efectiv threat canon, runbooks, checklists, monitoring și conformance corpus.

După AZ-001 până la AZ-041, există deja:

specificația protocolului și a subsistemelor;
modelul de securitate, incident response și recovery;
launch discipline, monitoring, stabilization și archive;
upgrade, hard fork, key compromise și long-term preservation;
threat model canon, conformance claim framework și economic attack evaluation.

AZ-042 răspunde la întrebarea: cum transformăm un incident real într-un set canonic de adevăruri și lecții acționabile, astfel încât organizația și protocolul să nu repete aceeași clasă de eșec sub alt nume, iar memoria incidentului să nu se piardă în rezumate informale sau postmortem-uri inconsistente?

Scopul documentului este să fixeze:

structura canonicală a postmortem-ului;
taxonomia finding-urilor și lecțiilor;
relația dintre incident, cauză, control failure și remediation;
registrul central de lessons learned;
regulile de follow-up, closure și verification;
și legătura cu threat canon, residual risk canon, runbooks, checklist-uri, monitoring profiles, conformance corpus și audit export.

Acest document se bazează pe:

AZ-002 până la AZ-041, cu accent direct pe AZ-008, AZ-015, AZ-027, AZ-030, AZ-031, AZ-032, AZ-038, AZ-039, AZ-040 și AZ-041.

Termeni:

MUST = obligatoriu
MUST NOT = interzis
SHOULD = recomandat puternic
MAY = opțional

1. Obiectiv

AZ-042 răspunde la 10 întrebări critice:

Ce este un incident postmortem canonical?
Ce câmpuri și secțiuni trebuie să conțină?
Cum separăm cronologia, cauza, impactul, control failures și remediation?
Ce este un lesson și cum este stocat central?
Cum legăm un incident de threat canon și residual risk canon?
Cum legăm postmortem-ul de runbooks, checklist-uri, conformance și monitoring?
Cum verificăm că remediation chiar s-a întâmplat?
Când un postmortem este complet și când este insuficient?
Cum exportăm lessons și postmortem-uri pentru audit extern?
Cum evităm postmortem-uri narative, vagi sau neacționabile?

2. Principii

2.1 Postmortem is a truth artifact, not a ritual essay

Postmortem-ul MUST fi tratat ca artefact operațional și auditabil, nu ca text ceremonial de închidere.

2.2 Incident memory must be structured

Cronologia, cauzele, controalele eșuate, remediation și lecțiile MUST fi separate și tipizate.

2.3 Every material incident must change something or explain why not

Un incident material SHOULD produce:

remediation;
registry updates;
risk updates;
sau explicație explicită de ce nu e nevoie.

2.4 Root cause is rarely enough alone

Postmortem-ul SHOULD surprinde:

trigger;
contributing factors;
failed assumptions;
failed controls;
detection gaps;
response gaps;
și blast radius.

2.5 Lessons without ownership decay into folklore

Orice lesson material SHOULD avea:

owner;
due boundary;
verification path;
și closure state.

2.6 The canon must outlive the people involved

Memoria incidentului MUST rămâne utilă chiar dacă echipa originală dispare sau se schimbă.

3. Postmortem purpose

3.1 A canonical postmortem SHOULD answer:

ce s-a întâmplat?
când?
ce a fost afectat?
cum am detectat?
ce control a eșuat sau a lipsit?
ce a permis escalarea?
ce am făcut?
ce schimbăm?
cum știm că am învățat ceva real?

3.2 Rule

Dacă postmortem-ul nu poate răspunde clar la aceste întrebări, este insuficient.

4. Incident classes in postmortem scope

4.1 Canon SHOULD support postmortem-uri for:

consensus incidents
finality/liveness incidents
validation divergence
BVM execution incidents
witness/proof incidents
governance incidents
economic or spam incidents
release/genesis/provenance incidents
key compromise incidents
operator or rollout incidents
archive/preservation incidents
audit/export integrity incidents

4.2 Rule

Material incidents across all major trust boundaries SHOULD have canonical postmortem support.

5. Postmortem object model

5.1 Canonical structure

IncidentPostmortemRecord {
  version_major
  version_minor

  postmortem_id
  incident_id
  postmortem_scope_hash
  incident_class
  severity_class
  timeline_root
  impact_root
  cause_root
  control_failure_root
  remediation_root
  lesson_root
  risk_update_root?
  verification_plan_root?
  status
  created_at_unix_ms
  finalized_at_unix_ms?
  authoring_scope_hash
  metadata_hash?
}

5.2 status

PM_DRAFT
PM_IN_REVIEW
PM_FINAL
PM_SUPERSEDED
PM_REVOKED

5.3 Rule

Material incidents SHOULD reach PM_FINAL unless explicitly superseded.

6. Postmortem scope model

6.1 Canonical structure

PostmortemScope {
  scope_id
  target_network_class
  target_chain_id?
  target_genesis_hash?
  affected_release_package_id?
  affected_upgrade_proposal_id?
  affected_incident_window_hash
  affected_asset_root
}

6.2 Rule

Scope MUST bind the postmortem to exact incident context, not generalize implicitly.

7. Timeline model

7.1 Purpose

Incident memory needs exact chronology.

7.2 Canonical structure

IncidentTimelineEntry {
  timeline_entry_id
  event_class
  timestamp_unix_ms
  event_ref?
  summary_hash
}

7.3 event_class examples

first_signal_observed
incident_opened
escalation_triggered
operator_action_taken
decision_issued
mitigation_applied
recovery_started
recovery_completed
incident_closed
postmortem_opened
postmortem_finalized

7.4 Rule

Timelines SHOULD be precise enough to reconstruct sequencing and latency of response.

8. Impact model

8.1 Canonical structure

IncidentImpactRecord {
  impact_id
  affected_asset_root
  impact_class
  severity_class
  blast_radius_hash
  duration_hash?
  observed_metric_delta_root?
  notes_hash?
}

8.2 impact_class examples

finality_delay
liveness_loss
validation_inconsistency
execution_failure
governance_distortion
operator_unavailability
provenance_confidence_loss
archive_integrity_loss
auditability_gap

8.3 Rule

Impact SHOULD be recorded as actual effect, not mixed with cause.

9. Cause model

9.1 Need

Cause analysis must be typed.

9.2 Canonical structure

IncidentCauseRecord {
  cause_id
  cause_class
  primary
  statement_hash
  evidence_root?
}

9.3 cause_class examples

code_defect
parameter_misconfiguration
operator_misconfiguration
key_compromise
incompatible_upgrade_behavior
control_missing
monitoring_blind_spot
unexpected_adversary_strategy
archive_or_storage_failure
external_dependency_failure

9.4 Rule

Postmortem SHOULD support multiple causes with explicit primary/non-primary distinction.

10. Contributing factor model

10.1 Need

Many incidents have contributing factors distinct from primary cause.

10.2 Canonical structure

ContributingFactorRecord {
  factor_id
  factor_class
  statement_hash
  evidence_root?
}

10.3 factor_class examples

poor_threshold_tuning
runbook_gap
checklist_gap
alert_noise
delayed_detection
delayed_escalation
mixed_fleet_confusion
inadequate_test_coverage
residual_risk_underestimated
operator_training_gap

10.4 Rule

Contributing factors SHOULD NOT be buried inside prose.

11. Control failure model

11.1 Canonical structure

ControlFailureRecord {
  control_failure_id
  control_ref?
  failure_class
  statement_hash
  evidence_root?
  prevention_gap
  detection_gap
  recovery_gap
}

11.2 failure_class examples

control_missing
control_present_but_not_executed
control_present_but_ineffective
control_scope_too_narrow
control_silenced_by_noise
control_bypassed
control_not_tested

11.3 Rule

A material postmortem SHOULD identify control failures explicitly.

12. Detection analysis model

12.1 Purpose

Need to know how incident was noticed and where detection failed.

12.2 Canonical structure

DetectionAnalysisRecord {
  detection_id
  first_detection_source_class
  detection_latency_hash
  detection_quality_class
  missed_signal_root?
  notes_hash?
}

12.3 first_detection_source_class examples

monitoring_alert
operator_observation
user_report
audit_finding
simulation_or_test
external_party_report
archive_verification_run

12.4 Rule

Detection SHOULD be analyzed separately from mitigation quality.

13. Response analysis model

13.1 Canonical structure

ResponseAnalysisRecord {
  response_id
  response_quality_class
  escalation_latency_hash?
  mitigation_latency_hash?
  runbook_fit_class
  coordination_quality_class
  notes_hash?
}

13.2 response_quality_class examples

effective
delayed_but_effective
partially_effective
ineffective
harmful_side_effects

13.3 Rule

Response analysis SHOULD capture process quality, not only technical fix outcome.

14. Remediation model

14.1 Canonical structure

RemediationRecord {
  remediation_id
  remediation_class
  description_hash
  owner_role_class
  due_boundary_hash?
  verification_required
  remediation_status
}

14.2 remediation_class examples

code_fix
parameter_change
runbook_update
checklist_update
monitoring_threshold_update
alert_routing_update
training_or_process_update
conformance_case_addition
risk_canon_update
governance_policy_update
archive_repair_or_migration
no_change_explained

14.3 remediation_status

proposed
approved
in_progress
implemented
verified
deferred
rejected

14.4 Rule

Every material remediation SHOULD have owner and state.

15. Verification plan model

15.1 Need

Fixes and remediations need validation.

15.2 Canonical structure

RemediationVerificationPlan {
  verification_plan_id
  remediation_ref
  verification_method_root
  success_criteria_root
  due_boundary_hash?
}

15.3 verification_method examples

conformance_regression_case
replay_test
simulation_rerun
operator_drill
monitoring_validation
audit_export_check
archive_rebuild_check

15.4 Rule

Material remediation SHOULD not close without verification plan.

16. Lesson model

16.1 Definition

A lesson is a portable statement extracted from incident truth that should influence future behavior, design or review.

16.2 Canonical structure

IncidentLesson {
  lesson_id
  lesson_class
  statement_hash
  source_postmortem_id
  applicability_scope_hash
  owner_role_class
  lesson_status
}

16.3 lesson_class examples

design_lesson
control_lesson
operator_lesson
monitoring_lesson
governance_lesson
archive_lesson
audit_lesson
rollout_lesson
training_lesson

16.4 lesson_status

proposed
accepted
implemented
verified
retired
superseded

16.5 Rule

A lesson SHOULD be more general than raw incident facts, but still concrete enough to act on.

17. Lessons registry

17.1 Definition

Lessons Registry = authoritative collection of active and historical lessons learned.

17.2 Canonical structure

LessonsRegistry {
  registry_id
  registry_scope_hash
  active_lesson_root
  retired_lesson_root?
  superseded_lesson_root?
  timestamp_unix_ms
}

17.3 Rule

The registry SHOULD support lookup by incident class, subsystem, control family and lesson class.

18. Postmortem findings model

18.1 Canonical structure

PostmortemFinding {
  finding_id
  finding_class
  severity_class
  statement_hash
  evidence_root?
  action_required
}

18.2 finding_class examples

root_cause_confirmed
control_gap
monitoring_gap
operator_gap
risk_underestimated
rollout_weakness
archive_weakness
training_gap
false_assumption_exposed

18.3 Rule

Findings SHOULD drive lessons and remediation, not just summarize narrative.

19. Postmortem completeness criteria

19.1 A postmortem SHOULD be considered complete only if it includes:

incident scope
timeline
impact analysis
cause and contributing factors
control failures
response analysis
remediation set
lesson set
verification plan for material remediations
links to risk or canon updates where relevant

19.2 Rule

A postmortem missing these SHOULD be considered incomplete for material incidents.

20. Incomplete or limited postmortems

20.1 Sometimes evidence may be missing or incident may still be unfolding.

20.2 Rule

In such cases, postmortem MUST explicitly state:

incomplete sections
why incomplete
what evidence is still missing
review or refresh due boundary

20.3 Rule

A partial postmortem MUST NOT pretend finality.

21. Relationship to threat canon

21.1 Material postmortem findings SHOULD update:

threat records
control records
residual risk records
risk review records

21.2 Rule

Incident lessons that expose new adversary capabilities or new control failure modes MUST be reflected in AZ-039 canon.

22. Relationship to conformance corpus

22.1 If incident reveals reproducible protocol or process defect, postmortem SHOULD trigger:

new conformance cases
regression cases
upgrade boundary cases
operator procedure cases
monitoring validation cases

22.2 Rule

If no corpus update is needed, postmortem SHOULD explain why.

23. Relationship to runbooks and checklists

23.1 Postmortem remediation SHOULD explicitly state whether to update:

incident runbooks
operator launch manuals
operator checklists
restart/rejoin procedures
upgrade rollout procedures

23.2 Rule

If incident exposed operator action weakness, checklist/runbook updates SHOULD be mandatory or explicitly waived with rationale.

24. Relationship to monitoring profiles

24.1 Postmortem SHOULD assess whether to change:

alert thresholds
alert routes
anomaly classes
correlation rules
restricted posture criteria
blind spot detection

24.2 Rule

Monitoring gaps exposed by incidents SHOULD feed directly into AZ-031 profiles.

25. Relationship to launch and upgrade decisions

25.1 Serious incidents SHOULD influence:

launch blockers
upgrade blockers
restricted posture extension
rollout policy changes
mixed-fleet restrictions
risk acceptance reviews

25.2 Rule

Postmortem outcomes SHOULD be decision-relevant, not historical only.

26. Relationship to archive and audit export

26.1 Postmortem records SHOULD be archivable and exportable via audit interface.

26.2 Export MAY include:

postmortem record
findings
remediation set
lessons subset
risk updates
redacted evidence
closure verification state

26.3 Rule

External postmortem export SHOULD remain claim-centered and redaction-aware.

27. Closure model

27.1 Incident closure and postmortem closure are related but distinct.

27.2 Incident MAY close before all remediations are verified.

In that case, postmortem SHOULD remain open or partially finalized until follow-up is tracked.

27.3 Rule

“Incident closed” MUST NOT imply “lesson implemented”.

28. Remediation closure verification

28.1 Verification SHOULD require:

explicit evidence of implementation
explicit evidence of test/replay/drill if applicable
closure status in remediation record
update of related lesson status

28.2 Rule

Material remediation MUST NOT close solely on code merge or note in chat.

29. Postmortem review and sign-off

29.1 Recommended reviewers

incident commander
subsystem owner
security lead if security-relevant
ops lead if operationally material
audit scribe or review owner
governance liaison if governance scope touched

29.2 Rule

Major postmortems SHOULD be reviewed by more than one perspective.

30. Postmortem supersession

30.1 Need

Later evidence may produce improved postmortem or corrected analysis.

30.2 Rule

Supersession MUST be explicit:

prior_postmortem_id
new_postmortem_id
supersession_reason
what materially changed

30.3 Rule

Older postmortem remains visible as historical artifact.

31. Postmortem revocation

31.1 Need

A postmortem may be materially wrong or scope-confused.

31.2 Rule

Revocation MUST be explicit and SHOULD include:

target postmortem_id
reason hash
replacement ref if any
effect on lessons/remediation linkage

31.3 Rule

Revocation MUST NOT silently erase incident memory.

32. Lessons verification

32.1 A lesson SHOULD move from accepted to implemented/verified only when:

linked remediation completed
verification plan passed
related canon/runbook/checklist/corpus updates recorded
lesson no longer merely aspirational

32.2 Rule

“Lesson learned” without structural change SHOULD not be marked implemented.

33. Lessons query model

33.1 Registry SHOULD support queries such as:

all lessons from consensus incidents
all monitoring lessons not yet verified
all lessons affecting upgrade rollout
all lessons from key compromise incidents
all lessons that changed residual risk canon
all retired or superseded lessons

33.2 Rule

Lessons registry SHOULD be operationally searchable, not just archival.

34. Postmortem anti-patterns

34.1 Systems SHOULD avoid:

postmortem as narrative with no structured fields
root cause only, no contributing factors
“human error” as endpoint with no control analysis
no remediation owner
no verification plan
no lesson registry linkage
no threat/risk canon update after new class of incident
incident closed with remediation still aspirational
lessons worded too vaguely to act on
rewriting historical postmortem without supersession

35. Formal goals

AZ-042 urmărește aceste obiective:

35.1 Structured incident memory

Every material incident can be reconstructed as timeline, cause, impact, control failure and remediation.

35.2 Actionable learning

Lessons become owned, tracked and verified changes rather than slogans.

35.3 Canon integration

Incident knowledge updates threat canon, residual risk, conformance and operations artifacts.

35.4 Audit-grade post-incident truth

External and internal reviewers can inspect incident learning in a consistent, exportable form.

36. Formula documentului

Incident Postmortem Canon = structured incident truth + typed findings + lessons registry + remediation with verification + canon/runbook/checklist/corpus updates

37. Relația cu restul suitei

AZ-015 definește incident response.
AZ-039 definește threat and residual risk canon.
AZ-040 definește conformance claims.
AZ-042 definește cum incidentele devin memorie operațională și schimbare sistemică verificabilă.

Pe scurt: AZ-042 transformă incidentul din eveniment trecător în infrastructură de învățare și control.

38. Ce urmează

După AZ-042, documentul corect este:

AZ-043 — Constitutional Record and Network Identity Canon

Acolo trebuie fixate:

ce recorduri sunt constituționale pentru identitatea rețelei;
cum se leagă chain identity, genesis, hard fork lineage și governance critical records;
și cum definim memoria normativă ultimă a rețelei pe termen lung.

Închidere

Un incident este cu adevărat închis nu când alertele s-au oprit, ci când cauza este înțeleasă, controlul lipsă este numit, schimbarea este făcută, lecția este păstrată, și următoarea apariție a aceleiași clase de eșec devine mai puțin probabilă.

Acolo începe postmortem-ul cu valoare reală.