AZ-015 — Incident Response and Recovery Runbooks v1

Status

Acest document definește runbook-urile operaționale pentru incidente și recovery în ATLAS ZERO.

Specificațiile anterioare au definit:

reguli de protocol,
consens,
BVM,
witness,
economie,
agenți,
guvernanță,
securitate,
conformitate,
arhitectura nodului,
fraud proofs și slashing evidence.

AZ-015 răspunde la întrebarea practică: ce face sistemul și ce fac operatorii atunci când lucrurile merg prost?

Scopul lui este să fixeze:

clasele de incidente;
nivelele de severitate;
pașii operaționali standard;
condițiile de safe mode, halt, quarantine și recovery;
trasabilitatea deciziilor;
criteriile de revenire la operare normală.

Acest document se bazează pe:

AZ-002 până la AZ-014.

Termeni:

MUST = obligatoriu
MUST NOT = interzis
SHOULD = recomandat puternic
MAY = opțional

1. Obiectiv

AZ-015 răspunde la 10 întrebări operaționale:

Ce este un incident în protocol?
Cum este clasificat și prioritizat?
Cine poate declara safe mode, halt sau quarantine?
Care este ordinea exactă a pașilor de răspuns?
Ce dovezi și jurnale trebuie păstrate?
Când se folosește emergency governance?
Cum se face replay, rebuild și recovery?
Cum se comunică starea sistemului fără ambiguitate?
Când este permisă revenirea la normal?
Cum evităm ca recovery să devină sursă de arbitrar sau de divergență?

2. Principii

2.1 Incident response is protocol-aware

Răspunsul la incidente MUST respecte:

finalitatea;
boundedness;
guvernanța constituțională;
modelul de safe mode și emergency powers.

2.2 Contain first, explain second

În incidente critice, prioritatea este:

detectare,
confirmare minimă,
containment,
păstrarea evidenței,
apoi explicația completă.

2.3 No hidden fixes

Nodurile și operatorii MUST NOT aplica „fixuri invizibile” care schimbă adevărul protocolar fără traseu verificabil.

2.4 Recovery is staged

Recovery SHOULD fi gradual:

local degradation,
protocol safe mode,
scoped halt,
emergency restriction,
replay/rebuild,
controlled re-enable.

2.5 Evidence preservation

Orice incident relevant MUST păstra:

obiectele implicate,
state roots relevante,
logs structurate,
parameter state,
evidence objects,
cine a decis ce și când.

3. Incident taxonomy

3.1 Primary incident classes

ATLAS ZERO SHOULD clasifica incidentele în:

INC_CONSENSUS
INC_VALIDATION
INC_BVM
INC_WITNESS_PROOF
INC_ECONOMIC
INC_AGENT_CONTROL
INC_GOVERNANCE
INC_KEY_COMPROMISE
INC_INFRASTRUCTURE
INC_SUPPLY_CHAIN
INC_OBSERVABILITY_ONLY

3.2 Why this matters

Clasa incidentului determină:

cine trebuie alertat;
dacă safe mode e suficient;
dacă e necesară dovadă protocolară;
dacă recovery afectează consensul sau doar operațiunile.

4. Severity model

4.1 Standard severity levels

Protocol operations SHOULD folosi:

SEV_0_INFO
SEV_1_LOW
SEV_2_MEDIUM
SEV_3_HIGH
SEV_4_CRITICAL
SEV_5_SYSTEMIC

4.2 Guidance

SEV_0_INFO

Doar observabilitate sau warning fără impact operațional.

SEV_1_LOW

Degradare minoră, fără impact asupra finalității sau fondurilor.

SEV_2_MEDIUM

Impact local sau temporar, posibil safe mode local.

SEV_3_HIGH

Poate afecta:

liveness,
journaling critic,
oracle correctness locală,
operare de agent bounded.

SEV_4_CRITICAL

Poate afecta:

finalitate,
validare,
execuție bounded,
integritatea fondurilor în anumite domenii.

SEV_5_SYSTEMIC

Amenințare largă asupra protocolului:

split determinist,
notarizări incompatibile,
bug de validare/VM exploatabil sistemic,
capture/extindere periculoasă a guvernanței,
compromitere de chei/comitete majore.

5. Incident states

5.1 Lifecycle states

Un incident SHOULD trece prin:

DETECTED
TRIAGED
CONFIRMED
CONTAINING
STABILIZED
RECOVERING
MONITORING
CLOSED
POSTMORTEM_PENDING
POSTMORTEM_PUBLISHED

5.2 Rule

Niciun incident critic MUST NOT sări direct la CLOSED fără evidență, recovery state și review.

6. Incident roles

6.1 Roles

Operațional, sistemul SHOULD separa:

Incident Reporter
On-call Operator
Incident Commander
Consensus Lead
Execution/BVM Lead
Security Lead
Governance Liaison
Recovery Operator
Audit Scribe

6.2 Rule

În sisteme mici, unele roluri pot coincide. În sisteme serioase, rolurile critice SHOULD fi separate.

6.3 Audit Scribe

Un rol explicit de jurnalizare a deciziilor este recomandat. Fără el, postmortem-ul devine imprecis.

7. Incident object model

7.1 Abstract structure

IncidentRecord {
  incident_id
  incident_class
  severity
  detection_time
  detected_by
  status
  affected_scopes_hash
  initial_evidence_hash
  current_response_profile
  related_object_refs
  related_runbook_id
  metadata_hash?
}

7.2 incident_id

incident_id = H("AZ:INCIDENT:" || canonical_incident_record)

7.3 Rule

Incident records SHOULD be canonical and append-only in audit systems. Normative protocol effects still require protocol objects, not just incident records.

8. Response profiles

8.1 Need

Nu toate incidentele cer aceeași intensitate de răspuns.

8.2 Standard profiles

RP_OBSERVE_ONLY
RP_LOCAL_DEGRADED
RP_SAFE_MODE_SCOPED
RP_HALT_SCOPED
RP_EMERGENCY_RESTRICTION
RP_PROTOCOL_RECOVERY
RP_FORENSIC_ONLY

8.3 Rule

Response profile-ul SHOULD fi ales din gravitate + incident class + blast radius estimat.

9. Generic incident response ladder

9.1 Canonical order

Pentru orice incident semnificativ, ordinea SHOULD fi:

detect
classify
preserve evidence
scope impact
choose containment profile
apply safe mode/halt/restriction if needed
validate current finalized truth
prepare recovery path
perform recovery
monitor stabilization
publish postmortem and permanent fixes

9.2 Rule

Operatorii MUST NOT începe „cleanup” care distruge evidența înainte de preservare.

10. Evidence preservation requirements

10.1 For any SEV_3+

MUST preserve at minimum:

relevant object refs
local and finalized state roots
active parameter state
governance/emergency state
relevant logs
peer messages around event window where relevant
slash/fraud evidence if present
code/version identifiers

10.2 For consensus incidents

Also preserve:

block candidate DAG view
committee derivation inputs
verifier/notary messages
notarization candidates
replay traces

10.3 For BVM incidents

Also preserve:

module blob
code hash
args bytes
prior state blob/root
execution trace if available
effect accumulator data

11. Safe mode definitions

11.1 Local degraded mode

Node-local. May do:

reduce RPC exposure
stop proposer role
stop agent submission integration
keep full validation active
keep observing network

Does NOT change protocol truth.

11.2 Protocol safe mode

Protocol-recognized restriction mode. Can do:

tighten risk thresholds
force extra approvals
restrict agent classes
restrict feature subsets
move domains to exit-only

Requires protocol objects if consensus-relevant.

11.3 Rule

Every safe mode MUST be labeled either:

local only or
protocol active

Never mix them implicitly.

12. Halt definitions

12.1 Local service halt

Stops local service components:

API
proposer
agent relayer
indexing without claiming protocol-wide halt.

12.2 Scoped protocol halt

Stops:

one machine,
one mandate scope,
one witness issuer class,
one feature domain,
one economic subsystem subset, if protocol objects authorize it.

12.3 Systemic protocol halt

Highly exceptional. Should be considered only when:

widespread invalid state risk exists,
finality or execution integrity is threatened,
less restrictive containment is insufficient.

12.4 Rule

Protocol halts MUST be scoped as narrowly as possible.

13. Quarantine model

13.1 Purpose

Quarantine is weaker than permanent disablement. Used when confidence is incomplete but risk is material.

13.2 Quarantine targets

issuer
oracle source
committee member set
machine family
BVM module family
agent operator
governance actor stream
node peer cluster locally

13.3 Effects

May include:

ignore or down-rank objects from target
require corroboration
disallow new actions from target
freeze role privileges pending review

13.4 Rule

Protocol quarantine with consensus impact requires protocol authority path. Local peer quarantine does not.

14. Consensus incident runbook

14.1 Trigger examples

conflicting notarization observed
repeated no-finality
committee selection mismatch
front selection divergence
suspicious double-signing

14.2 Minimum actions

preserve notarization candidates and committee derivation inputs
stop local proposer/notary if risk of compounding fault
continue validation if safe
replay finalized and candidate front from last safe checkpoint
identify whether issue is:
- invalid object,
- local implementation divergence,
- genuine protocol fault,
- adversarial equivocation
generate fraud proof if possible
if blast radius high, prepare emergency restriction request
avoid accepting new dubious finality until verified

14.3 Recovery goals

confirm last unquestionably finalized epoch
isolate incompatible candidate branches
ensure no false hard-final state is exposed as safe
restore healthy finality path

15. Validation incident runbook

15.1 Trigger examples

one implementation accepts, another rejects same tx
canonical decode mismatch
reference resolution mismatch
expiry/revocation mismatch

15.2 Minimum actions

preserve exact input bytes and state fixture
run AZ-011/AZ-003 targeted replay
classify whether:
- node bug,
- spec ambiguity,
- malformed object,
- feature activation mismatch
if consensus-critical ambiguity exists, enter safe mode for affected roles
publish temporary operational guidance
prepare patched implementation and/or governance clarification if needed

15.3 Rule

Validation ambiguity is potentially systemic until proven otherwise.

16. BVM incident runbook

16.1 Trigger examples

divergent execution results
unexpected trap/revert pattern
effect digest mismatch
manifest accepted but should fail
exec cost mis-accounting
host call permission leak

16.2 Minimum actions

preserve module, args, prior state, parameter state
disable risky machine family or module scope if needed
compare against semantic oracle/runtime reference
run deterministic replay on multiple implementations
classify:
- verifier bug,
- runtime bug,
- malformed bytecode,
- spec gap
if exploitability exists, escalate to protocol restriction or halt for affected domain
prepare patched verifier/runtime
require explicit re-enable criteria

16.3 Rule

BVM incidents with possible boundedness bypass SHOULD default to fail closed for affected machine scope.

17. Witness / proof incident runbook

17.1 Trigger examples

contradictory oracle claims
revocation mismatch
proof verifier bug
stale claim accepted
unauthorized witness emission

17.2 Minimum actions

preserve witness/proof objects and issuer policies
mark affected issuer/source/domain
if high impact, quarantine issuer and require corroboration
derive whether objects are:
- invalid,
- contradictory,
- stale,
- parser-bug dependent
generate fraud proof or invalidation trail where possible
if settlement or treasury flows depend on affected witness family, freeze those flows as needed
define revalidation plan

18. Economic incident runbook

18.1 Trigger examples

fee underpricing exploited
rent bypass
slash amount bug
reward distortion
no-finality reward leak
spam overwhelms state or mempool economics

18.2 Minimum actions

preserve parameter state and observed exploit path
quantify exploit economics and blast radius
apply local throttles if safe and non-consensus
if protocol-level fix needed, prepare emergency restriction or fast governance path within constitution
mark affected feature/domain
stop claiming economic normality until patch active
simulate patched parameters before activation

18.3 Rule

Economic incidents that do not yet corrupt finality can still justify rapid restriction if exploit path is cheap and repeatable.

19. Agent incident runbook

19.1 Trigger examples

action outside mandate
missing mandatory log
action after halt/revoke
cap bypass
compromised operator key
inconsistent decision/execution logs

19.2 Minimum actions

preserve mandate snapshot and action refs
halt or suspend affected mandate scope
rotate agent/operator if compromise suspected
move to exit-only where possible
produce witness/audit observation or slashing evidence if applicable
quantify open exposure and close/reduce if policy allows
require re-authorization before restart

19.3 Rule

For high-impact agents, halt first and explain second.

20. Governance incident runbook

20.1 Trigger examples

activation before timelock
challenge window bypass
missing required review
emergency action out of scope
proposal class mislabeling
conflicting activations

20.2 Minimum actions

preserve all proposal/review/vote/outcome/activation objects
compute active governance state from last finalized good point
identify whether incident is:
- invalid object,
- implementation bug,
- procedural abuse,
- constitutional violation
suspend applying disputed future activations
if already activated improperly, enter governance anomaly state and apply constitutionally allowed containment
require constitutional review for restart of disputed path

20.3 Rule

Governance anomalies must never be hidden behind UI-level reinterpretation.

21. Key compromise runbook

21.1 Trigger examples

validator key leak
notary key compromise
oracle issuer key compromise
agent operator key compromise
governance signer compromise

21.2 Minimum actions

preserve compromise evidence and timeline
rotate or revoke affected keys/policies
quarantine recent messages if required by rules
assess whether any signed objects become slashable evidence
reduce privileges on remaining linked scopes
force re-authorization for critical flows
publish blast radius assessment

21.3 Rule

Key compromise response SHOULD prefer scoped revocation and role separation over broad panic shutdown where possible.

22. Infrastructure incident runbook

22.1 Trigger examples

data corruption in indexer
snapshot corruption
disk or DB issues
network partition local to operator
telemetry pipeline failure
time sync drift

22.2 Minimum actions

distinguish consensus-critical vs non-critical impact
if finalized truth at risk locally, stop producing role actions
rebuild from last finalized checkpoint if needed
quarantine corrupted local indexes
verify state root against peers/reference checkpoints
only re-enable proposer/verifier/notary roles after integrity checks pass

22.3 Rule

Infrastructure failure MUST NOT silently continue as if node were healthy in a consensus role.

23. Supply chain incident runbook

23.1 Trigger examples

malicious dependency
compromised build artifact
divergent release binary
compiler bug affecting determinism

23.2 Minimum actions

preserve binary hashes and build metadata
compare reproducible builds
suspend rollout and possibly role participation
identify affected versions
require rebuilt, verified artifacts
consider protocol-level quarantine of known-bad implementation families only if constitutionally allowed and technically grounded

24. Containment decision matrix

24.1 Containment choices

observe only
local degraded mode
local service halt
scoped protocol safe mode
scoped protocol halt
issuer/committee quarantine
emergency restriction
recovery mode

24.2 Selection guidance

Use the least restrictive measure that:

stops further damage,
preserves evidence,
does not create protocol ambiguity,
is actually fast enough.

24.3 Rule

Containment MUST be scoped and reversible where possible.

25. Emergency governance escalation criteria

25.1 Consider emergency path when:

protocol-level exploit is active or imminent;
less restrictive containment is insufficient;
affected scope is clearly definable;
evidence exists or risk is extreme;
delay from normal governance would materially worsen damage.

25.2 Do not use emergency path when:

issue is purely local/operator-side;
impact is observability-only;
evidence is too weak and restriction would be broader than justified;
normal path is fast enough and adequate.

26. Recovery modes

26.1 Recovery mode classes

RM_LOCAL_REBUILD
RM_FINALIZED_REPLAY
RM_SPECULATIVE_BRANCH_RESET
RM_PROTOCOL_SAFE_RESTART
RM_DOMAIN_REENABLE
RM_POST_EMERGENCY_NORMALIZATION

26.2 Rule

Recovery mode must specify:

target scope
entry condition
expected outputs
exit criteria

27. Replay and rebuild runbook

27.1 Use when

local state corruption
suspected deterministic divergence
uncertain candidate branch correctness
governance state ambiguity after bug
BVM runtime discrepancy

27.2 Steps

identify last trusted finalized checkpoint
freeze consensus-role outputs locally
load canonical object sets and parameter state
replay finalized path
compare derived roots and receipts
rebuild speculative branches if needed
revalidate current active restrictions and governance activations
only rejoin role participation after consistency checks

27.3 Rule

Replay source of truth MUST be finalized objects and canonical fixtures, not opportunistic local cache state.

28. Exit-only recovery path

28.1 Use when

agents or machine families may be unsafe for new positions;
open exposure must be reduced before deeper restart.

28.2 Behavior

allow close/reduce actions
disallow increase of exposure
require extra journaling
tighten caps
possibly require manual approvals

28.3 Rule

Exit-only mode SHOULD be preferred over full freeze when user protection benefits from orderly de-risking.

29. Restart criteria

29.1 A subsystem SHOULD NOT restart normal operation until:

root cause category is at least bounded;
containment is holding;
replay/rebuild checks pass where relevant;
active emergency restrictions are understood;
key roles are safe or rotated;
required reviews/approvals are complete;
telemetry signals are back within thresholds;
explicit restart decision is logged.

29.2 Rule

“Seems fine now” is not sufficient restart criteria.

30. Re-enable phases

30.1 Recommended order

observer/indexer paths
RPC read-only
validation-only participation
verifier role if safe
proposer role if safe
notary role last
agent execution domains last among app domains if incident touched them

30.2 Rule

The more authority a component has, the later it should be re-enabled.

31. Communication states

31.1 Public status classes

Operational communications SHOULD distinguish:

healthy
degraded
safe mode
scoped halt
recovery in progress
monitoring after recovery

31.2 Rule

Never present:

speculative state as final;
local node issue as protocol-wide issue without basis;
protocol-wide restriction as mere local maintenance.

32. Mandatory audit trail

32.1 For any SEV_3+ incident, audit trail SHOULD contain:

incident id
timeline of events
who declared severity
who declared containment profile
affected scopes
evidence references
protocol objects used for restriction/recovery
replay/rebuild results
restart criteria and sign-off
postmortem link/hash

32.2 Rule

Missing audit trail in critical incidents is itself an operational failure.

33. Postmortem requirements

33.1 Every SEV_4+ incident SHOULD produce:

root cause class
exact triggering condition
timeline
blast radius
why earlier controls did or did not work
immediate fix
long-term fix
spec/implementation/process changes required
whether governance change is needed
whether conformance vectors should be added

33.2 Rule

Postmortem should distinguish:

protocol bug,
implementation bug,
ops failure,
governance failure,
external dependency failure.

34. Runbook catalog

34.1 Each runbook SHOULD have:

runbook_id
title
incident classes covered
trigger signatures
preconditions
immediate actions
escalation paths
recovery steps
exit criteria
evidence checklist

34.2 Suggested IDs

RB-CONSENSUS-001
RB-VALIDATION-001
RB-BVM-001
RB-WITNESS-001
RB-ECON-001
RB-AGENT-001
RB-GOV-001
RB-KEY-001
RB-INFRA-001
RB-SUPPLY-001

35. Example runbook skeleton

35.1 Abstract structure

Runbook {
  runbook_id
  title
  incident_class
  severity_min
  trigger_conditions_hash
  evidence_checklist_hash
  immediate_actions_hash
  containment_matrix_hash
  recovery_steps_hash
  exit_criteria_hash
}

35.2 Rule

Runbook content SHOULD be versioned and auditable. Critical operational changes to runbooks SHOULD be reviewed.

36. Drills and exercises

36.1 Required drills

Serious deployments SHOULD practice:

no-finality drill
conflicting notarization drill
BVM divergence drill
oracle contradiction drill
agent key compromise drill
governance activation anomaly drill
snapshot rebuild drill
exit-only recovery drill

36.2 Purpose

Runbooks unexersized are weaker than they look.

37. Interaction with fraud proofs and slashing

37.1 Rule

Incident response and slashing are related but distinct.

37.2 Operational order

Often:

contain
preserve evidence
derive fraud proof
apply slash or invalidation path
continue recovery

37.3 Caution

Do not wait for slash execution before containing a live exploit.

38. Interaction with governance

38.1 Rule

Runbooks may recommend emergency governance escalation, but MUST NOT silently execute governance powers outside allowed process.

38.2 Requirement

If emergency action is used, the runbook SHOULD specify:

why normal path insufficient,
why scope is minimal,
expected expiry,
what post-activation review is required.

39. Interaction with conformance suites

39.1 Every critical incident SHOULD trigger review of AZ-011 coverage.

Questions to ask:

should a new vector be added?
was an existing vector insufficient?
do multiple implementations now diverge on a missed edge case?

39.2 Rule

Incidents should improve the protocol test corpus, not just operations.

40. Runbook anti-patterns

Operators and implementers SHOULD avoid:

reboot first, preserve evidence later
local patch that changes consensus behavior silently
protocol-wide panic halt for a local-only bug
waiting for full root cause before containment
restoring proposer/notary roles before replay checks
unclear distinction between safe mode and halt
emergency action with no expiry or no audit trail
trusting dashboards over canonical replay
re-enabling compromised keys without rotation
closing incident because telemetry is quiet but root cause unknown

41. Formal goals

AZ-015 urmărește patru obiective:

41.1 Containment soundness

Incidentele pot fi limitate rapid fără a produce și mai multă ambiguitate protocolară.

41.2 Recovery replayability

Recovery can reconstruct trusted state from canonical finalized truth.

41.3 Operational auditability

Deciziile de incident pot fi urmărite și evaluate post-factum.

41.4 Controlled return to service

Revenirea la normal nu se face prin presupuneri, ci prin criterii verificabile.

42. Formula documentului

Incident Response = classify + preserve evidence + contain minimally + replay canonical truth + recover in stages + re-enable by criteria + publish audit trail

43. Relația cu restul suitei

AZ-010 a definit modelul de securitate.
AZ-014 a definit dovada și penalitatea.
AZ-015 definește operațiunile umane și ale nodurilor în jurul incidentelor.

Pe scurt: dacă AZ-014 spune cum dovedești fault-ul, AZ-015 spune cum supraviețuiești lui.

44. Ce urmează

După AZ-015, documentul corect este:

AZ-016 — Genesis Specification

Acolo trebuie fixate:

parametrii genesis,
activele inițiale,
validatorii inițiali,
seed-ul inițial,
registries inițiale,
param state inițial,
feature flags inițiale.

Închidere

Un protocol serios nu este cel care presupune că incidentele nu vor exista. Este cel care știe deja: cum le recunoaște, cum nu pierde evidența, cum nu agravează situația prin reacții haotice, și cum revine la operare fără să mintă despre starea sistemului.

Acolo începe reziliența operațională reală.