AZ-015 — Incident Response and Recovery Runbooks v1
Status
Acest document definește runbook-urile operaționale pentru incidente și recovery în ATLAS ZERO.
Specificațiile anterioare au definit:
- reguli de protocol,
- consens,
- BVM,
- witness,
- economie,
- agenți,
- guvernanță,
- securitate,
- conformitate,
- arhitectura nodului,
- fraud proofs și slashing evidence.
AZ-015 răspunde la întrebarea practică: ce face sistemul și ce fac operatorii atunci când lucrurile merg prost?
Scopul lui este să fixeze:
- clasele de incidente;
- nivelele de severitate;
- pașii operaționali standard;
- condițiile de safe mode, halt, quarantine și recovery;
- trasabilitatea deciziilor;
- criteriile de revenire la operare normală.
Acest document se bazează pe:
- AZ-002 până la AZ-014.
Termeni:
- MUST = obligatoriu
- MUST NOT = interzis
- SHOULD = recomandat puternic
- MAY = opțional
1. Obiectiv
AZ-015 răspunde la 10 întrebări operaționale:
- Ce este un incident în protocol?
- Cum este clasificat și prioritizat?
- Cine poate declara safe mode, halt sau quarantine?
- Care este ordinea exactă a pașilor de răspuns?
- Ce dovezi și jurnale trebuie păstrate?
- Când se folosește emergency governance?
- Cum se face replay, rebuild și recovery?
- Cum se comunică starea sistemului fără ambiguitate?
- Când este permisă revenirea la normal?
- Cum evităm ca recovery să devină sursă de arbitrar sau de divergență?
2. Principii
2.1 Incident response is protocol-aware
Răspunsul la incidente MUST respecte:
- finalitatea;
- boundedness;
- guvernanța constituțională;
- modelul de safe mode și emergency powers.
2.2 Contain first, explain second
În incidente critice, prioritatea este:
- detectare,
- confirmare minimă,
- containment,
- păstrarea evidenței,
- apoi explicația completă.
2.3 No hidden fixes
Nodurile și operatorii MUST NOT aplica „fixuri invizibile” care schimbă adevărul protocolar fără traseu verificabil.
2.4 Recovery is staged
Recovery SHOULD fi gradual:
- local degradation,
- protocol safe mode,
- scoped halt,
- emergency restriction,
- replay/rebuild,
- controlled re-enable.
2.5 Evidence preservation
Orice incident relevant MUST păstra:
- obiectele implicate,
- state roots relevante,
- logs structurate,
- parameter state,
- evidence objects,
- cine a decis ce și când.
3. Incident taxonomy
3.1 Primary incident classes
ATLAS ZERO SHOULD clasifica incidentele în:
INC_CONSENSUSINC_VALIDATIONINC_BVMINC_WITNESS_PROOFINC_ECONOMICINC_AGENT_CONTROLINC_GOVERNANCEINC_KEY_COMPROMISEINC_INFRASTRUCTUREINC_SUPPLY_CHAININC_OBSERVABILITY_ONLY
3.2 Why this matters
Clasa incidentului determină:
- cine trebuie alertat;
- dacă safe mode e suficient;
- dacă e necesară dovadă protocolară;
- dacă recovery afectează consensul sau doar operațiunile.
4. Severity model
4.1 Standard severity levels
Protocol operations SHOULD folosi:
SEV_0_INFOSEV_1_LOWSEV_2_MEDIUMSEV_3_HIGHSEV_4_CRITICALSEV_5_SYSTEMIC
4.2 Guidance
SEV_0_INFO
Doar observabilitate sau warning fără impact operațional.
SEV_1_LOW
Degradare minoră, fără impact asupra finalității sau fondurilor.
SEV_2_MEDIUM
Impact local sau temporar, posibil safe mode local.
SEV_3_HIGH
Poate afecta:
- liveness,
- journaling critic,
- oracle correctness locală,
- operare de agent bounded.
SEV_4_CRITICAL
Poate afecta:
- finalitate,
- validare,
- execuție bounded,
- integritatea fondurilor în anumite domenii.
SEV_5_SYSTEMIC
Amenințare largă asupra protocolului:
- split determinist,
- notarizări incompatibile,
- bug de validare/VM exploatabil sistemic,
- capture/extindere periculoasă a guvernanței,
- compromitere de chei/comitete majore.
5. Incident states
5.1 Lifecycle states
Un incident SHOULD trece prin:
DETECTEDTRIAGEDCONFIRMEDCONTAININGSTABILIZEDRECOVERINGMONITORINGCLOSEDPOSTMORTEM_PENDINGPOSTMORTEM_PUBLISHED
5.2 Rule
Niciun incident critic MUST NOT sări direct la CLOSED fără evidență, recovery state și review.
6. Incident roles
6.1 Roles
Operațional, sistemul SHOULD separa:
Incident ReporterOn-call OperatorIncident CommanderConsensus LeadExecution/BVM LeadSecurity LeadGovernance LiaisonRecovery OperatorAudit Scribe
6.2 Rule
În sisteme mici, unele roluri pot coincide. În sisteme serioase, rolurile critice SHOULD fi separate.
6.3 Audit Scribe
Un rol explicit de jurnalizare a deciziilor este recomandat. Fără el, postmortem-ul devine imprecis.
7. Incident object model
7.1 Abstract structure
IncidentRecord {
incident_id
incident_class
severity
detection_time
detected_by
status
affected_scopes_hash
initial_evidence_hash
current_response_profile
related_object_refs
related_runbook_id
metadata_hash?
}
7.2 incident_id
incident_id = H("AZ:INCIDENT:" || canonical_incident_record)
7.3 Rule
Incident records SHOULD be canonical and append-only in audit systems. Normative protocol effects still require protocol objects, not just incident records.
8. Response profiles
8.1 Need
Nu toate incidentele cer aceeași intensitate de răspuns.
8.2 Standard profiles
RP_OBSERVE_ONLYRP_LOCAL_DEGRADEDRP_SAFE_MODE_SCOPEDRP_HALT_SCOPEDRP_EMERGENCY_RESTRICTIONRP_PROTOCOL_RECOVERYRP_FORENSIC_ONLY
8.3 Rule
Response profile-ul SHOULD fi ales din gravitate + incident class + blast radius estimat.
9. Generic incident response ladder
9.1 Canonical order
Pentru orice incident semnificativ, ordinea SHOULD fi:
- detect
- classify
- preserve evidence
- scope impact
- choose containment profile
- apply safe mode/halt/restriction if needed
- validate current finalized truth
- prepare recovery path
- perform recovery
- monitor stabilization
- publish postmortem and permanent fixes
9.2 Rule
Operatorii MUST NOT începe „cleanup” care distruge evidența înainte de preservare.
10. Evidence preservation requirements
10.1 For any SEV_3+
MUST preserve at minimum:
- relevant object refs
- local and finalized state roots
- active parameter state
- governance/emergency state
- relevant logs
- peer messages around event window where relevant
- slash/fraud evidence if present
- code/version identifiers
10.2 For consensus incidents
Also preserve:
- block candidate DAG view
- committee derivation inputs
- verifier/notary messages
- notarization candidates
- replay traces
10.3 For BVM incidents
Also preserve:
- module blob
- code hash
- args bytes
- prior state blob/root
- execution trace if available
- effect accumulator data
11. Safe mode definitions
11.1 Local degraded mode
Node-local. May do:
- reduce RPC exposure
- stop proposer role
- stop agent submission integration
- keep full validation active
- keep observing network
Does NOT change protocol truth.
11.2 Protocol safe mode
Protocol-recognized restriction mode. Can do:
- tighten risk thresholds
- force extra approvals
- restrict agent classes
- restrict feature subsets
- move domains to exit-only
Requires protocol objects if consensus-relevant.
11.3 Rule
Every safe mode MUST be labeled either:
- local only or
- protocol active
Never mix them implicitly.
12. Halt definitions
12.1 Local service halt
Stops local service components:
- API
- proposer
- agent relayer
- indexing without claiming protocol-wide halt.
12.2 Scoped protocol halt
Stops:
- one machine,
- one mandate scope,
- one witness issuer class,
- one feature domain,
- one economic subsystem subset, if protocol objects authorize it.
12.3 Systemic protocol halt
Highly exceptional. Should be considered only when:
- widespread invalid state risk exists,
- finality or execution integrity is threatened,
- less restrictive containment is insufficient.
12.4 Rule
Protocol halts MUST be scoped as narrowly as possible.
13. Quarantine model
13.1 Purpose
Quarantine is weaker than permanent disablement. Used when confidence is incomplete but risk is material.
13.2 Quarantine targets
- issuer
- oracle source
- committee member set
- machine family
- BVM module family
- agent operator
- governance actor stream
- node peer cluster locally
13.3 Effects
May include:
- ignore or down-rank objects from target
- require corroboration
- disallow new actions from target
- freeze role privileges pending review
13.4 Rule
Protocol quarantine with consensus impact requires protocol authority path. Local peer quarantine does not.
14. Consensus incident runbook
14.1 Trigger examples
- conflicting notarization observed
- repeated no-finality
- committee selection mismatch
- front selection divergence
- suspicious double-signing
14.2 Minimum actions
- preserve notarization candidates and committee derivation inputs
- stop local proposer/notary if risk of compounding fault
- continue validation if safe
- replay finalized and candidate front from last safe checkpoint
- identify whether issue is:
- invalid object,
- local implementation divergence,
- genuine protocol fault,
- adversarial equivocation
- generate fraud proof if possible
- if blast radius high, prepare emergency restriction request
- avoid accepting new dubious finality until verified
14.3 Recovery goals
- confirm last unquestionably finalized epoch
- isolate incompatible candidate branches
- ensure no false hard-final state is exposed as safe
- restore healthy finality path
15. Validation incident runbook
15.1 Trigger examples
- one implementation accepts, another rejects same tx
- canonical decode mismatch
- reference resolution mismatch
- expiry/revocation mismatch
15.2 Minimum actions
- preserve exact input bytes and state fixture
- run AZ-011/AZ-003 targeted replay
- classify whether:
- node bug,
- spec ambiguity,
- malformed object,
- feature activation mismatch
- if consensus-critical ambiguity exists, enter safe mode for affected roles
- publish temporary operational guidance
- prepare patched implementation and/or governance clarification if needed
15.3 Rule
Validation ambiguity is potentially systemic until proven otherwise.
16. BVM incident runbook
16.1 Trigger examples
- divergent execution results
- unexpected trap/revert pattern
- effect digest mismatch
- manifest accepted but should fail
- exec cost mis-accounting
- host call permission leak
16.2 Minimum actions
- preserve module, args, prior state, parameter state
- disable risky machine family or module scope if needed
- compare against semantic oracle/runtime reference
- run deterministic replay on multiple implementations
- classify:
- verifier bug,
- runtime bug,
- malformed bytecode,
- spec gap
- if exploitability exists, escalate to protocol restriction or halt for affected domain
- prepare patched verifier/runtime
- require explicit re-enable criteria
16.3 Rule
BVM incidents with possible boundedness bypass SHOULD default to fail closed for affected machine scope.
17. Witness / proof incident runbook
17.1 Trigger examples
- contradictory oracle claims
- revocation mismatch
- proof verifier bug
- stale claim accepted
- unauthorized witness emission
17.2 Minimum actions
- preserve witness/proof objects and issuer policies
- mark affected issuer/source/domain
- if high impact, quarantine issuer and require corroboration
- derive whether objects are:
- invalid,
- contradictory,
- stale,
- parser-bug dependent
- generate fraud proof or invalidation trail where possible
- if settlement or treasury flows depend on affected witness family, freeze those flows as needed
- define revalidation plan
18. Economic incident runbook
18.1 Trigger examples
- fee underpricing exploited
- rent bypass
- slash amount bug
- reward distortion
- no-finality reward leak
- spam overwhelms state or mempool economics
18.2 Minimum actions
- preserve parameter state and observed exploit path
- quantify exploit economics and blast radius
- apply local throttles if safe and non-consensus
- if protocol-level fix needed, prepare emergency restriction or fast governance path within constitution
- mark affected feature/domain
- stop claiming economic normality until patch active
- simulate patched parameters before activation
18.3 Rule
Economic incidents that do not yet corrupt finality can still justify rapid restriction if exploit path is cheap and repeatable.
19. Agent incident runbook
19.1 Trigger examples
- action outside mandate
- missing mandatory log
- action after halt/revoke
- cap bypass
- compromised operator key
- inconsistent decision/execution logs
19.2 Minimum actions
- preserve mandate snapshot and action refs
- halt or suspend affected mandate scope
- rotate agent/operator if compromise suspected
- move to exit-only where possible
- produce witness/audit observation or slashing evidence if applicable
- quantify open exposure and close/reduce if policy allows
- require re-authorization before restart
19.3 Rule
For high-impact agents, halt first and explain second.
20. Governance incident runbook
20.1 Trigger examples
- activation before timelock
- challenge window bypass
- missing required review
- emergency action out of scope
- proposal class mislabeling
- conflicting activations
20.2 Minimum actions
- preserve all proposal/review/vote/outcome/activation objects
- compute active governance state from last finalized good point
- identify whether incident is:
- invalid object,
- implementation bug,
- procedural abuse,
- constitutional violation
- suspend applying disputed future activations
- if already activated improperly, enter governance anomaly state and apply constitutionally allowed containment
- require constitutional review for restart of disputed path
20.3 Rule
Governance anomalies must never be hidden behind UI-level reinterpretation.
21. Key compromise runbook
21.1 Trigger examples
- validator key leak
- notary key compromise
- oracle issuer key compromise
- agent operator key compromise
- governance signer compromise
21.2 Minimum actions
- preserve compromise evidence and timeline
- rotate or revoke affected keys/policies
- quarantine recent messages if required by rules
- assess whether any signed objects become slashable evidence
- reduce privileges on remaining linked scopes
- force re-authorization for critical flows
- publish blast radius assessment
21.3 Rule
Key compromise response SHOULD prefer scoped revocation and role separation over broad panic shutdown where possible.
22. Infrastructure incident runbook
22.1 Trigger examples
- data corruption in indexer
- snapshot corruption
- disk or DB issues
- network partition local to operator
- telemetry pipeline failure
- time sync drift
22.2 Minimum actions
- distinguish consensus-critical vs non-critical impact
- if finalized truth at risk locally, stop producing role actions
- rebuild from last finalized checkpoint if needed
- quarantine corrupted local indexes
- verify state root against peers/reference checkpoints
- only re-enable proposer/verifier/notary roles after integrity checks pass
22.3 Rule
Infrastructure failure MUST NOT silently continue as if node were healthy in a consensus role.
23. Supply chain incident runbook
23.1 Trigger examples
- malicious dependency
- compromised build artifact
- divergent release binary
- compiler bug affecting determinism
23.2 Minimum actions
- preserve binary hashes and build metadata
- compare reproducible builds
- suspend rollout and possibly role participation
- identify affected versions
- require rebuilt, verified artifacts
- consider protocol-level quarantine of known-bad implementation families only if constitutionally allowed and technically grounded
24. Containment decision matrix
24.1 Containment choices
- observe only
- local degraded mode
- local service halt
- scoped protocol safe mode
- scoped protocol halt
- issuer/committee quarantine
- emergency restriction
- recovery mode
24.2 Selection guidance
Use the least restrictive measure that:
- stops further damage,
- preserves evidence,
- does not create protocol ambiguity,
- is actually fast enough.
24.3 Rule
Containment MUST be scoped and reversible where possible.
25. Emergency governance escalation criteria
25.1 Consider emergency path when:
- protocol-level exploit is active or imminent;
- less restrictive containment is insufficient;
- affected scope is clearly definable;
- evidence exists or risk is extreme;
- delay from normal governance would materially worsen damage.
25.2 Do not use emergency path when:
- issue is purely local/operator-side;
- impact is observability-only;
- evidence is too weak and restriction would be broader than justified;
- normal path is fast enough and adequate.
26. Recovery modes
26.1 Recovery mode classes
RM_LOCAL_REBUILDRM_FINALIZED_REPLAYRM_SPECULATIVE_BRANCH_RESETRM_PROTOCOL_SAFE_RESTARTRM_DOMAIN_REENABLERM_POST_EMERGENCY_NORMALIZATION
26.2 Rule
Recovery mode must specify:
- target scope
- entry condition
- expected outputs
- exit criteria
27. Replay and rebuild runbook
27.1 Use when
- local state corruption
- suspected deterministic divergence
- uncertain candidate branch correctness
- governance state ambiguity after bug
- BVM runtime discrepancy
27.2 Steps
- identify last trusted finalized checkpoint
- freeze consensus-role outputs locally
- load canonical object sets and parameter state
- replay finalized path
- compare derived roots and receipts
- rebuild speculative branches if needed
- revalidate current active restrictions and governance activations
- only rejoin role participation after consistency checks
27.3 Rule
Replay source of truth MUST be finalized objects and canonical fixtures, not opportunistic local cache state.
28. Exit-only recovery path
28.1 Use when
- agents or machine families may be unsafe for new positions;
- open exposure must be reduced before deeper restart.
28.2 Behavior
- allow close/reduce actions
- disallow increase of exposure
- require extra journaling
- tighten caps
- possibly require manual approvals
28.3 Rule
Exit-only mode SHOULD be preferred over full freeze when user protection benefits from orderly de-risking.
29. Restart criteria
29.1 A subsystem SHOULD NOT restart normal operation until:
- root cause category is at least bounded;
- containment is holding;
- replay/rebuild checks pass where relevant;
- active emergency restrictions are understood;
- key roles are safe or rotated;
- required reviews/approvals are complete;
- telemetry signals are back within thresholds;
- explicit restart decision is logged.
29.2 Rule
“Seems fine now” is not sufficient restart criteria.
30. Re-enable phases
30.1 Recommended order
- observer/indexer paths
- RPC read-only
- validation-only participation
- verifier role if safe
- proposer role if safe
- notary role last
- agent execution domains last among app domains if incident touched them
30.2 Rule
The more authority a component has, the later it should be re-enabled.
31. Communication states
31.1 Public status classes
Operational communications SHOULD distinguish:
- healthy
- degraded
- safe mode
- scoped halt
- recovery in progress
- monitoring after recovery
31.2 Rule
Never present:
- speculative state as final;
- local node issue as protocol-wide issue without basis;
- protocol-wide restriction as mere local maintenance.
32. Mandatory audit trail
32.1 For any SEV_3+ incident, audit trail SHOULD contain:
- incident id
- timeline of events
- who declared severity
- who declared containment profile
- affected scopes
- evidence references
- protocol objects used for restriction/recovery
- replay/rebuild results
- restart criteria and sign-off
- postmortem link/hash
32.2 Rule
Missing audit trail in critical incidents is itself an operational failure.
33. Postmortem requirements
33.1 Every SEV_4+ incident SHOULD produce:
- root cause class
- exact triggering condition
- timeline
- blast radius
- why earlier controls did or did not work
- immediate fix
- long-term fix
- spec/implementation/process changes required
- whether governance change is needed
- whether conformance vectors should be added
33.2 Rule
Postmortem should distinguish:
- protocol bug,
- implementation bug,
- ops failure,
- governance failure,
- external dependency failure.
34. Runbook catalog
34.1 Each runbook SHOULD have:
- runbook_id
- title
- incident classes covered
- trigger signatures
- preconditions
- immediate actions
- escalation paths
- recovery steps
- exit criteria
- evidence checklist
34.2 Suggested IDs
RB-CONSENSUS-001RB-VALIDATION-001RB-BVM-001RB-WITNESS-001RB-ECON-001RB-AGENT-001RB-GOV-001RB-KEY-001RB-INFRA-001RB-SUPPLY-001
35. Example runbook skeleton
35.1 Abstract structure
Runbook {
runbook_id
title
incident_class
severity_min
trigger_conditions_hash
evidence_checklist_hash
immediate_actions_hash
containment_matrix_hash
recovery_steps_hash
exit_criteria_hash
}
35.2 Rule
Runbook content SHOULD be versioned and auditable. Critical operational changes to runbooks SHOULD be reviewed.
36. Drills and exercises
36.1 Required drills
Serious deployments SHOULD practice:
- no-finality drill
- conflicting notarization drill
- BVM divergence drill
- oracle contradiction drill
- agent key compromise drill
- governance activation anomaly drill
- snapshot rebuild drill
- exit-only recovery drill
36.2 Purpose
Runbooks unexersized are weaker than they look.
37. Interaction with fraud proofs and slashing
37.1 Rule
Incident response and slashing are related but distinct.
37.2 Operational order
Often:
- contain
- preserve evidence
- derive fraud proof
- apply slash or invalidation path
- continue recovery
37.3 Caution
Do not wait for slash execution before containing a live exploit.
38. Interaction with governance
38.1 Rule
Runbooks may recommend emergency governance escalation, but MUST NOT silently execute governance powers outside allowed process.
38.2 Requirement
If emergency action is used, the runbook SHOULD specify:
- why normal path insufficient,
- why scope is minimal,
- expected expiry,
- what post-activation review is required.
39. Interaction with conformance suites
39.1 Every critical incident SHOULD trigger review of AZ-011 coverage.
Questions to ask:
- should a new vector be added?
- was an existing vector insufficient?
- do multiple implementations now diverge on a missed edge case?
39.2 Rule
Incidents should improve the protocol test corpus, not just operations.
40. Runbook anti-patterns
Operators and implementers SHOULD avoid:
- reboot first, preserve evidence later
- local patch that changes consensus behavior silently
- protocol-wide panic halt for a local-only bug
- waiting for full root cause before containment
- restoring proposer/notary roles before replay checks
- unclear distinction between safe mode and halt
- emergency action with no expiry or no audit trail
- trusting dashboards over canonical replay
- re-enabling compromised keys without rotation
- closing incident because telemetry is quiet but root cause unknown
41. Formal goals
AZ-015 urmărește patru obiective:
41.1 Containment soundness
Incidentele pot fi limitate rapid fără a produce și mai multă ambiguitate protocolară.
41.2 Recovery replayability
Recovery can reconstruct trusted state from canonical finalized truth.
41.3 Operational auditability
Deciziile de incident pot fi urmărite și evaluate post-factum.
41.4 Controlled return to service
Revenirea la normal nu se face prin presupuneri, ci prin criterii verificabile.
42. Formula documentului
Incident Response = classify + preserve evidence + contain minimally + replay canonical truth + recover in stages + re-enable by criteria + publish audit trail
43. Relația cu restul suitei
- AZ-010 a definit modelul de securitate.
- AZ-014 a definit dovada și penalitatea.
- AZ-015 definește operațiunile umane și ale nodurilor în jurul incidentelor.
Pe scurt: dacă AZ-014 spune cum dovedești fault-ul, AZ-015 spune cum supraviețuiești lui.
44. Ce urmează
După AZ-015, documentul corect este:
AZ-016 — Genesis Specification
Acolo trebuie fixate:
- parametrii genesis,
- activele inițiale,
- validatorii inițiali,
- seed-ul inițial,
- registries inițiale,
- param state inițial,
- feature flags inițiale.
Închidere
Un protocol serios nu este cel care presupune că incidentele nu vor exista. Este cel care știe deja: cum le recunoaște, cum nu pierde evidența, cum nu agravează situația prin reacții haotice, și cum revine la operare fără să mintă despre starea sistemului.
Acolo începe reziliența operațională reală.