ATLAS ZERO VM.zip / AZ-015_Incident_Response_and_Recovery_Runbooks_v1.md

AZ-015 — Incident Response and Recovery Runbooks v1

AZ-015 — Incident Response and Recovery Runbooks v1

Status

Acest document definește runbook-urile operaționale pentru incidente și recovery în ATLAS ZERO.

Specificațiile anterioare au definit:

  • reguli de protocol,
  • consens,
  • BVM,
  • witness,
  • economie,
  • agenți,
  • guvernanță,
  • securitate,
  • conformitate,
  • arhitectura nodului,
  • fraud proofs și slashing evidence.

AZ-015 răspunde la întrebarea practică: ce face sistemul și ce fac operatorii atunci când lucrurile merg prost?

Scopul lui este să fixeze:

  • clasele de incidente;
  • nivelele de severitate;
  • pașii operaționali standard;
  • condițiile de safe mode, halt, quarantine și recovery;
  • trasabilitatea deciziilor;
  • criteriile de revenire la operare normală.

Acest document se bazează pe:

  • AZ-002 până la AZ-014.

Termeni:

  • MUST = obligatoriu
  • MUST NOT = interzis
  • SHOULD = recomandat puternic
  • MAY = opțional

1. Obiectiv

AZ-015 răspunde la 10 întrebări operaționale:

  1. Ce este un incident în protocol?
  2. Cum este clasificat și prioritizat?
  3. Cine poate declara safe mode, halt sau quarantine?
  4. Care este ordinea exactă a pașilor de răspuns?
  5. Ce dovezi și jurnale trebuie păstrate?
  6. Când se folosește emergency governance?
  7. Cum se face replay, rebuild și recovery?
  8. Cum se comunică starea sistemului fără ambiguitate?
  9. Când este permisă revenirea la normal?
  10. Cum evităm ca recovery să devină sursă de arbitrar sau de divergență?

2. Principii

2.1 Incident response is protocol-aware

Răspunsul la incidente MUST respecte:

  • finalitatea;
  • boundedness;
  • guvernanța constituțională;
  • modelul de safe mode și emergency powers.

2.2 Contain first, explain second

În incidente critice, prioritatea este:

  1. detectare,
  2. confirmare minimă,
  3. containment,
  4. păstrarea evidenței,
  5. apoi explicația completă.

2.3 No hidden fixes

Nodurile și operatorii MUST NOT aplica „fixuri invizibile” care schimbă adevărul protocolar fără traseu verificabil.

2.4 Recovery is staged

Recovery SHOULD fi gradual:

  • local degradation,
  • protocol safe mode,
  • scoped halt,
  • emergency restriction,
  • replay/rebuild,
  • controlled re-enable.

2.5 Evidence preservation

Orice incident relevant MUST păstra:

  • obiectele implicate,
  • state roots relevante,
  • logs structurate,
  • parameter state,
  • evidence objects,
  • cine a decis ce și când.

3. Incident taxonomy

3.1 Primary incident classes

ATLAS ZERO SHOULD clasifica incidentele în:

  1. INC_CONSENSUS
  2. INC_VALIDATION
  3. INC_BVM
  4. INC_WITNESS_PROOF
  5. INC_ECONOMIC
  6. INC_AGENT_CONTROL
  7. INC_GOVERNANCE
  8. INC_KEY_COMPROMISE
  9. INC_INFRASTRUCTURE
  10. INC_SUPPLY_CHAIN
  11. INC_OBSERVABILITY_ONLY

3.2 Why this matters

Clasa incidentului determină:

  • cine trebuie alertat;
  • dacă safe mode e suficient;
  • dacă e necesară dovadă protocolară;
  • dacă recovery afectează consensul sau doar operațiunile.

4. Severity model

4.1 Standard severity levels

Protocol operations SHOULD folosi:

  • SEV_0_INFO
  • SEV_1_LOW
  • SEV_2_MEDIUM
  • SEV_3_HIGH
  • SEV_4_CRITICAL
  • SEV_5_SYSTEMIC

4.2 Guidance

SEV_0_INFO

Doar observabilitate sau warning fără impact operațional.

SEV_1_LOW

Degradare minoră, fără impact asupra finalității sau fondurilor.

SEV_2_MEDIUM

Impact local sau temporar, posibil safe mode local.

SEV_3_HIGH

Poate afecta:

  • liveness,
  • journaling critic,
  • oracle correctness locală,
  • operare de agent bounded.

SEV_4_CRITICAL

Poate afecta:

  • finalitate,
  • validare,
  • execuție bounded,
  • integritatea fondurilor în anumite domenii.

SEV_5_SYSTEMIC

Amenințare largă asupra protocolului:

  • split determinist,
  • notarizări incompatibile,
  • bug de validare/VM exploatabil sistemic,
  • capture/extindere periculoasă a guvernanței,
  • compromitere de chei/comitete majore.

5. Incident states

5.1 Lifecycle states

Un incident SHOULD trece prin:

  • DETECTED
  • TRIAGED
  • CONFIRMED
  • CONTAINING
  • STABILIZED
  • RECOVERING
  • MONITORING
  • CLOSED
  • POSTMORTEM_PENDING
  • POSTMORTEM_PUBLISHED

5.2 Rule

Niciun incident critic MUST NOT sări direct la CLOSED fără evidență, recovery state și review.


6. Incident roles

6.1 Roles

Operațional, sistemul SHOULD separa:

  • Incident Reporter
  • On-call Operator
  • Incident Commander
  • Consensus Lead
  • Execution/BVM Lead
  • Security Lead
  • Governance Liaison
  • Recovery Operator
  • Audit Scribe

6.2 Rule

În sisteme mici, unele roluri pot coincide. În sisteme serioase, rolurile critice SHOULD fi separate.

6.3 Audit Scribe

Un rol explicit de jurnalizare a deciziilor este recomandat. Fără el, postmortem-ul devine imprecis.


7. Incident object model

7.1 Abstract structure

IncidentRecord {
  incident_id
  incident_class
  severity
  detection_time
  detected_by
  status
  affected_scopes_hash
  initial_evidence_hash
  current_response_profile
  related_object_refs
  related_runbook_id
  metadata_hash?
}

7.2 incident_id

incident_id = H("AZ:INCIDENT:" || canonical_incident_record)

7.3 Rule

Incident records SHOULD be canonical and append-only in audit systems. Normative protocol effects still require protocol objects, not just incident records.


8. Response profiles

8.1 Need

Nu toate incidentele cer aceeași intensitate de răspuns.

8.2 Standard profiles

  • RP_OBSERVE_ONLY
  • RP_LOCAL_DEGRADED
  • RP_SAFE_MODE_SCOPED
  • RP_HALT_SCOPED
  • RP_EMERGENCY_RESTRICTION
  • RP_PROTOCOL_RECOVERY
  • RP_FORENSIC_ONLY

8.3 Rule

Response profile-ul SHOULD fi ales din gravitate + incident class + blast radius estimat.


9. Generic incident response ladder

9.1 Canonical order

Pentru orice incident semnificativ, ordinea SHOULD fi:

  1. detect
  2. classify
  3. preserve evidence
  4. scope impact
  5. choose containment profile
  6. apply safe mode/halt/restriction if needed
  7. validate current finalized truth
  8. prepare recovery path
  9. perform recovery
  10. monitor stabilization
  11. publish postmortem and permanent fixes

9.2 Rule

Operatorii MUST NOT începe „cleanup” care distruge evidența înainte de preservare.


10. Evidence preservation requirements

10.1 For any SEV_3+

MUST preserve at minimum:

  • relevant object refs
  • local and finalized state roots
  • active parameter state
  • governance/emergency state
  • relevant logs
  • peer messages around event window where relevant
  • slash/fraud evidence if present
  • code/version identifiers

10.2 For consensus incidents

Also preserve:

  • block candidate DAG view
  • committee derivation inputs
  • verifier/notary messages
  • notarization candidates
  • replay traces

10.3 For BVM incidents

Also preserve:

  • module blob
  • code hash
  • args bytes
  • prior state blob/root
  • execution trace if available
  • effect accumulator data

11. Safe mode definitions

11.1 Local degraded mode

Node-local. May do:

  • reduce RPC exposure
  • stop proposer role
  • stop agent submission integration
  • keep full validation active
  • keep observing network

Does NOT change protocol truth.

11.2 Protocol safe mode

Protocol-recognized restriction mode. Can do:

  • tighten risk thresholds
  • force extra approvals
  • restrict agent classes
  • restrict feature subsets
  • move domains to exit-only

Requires protocol objects if consensus-relevant.

11.3 Rule

Every safe mode MUST be labeled either:

  • local only or
  • protocol active

Never mix them implicitly.


12. Halt definitions

12.1 Local service halt

Stops local service components:

  • API
  • proposer
  • agent relayer
  • indexing without claiming protocol-wide halt.

12.2 Scoped protocol halt

Stops:

  • one machine,
  • one mandate scope,
  • one witness issuer class,
  • one feature domain,
  • one economic subsystem subset, if protocol objects authorize it.

12.3 Systemic protocol halt

Highly exceptional. Should be considered only when:

  • widespread invalid state risk exists,
  • finality or execution integrity is threatened,
  • less restrictive containment is insufficient.

12.4 Rule

Protocol halts MUST be scoped as narrowly as possible.


13. Quarantine model

13.1 Purpose

Quarantine is weaker than permanent disablement. Used when confidence is incomplete but risk is material.

13.2 Quarantine targets

  • issuer
  • oracle source
  • committee member set
  • machine family
  • BVM module family
  • agent operator
  • governance actor stream
  • node peer cluster locally

13.3 Effects

May include:

  • ignore or down-rank objects from target
  • require corroboration
  • disallow new actions from target
  • freeze role privileges pending review

13.4 Rule

Protocol quarantine with consensus impact requires protocol authority path. Local peer quarantine does not.


14. Consensus incident runbook

14.1 Trigger examples

  • conflicting notarization observed
  • repeated no-finality
  • committee selection mismatch
  • front selection divergence
  • suspicious double-signing

14.2 Minimum actions

  1. preserve notarization candidates and committee derivation inputs
  2. stop local proposer/notary if risk of compounding fault
  3. continue validation if safe
  4. replay finalized and candidate front from last safe checkpoint
  5. identify whether issue is:
    • invalid object,
    • local implementation divergence,
    • genuine protocol fault,
    • adversarial equivocation
  6. generate fraud proof if possible
  7. if blast radius high, prepare emergency restriction request
  8. avoid accepting new dubious finality until verified

14.3 Recovery goals

  • confirm last unquestionably finalized epoch
  • isolate incompatible candidate branches
  • ensure no false hard-final state is exposed as safe
  • restore healthy finality path

15. Validation incident runbook

15.1 Trigger examples

  • one implementation accepts, another rejects same tx
  • canonical decode mismatch
  • reference resolution mismatch
  • expiry/revocation mismatch

15.2 Minimum actions

  1. preserve exact input bytes and state fixture
  2. run AZ-011/AZ-003 targeted replay
  3. classify whether:
    • node bug,
    • spec ambiguity,
    • malformed object,
    • feature activation mismatch
  4. if consensus-critical ambiguity exists, enter safe mode for affected roles
  5. publish temporary operational guidance
  6. prepare patched implementation and/or governance clarification if needed

15.3 Rule

Validation ambiguity is potentially systemic until proven otherwise.


16. BVM incident runbook

16.1 Trigger examples

  • divergent execution results
  • unexpected trap/revert pattern
  • effect digest mismatch
  • manifest accepted but should fail
  • exec cost mis-accounting
  • host call permission leak

16.2 Minimum actions

  1. preserve module, args, prior state, parameter state
  2. disable risky machine family or module scope if needed
  3. compare against semantic oracle/runtime reference
  4. run deterministic replay on multiple implementations
  5. classify:
    • verifier bug,
    • runtime bug,
    • malformed bytecode,
    • spec gap
  6. if exploitability exists, escalate to protocol restriction or halt for affected domain
  7. prepare patched verifier/runtime
  8. require explicit re-enable criteria

16.3 Rule

BVM incidents with possible boundedness bypass SHOULD default to fail closed for affected machine scope.


17. Witness / proof incident runbook

17.1 Trigger examples

  • contradictory oracle claims
  • revocation mismatch
  • proof verifier bug
  • stale claim accepted
  • unauthorized witness emission

17.2 Minimum actions

  1. preserve witness/proof objects and issuer policies
  2. mark affected issuer/source/domain
  3. if high impact, quarantine issuer and require corroboration
  4. derive whether objects are:
    • invalid,
    • contradictory,
    • stale,
    • parser-bug dependent
  5. generate fraud proof or invalidation trail where possible
  6. if settlement or treasury flows depend on affected witness family, freeze those flows as needed
  7. define revalidation plan

18. Economic incident runbook

18.1 Trigger examples

  • fee underpricing exploited
  • rent bypass
  • slash amount bug
  • reward distortion
  • no-finality reward leak
  • spam overwhelms state or mempool economics

18.2 Minimum actions

  1. preserve parameter state and observed exploit path
  2. quantify exploit economics and blast radius
  3. apply local throttles if safe and non-consensus
  4. if protocol-level fix needed, prepare emergency restriction or fast governance path within constitution
  5. mark affected feature/domain
  6. stop claiming economic normality until patch active
  7. simulate patched parameters before activation

18.3 Rule

Economic incidents that do not yet corrupt finality can still justify rapid restriction if exploit path is cheap and repeatable.


19. Agent incident runbook

19.1 Trigger examples

  • action outside mandate
  • missing mandatory log
  • action after halt/revoke
  • cap bypass
  • compromised operator key
  • inconsistent decision/execution logs

19.2 Minimum actions

  1. preserve mandate snapshot and action refs
  2. halt or suspend affected mandate scope
  3. rotate agent/operator if compromise suspected
  4. move to exit-only where possible
  5. produce witness/audit observation or slashing evidence if applicable
  6. quantify open exposure and close/reduce if policy allows
  7. require re-authorization before restart

19.3 Rule

For high-impact agents, halt first and explain second.


20. Governance incident runbook

20.1 Trigger examples

  • activation before timelock
  • challenge window bypass
  • missing required review
  • emergency action out of scope
  • proposal class mislabeling
  • conflicting activations

20.2 Minimum actions

  1. preserve all proposal/review/vote/outcome/activation objects
  2. compute active governance state from last finalized good point
  3. identify whether incident is:
    • invalid object,
    • implementation bug,
    • procedural abuse,
    • constitutional violation
  4. suspend applying disputed future activations
  5. if already activated improperly, enter governance anomaly state and apply constitutionally allowed containment
  6. require constitutional review for restart of disputed path

20.3 Rule

Governance anomalies must never be hidden behind UI-level reinterpretation.


21. Key compromise runbook

21.1 Trigger examples

  • validator key leak
  • notary key compromise
  • oracle issuer key compromise
  • agent operator key compromise
  • governance signer compromise

21.2 Minimum actions

  1. preserve compromise evidence and timeline
  2. rotate or revoke affected keys/policies
  3. quarantine recent messages if required by rules
  4. assess whether any signed objects become slashable evidence
  5. reduce privileges on remaining linked scopes
  6. force re-authorization for critical flows
  7. publish blast radius assessment

21.3 Rule

Key compromise response SHOULD prefer scoped revocation and role separation over broad panic shutdown where possible.


22. Infrastructure incident runbook

22.1 Trigger examples

  • data corruption in indexer
  • snapshot corruption
  • disk or DB issues
  • network partition local to operator
  • telemetry pipeline failure
  • time sync drift

22.2 Minimum actions

  1. distinguish consensus-critical vs non-critical impact
  2. if finalized truth at risk locally, stop producing role actions
  3. rebuild from last finalized checkpoint if needed
  4. quarantine corrupted local indexes
  5. verify state root against peers/reference checkpoints
  6. only re-enable proposer/verifier/notary roles after integrity checks pass

22.3 Rule

Infrastructure failure MUST NOT silently continue as if node were healthy in a consensus role.


23. Supply chain incident runbook

23.1 Trigger examples

  • malicious dependency
  • compromised build artifact
  • divergent release binary
  • compiler bug affecting determinism

23.2 Minimum actions

  1. preserve binary hashes and build metadata
  2. compare reproducible builds
  3. suspend rollout and possibly role participation
  4. identify affected versions
  5. require rebuilt, verified artifacts
  6. consider protocol-level quarantine of known-bad implementation families only if constitutionally allowed and technically grounded

24. Containment decision matrix

24.1 Containment choices

  • observe only
  • local degraded mode
  • local service halt
  • scoped protocol safe mode
  • scoped protocol halt
  • issuer/committee quarantine
  • emergency restriction
  • recovery mode

24.2 Selection guidance

Use the least restrictive measure that:

  • stops further damage,
  • preserves evidence,
  • does not create protocol ambiguity,
  • is actually fast enough.

24.3 Rule

Containment MUST be scoped and reversible where possible.


25. Emergency governance escalation criteria

25.1 Consider emergency path when:

  • protocol-level exploit is active or imminent;
  • less restrictive containment is insufficient;
  • affected scope is clearly definable;
  • evidence exists or risk is extreme;
  • delay from normal governance would materially worsen damage.

25.2 Do not use emergency path when:

  • issue is purely local/operator-side;
  • impact is observability-only;
  • evidence is too weak and restriction would be broader than justified;
  • normal path is fast enough and adequate.

26. Recovery modes

26.1 Recovery mode classes

  • RM_LOCAL_REBUILD
  • RM_FINALIZED_REPLAY
  • RM_SPECULATIVE_BRANCH_RESET
  • RM_PROTOCOL_SAFE_RESTART
  • RM_DOMAIN_REENABLE
  • RM_POST_EMERGENCY_NORMALIZATION

26.2 Rule

Recovery mode must specify:

  • target scope
  • entry condition
  • expected outputs
  • exit criteria

27. Replay and rebuild runbook

27.1 Use when

  • local state corruption
  • suspected deterministic divergence
  • uncertain candidate branch correctness
  • governance state ambiguity after bug
  • BVM runtime discrepancy

27.2 Steps

  1. identify last trusted finalized checkpoint
  2. freeze consensus-role outputs locally
  3. load canonical object sets and parameter state
  4. replay finalized path
  5. compare derived roots and receipts
  6. rebuild speculative branches if needed
  7. revalidate current active restrictions and governance activations
  8. only rejoin role participation after consistency checks

27.3 Rule

Replay source of truth MUST be finalized objects and canonical fixtures, not opportunistic local cache state.


28. Exit-only recovery path

28.1 Use when

  • agents or machine families may be unsafe for new positions;
  • open exposure must be reduced before deeper restart.

28.2 Behavior

  • allow close/reduce actions
  • disallow increase of exposure
  • require extra journaling
  • tighten caps
  • possibly require manual approvals

28.3 Rule

Exit-only mode SHOULD be preferred over full freeze when user protection benefits from orderly de-risking.


29. Restart criteria

29.1 A subsystem SHOULD NOT restart normal operation until:

  1. root cause category is at least bounded;
  2. containment is holding;
  3. replay/rebuild checks pass where relevant;
  4. active emergency restrictions are understood;
  5. key roles are safe or rotated;
  6. required reviews/approvals are complete;
  7. telemetry signals are back within thresholds;
  8. explicit restart decision is logged.

29.2 Rule

“Seems fine now” is not sufficient restart criteria.


30. Re-enable phases

30.1 Recommended order

  1. observer/indexer paths
  2. RPC read-only
  3. validation-only participation
  4. verifier role if safe
  5. proposer role if safe
  6. notary role last
  7. agent execution domains last among app domains if incident touched them

30.2 Rule

The more authority a component has, the later it should be re-enabled.


31. Communication states

31.1 Public status classes

Operational communications SHOULD distinguish:

  • healthy
  • degraded
  • safe mode
  • scoped halt
  • recovery in progress
  • monitoring after recovery

31.2 Rule

Never present:

  • speculative state as final;
  • local node issue as protocol-wide issue without basis;
  • protocol-wide restriction as mere local maintenance.

32. Mandatory audit trail

32.1 For any SEV_3+ incident, audit trail SHOULD contain:

  • incident id
  • timeline of events
  • who declared severity
  • who declared containment profile
  • affected scopes
  • evidence references
  • protocol objects used for restriction/recovery
  • replay/rebuild results
  • restart criteria and sign-off
  • postmortem link/hash

32.2 Rule

Missing audit trail in critical incidents is itself an operational failure.


33. Postmortem requirements

33.1 Every SEV_4+ incident SHOULD produce:

  • root cause class
  • exact triggering condition
  • timeline
  • blast radius
  • why earlier controls did or did not work
  • immediate fix
  • long-term fix
  • spec/implementation/process changes required
  • whether governance change is needed
  • whether conformance vectors should be added

33.2 Rule

Postmortem should distinguish:

  • protocol bug,
  • implementation bug,
  • ops failure,
  • governance failure,
  • external dependency failure.

34. Runbook catalog

34.1 Each runbook SHOULD have:

  • runbook_id
  • title
  • incident classes covered
  • trigger signatures
  • preconditions
  • immediate actions
  • escalation paths
  • recovery steps
  • exit criteria
  • evidence checklist

34.2 Suggested IDs

  • RB-CONSENSUS-001
  • RB-VALIDATION-001
  • RB-BVM-001
  • RB-WITNESS-001
  • RB-ECON-001
  • RB-AGENT-001
  • RB-GOV-001
  • RB-KEY-001
  • RB-INFRA-001
  • RB-SUPPLY-001

35. Example runbook skeleton

35.1 Abstract structure

Runbook {
  runbook_id
  title
  incident_class
  severity_min
  trigger_conditions_hash
  evidence_checklist_hash
  immediate_actions_hash
  containment_matrix_hash
  recovery_steps_hash
  exit_criteria_hash
}

35.2 Rule

Runbook content SHOULD be versioned and auditable. Critical operational changes to runbooks SHOULD be reviewed.


36. Drills and exercises

36.1 Required drills

Serious deployments SHOULD practice:

  • no-finality drill
  • conflicting notarization drill
  • BVM divergence drill
  • oracle contradiction drill
  • agent key compromise drill
  • governance activation anomaly drill
  • snapshot rebuild drill
  • exit-only recovery drill

36.2 Purpose

Runbooks unexersized are weaker than they look.


37. Interaction with fraud proofs and slashing

37.1 Rule

Incident response and slashing are related but distinct.

37.2 Operational order

Often:

  1. contain
  2. preserve evidence
  3. derive fraud proof
  4. apply slash or invalidation path
  5. continue recovery

37.3 Caution

Do not wait for slash execution before containing a live exploit.


38. Interaction with governance

38.1 Rule

Runbooks may recommend emergency governance escalation, but MUST NOT silently execute governance powers outside allowed process.

38.2 Requirement

If emergency action is used, the runbook SHOULD specify:

  • why normal path insufficient,
  • why scope is minimal,
  • expected expiry,
  • what post-activation review is required.

39. Interaction with conformance suites

39.1 Every critical incident SHOULD trigger review of AZ-011 coverage.

Questions to ask:

  • should a new vector be added?
  • was an existing vector insufficient?
  • do multiple implementations now diverge on a missed edge case?

39.2 Rule

Incidents should improve the protocol test corpus, not just operations.


40. Runbook anti-patterns

Operators and implementers SHOULD avoid:

  1. reboot first, preserve evidence later
  2. local patch that changes consensus behavior silently
  3. protocol-wide panic halt for a local-only bug
  4. waiting for full root cause before containment
  5. restoring proposer/notary roles before replay checks
  6. unclear distinction between safe mode and halt
  7. emergency action with no expiry or no audit trail
  8. trusting dashboards over canonical replay
  9. re-enabling compromised keys without rotation
  10. closing incident because telemetry is quiet but root cause unknown

41. Formal goals

AZ-015 urmărește patru obiective:

41.1 Containment soundness

Incidentele pot fi limitate rapid fără a produce și mai multă ambiguitate protocolară.

41.2 Recovery replayability

Recovery can reconstruct trusted state from canonical finalized truth.

41.3 Operational auditability

Deciziile de incident pot fi urmărite și evaluate post-factum.

41.4 Controlled return to service

Revenirea la normal nu se face prin presupuneri, ci prin criterii verificabile.


42. Formula documentului

Incident Response = classify + preserve evidence + contain minimally + replay canonical truth + recover in stages + re-enable by criteria + publish audit trail


43. Relația cu restul suitei

  • AZ-010 a definit modelul de securitate.
  • AZ-014 a definit dovada și penalitatea.
  • AZ-015 definește operațiunile umane și ale nodurilor în jurul incidentelor.

Pe scurt: dacă AZ-014 spune cum dovedești fault-ul, AZ-015 spune cum supraviețuiești lui.


44. Ce urmează

După AZ-015, documentul corect este:

AZ-016 — Genesis Specification

Acolo trebuie fixate:

  • parametrii genesis,
  • activele inițiale,
  • validatorii inițiali,
  • seed-ul inițial,
  • registries inițiale,
  • param state inițial,
  • feature flags inițiale.

Închidere

Un protocol serios nu este cel care presupune că incidentele nu vor exista. Este cel care știe deja: cum le recunoaște, cum nu pierde evidența, cum nu agravează situația prin reacții haotice, și cum revine la operare fără să mintă despre starea sistemului.

Acolo începe reziliența operațională reală.