ATLAS ZERO VM.zip / AZ-031_Early_Mainnet_Monitoring_Profiles_v1.md

AZ-031 — Early Mainnet Monitoring Profiles v1

AZ-031 — Early Mainnet Monitoring Profiles v1

Status

Acest document definește profilele de monitorizare pentru early mainnet în ATLAS ZERO.

După AZ-001 până la AZ-030, există deja:

  • specificația protocolului și a subsistemelor lui;
  • readiness, launch ceremony și launch window;
  • pachetele release/genesis;
  • manualele și checklist-urile operatorilor;
  • ledger-ul formal al deciziilor de launch.

AZ-031 răspunde la întrebarea: cum monitorizăm primele epoci și primele intervale critice ale rețelei astfel încât să distingem rapid între comportament sănătos, degradare controlată și incident real, fără să confundăm zgomotul normal de bootstrap cu semnale de risc sistemic?

Scopul documentului este să fixeze:

  • profilele de monitorizare pentru early mainnet;
  • metricile minime obligatorii;
  • clasele de semnal și alertă;
  • pragurile pentru healthy, degraded, incident-open și emergency-escalation;
  • legătura cu restricted posture, incident response și launch decision ledger.

Acest document se bazează pe:

  • AZ-002 până la AZ-030, cu accent direct pe AZ-015, AZ-017, AZ-025, AZ-028, AZ-029 și AZ-030.

Termeni:

  • MUST = obligatoriu
  • MUST NOT = interzis
  • SHOULD = recomandat puternic
  • MAY = opțional

1. Obiectiv

AZ-031 răspunde la 10 întrebări critice:

  1. Ce înseamnă monitorizare early mainnet?
  2. Ce profile de monitorizare trebuie să existe?
  3. Ce metrici sunt obligatorii în primele epoci?
  4. Ce praguri definesc healthy, degraded și incident?
  5. Ce alerte sunt informative și ce alerte sunt blocante?
  6. Cum corelăm semnalele locale de operator cu semnalele de rețea?
  7. Cum decidem dacă restricted posture poate continua sau trebuie întărită?
  8. Cum alimentăm Launch Decision Ledger și incident runbooks cu semnale reale?
  9. Cum evităm atât panica excesivă, cât și ignorarea semnalelor sistemice?
  10. Când avem suficientă stabilitate încât să ieșim din monitoring profile-ul de early mainnet?

2. Principii

2.1 Monitoring is part of launch control, not decorative observability

În early mainnet, monitorizarea MUST fi tratată ca mecanism de control operațional.

2.2 Signals must be typed

Nu orice anomalie are aceeași semnificație. Monitorizarea SHOULD separa:

  • informational,
  • caution,
  • degraded,
  • incident-open,
  • emergency-escalation.

2.3 Network truth and local truth must both be observed

Un operator trebuie să observe:

  • sănătatea locală a nodului;
  • sănătatea percepută a rețelei;
  • și relația dintre ele.

2.4 Thresholds must be predeclared

Pragurile principale SHOULD fi definite înainte de launch. Nu trebuie inventate după apariția unei anomalii.

2.5 Monitoring must feed action

Un profil de monitorizare este incomplet dacă nu spune:

  • ce se alertează,
  • cui,
  • și ce clasă de acțiune sugerează.

2.6 Early mainnet is stricter than steady state

Primele epoci și primele intervale live MUST avea praguri mai conservatoare și escalare mai rapidă decât steady state.


3. Monitoring scope

3.1 Early mainnet monitoring covers at minimum:

  • consensus/finality health
  • validation correctness signals
  • BVM execution health
  • witness/proof health
  • governance/activation anomalies
  • operator node health
  • artifact/release/genesis mismatch signals
  • role participation health
  • incident/recovery control paths

3.2 Rule

Un profil de early mainnet fără finality, role health și artifact scope checks este insuficient.


4. Monitoring profile classes

4.1 Standard profiles

ATLAS ZERO SHOULD defini cel puțin:

  • MP_BOOTSTRAP
  • MP_FIRST_BLOCKS
  • MP_FIRST_EPOCHS
  • MP_RESTRICTED_POSTURE
  • MP_POST_RESTRICTED_STABILIZATION

4.2 Meaning

MP_BOOTSTRAP

Monitorizare intensă în intervalul imediat al pornirii nodurilor și al primelor peer checks.

MP_FIRST_BLOCKS

Focus pe primele propuneri, primele validări și anomaliile de start.

MP_FIRST_EPOCHS

Focus pe primele finalizări și pe primele semnale de comportament sistemic.

MP_RESTRICTED_POSTURE

Profile active cât timp rețeaua este live, dar sub regim strict.

MP_POST_RESTRICTED_STABILIZATION

Profil de tranziție înainte de revenirea la steady state.

4.3 Rule

Trecerea între profile SHOULD fi explicită și jurnalizată.


5. Signal classes

5.1 Standard signal classes

  • SIG_INFO
  • SIG_CAUTION
  • SIG_DEGRADED
  • SIG_INCIDENT_OPEN
  • SIG_EMERGENCY_ESCALATE

5.2 Meaning

SIG_INFO

Informație utilă, fără impact operațional imediat.

SIG_CAUTION

Anomalie mică sau trend de urmărit.

SIG_DEGRADED

Comportament sub așteptări, dar încă controlabil fără incident formal obligatoriu.

SIG_INCIDENT_OPEN

Semnal suficient de puternic încât să deschidă incident sau să impună local safe mode/hold decisions.

SIG_EMERGENCY_ESCALATE

Semnal de severitate excepțională, compatibil cu escalare urgentă și eventual emergency action workflow.

5.3 Rule

Semnalele MUST fi mapate la acțiuni și la roluri de notificare.


6. Health classes

6.1 Standard health classes

  • HEALTHY
  • WATCH
  • DEGRADED
  • UNSTABLE
  • INCIDENT
  • EMERGENCY

6.2 Rule

Fiecare profil SHOULD putea deriva o clasă de health agregată din metrici și semnale.


7. Monitoring dimensions

7.1 Core dimensions

  1. artifact integrity
  2. chain identity consistency
  3. peer compatibility
  4. validation correctness
  5. consensus/finality
  6. BVM execution
  7. witness/proof correctness
  8. governance activation correctness
  9. node resource health
  10. role participation
  11. logging/telemetry health
  12. operator action path health

7.2 Rule

Toate dimensiunile care pot bloca restricted posture exit SHOULD avea metrici sau semnale explicite.


8. Artifact integrity profile

8.1 Purpose

Detectează dacă un nod rulează cu artefacte greșite sau nealiniate cu launch scope.

8.2 Required checks

  • binary hash match to authorized release
  • release_package_id match
  • genesis_package_id match
  • chain_id match
  • genesis_hash match
  • no active revocation on critical artifacts known locally

8.3 Signal mapping

  • mismatch on binary hash => SIG_INCIDENT_OPEN
  • mismatch on genesis_hash or chain_id => SIG_EMERGENCY_ESCALATE
  • unknown artifact provenance in launch-critical role => SIG_INCIDENT_OPEN

8.4 Rule

Artifact mismatch in early mainnet SHOULD be treated as severe until disproven.


9. Peer compatibility profile

9.1 Purpose

Observă dacă peer-ii din jur aparțin aceleiași realități de rețea.

9.2 Required signals

  • peer chain_id mismatch count
  • peer genesis_hash mismatch count
  • unsupported protocol version count
  • peer handshake failure rate
  • peer diversity health

9.3 Threshold guidance

  • isolated mismatch peers => SIG_CAUTION
  • repeated mismatch majority or unexpected cluster => SIG_DEGRADED or worse
  • widespread mismatch in launch-critical peers => SIG_INCIDENT_OPEN

9.4 Rule

Peer incompatibility MUST NOT be ignored as mere noise during launch.


10. Validation correctness profile

10.1 Purpose

Detectează semnale că nodul sau rețeaua procesează obiecte greșit.

10.2 Required metrics

  • invalid object rate by class
  • tx reject rate by category
  • unexpected parser/canonicalization failures
  • receipt mismatch signals if available
  • replay mismatch signals if available

10.3 Escalation guidance

  • mild expected invalids from public traffic => SIG_INFO or SIG_CAUTION
  • sudden spike in well-formed but unexpectedly rejected objects => SIG_DEGRADED
  • deterministic replay mismatch => SIG_INCIDENT_OPEN or SIG_EMERGENCY_ESCALATE

10.4 Rule

Any sign of deterministic validation divergence in early mainnet is extremely serious.


11. Consensus and finality profile

11.1 Purpose

Este profilul central al early mainnet.

11.2 Required metrics

  • block proposal cadence
  • block acceptance rate
  • verifier vote participation
  • notary participation
  • finality latency
  • finalized epoch cadence
  • no-finality interval length
  • conflicting notarization signals
  • committee derivation mismatch signals

11.3 Health guidance

Healthy

Cadence and participation inside expected launch bands.

Degraded

Transient slower finality or reduced participation, but network still understandable and recoverable.

Incident

Sustained no-finality, contradictory notarization, or unexplained participation collapse.

11.4 Rule

Consensus/finality profile SHOULD dominate launch health classification during first epochs.


12. BVM execution profile

12.1 Purpose

Observă sănătatea execuției mașinilor și bounded runtime.

12.2 Required metrics if BVM active at launch

  • machine call success rate
  • trap/revert rate by class
  • exec unit exhaustion rate
  • permission surface violations
  • effect bound exceeded count
  • state write failures
  • unexpected verifier/runtime mismatch signals

12.3 Escalation guidance

  • normal revert patterns from user logic => often SIG_INFO
  • repeated trap clusters from same module family => SIG_CAUTION or SIG_DEGRADED
  • cross-node execution mismatch => SIG_INCIDENT_OPEN
  • boundedness bypass indication => SIG_EMERGENCY_ESCALATE

12.4 Rule

A BVM mismatch that appears consensus-relevant MUST be escalated immediately.


13. Witness / proof profile

13.1 Purpose

Observă sănătatea subsistemelor de statements, proofs, revocări și contradicții.

13.2 Required metrics if witness/proof active

  • witness validation failure rate
  • proof verification failure rate
  • stale/expired witness usage attempts
  • revocation mismatches
  • contradiction detections
  • unauthorized witness emission signals

13.3 Escalation guidance

  • noisy invalid witness spam => SIG_CAUTION to SIG_DEGRADED
  • repeated valid-looking unauthorized witness emission => SIG_INCIDENT_OPEN
  • contradiction in critical operational witness family => potentially SIG_EMERGENCY_ESCALATE

13.4 Rule

Witness/proof anomalies tied to settlement, halt, treasury or governance scopes SHOULD have stricter thresholds.


14. Governance activation profile

14.1 Purpose

Observă că guvernanța activă nu deviază de la așteptări.

14.2 Required signals

  • unexpected activation event count
  • timelock boundary mismatch signals
  • challenge window mismatch signals
  • unauthorized emergency action appearance
  • governance state derivation mismatch signals

14.3 Rule

Unexpected governance activation in early mainnet SHOULD be treated as at least SIG_INCIDENT_OPEN.


15. Operator node health profile

15.1 Purpose

Observă sănătatea locală a nodului, fără a o confunda cu adevărul de rețea.

15.2 Required metrics

  • process liveness
  • restart count
  • signer health
  • disk pressure
  • memory pressure
  • CPU saturation
  • network connectivity health
  • queue backlogs
  • snapshot success/failure
  • log/metric sink health

15.3 Escalation guidance

  • transient resource spikes => SIG_CAUTION
  • repeated restarts or signer failures in validator role => SIG_DEGRADED
  • inability to validate or safe-sign => SIG_INCIDENT_OPEN

15.4 Rule

Local node degradation SHOULD often trigger local safe mode before network-wide escalation.


16. Role participation profile

16.1 Purpose

Observă dacă actorii așteptați chiar participă conform planului.

16.2 Required metrics

  • validator online count estimate
  • proposer participation rate
  • verifier participation rate
  • notary participation rate
  • expected operator readiness vs actual live behavior
  • unexpected inactive critical role count

16.3 Rule

Drops in notary or verifier participation during early mainnet SHOULD escalate quickly.


17. Monitoring pipeline health profile

17.1 Purpose

Observă dacă sistemul de observabilitate însuși funcționează.

17.2 Required metrics

  • metric ingest lag
  • alert delivery success
  • log sink errors
  • dashboard query freshness
  • tracing or event bus health if used
  • monitoring blind spot count

17.3 Rule

Telemetry blindness in early mainnet SHOULD be treated as degradation or incident depending on severity.


18. Monitoring profile object

18.1 Canonical structure

EarlyMainnetMonitoringProfile {
  profile_id
  profile_class
  target_network_class
  target_chain_id
  target_genesis_hash
  metric_rule_root
  alert_rule_root
  escalation_rule_root
  restricted_posture_binding_hash?
  version
}

18.2 Rule

Profiles SHOULD be versioned and immutable per launch scope.


19. Metric rule object

19.1 Canonical structure

MetricRule {
  metric_rule_id
  metric_class
  metric_name_hash
  observation_window_class
  warning_threshold_hash
  degraded_threshold_hash
  incident_threshold_hash
  emergency_threshold_hash?
  aggregation_mode
}

19.2 aggregation_mode examples

  • instant
  • rolling_mean
  • rolling_max
  • percentile
  • count_over_window
  • ratio_over_window

19.3 Rule

Threshold semantics MUST be defined clearly enough to avoid operator reinterpretation.


20. Alert rule object

20.1 Canonical structure

AlertRule {
  alert_rule_id
  signal_class
  source_metric_rule_refs
  dedup_window_hash
  routing_class
  required_ack_role_classes
}

20.2 routing_class examples

  • local_operator
  • validator_cluster
  • launch_coordination
  • incident_commander
  • security_triage
  • emergency_escalation

20.3 Rule

Alerts SHOULD route differently depending on severity and scope.


21. Escalation rule object

21.1 Canonical structure

EscalationRule {
  escalation_rule_id
  trigger_signal_class
  target_action_class
  required_roles_notified
  decision_ledger_entry_required
  incident_open_required
}

21.2 target_action_class examples

  • observe_only
  • operator_investigate
  • local_safe_mode
  • disable_signing
  • open_incident
  • hold_launch_flow
  • escalate_emergency

21.3 Rule

Severe signals SHOULD map deterministically to action classes.


22. Observation windows

22.1 Recommended windows

  • OW_BOOTSTRAP_SECONDS
  • OW_FIRST_BLOCKS_SHORT
  • OW_FIRST_EPOCH
  • OW_FIRST_3_EPOCHS
  • OW_FIRST_10_EPOCHS
  • OW_RESTRICTED_POSTURE_ROLLING

22.2 Rule

Thresholds SHOULD be tuned to observation windows, not reused blindly.


23. Bootstrap profile specifics

23.1 MP_BOOTSTRAP SHOULD emphasize

  • artifact integrity
  • peer compatibility
  • local process/signer health
  • validation-only correctness
  • first connectivity and chain identity checks

23.2 Alert posture

Thresholds SHOULD be highly sensitive. This is a phase where small anomalies can matter a lot.

23.3 Rule

A bootstrap profile SHOULD prefer false positives over false negatives for critical launch signals.


24. First blocks profile specifics

24.1 MP_FIRST_BLOCKS SHOULD emphasize

  • proposal cadence
  • invalid object spikes
  • verifier readiness
  • early BVM and witness anomalies
  • artifact mismatches discovered only under live load

24.2 Rule

Repeated anomalies across multiple nodes in first blocks SHOULD quickly escalate above local issue classification.


25. First epochs profile specifics

25.1 MP_FIRST_EPOCHS SHOULD emphasize

  • finalized roots
  • finality cadence
  • validator participation trends
  • repeated no-finality windows
  • deterministic replay anomalies
  • governance activation surprises

25.2 Rule

This profile SHOULD determine whether restricted posture can remain stable or needs tightening.


26. Restricted posture profile specifics

26.1 MP_RESTRICTED_POSTURE SHOULD emphasize

  • sustained healthy finality
  • absence of repeated critical anomalies
  • restart/rejoin anomaly counts
  • operator cluster health
  • alert fatigue avoidance while preserving high sensitivity to real regressions

26.2 Rule

Restricted posture SHOULD have lower incident thresholds than steady state, but slightly less noisy than raw bootstrap profile.


27. Healthy baseline model

27.1 Need

Without expected baseline, alerts become arbitrary.

27.2 Each profile SHOULD define:

  • expected block cadence range
  • acceptable validation error background
  • acceptable participation floor
  • acceptable restart background rate
  • acceptable telemetry lag
  • acceptable first-epoch convergence pattern

27.3 Rule

Healthy baseline MUST be tied to current launch scope and early-mainnet phase, not generic chain folklore.


28. Correlation rules

28.1 Need

Single metrics may be misleading. Correlated anomalies are stronger signals.

28.2 Examples

  • no-finality + notary participation drop + signer errors => likely operator/infrastructure cluster issue
  • no-finality + contradictory notarization signal => protocol/security critical
  • BVM trap spike + one module family concentration => module-specific issue
  • invalid object spike + chain_id mismatch peers => scope contamination

28.3 Rule

Monitoring SHOULD support correlated signal interpretation, not only isolated alerts.


29. Decision ledger linkage

29.1 Monitoring SHOULD feed LDL with evidence for decisions like:

  • HOLD
  • PROCEED
  • ABORT
  • RESTRICTED_POSTURE_ENTER
  • RESTRICTED_POSTURE_EXIT
  • REJOIN_APPROVED
  • SCOPE_QUARANTINE

29.2 Rule

Critical monitoring events SHOULD produce evidence refs consumable by AZ-030 Launch Decision Ledger.


30. Incident linkage

30.1 When signals reach SIG_INCIDENT_OPEN, monitoring SHOULD:

  • open or recommend incident
  • preserve relevant metric snapshots
  • preserve logs and state roots if relevant
  • bind affected observation windows
  • link anomaly classes to runbooks

30.2 Rule

Monitoring without incident handoff path is incomplete for early mainnet.


31. Alert fatigue protections

31.1 Need

Too many alerts can blind operators.

31.2 Controls SHOULD include

  • dedup windows
  • correlation grouping
  • phase-specific thresholds
  • routing by severity
  • explicit suppression only with traceable justification

31.3 Rule

Suppression of critical launch signals SHOULD be extremely conservative and auditable.


32. Monitoring records

32.1 Recommended objects

  • MonitoringSnapshotRecord
  • MonitoringAnomalyRecord
  • MonitoringHealthAssessment
  • MonitoringEscalationRecord

32.2 MonitoringSnapshotRecord

MonitoringSnapshotRecord {
  snapshot_id
  launch_window_id?
  profile_id
  observation_window_class
  metric_root
  aggregated_health_class
  timestamp_unix_ms
}

32.3 MonitoringAnomalyRecord

MonitoringAnomalyRecord {
  anomaly_id
  profile_id
  anomaly_class
  signal_class
  evidence_root?
  timestamp_unix_ms
}

32.4 Rule

Critical anomalies SHOULD be recordable as canonical objects, not only dashboard events.


33. Restricted posture exit criteria linkage

33.1 A profile SHOULD support answering:

  • has finality been stable enough?
  • have critical anomalies stayed absent long enough?
  • have operator restarts/rejoins stabilized?
  • is telemetry healthy enough?
  • are alert classes back within acceptable baseline?

33.2 Rule

Restricted posture exit SHOULD depend partly on monitoring evidence, not only on subjective confidence.


34. Public vs internal visibility

34.1 Some monitoring views MAY be internal only:

  • raw node health
  • sensitive security indicators
  • operator-specific degradation

34.2 Some summaries SHOULD be shareable:

  • general health class
  • known active major incidents
  • early-mainnet posture status
  • advisories relevant to participants

34.3 Rule

Visibility policy MUST not deprive launch-critical operators of necessary truth.


35. Profile evolution

35.1 Monitoring profiles MAY evolve across launches or network maturity phases.

35.2 Rule

For a given launch scope, active monitoring profiles SHOULD be frozen before use, versioned and archived.

35.3 Rule

Threshold changes during restricted posture SHOULD require explicit review and evidence.


36. Anti-patterns

Systems SHOULD avoid:

  1. reusing steady-state dashboards as-is for launch
  2. no explicit thresholds for early epochs
  3. alerting only on infrastructure and not on protocol semantics
  4. only local node monitoring with no network-view metrics
  5. only network metrics with no local signer/process health
  6. no correlation logic for major anomalies
  7. suppressing repeated critical signals because they are noisy
  8. ending restricted posture with no monitoring evidence
  9. monitoring that cannot feed incident or decision processes
  10. undocumented threshold changes during launch scope

37. Formal goals

AZ-031 urmărește aceste obiective:

37.1 Early anomaly visibility

The system detects launch-critical anomalies quickly enough to matter.

37.2 Actionable classification

Signals map to health classes and escalation actions clearly.

37.3 Evidence-producing observability

Monitoring produces evidence usable by operators, incidents and decision ledger.

37.4 Safe transition to normal operations

Monitoring supports explicit exit from restricted posture rather than a guess-based transition.


38. Formula documentului

Early Mainnet Monitoring = profile-bound metrics + typed signals + predeclared thresholds + escalation mappings + evidence records + restricted-posture exit support


39. Relația cu restul suitei

  • AZ-028 definește procedura din launch window.
  • AZ-030 definește ledger-ul deciziilor.
  • AZ-031 definește semnalele concrete pe baza cărora acele decizii pot fi luate responsabil în primele etape live.

Pe scurt: AZ-028 spune când și unde privești; AZ-031 spune exact ce privești și ce înseamnă ce vezi.


40. Ce urmează

După AZ-031, documentul corect este:

AZ-032 — Post-Launch Stabilization Review Protocol

Acolo trebuie fixate:

  • cum evaluăm primele zile/epoci după launch;
  • ce review formal facem;
  • cum clasificăm stabilizarea ca suficientă sau insuficientă;
  • ce se arhivează;
  • și ce schimbări, fixuri sau restricții rămân active ori se ridică după această evaluare.

Închidere

În early mainnet, problema nu este doar să ai grafice și alerte. Problema reală este să știi: care metrică contează, ce prag schimbă clasa de risc, când o anomalie este doar zgomot și când este primul semn al unei deviații sistemice.

Acolo începe monitorizarea de launch cu valoare reală.