AZ-031 — Early Mainnet Monitoring Profiles v1

Status

Acest document definește profilele de monitorizare pentru early mainnet în ATLAS ZERO.

După AZ-001 până la AZ-030, există deja:

specificația protocolului și a subsistemelor lui;
readiness, launch ceremony și launch window;
pachetele release/genesis;
manualele și checklist-urile operatorilor;
ledger-ul formal al deciziilor de launch.

AZ-031 răspunde la întrebarea: cum monitorizăm primele epoci și primele intervale critice ale rețelei astfel încât să distingem rapid între comportament sănătos, degradare controlată și incident real, fără să confundăm zgomotul normal de bootstrap cu semnale de risc sistemic?

Scopul documentului este să fixeze:

profilele de monitorizare pentru early mainnet;
metricile minime obligatorii;
clasele de semnal și alertă;
pragurile pentru healthy, degraded, incident-open și emergency-escalation;
legătura cu restricted posture, incident response și launch decision ledger.

Acest document se bazează pe:

AZ-002 până la AZ-030, cu accent direct pe AZ-015, AZ-017, AZ-025, AZ-028, AZ-029 și AZ-030.

Termeni:

MUST = obligatoriu
MUST NOT = interzis
SHOULD = recomandat puternic
MAY = opțional

1. Obiectiv

AZ-031 răspunde la 10 întrebări critice:

Ce înseamnă monitorizare early mainnet?
Ce profile de monitorizare trebuie să existe?
Ce metrici sunt obligatorii în primele epoci?
Ce praguri definesc healthy, degraded și incident?
Ce alerte sunt informative și ce alerte sunt blocante?
Cum corelăm semnalele locale de operator cu semnalele de rețea?
Cum decidem dacă restricted posture poate continua sau trebuie întărită?
Cum alimentăm Launch Decision Ledger și incident runbooks cu semnale reale?
Cum evităm atât panica excesivă, cât și ignorarea semnalelor sistemice?
Când avem suficientă stabilitate încât să ieșim din monitoring profile-ul de early mainnet?

2. Principii

2.1 Monitoring is part of launch control, not decorative observability

În early mainnet, monitorizarea MUST fi tratată ca mecanism de control operațional.

2.2 Signals must be typed

Nu orice anomalie are aceeași semnificație. Monitorizarea SHOULD separa:

informational,
caution,
degraded,
incident-open,
emergency-escalation.

2.3 Network truth and local truth must both be observed

Un operator trebuie să observe:

sănătatea locală a nodului;
sănătatea percepută a rețelei;
și relația dintre ele.

2.4 Thresholds must be predeclared

Pragurile principale SHOULD fi definite înainte de launch. Nu trebuie inventate după apariția unei anomalii.

2.5 Monitoring must feed action

Un profil de monitorizare este incomplet dacă nu spune:

ce se alertează,
cui,
și ce clasă de acțiune sugerează.

2.6 Early mainnet is stricter than steady state

Primele epoci și primele intervale live MUST avea praguri mai conservatoare și escalare mai rapidă decât steady state.

3. Monitoring scope

3.1 Early mainnet monitoring covers at minimum:

consensus/finality health
validation correctness signals
BVM execution health
witness/proof health
governance/activation anomalies
operator node health
artifact/release/genesis mismatch signals
role participation health
incident/recovery control paths

3.2 Rule

Un profil de early mainnet fără finality, role health și artifact scope checks este insuficient.

4. Monitoring profile classes

4.1 Standard profiles

ATLAS ZERO SHOULD defini cel puțin:

MP_BOOTSTRAP
MP_FIRST_BLOCKS
MP_FIRST_EPOCHS
MP_RESTRICTED_POSTURE
MP_POST_RESTRICTED_STABILIZATION

4.2 Meaning

MP_BOOTSTRAP

Monitorizare intensă în intervalul imediat al pornirii nodurilor și al primelor peer checks.

MP_FIRST_BLOCKS

Focus pe primele propuneri, primele validări și anomaliile de start.

MP_FIRST_EPOCHS

Focus pe primele finalizări și pe primele semnale de comportament sistemic.

MP_RESTRICTED_POSTURE

Profile active cât timp rețeaua este live, dar sub regim strict.

MP_POST_RESTRICTED_STABILIZATION

Profil de tranziție înainte de revenirea la steady state.

4.3 Rule

Trecerea între profile SHOULD fi explicită și jurnalizată.

5. Signal classes

5.1 Standard signal classes

SIG_INFO
SIG_CAUTION
SIG_DEGRADED
SIG_INCIDENT_OPEN
SIG_EMERGENCY_ESCALATE

5.2 Meaning

SIG_INFO

Informație utilă, fără impact operațional imediat.

SIG_CAUTION

Anomalie mică sau trend de urmărit.

SIG_DEGRADED

Comportament sub așteptări, dar încă controlabil fără incident formal obligatoriu.

SIG_INCIDENT_OPEN

Semnal suficient de puternic încât să deschidă incident sau să impună local safe mode/hold decisions.

SIG_EMERGENCY_ESCALATE

Semnal de severitate excepțională, compatibil cu escalare urgentă și eventual emergency action workflow.

5.3 Rule

Semnalele MUST fi mapate la acțiuni și la roluri de notificare.

6. Health classes

6.1 Standard health classes

HEALTHY
WATCH
DEGRADED
UNSTABLE
INCIDENT
EMERGENCY

6.2 Rule

Fiecare profil SHOULD putea deriva o clasă de health agregată din metrici și semnale.

7. Monitoring dimensions

7.1 Core dimensions

artifact integrity
chain identity consistency
peer compatibility
validation correctness
consensus/finality
BVM execution
witness/proof correctness
governance activation correctness
node resource health
role participation
logging/telemetry health
operator action path health

7.2 Rule

Toate dimensiunile care pot bloca restricted posture exit SHOULD avea metrici sau semnale explicite.

8. Artifact integrity profile

8.1 Purpose

Detectează dacă un nod rulează cu artefacte greșite sau nealiniate cu launch scope.

8.2 Required checks

binary hash match to authorized release
release_package_id match
genesis_package_id match
chain_id match
genesis_hash match
no active revocation on critical artifacts known locally

8.3 Signal mapping

mismatch on binary hash => SIG_INCIDENT_OPEN
mismatch on genesis_hash or chain_id => SIG_EMERGENCY_ESCALATE
unknown artifact provenance in launch-critical role => SIG_INCIDENT_OPEN

8.4 Rule

Artifact mismatch in early mainnet SHOULD be treated as severe until disproven.

9. Peer compatibility profile

9.1 Purpose

Observă dacă peer-ii din jur aparțin aceleiași realități de rețea.

9.2 Required signals

peer chain_id mismatch count
peer genesis_hash mismatch count
unsupported protocol version count
peer handshake failure rate
peer diversity health

9.3 Threshold guidance

isolated mismatch peers => SIG_CAUTION
repeated mismatch majority or unexpected cluster => SIG_DEGRADED or worse
widespread mismatch in launch-critical peers => SIG_INCIDENT_OPEN

9.4 Rule

Peer incompatibility MUST NOT be ignored as mere noise during launch.

10. Validation correctness profile

10.1 Purpose

Detectează semnale că nodul sau rețeaua procesează obiecte greșit.

10.2 Required metrics

invalid object rate by class
tx reject rate by category
unexpected parser/canonicalization failures
receipt mismatch signals if available
replay mismatch signals if available

10.3 Escalation guidance

mild expected invalids from public traffic => SIG_INFO or SIG_CAUTION
sudden spike in well-formed but unexpectedly rejected objects => SIG_DEGRADED
deterministic replay mismatch => SIG_INCIDENT_OPEN or SIG_EMERGENCY_ESCALATE

10.4 Rule

Any sign of deterministic validation divergence in early mainnet is extremely serious.

11. Consensus and finality profile

11.1 Purpose

Este profilul central al early mainnet.

11.2 Required metrics

block proposal cadence
block acceptance rate
verifier vote participation
notary participation
finality latency
finalized epoch cadence
no-finality interval length
conflicting notarization signals
committee derivation mismatch signals

11.3 Health guidance

Healthy

Cadence and participation inside expected launch bands.

Degraded

Transient slower finality or reduced participation, but network still understandable and recoverable.

Incident

Sustained no-finality, contradictory notarization, or unexplained participation collapse.

11.4 Rule

Consensus/finality profile SHOULD dominate launch health classification during first epochs.

12. BVM execution profile

12.1 Purpose

Observă sănătatea execuției mașinilor și bounded runtime.

12.2 Required metrics if BVM active at launch

machine call success rate
trap/revert rate by class
exec unit exhaustion rate
permission surface violations
effect bound exceeded count
state write failures
unexpected verifier/runtime mismatch signals

12.3 Escalation guidance

normal revert patterns from user logic => often SIG_INFO
repeated trap clusters from same module family => SIG_CAUTION or SIG_DEGRADED
cross-node execution mismatch => SIG_INCIDENT_OPEN
boundedness bypass indication => SIG_EMERGENCY_ESCALATE

12.4 Rule

A BVM mismatch that appears consensus-relevant MUST be escalated immediately.

13. Witness / proof profile

13.1 Purpose

Observă sănătatea subsistemelor de statements, proofs, revocări și contradicții.

13.2 Required metrics if witness/proof active

witness validation failure rate
proof verification failure rate
stale/expired witness usage attempts
revocation mismatches
contradiction detections
unauthorized witness emission signals

13.3 Escalation guidance

noisy invalid witness spam => SIG_CAUTION to SIG_DEGRADED
repeated valid-looking unauthorized witness emission => SIG_INCIDENT_OPEN
contradiction in critical operational witness family => potentially SIG_EMERGENCY_ESCALATE

13.4 Rule

Witness/proof anomalies tied to settlement, halt, treasury or governance scopes SHOULD have stricter thresholds.

14. Governance activation profile

14.1 Purpose

Observă că guvernanța activă nu deviază de la așteptări.

14.2 Required signals

unexpected activation event count
timelock boundary mismatch signals
challenge window mismatch signals
unauthorized emergency action appearance
governance state derivation mismatch signals

14.3 Rule

Unexpected governance activation in early mainnet SHOULD be treated as at least SIG_INCIDENT_OPEN.

15. Operator node health profile

15.1 Purpose

Observă sănătatea locală a nodului, fără a o confunda cu adevărul de rețea.

15.2 Required metrics

process liveness
restart count
signer health
disk pressure
memory pressure
CPU saturation
network connectivity health
queue backlogs
snapshot success/failure
log/metric sink health

15.3 Escalation guidance

transient resource spikes => SIG_CAUTION
repeated restarts or signer failures in validator role => SIG_DEGRADED
inability to validate or safe-sign => SIG_INCIDENT_OPEN

15.4 Rule

Local node degradation SHOULD often trigger local safe mode before network-wide escalation.

16. Role participation profile

16.1 Purpose

Observă dacă actorii așteptați chiar participă conform planului.

16.2 Required metrics

validator online count estimate
proposer participation rate
verifier participation rate
notary participation rate
expected operator readiness vs actual live behavior
unexpected inactive critical role count

16.3 Rule

Drops in notary or verifier participation during early mainnet SHOULD escalate quickly.

17. Monitoring pipeline health profile

17.1 Purpose

Observă dacă sistemul de observabilitate însuși funcționează.

17.2 Required metrics

metric ingest lag
alert delivery success
log sink errors
dashboard query freshness
tracing or event bus health if used
monitoring blind spot count

17.3 Rule

Telemetry blindness in early mainnet SHOULD be treated as degradation or incident depending on severity.

18. Monitoring profile object

18.1 Canonical structure

EarlyMainnetMonitoringProfile {
  profile_id
  profile_class
  target_network_class
  target_chain_id
  target_genesis_hash
  metric_rule_root
  alert_rule_root
  escalation_rule_root
  restricted_posture_binding_hash?
  version
}

18.2 Rule

Profiles SHOULD be versioned and immutable per launch scope.

19. Metric rule object

19.1 Canonical structure

MetricRule {
  metric_rule_id
  metric_class
  metric_name_hash
  observation_window_class
  warning_threshold_hash
  degraded_threshold_hash
  incident_threshold_hash
  emergency_threshold_hash?
  aggregation_mode
}

19.2 aggregation_mode examples

instant
rolling_mean
rolling_max
percentile
count_over_window
ratio_over_window

19.3 Rule

Threshold semantics MUST be defined clearly enough to avoid operator reinterpretation.

20. Alert rule object

20.1 Canonical structure

AlertRule {
  alert_rule_id
  signal_class
  source_metric_rule_refs
  dedup_window_hash
  routing_class
  required_ack_role_classes
}

20.2 routing_class examples

local_operator
validator_cluster
launch_coordination
incident_commander
security_triage
emergency_escalation

20.3 Rule

Alerts SHOULD route differently depending on severity and scope.

21. Escalation rule object

21.1 Canonical structure

EscalationRule {
  escalation_rule_id
  trigger_signal_class
  target_action_class
  required_roles_notified
  decision_ledger_entry_required
  incident_open_required
}

21.2 target_action_class examples

observe_only
operator_investigate
local_safe_mode
disable_signing
open_incident
hold_launch_flow
escalate_emergency

21.3 Rule

Severe signals SHOULD map deterministically to action classes.

22. Observation windows

22.1 Recommended windows

OW_BOOTSTRAP_SECONDS
OW_FIRST_BLOCKS_SHORT
OW_FIRST_EPOCH
OW_FIRST_3_EPOCHS
OW_FIRST_10_EPOCHS
OW_RESTRICTED_POSTURE_ROLLING

22.2 Rule

Thresholds SHOULD be tuned to observation windows, not reused blindly.

23. Bootstrap profile specifics

23.1 MP_BOOTSTRAP SHOULD emphasize

artifact integrity
peer compatibility
local process/signer health
validation-only correctness
first connectivity and chain identity checks

23.2 Alert posture

Thresholds SHOULD be highly sensitive. This is a phase where small anomalies can matter a lot.

23.3 Rule

A bootstrap profile SHOULD prefer false positives over false negatives for critical launch signals.

24. First blocks profile specifics

24.1 MP_FIRST_BLOCKS SHOULD emphasize

proposal cadence
invalid object spikes
verifier readiness
early BVM and witness anomalies
artifact mismatches discovered only under live load

24.2 Rule

Repeated anomalies across multiple nodes in first blocks SHOULD quickly escalate above local issue classification.

25. First epochs profile specifics

25.1 MP_FIRST_EPOCHS SHOULD emphasize

finalized roots
finality cadence
validator participation trends
repeated no-finality windows
deterministic replay anomalies
governance activation surprises

25.2 Rule

This profile SHOULD determine whether restricted posture can remain stable or needs tightening.

26. Restricted posture profile specifics

26.1 MP_RESTRICTED_POSTURE SHOULD emphasize

sustained healthy finality
absence of repeated critical anomalies
restart/rejoin anomaly counts
operator cluster health
alert fatigue avoidance while preserving high sensitivity to real regressions

26.2 Rule

Restricted posture SHOULD have lower incident thresholds than steady state, but slightly less noisy than raw bootstrap profile.

27. Healthy baseline model

27.1 Need

Without expected baseline, alerts become arbitrary.

27.2 Each profile SHOULD define:

expected block cadence range
acceptable validation error background
acceptable participation floor
acceptable restart background rate
acceptable telemetry lag
acceptable first-epoch convergence pattern

27.3 Rule

Healthy baseline MUST be tied to current launch scope and early-mainnet phase, not generic chain folklore.

28. Correlation rules

28.1 Need

Single metrics may be misleading. Correlated anomalies are stronger signals.

28.2 Examples

no-finality + notary participation drop + signer errors => likely operator/infrastructure cluster issue
no-finality + contradictory notarization signal => protocol/security critical
BVM trap spike + one module family concentration => module-specific issue
invalid object spike + chain_id mismatch peers => scope contamination

28.3 Rule

Monitoring SHOULD support correlated signal interpretation, not only isolated alerts.

29. Decision ledger linkage

29.1 Monitoring SHOULD feed LDL with evidence for decisions like:

HOLD
PROCEED
ABORT
RESTRICTED_POSTURE_ENTER
RESTRICTED_POSTURE_EXIT
REJOIN_APPROVED
SCOPE_QUARANTINE

29.2 Rule

Critical monitoring events SHOULD produce evidence refs consumable by AZ-030 Launch Decision Ledger.

30. Incident linkage

30.1 When signals reach `SIG_INCIDENT_OPEN`, monitoring SHOULD:

open or recommend incident
preserve relevant metric snapshots
preserve logs and state roots if relevant
bind affected observation windows
link anomaly classes to runbooks

30.2 Rule

Monitoring without incident handoff path is incomplete for early mainnet.

31. Alert fatigue protections

31.1 Need

Too many alerts can blind operators.

31.2 Controls SHOULD include

dedup windows
correlation grouping
phase-specific thresholds
routing by severity
explicit suppression only with traceable justification

31.3 Rule

Suppression of critical launch signals SHOULD be extremely conservative and auditable.

32. Monitoring records

32.1 Recommended objects

MonitoringSnapshotRecord
MonitoringAnomalyRecord
MonitoringHealthAssessment
MonitoringEscalationRecord

32.2 MonitoringSnapshotRecord

MonitoringSnapshotRecord {
  snapshot_id
  launch_window_id?
  profile_id
  observation_window_class
  metric_root
  aggregated_health_class
  timestamp_unix_ms
}

32.3 MonitoringAnomalyRecord

MonitoringAnomalyRecord {
  anomaly_id
  profile_id
  anomaly_class
  signal_class
  evidence_root?
  timestamp_unix_ms
}

32.4 Rule

Critical anomalies SHOULD be recordable as canonical objects, not only dashboard events.

33. Restricted posture exit criteria linkage

33.1 A profile SHOULD support answering:

has finality been stable enough?
have critical anomalies stayed absent long enough?
have operator restarts/rejoins stabilized?
is telemetry healthy enough?
are alert classes back within acceptable baseline?

33.2 Rule

Restricted posture exit SHOULD depend partly on monitoring evidence, not only on subjective confidence.

34. Public vs internal visibility

34.1 Some monitoring views MAY be internal only:

raw node health
sensitive security indicators
operator-specific degradation

34.2 Some summaries SHOULD be shareable:

general health class
known active major incidents
early-mainnet posture status
advisories relevant to participants

34.3 Rule

Visibility policy MUST not deprive launch-critical operators of necessary truth.

35. Profile evolution

35.1 Monitoring profiles MAY evolve across launches or network maturity phases.

35.2 Rule

For a given launch scope, active monitoring profiles SHOULD be frozen before use, versioned and archived.

35.3 Rule

Threshold changes during restricted posture SHOULD require explicit review and evidence.

36. Anti-patterns

Systems SHOULD avoid:

reusing steady-state dashboards as-is for launch
no explicit thresholds for early epochs
alerting only on infrastructure and not on protocol semantics
only local node monitoring with no network-view metrics
only network metrics with no local signer/process health
no correlation logic for major anomalies
suppressing repeated critical signals because they are noisy
ending restricted posture with no monitoring evidence
monitoring that cannot feed incident or decision processes
undocumented threshold changes during launch scope

37. Formal goals

AZ-031 urmărește aceste obiective:

37.1 Early anomaly visibility

The system detects launch-critical anomalies quickly enough to matter.

37.2 Actionable classification

Signals map to health classes and escalation actions clearly.

37.3 Evidence-producing observability

Monitoring produces evidence usable by operators, incidents and decision ledger.

37.4 Safe transition to normal operations

Monitoring supports explicit exit from restricted posture rather than a guess-based transition.

38. Formula documentului

Early Mainnet Monitoring = profile-bound metrics + typed signals + predeclared thresholds + escalation mappings + evidence records + restricted-posture exit support

39. Relația cu restul suitei

AZ-028 definește procedura din launch window.
AZ-030 definește ledger-ul deciziilor.
AZ-031 definește semnalele concrete pe baza cărora acele decizii pot fi luate responsabil în primele etape live.

Pe scurt: AZ-028 spune când și unde privești; AZ-031 spune exact ce privești și ce înseamnă ce vezi.

40. Ce urmează

După AZ-031, documentul corect este:

AZ-032 — Post-Launch Stabilization Review Protocol

Acolo trebuie fixate:

cum evaluăm primele zile/epoci după launch;
ce review formal facem;
cum clasificăm stabilizarea ca suficientă sau insuficientă;
ce se arhivează;
și ce schimbări, fixuri sau restricții rămân active ori se ridică după această evaluare.

Închidere

În early mainnet, problema nu este doar să ai grafice și alerte. Problema reală este să știi: care metrică contează, ce prag schimbă clasa de risc, când o anomalie este doar zgomot și când este primul semn al unei deviații sistemice.

Acolo începe monitorizarea de launch cu valoare reală.