AZ-031 — Early Mainnet Monitoring Profiles v1
Status
Acest document definește profilele de monitorizare pentru early mainnet în ATLAS ZERO.
După AZ-001 până la AZ-030, există deja:
- specificația protocolului și a subsistemelor lui;
- readiness, launch ceremony și launch window;
- pachetele release/genesis;
- manualele și checklist-urile operatorilor;
- ledger-ul formal al deciziilor de launch.
AZ-031 răspunde la întrebarea: cum monitorizăm primele epoci și primele intervale critice ale rețelei astfel încât să distingem rapid între comportament sănătos, degradare controlată și incident real, fără să confundăm zgomotul normal de bootstrap cu semnale de risc sistemic?
Scopul documentului este să fixeze:
- profilele de monitorizare pentru early mainnet;
- metricile minime obligatorii;
- clasele de semnal și alertă;
- pragurile pentru healthy, degraded, incident-open și emergency-escalation;
- legătura cu restricted posture, incident response și launch decision ledger.
Acest document se bazează pe:
- AZ-002 până la AZ-030, cu accent direct pe AZ-015, AZ-017, AZ-025, AZ-028, AZ-029 și AZ-030.
Termeni:
- MUST = obligatoriu
- MUST NOT = interzis
- SHOULD = recomandat puternic
- MAY = opțional
1. Obiectiv
AZ-031 răspunde la 10 întrebări critice:
- Ce înseamnă monitorizare early mainnet?
- Ce profile de monitorizare trebuie să existe?
- Ce metrici sunt obligatorii în primele epoci?
- Ce praguri definesc healthy, degraded și incident?
- Ce alerte sunt informative și ce alerte sunt blocante?
- Cum corelăm semnalele locale de operator cu semnalele de rețea?
- Cum decidem dacă restricted posture poate continua sau trebuie întărită?
- Cum alimentăm Launch Decision Ledger și incident runbooks cu semnale reale?
- Cum evităm atât panica excesivă, cât și ignorarea semnalelor sistemice?
- Când avem suficientă stabilitate încât să ieșim din monitoring profile-ul de early mainnet?
2. Principii
2.1 Monitoring is part of launch control, not decorative observability
În early mainnet, monitorizarea MUST fi tratată ca mecanism de control operațional.
2.2 Signals must be typed
Nu orice anomalie are aceeași semnificație. Monitorizarea SHOULD separa:
- informational,
- caution,
- degraded,
- incident-open,
- emergency-escalation.
2.3 Network truth and local truth must both be observed
Un operator trebuie să observe:
- sănătatea locală a nodului;
- sănătatea percepută a rețelei;
- și relația dintre ele.
2.4 Thresholds must be predeclared
Pragurile principale SHOULD fi definite înainte de launch. Nu trebuie inventate după apariția unei anomalii.
2.5 Monitoring must feed action
Un profil de monitorizare este incomplet dacă nu spune:
- ce se alertează,
- cui,
- și ce clasă de acțiune sugerează.
2.6 Early mainnet is stricter than steady state
Primele epoci și primele intervale live MUST avea praguri mai conservatoare și escalare mai rapidă decât steady state.
3. Monitoring scope
3.1 Early mainnet monitoring covers at minimum:
- consensus/finality health
- validation correctness signals
- BVM execution health
- witness/proof health
- governance/activation anomalies
- operator node health
- artifact/release/genesis mismatch signals
- role participation health
- incident/recovery control paths
3.2 Rule
Un profil de early mainnet fără finality, role health și artifact scope checks este insuficient.
4. Monitoring profile classes
4.1 Standard profiles
ATLAS ZERO SHOULD defini cel puțin:
MP_BOOTSTRAPMP_FIRST_BLOCKSMP_FIRST_EPOCHSMP_RESTRICTED_POSTUREMP_POST_RESTRICTED_STABILIZATION
4.2 Meaning
MP_BOOTSTRAP
Monitorizare intensă în intervalul imediat al pornirii nodurilor și al primelor peer checks.
MP_FIRST_BLOCKS
Focus pe primele propuneri, primele validări și anomaliile de start.
MP_FIRST_EPOCHS
Focus pe primele finalizări și pe primele semnale de comportament sistemic.
MP_RESTRICTED_POSTURE
Profile active cât timp rețeaua este live, dar sub regim strict.
MP_POST_RESTRICTED_STABILIZATION
Profil de tranziție înainte de revenirea la steady state.
4.3 Rule
Trecerea între profile SHOULD fi explicită și jurnalizată.
5. Signal classes
5.1 Standard signal classes
SIG_INFOSIG_CAUTIONSIG_DEGRADEDSIG_INCIDENT_OPENSIG_EMERGENCY_ESCALATE
5.2 Meaning
SIG_INFO
Informație utilă, fără impact operațional imediat.
SIG_CAUTION
Anomalie mică sau trend de urmărit.
SIG_DEGRADED
Comportament sub așteptări, dar încă controlabil fără incident formal obligatoriu.
SIG_INCIDENT_OPEN
Semnal suficient de puternic încât să deschidă incident sau să impună local safe mode/hold decisions.
SIG_EMERGENCY_ESCALATE
Semnal de severitate excepțională, compatibil cu escalare urgentă și eventual emergency action workflow.
5.3 Rule
Semnalele MUST fi mapate la acțiuni și la roluri de notificare.
6. Health classes
6.1 Standard health classes
HEALTHYWATCHDEGRADEDUNSTABLEINCIDENTEMERGENCY
6.2 Rule
Fiecare profil SHOULD putea deriva o clasă de health agregată din metrici și semnale.
7. Monitoring dimensions
7.1 Core dimensions
- artifact integrity
- chain identity consistency
- peer compatibility
- validation correctness
- consensus/finality
- BVM execution
- witness/proof correctness
- governance activation correctness
- node resource health
- role participation
- logging/telemetry health
- operator action path health
7.2 Rule
Toate dimensiunile care pot bloca restricted posture exit SHOULD avea metrici sau semnale explicite.
8. Artifact integrity profile
8.1 Purpose
Detectează dacă un nod rulează cu artefacte greșite sau nealiniate cu launch scope.
8.2 Required checks
- binary hash match to authorized release
- release_package_id match
- genesis_package_id match
- chain_id match
- genesis_hash match
- no active revocation on critical artifacts known locally
8.3 Signal mapping
- mismatch on binary hash =>
SIG_INCIDENT_OPEN - mismatch on genesis_hash or chain_id =>
SIG_EMERGENCY_ESCALATE - unknown artifact provenance in launch-critical role =>
SIG_INCIDENT_OPEN
8.4 Rule
Artifact mismatch in early mainnet SHOULD be treated as severe until disproven.
9. Peer compatibility profile
9.1 Purpose
Observă dacă peer-ii din jur aparțin aceleiași realități de rețea.
9.2 Required signals
- peer chain_id mismatch count
- peer genesis_hash mismatch count
- unsupported protocol version count
- peer handshake failure rate
- peer diversity health
9.3 Threshold guidance
- isolated mismatch peers =>
SIG_CAUTION - repeated mismatch majority or unexpected cluster =>
SIG_DEGRADEDor worse - widespread mismatch in launch-critical peers =>
SIG_INCIDENT_OPEN
9.4 Rule
Peer incompatibility MUST NOT be ignored as mere noise during launch.
10. Validation correctness profile
10.1 Purpose
Detectează semnale că nodul sau rețeaua procesează obiecte greșit.
10.2 Required metrics
- invalid object rate by class
- tx reject rate by category
- unexpected parser/canonicalization failures
- receipt mismatch signals if available
- replay mismatch signals if available
10.3 Escalation guidance
- mild expected invalids from public traffic =>
SIG_INFOorSIG_CAUTION - sudden spike in well-formed but unexpectedly rejected objects =>
SIG_DEGRADED - deterministic replay mismatch =>
SIG_INCIDENT_OPENorSIG_EMERGENCY_ESCALATE
10.4 Rule
Any sign of deterministic validation divergence in early mainnet is extremely serious.
11. Consensus and finality profile
11.1 Purpose
Este profilul central al early mainnet.
11.2 Required metrics
- block proposal cadence
- block acceptance rate
- verifier vote participation
- notary participation
- finality latency
- finalized epoch cadence
- no-finality interval length
- conflicting notarization signals
- committee derivation mismatch signals
11.3 Health guidance
Healthy
Cadence and participation inside expected launch bands.
Degraded
Transient slower finality or reduced participation, but network still understandable and recoverable.
Incident
Sustained no-finality, contradictory notarization, or unexplained participation collapse.
11.4 Rule
Consensus/finality profile SHOULD dominate launch health classification during first epochs.
12. BVM execution profile
12.1 Purpose
Observă sănătatea execuției mașinilor și bounded runtime.
12.2 Required metrics if BVM active at launch
- machine call success rate
- trap/revert rate by class
- exec unit exhaustion rate
- permission surface violations
- effect bound exceeded count
- state write failures
- unexpected verifier/runtime mismatch signals
12.3 Escalation guidance
- normal revert patterns from user logic => often
SIG_INFO - repeated trap clusters from same module family =>
SIG_CAUTIONorSIG_DEGRADED - cross-node execution mismatch =>
SIG_INCIDENT_OPEN - boundedness bypass indication =>
SIG_EMERGENCY_ESCALATE
12.4 Rule
A BVM mismatch that appears consensus-relevant MUST be escalated immediately.
13. Witness / proof profile
13.1 Purpose
Observă sănătatea subsistemelor de statements, proofs, revocări și contradicții.
13.2 Required metrics if witness/proof active
- witness validation failure rate
- proof verification failure rate
- stale/expired witness usage attempts
- revocation mismatches
- contradiction detections
- unauthorized witness emission signals
13.3 Escalation guidance
- noisy invalid witness spam =>
SIG_CAUTIONtoSIG_DEGRADED - repeated valid-looking unauthorized witness emission =>
SIG_INCIDENT_OPEN - contradiction in critical operational witness family => potentially
SIG_EMERGENCY_ESCALATE
13.4 Rule
Witness/proof anomalies tied to settlement, halt, treasury or governance scopes SHOULD have stricter thresholds.
14. Governance activation profile
14.1 Purpose
Observă că guvernanța activă nu deviază de la așteptări.
14.2 Required signals
- unexpected activation event count
- timelock boundary mismatch signals
- challenge window mismatch signals
- unauthorized emergency action appearance
- governance state derivation mismatch signals
14.3 Rule
Unexpected governance activation in early mainnet SHOULD be treated as at least SIG_INCIDENT_OPEN.
15. Operator node health profile
15.1 Purpose
Observă sănătatea locală a nodului, fără a o confunda cu adevărul de rețea.
15.2 Required metrics
- process liveness
- restart count
- signer health
- disk pressure
- memory pressure
- CPU saturation
- network connectivity health
- queue backlogs
- snapshot success/failure
- log/metric sink health
15.3 Escalation guidance
- transient resource spikes =>
SIG_CAUTION - repeated restarts or signer failures in validator role =>
SIG_DEGRADED - inability to validate or safe-sign =>
SIG_INCIDENT_OPEN
15.4 Rule
Local node degradation SHOULD often trigger local safe mode before network-wide escalation.
16. Role participation profile
16.1 Purpose
Observă dacă actorii așteptați chiar participă conform planului.
16.2 Required metrics
- validator online count estimate
- proposer participation rate
- verifier participation rate
- notary participation rate
- expected operator readiness vs actual live behavior
- unexpected inactive critical role count
16.3 Rule
Drops in notary or verifier participation during early mainnet SHOULD escalate quickly.
17. Monitoring pipeline health profile
17.1 Purpose
Observă dacă sistemul de observabilitate însuși funcționează.
17.2 Required metrics
- metric ingest lag
- alert delivery success
- log sink errors
- dashboard query freshness
- tracing or event bus health if used
- monitoring blind spot count
17.3 Rule
Telemetry blindness in early mainnet SHOULD be treated as degradation or incident depending on severity.
18. Monitoring profile object
18.1 Canonical structure
EarlyMainnetMonitoringProfile {
profile_id
profile_class
target_network_class
target_chain_id
target_genesis_hash
metric_rule_root
alert_rule_root
escalation_rule_root
restricted_posture_binding_hash?
version
}
18.2 Rule
Profiles SHOULD be versioned and immutable per launch scope.
19. Metric rule object
19.1 Canonical structure
MetricRule {
metric_rule_id
metric_class
metric_name_hash
observation_window_class
warning_threshold_hash
degraded_threshold_hash
incident_threshold_hash
emergency_threshold_hash?
aggregation_mode
}
19.2 aggregation_mode examples
- instant
- rolling_mean
- rolling_max
- percentile
- count_over_window
- ratio_over_window
19.3 Rule
Threshold semantics MUST be defined clearly enough to avoid operator reinterpretation.
20. Alert rule object
20.1 Canonical structure
AlertRule {
alert_rule_id
signal_class
source_metric_rule_refs
dedup_window_hash
routing_class
required_ack_role_classes
}
20.2 routing_class examples
- local_operator
- validator_cluster
- launch_coordination
- incident_commander
- security_triage
- emergency_escalation
20.3 Rule
Alerts SHOULD route differently depending on severity and scope.
21. Escalation rule object
21.1 Canonical structure
EscalationRule {
escalation_rule_id
trigger_signal_class
target_action_class
required_roles_notified
decision_ledger_entry_required
incident_open_required
}
21.2 target_action_class examples
- observe_only
- operator_investigate
- local_safe_mode
- disable_signing
- open_incident
- hold_launch_flow
- escalate_emergency
21.3 Rule
Severe signals SHOULD map deterministically to action classes.
22. Observation windows
22.1 Recommended windows
OW_BOOTSTRAP_SECONDSOW_FIRST_BLOCKS_SHORTOW_FIRST_EPOCHOW_FIRST_3_EPOCHSOW_FIRST_10_EPOCHSOW_RESTRICTED_POSTURE_ROLLING
22.2 Rule
Thresholds SHOULD be tuned to observation windows, not reused blindly.
23. Bootstrap profile specifics
23.1 MP_BOOTSTRAP SHOULD emphasize
- artifact integrity
- peer compatibility
- local process/signer health
- validation-only correctness
- first connectivity and chain identity checks
23.2 Alert posture
Thresholds SHOULD be highly sensitive. This is a phase where small anomalies can matter a lot.
23.3 Rule
A bootstrap profile SHOULD prefer false positives over false negatives for critical launch signals.
24. First blocks profile specifics
24.1 MP_FIRST_BLOCKS SHOULD emphasize
- proposal cadence
- invalid object spikes
- verifier readiness
- early BVM and witness anomalies
- artifact mismatches discovered only under live load
24.2 Rule
Repeated anomalies across multiple nodes in first blocks SHOULD quickly escalate above local issue classification.
25. First epochs profile specifics
25.1 MP_FIRST_EPOCHS SHOULD emphasize
- finalized roots
- finality cadence
- validator participation trends
- repeated no-finality windows
- deterministic replay anomalies
- governance activation surprises
25.2 Rule
This profile SHOULD determine whether restricted posture can remain stable or needs tightening.
26. Restricted posture profile specifics
26.1 MP_RESTRICTED_POSTURE SHOULD emphasize
- sustained healthy finality
- absence of repeated critical anomalies
- restart/rejoin anomaly counts
- operator cluster health
- alert fatigue avoidance while preserving high sensitivity to real regressions
26.2 Rule
Restricted posture SHOULD have lower incident thresholds than steady state, but slightly less noisy than raw bootstrap profile.
27. Healthy baseline model
27.1 Need
Without expected baseline, alerts become arbitrary.
27.2 Each profile SHOULD define:
- expected block cadence range
- acceptable validation error background
- acceptable participation floor
- acceptable restart background rate
- acceptable telemetry lag
- acceptable first-epoch convergence pattern
27.3 Rule
Healthy baseline MUST be tied to current launch scope and early-mainnet phase, not generic chain folklore.
28. Correlation rules
28.1 Need
Single metrics may be misleading. Correlated anomalies are stronger signals.
28.2 Examples
- no-finality + notary participation drop + signer errors => likely operator/infrastructure cluster issue
- no-finality + contradictory notarization signal => protocol/security critical
- BVM trap spike + one module family concentration => module-specific issue
- invalid object spike + chain_id mismatch peers => scope contamination
28.3 Rule
Monitoring SHOULD support correlated signal interpretation, not only isolated alerts.
29. Decision ledger linkage
29.1 Monitoring SHOULD feed LDL with evidence for decisions like:
- HOLD
- PROCEED
- ABORT
- RESTRICTED_POSTURE_ENTER
- RESTRICTED_POSTURE_EXIT
- REJOIN_APPROVED
- SCOPE_QUARANTINE
29.2 Rule
Critical monitoring events SHOULD produce evidence refs consumable by AZ-030 Launch Decision Ledger.
30. Incident linkage
30.1 When signals reach SIG_INCIDENT_OPEN, monitoring SHOULD:
- open or recommend incident
- preserve relevant metric snapshots
- preserve logs and state roots if relevant
- bind affected observation windows
- link anomaly classes to runbooks
30.2 Rule
Monitoring without incident handoff path is incomplete for early mainnet.
31. Alert fatigue protections
31.1 Need
Too many alerts can blind operators.
31.2 Controls SHOULD include
- dedup windows
- correlation grouping
- phase-specific thresholds
- routing by severity
- explicit suppression only with traceable justification
31.3 Rule
Suppression of critical launch signals SHOULD be extremely conservative and auditable.
32. Monitoring records
32.1 Recommended objects
MonitoringSnapshotRecordMonitoringAnomalyRecordMonitoringHealthAssessmentMonitoringEscalationRecord
32.2 MonitoringSnapshotRecord
MonitoringSnapshotRecord {
snapshot_id
launch_window_id?
profile_id
observation_window_class
metric_root
aggregated_health_class
timestamp_unix_ms
}
32.3 MonitoringAnomalyRecord
MonitoringAnomalyRecord {
anomaly_id
profile_id
anomaly_class
signal_class
evidence_root?
timestamp_unix_ms
}
32.4 Rule
Critical anomalies SHOULD be recordable as canonical objects, not only dashboard events.
33. Restricted posture exit criteria linkage
33.1 A profile SHOULD support answering:
- has finality been stable enough?
- have critical anomalies stayed absent long enough?
- have operator restarts/rejoins stabilized?
- is telemetry healthy enough?
- are alert classes back within acceptable baseline?
33.2 Rule
Restricted posture exit SHOULD depend partly on monitoring evidence, not only on subjective confidence.
34. Public vs internal visibility
34.1 Some monitoring views MAY be internal only:
- raw node health
- sensitive security indicators
- operator-specific degradation
34.2 Some summaries SHOULD be shareable:
- general health class
- known active major incidents
- early-mainnet posture status
- advisories relevant to participants
34.3 Rule
Visibility policy MUST not deprive launch-critical operators of necessary truth.
35. Profile evolution
35.1 Monitoring profiles MAY evolve across launches or network maturity phases.
35.2 Rule
For a given launch scope, active monitoring profiles SHOULD be frozen before use, versioned and archived.
35.3 Rule
Threshold changes during restricted posture SHOULD require explicit review and evidence.
36. Anti-patterns
Systems SHOULD avoid:
- reusing steady-state dashboards as-is for launch
- no explicit thresholds for early epochs
- alerting only on infrastructure and not on protocol semantics
- only local node monitoring with no network-view metrics
- only network metrics with no local signer/process health
- no correlation logic for major anomalies
- suppressing repeated critical signals because they are noisy
- ending restricted posture with no monitoring evidence
- monitoring that cannot feed incident or decision processes
- undocumented threshold changes during launch scope
37. Formal goals
AZ-031 urmărește aceste obiective:
37.1 Early anomaly visibility
The system detects launch-critical anomalies quickly enough to matter.
37.2 Actionable classification
Signals map to health classes and escalation actions clearly.
37.3 Evidence-producing observability
Monitoring produces evidence usable by operators, incidents and decision ledger.
37.4 Safe transition to normal operations
Monitoring supports explicit exit from restricted posture rather than a guess-based transition.
38. Formula documentului
Early Mainnet Monitoring = profile-bound metrics + typed signals + predeclared thresholds + escalation mappings + evidence records + restricted-posture exit support
39. Relația cu restul suitei
- AZ-028 definește procedura din launch window.
- AZ-030 definește ledger-ul deciziilor.
- AZ-031 definește semnalele concrete pe baza cărora acele decizii pot fi luate responsabil în primele etape live.
Pe scurt: AZ-028 spune când și unde privești; AZ-031 spune exact ce privești și ce înseamnă ce vezi.
40. Ce urmează
După AZ-031, documentul corect este:
AZ-032 — Post-Launch Stabilization Review Protocol
Acolo trebuie fixate:
- cum evaluăm primele zile/epoci după launch;
- ce review formal facem;
- cum clasificăm stabilizarea ca suficientă sau insuficientă;
- ce se arhivează;
- și ce schimbări, fixuri sau restricții rămân active ori se ridică după această evaluare.
Închidere
În early mainnet, problema nu este doar să ai grafice și alerte. Problema reală este să știi: care metrică contează, ce prag schimbă clasa de risc, când o anomalie este doar zgomot și când este primul semn al unei deviații sistemice.
Acolo începe monitorizarea de launch cu valoare reală.