The suite
When monitoring fails silently, the rest of your stack fails noisily.
Three fabrics for cloud reliability — built to handle the three categories of monitoring failure that hit most AWS accounts. Each fabric is a multi-agent system in its own right; together they cover the full loop from "what's broken" to "who's been paged."
How it works
Built on THREAD.
Three phases — Audit, Remediate, Respond — woven from six pillars of telemetry, health, response, evolution, automation and defense.
Audit
Map every alarm to the resource it protects across AWS and Azure. Score coverage. Flag DEGRADED alarms before they bite.
-
Telemetry
Unify alarms, metrics, logs, and events into one signal plane.
-
Health
Continuously score coverage. Catch alarms that exist but won't notify.
Remediate
Close the gaps the audit found. Create the missing alarms via your role, wire SNS / PagerDuty / Opsgenie targets, tag everything for one-call rollback.
-
Defense
Proactively close the gaps. Defense before incident, not just after.
-
Automation
Safe, gated automation. Tagged + reversible writes only.
Respond
When alarms fire, triage from Slack, run RCA, execute remediation in Copilot, Autonomous, and HIL modes. Drive Jira + Confluence to close the loop.
-
Response
Triage, RCA, remediation. Copilot / Autonomous / HIL — your call.
-
Evolution
Every incident sharpens playbooks. The fabric learns over time.
Reliability Audit
Find which of your alarms are broken — and which services have no alarm at all. Read-only AWS audit that maps alarms to resources and surfaces the gaps.
The Reliability Audit (open-source as DiscoveryFabric) scans your AWS account through Resource Explorer 2 and maps every CloudWatch alarm to the ECS service, Lambda function, RDS instance, Aurora cluster, or SQS queue it actually protects. Five matching strategies (exact dimension, ALB → ECS target-group bridge, namespace + partial dimension, log-group linkage, and naming heuristic) catch alarms that simple dimension-match would miss.
It also flags DEGRADED alarms — alarms that exist but won't actually notify anyone, because actions are disabled, no SNS target is configured, or the metric is in INSUFFICIENT_DATA. The output is a JSON file and a self-contained HTML report you can open in any browser or send to a CTO or auditor unedited.
- Read-only — only describe / list AWS calls, no writes ever
- Runs entirely on your laptop, no telemetry, no phone-home
- Five-strategy alarm-to-resource matching (covers the edge cases)
- DEGRADED alarm detection (the ones that look fine but won't page)
- Cross-account audits via STS AssumeRole + external ID
- --demo mode runs against a synthetic account in ~3 seconds
$ pip install opsfabric-discovery && opsfabric-discovery audit --demo Alarm Engineering
Close the gaps the audit found — in one deploy. Creates the missing alarms, wires the targets, at your chosen trust level.
Alarm Engineering (AlarmFabric) reads the audit's JSON output and creates the missing CloudWatch alarms in your account via the same read-only role (plus PutMetricAlarm). Every alarm is tagged with its source audit ID, the resource it protects, and the rule it satisfies — so rollback is one tag-filtered DeleteAlarms call away.
Pick the remediation mode that matches your trust level: Copilot proposes alarms you click to create, Autonomous applies the whole pack on a schedule, Human-in-the-loop waits for approval on every PutMetricAlarm. It wires SNS / PagerDuty / Opsgenie targets, re-audits on a schedule, and tells you when resources show up without coverage or alarms drift.
- Three remediation modes: Copilot, Autonomous, Human-in-the-loop
- Creates missing alarms via your role + PutMetricAlarm
- Tagged + reversible — every alarm carries its provenance
- SNS / PagerDuty / Opsgenie target wiring
- Scheduled re-audits with drift detection between scans
- Multi-tenant managed SaaS — one console for every account
Incident Operations
Run the incidents end to end — when alarms actually fire. Across AWS or Azure, from page through post-mortem.
Incident Operations is the layer above Alarm Engineering. When an alarm fires on AWS or Azure, it picks up the incident from Slack, CloudWatch, or Azure Monitor, pulls logs and traces, generates a root-cause analysis, and proposes remediations. You choose the trust level: Copilot suggests, Autonomous executes pre-approved playbooks, Human-in-the-loop gates every infrastructure-touching step.
It drives the Jira lifecycle (create, update, transition, close), opens a GitHub PR for app-side fixes when the RCA points at code, and writes the Confluence post-mortem at the end. The whole incident has one correlated thread you can replay later in LangSmith.
- Three remediation modes: Copilot, Autonomous, Human-in-the-loop
- Multi-cloud incident response: AWS + Azure today, GCP on the roadmap
- Slack-native triage with full incident context inline
- Automated RCA from logs + traces + topology
- Jira lifecycle + Confluence post-mortems
- GitHub PR for app-side fixes when RCA points at code
Open source vs commercial
Feature matrix
The Reliability Audit does the audit well. The paid products do different things — not a crippled version of the audit, just a different part of the reliability loop.
| Capability |
Reliability Audit
Open source
|
Alarm Engineering
Paid
|
Incident Operations
Paid
|
|---|---|---|---|
| Read-only audit | |||
| Resource discovery (ECS / Lambda / RDS / Aurora / SQS) | |||
| Five-strategy alarm matching | |||
| DEGRADED alarm detection | |||
| Executive PDF + JSON output | |||
| --demo synthetic walkthrough | |||
| Cross-account audits via STS AssumeRole | |||
| Create missing alarms in your account | — | ||
| Tagged + reversible alarm provenance | — | ||
| SNS / PagerDuty / Opsgenie wiring | — | ||
| Scheduled / continuous audits | — | ||
| Drift detection between scans | — | ||
| Auto-cleanup of orphan alarms | — | ||
| Copilot / Autonomous / HIL remediation modes | — | ||
| Multi-cloud (AWS + Azure) | — | — | |
| Slack-based incident triage | — | — | |
| Automated root-cause analysis | — | — | |
| Jira / Confluence incident lifecycle | — | — | |
| GitHub PR for app-side fixes | — | — | |
| Multi-tenant managed SaaS | — |
Start with the audit. Talk to us about the rest.
The Reliability Audit is free forever. Alarm Engineering and Incident Operations are pilot-priced during the first customer cohort.
$ pip install opsfabric-discovery && opsfabric-discovery audit --demo