The suite

When monitoring fails silently, the rest of your stack fails noisily.

Three fabrics for cloud reliability — built to handle the three categories of monitoring failure that hit most AWS accounts. Each fabric is a multi-agent system in its own right; together they cover the full loop from "what's broken" to "who's been paged."

How it works

Built on THREAD.

Three phases — Audit, Remediate, Respond — woven from six pillars of telemetry, health, response, evolution, automation and defense.

Audit

Map every alarm to the resource it protects across AWS and Azure. Score coverage. Flag DEGRADED alarms before they bite.

Pillars

Telemetry

Unify alarms, metrics, logs, and events into one signal plane.
Health

Continuously score coverage. Catch alarms that exist but won't notify.

Remediate

Close the gaps the audit found. Create the missing alarms via your role, wire SNS / PagerDuty / Opsgenie targets, tag everything for one-call rollback.

Pillars

Defense

Proactively close the gaps. Defense before incident, not just after.
Automation

Safe, gated automation. Tagged + reversible writes only.

Respond

When alarms fire, triage from Slack, run RCA, execute remediation in Copilot, Autonomous, and HIL modes. Drive Jira + Confluence to close the loop.

Pillars

Response

Triage, RCA, remediation. Copilot / Autonomous / HIL — your call.
Evolution

Every incident sharpens playbooks. The fabric learns over time.

DiscoveryFabric · Open source · MIT · on PyPI

Reliability Audit

Find which of your alarms are broken — and which services have no alarm at all. Read-only AWS audit that maps alarms to resources and surfaces the gaps.

The Reliability Audit (open-source as DiscoveryFabric) scans your AWS account through Resource Explorer 2 and maps every CloudWatch alarm to the ECS service, Lambda function, RDS instance, Aurora cluster, or SQS queue it actually protects. Five matching strategies (exact dimension, ALB → ECS target-group bridge, namespace + partial dimension, log-group linkage, and naming heuristic) catch alarms that simple dimension-match would miss.

It also flags DEGRADED alarms — alarms that exist but won't actually notify anyone, because actions are disabled, no SNS target is configured, or the metric is in INSUFFICIENT_DATA. The output is a JSON file and a self-contained HTML report you can open in any browser or send to a CTO or auditor unedited.

View on PyPI Sample audit report

Key features

Read-only — only describe / list AWS calls, no writes ever
Runs entirely on your laptop, no telemetry, no phone-home
Five-strategy alarm-to-resource matching (covers the edge cases)
DEGRADED alarm detection (the ones that look fine but won't page)
Cross-account audits via STS AssumeRole + external ID
--demo mode runs against a synthetic account in ~3 seconds

Install

$ pip install opsfabric-discovery && opsfabric-discovery audit --demo

AlarmFabric · Paid · Managed SaaS

Alarm Engineering

Close the gaps the audit found — in one deploy. Creates the missing alarms, wires the targets, at your chosen trust level.

Alarm Engineering (AlarmFabric) reads the audit's JSON output and creates the missing CloudWatch alarms in your account via the same read-only role (plus PutMetricAlarm). Every alarm is tagged with its source audit ID, the resource it protects, and the rule it satisfies — so rollback is one tag-filtered DeleteAlarms call away.

Pick the remediation mode that matches your trust level: Copilot proposes alarms you click to create, Autonomous applies the whole pack on a schedule, Human-in-the-loop waits for approval on every PutMetricAlarm. It wires SNS / PagerDuty / Opsgenie targets, re-audits on a schedule, and tells you when resources show up without coverage or alarms drift.

Book a demo See the audit it reads from

Key features

Three remediation modes: Copilot, Autonomous, Human-in-the-loop
Creates missing alarms via your role + PutMetricAlarm
Tagged + reversible — every alarm carries its provenance
SNS / PagerDuty / Opsgenie target wiring
Scheduled re-audits with drift detection between scans
Multi-tenant managed SaaS — one console for every account

OpsFabric · Paid · Multi-cloud SaaS

Incident Operations

Run the incidents end to end — when alarms actually fire. Across AWS or Azure, from page through post-mortem.

Incident Operations is the layer above Alarm Engineering. When an alarm fires on AWS or Azure, it picks up the incident from Slack, CloudWatch, or Azure Monitor, pulls logs and traces, generates a root-cause analysis, and proposes remediations. You choose the trust level: Copilot suggests, Autonomous executes pre-approved playbooks, Human-in-the-loop gates every infrastructure-touching step.

It drives the Jira lifecycle (create, update, transition, close), opens a GitHub PR for app-side fixes when the RCA points at code, and writes the Confluence post-mortem at the end. The whole incident has one correlated thread you can replay later in LangSmith.

Book a demo How it fits the loop

Key features

Three remediation modes: Copilot, Autonomous, Human-in-the-loop
Multi-cloud incident response: AWS + Azure today, GCP on the roadmap
Slack-native triage with full incident context inline
Automated RCA from logs + traces + topology
Jira lifecycle + Confluence post-mortems
GitHub PR for app-side fixes when RCA points at code

Open source vs commercial

Feature matrix

The Reliability Audit does the audit well. The paid products do different things — not a crippled version of the audit, just a different part of the reliability loop.

Capability	Reliability Audit Open source	Alarm Engineering Paid
Read-only audit
Resource discovery (ECS / Lambda / RDS / Aurora / SQS)
Five-strategy alarm matching
DEGRADED alarm detection
Executive PDF + JSON output
--demo synthetic walkthrough
Cross-account audits via STS AssumeRole
Create missing alarms in your account	—
Tagged + reversible alarm provenance	—
SNS / PagerDuty / Opsgenie wiring	—
Scheduled / continuous audits	—
Drift detection between scans	—
Auto-cleanup of orphan alarms	—
Copilot / Autonomous / HIL remediation modes	—
Multi-cloud (AWS + Azure)	—	—
Slack-based incident triage	—	—
Automated root-cause analysis	—	—
Jira / Confluence incident lifecycle	—	—
GitHub PR for app-side fixes	—	—
Multi-tenant managed SaaS	—

Start with the audit. Talk to us about the rest.

The Reliability Audit is free forever. Alarm Engineering and Incident Operations are pilot-priced during the first customer cohort.

$ pip install opsfabric-discovery && opsfabric-discovery audit --demo

Book a demo