The suite

When monitoring fails silently, the rest of your stack fails noisily.

Three fabrics for cloud reliability — built to handle the three categories of monitoring failure that hit most AWS accounts. Each fabric is a multi-agent system in its own right; together they cover the full loop from "what's broken" to "who's been paged."

How it works

Built on THREAD.

Three phases — Audit, Remediate, Respond — woven from six pillars of telemetry, health, response, evolution, automation and defense.

01

Audit

Map every alarm to the resource it protects across AWS and Azure. Score coverage. Flag DEGRADED alarms before they bite.

Pillars

  • Telemetry

    Unify alarms, metrics, logs, and events into one signal plane.

  • Health

    Continuously score coverage. Catch alarms that exist but won't notify.

02

Remediate

Close the gaps the audit found. Create the missing alarms via your role, wire SNS / PagerDuty / Opsgenie targets, tag everything for one-call rollback.

Pillars

  • Defense

    Proactively close the gaps. Defense before incident, not just after.

  • Automation

    Safe, gated automation. Tagged + reversible writes only.

03

Respond

When alarms fire, triage from Slack, run RCA, execute remediation in Copilot, Autonomous, and HIL modes. Drive Jira + Confluence to close the loop.

Pillars

  • Response

    Triage, RCA, remediation. Copilot / Autonomous / HIL — your call.

  • Evolution

    Every incident sharpens playbooks. The fabric learns over time.

DiscoveryFabric · Open source · MIT · on PyPI

Reliability Audit

Find which of your alarms are broken — and which services have no alarm at all. Read-only AWS audit that maps alarms to resources and surfaces the gaps.

The Reliability Audit (open-source as DiscoveryFabric) scans your AWS account through Resource Explorer 2 and maps every CloudWatch alarm to the ECS service, Lambda function, RDS instance, Aurora cluster, or SQS queue it actually protects. Five matching strategies (exact dimension, ALB → ECS target-group bridge, namespace + partial dimension, log-group linkage, and naming heuristic) catch alarms that simple dimension-match would miss.

It also flags DEGRADED alarms — alarms that exist but won't actually notify anyone, because actions are disabled, no SNS target is configured, or the metric is in INSUFFICIENT_DATA. The output is a JSON file and a self-contained HTML report you can open in any browser or send to a CTO or auditor unedited.

Key features

  • Read-only — only describe / list AWS calls, no writes ever
  • Runs entirely on your laptop, no telemetry, no phone-home
  • Five-strategy alarm-to-resource matching (covers the edge cases)
  • DEGRADED alarm detection (the ones that look fine but won't page)
  • Cross-account audits via STS AssumeRole + external ID
  • --demo mode runs against a synthetic account in ~3 seconds

Install

$ pip install opsfabric-discovery && opsfabric-discovery audit --demo
AlarmFabric · Paid · Managed SaaS

Alarm Engineering

Close the gaps the audit found — in one deploy. Creates the missing alarms, wires the targets, at your chosen trust level.

Alarm Engineering (AlarmFabric) reads the audit's JSON output and creates the missing CloudWatch alarms in your account via the same read-only role (plus PutMetricAlarm). Every alarm is tagged with its source audit ID, the resource it protects, and the rule it satisfies — so rollback is one tag-filtered DeleteAlarms call away.

Pick the remediation mode that matches your trust level: Copilot proposes alarms you click to create, Autonomous applies the whole pack on a schedule, Human-in-the-loop waits for approval on every PutMetricAlarm. It wires SNS / PagerDuty / Opsgenie targets, re-audits on a schedule, and tells you when resources show up without coverage or alarms drift.

Key features

  • Three remediation modes: Copilot, Autonomous, Human-in-the-loop
  • Creates missing alarms via your role + PutMetricAlarm
  • Tagged + reversible — every alarm carries its provenance
  • SNS / PagerDuty / Opsgenie target wiring
  • Scheduled re-audits with drift detection between scans
  • Multi-tenant managed SaaS — one console for every account
OpsFabric · Paid · Multi-cloud SaaS

Incident Operations

Run the incidents end to end — when alarms actually fire. Across AWS or Azure, from page through post-mortem.

Incident Operations is the layer above Alarm Engineering. When an alarm fires on AWS or Azure, it picks up the incident from Slack, CloudWatch, or Azure Monitor, pulls logs and traces, generates a root-cause analysis, and proposes remediations. You choose the trust level: Copilot suggests, Autonomous executes pre-approved playbooks, Human-in-the-loop gates every infrastructure-touching step.

It drives the Jira lifecycle (create, update, transition, close), opens a GitHub PR for app-side fixes when the RCA points at code, and writes the Confluence post-mortem at the end. The whole incident has one correlated thread you can replay later in LangSmith.

Key features

  • Three remediation modes: Copilot, Autonomous, Human-in-the-loop
  • Multi-cloud incident response: AWS + Azure today, GCP on the roadmap
  • Slack-native triage with full incident context inline
  • Automated RCA from logs + traces + topology
  • Jira lifecycle + Confluence post-mortems
  • GitHub PR for app-side fixes when RCA points at code

Open source vs commercial

Feature matrix

The Reliability Audit does the audit well. The paid products do different things — not a crippled version of the audit, just a different part of the reliability loop.

Capability
Reliability Audit
Open source
Alarm Engineering
Paid
Incident Operations
Paid
Read-only audit
Resource discovery (ECS / Lambda / RDS / Aurora / SQS)
Five-strategy alarm matching
DEGRADED alarm detection
Executive PDF + JSON output
--demo synthetic walkthrough
Cross-account audits via STS AssumeRole
Create missing alarms in your account
Tagged + reversible alarm provenance
SNS / PagerDuty / Opsgenie wiring
Scheduled / continuous audits
Drift detection between scans
Auto-cleanup of orphan alarms
Copilot / Autonomous / HIL remediation modes
Multi-cloud (AWS + Azure)
Slack-based incident triage
Automated root-cause analysis
Jira / Confluence incident lifecycle
GitHub PR for app-side fixes
Multi-tenant managed SaaS

Start with the audit. Talk to us about the rest.

The Reliability Audit is free forever. Alarm Engineering and Incident Operations are pilot-priced during the first customer cohort.

$ pip install opsfabric-discovery && opsfabric-discovery audit --demo
Book a demo