Agentic ops · multi-cloud

Cloud reliability, run by agents — from audit to incident close.

Agents audit what your monitoring stack misses, close the gaps, and run incidents end-to-end across AWS and Azure — building an institutional memory so your reliability compounds over time instead of walking out the door with your senior engineers. Cloud reliability decays in five quiet ways; we close every one.

$ pip install opsfabric-discovery

~3 seconds against a synthetic AWS account. No credentials needed.

opsfabric-discovery · audit · demo
$ opsfabric-discovery audit --demo
Loading synthetic dataset (no AWS calls)…
Region: ca-central-1 · Profile: demo
Discovered 47 resources across ECS, Lambda, RDS, Aurora, SQS
Mapped alarms via 5 strategies → 112 alarms matched
Detected 9 DEGRADED alarms (actions disabled / no SNS target)
✓ Audit complete. Coverage 71% · 14 gaps
→ wrote ./audit.json
→ wrote ./audit-demo.html (executive report, 87 KB)
$

How reliability decays

Five quiet ways cloud reliability decays in production.

Downtime cost

Every hour of downtime is revenue you don't recover.

The cost

Mid-market SaaS bleeds $5k–$50k per hour of downtime. And you usually find out from a customer — by which point the churn email, the CSM scramble, and the tweet are already in motion.

What OpsFabric does

OpsFabric audits every alarm on your account and finds the ones that look healthy but won't notify anyone. Detection moves inside the building — minutes, not customer tickets.

Reliability Audit → Alarm Engineering (closes the gaps)
Reporting

There's no single number to report your reliability posture.

The cost

A 12-tab dashboard isn't a coverage score. Right now you have screenshots and a story you tell from memory.

What OpsFabric does

A self-contained HTML report with one coverage score, exposure ranked by business impact, and a trend line per quarter. Opens in any browser. Hand it upstairs unedited.

Reliability Audit (the executive report)
Drift

New services keep shipping without monitoring.

The cost

The org gets bigger, services multiply, and nobody can hold every team to a checklist. You only notice the gap during the outage.

What OpsFabric does

OpsFabric runs the audit on a schedule. Every new service that ships without alarms is flagged the next day. Coverage stops drifting.

Reliability Audit (scheduled) → Alarm Engineering (close the new gaps)
Engineering time

Outages drag engineers off product work for days.

The cost

A senior SRE costs $250k. Three of them on a P2 for 48 hours is $14k in payroll and a sprint your roadmap won't recover.

What OpsFabric does

Routine incidents — scale events, rollbacks, disk-full, memory leaks — close agentically. Copilot proposes, you approve, Autonomous runs the playbook end-to-end. Your engineers stay on the roadmap.

Incident Operations (multi-cloud, agentic)
AI in production

AI initiatives stall at the slide deck.

The cost

Pilots prove nothing if they don't ship. Most “AI transformation” stories don't survive a real outcome question.

What OpsFabric does

OpsFabric is AI-in-production with measurable outcomes — agents handling alarms, RCA, and remediation, with a Jira lifecycle you can audit. Bring receipts to the next strategy review.

Incident Operations (the AI-in-production story)

Five pains. One platform. Here's how each one closes — step by step.

Three remediation modes

Copilot, Autonomous, and Human-in-the-loop.

Same fabric. Different trust levels. Start in HIL on day one, graduate to Autonomous once you trust the playbooks.

01

Copilot

AI proposes the fix inline in Slack with full context — logs, RCA, suggested AWS or Azure CLI. You decide when to run it.

suggest  ·  you execute

02

Autonomous

Pre-approved playbooks run end-to-end. Alarms created, services scaled, deployments rolled back — no human in the loop. For patterns the fabric has learned from past incidents and you've validated.

AI suggests  ·  AI executes

03

Human-in-the-loop

Every infrastructure-touching step waits for your approval. The fabric does the analysis and proposes the action; you sign off before anything runs.

AI suggests  ·  you approve each step

The flywheel

Every other tool forgets. This one compounds.

Three fabrics do the work. Each incident they resolve feeds the memory at the core — so the whole platform gets sharper every week.

The work · three acting fabrics

Audit

Find the coverage gaps

01

Remediate

Close them automatically

02

Respond

Run the incident end-to-end

03
grounded by ↓

The core · KnowledgeFabric

Institutional Memory

Authored

Your existing SOPs, runbooks & ownership. Trusted by default.

Earned

Playbooks the agents write from resolved incidents. Graded before one runs.

writes back ↑ every incident

Month one, it cites your docs. By month six, it's citing playbooks it wrote itself. Outages stop being lessons you pay for and forget.

Threads through your stack

Weaves the tools you already use.

OpsFabric isn't a tool you add to your stack. It's the layer that connects what you already have — turning alarms, tickets, and chat into one coherent incident lifecycle.

Azure AWS Slack GitHub Confluence Jira PagerDuty GCP
Azure AWS Slack GitHub Confluence Jira PagerDuty GCP

Don't see your tool? Ask us about it — we add integrations based on customer ask.

The output

One command. One report your CTO can read.

The audit produces a JSON file you can pipe into anything, and a self-contained HTML report you can open in any browser or send to a CTO or auditor unedited. No dashboard required.

$ pip install opsfabric-discovery && opsfabric-discovery audit --demo

~3 seconds against a synthetic account · no AWS credentials needed

Open the sample report

Why we built this

I lived all five of those.

Every team I joined had at least three of those five — usually all five. Hundreds of monitoring alarms, nobody knew which ones actually worked. New services launched without anyone setting one up. Engineers stuck on a P2 for two days while the roadmap slipped. Leadership wanted a reliability number we couldn't produce.

The Reliability Audit is the one I wished I'd had on day one. It's free (open source as DiscoveryFabric) because nobody should pay to learn their alarms are broken. Alarm Engineering and Incident Operations are how I make a living — and how the rest of the work gets done after the audit hands you the gaps.

— Vaishal, building OpsFabric

PyPI version PyPI downloads MIT licensed

Read-only by default

The audit only reads your cloud. Nothing in your account changes unless you choose the paid platform to fix it.

Runs on your laptop

No telemetry, no phone-home. Your AWS data goes to your terminal and back.

MIT — fork it, ship it

Use it however you want. The audit shouldn't cost anything. The platform is where we make money.

See your gaps in 3 seconds. We close the hard ones together.

The free audit answers the first one — which of your alarms actually work. The rest of the platform handles the other four. No credit card to start.

$ pip install opsfabric-discovery && opsfabric-discovery audit --demo
Book a demo