OpsFabric · Reliability Audit
2026-05-22 00:35 UTC Talk to us →
Sample audit — synthetic data. This report was rendered from a baked-in fixture; no AWS resources were inspected. Run opsfabric-discovery audit --profile <your-profile> --regions all against your AWS account to see real coverage.

Reliability audit

Reliability monitoring coverage

AWS account 123456789012 (blazecommerce) · scanned 2026-05-22 00:35 UTC

How to read this report

1

What was checked

Every service this AWS account is running — applications, background jobs, databases, queues — and the alarms configured to detect their failures.

2

What was found

Where alarms are missing or broken. A missing alarm means a failure will go undetected until customers report it. A broken alarm means you've already paid for monitoring that isn't actually working.

3

What to do

See Where the business is exposed for the categories of damage, then Recommended next steps for the actions to take this week.

The risk

13 resources are undefended.

35 of 47 required checks are missing. The team is flying without instruments on these workloads.

Of the 35 required gaps: 33 we can auto-create · 2 already exist but aren't wired.

First-audit baseline. Most accounts score in this range before any monitoring is configured — there's no monitoring debt yet because there's no monitoring yet. Engineering teams typically reach the 90/100 industry baseline within a single AlarmFabric engagement; see Recommended next steps for the order of operations.

35
Monitoring gaps
checks that should exist but don't
13
Services at risk
have at least one missing alarm
9
Critical-impact gaps
cause customer-visible outages
2
Broken alarms
exist but won't actually notify

Where the business is exposed

35 technical gaps grouped into the categories of damage they create. Switch tabs to drill into the per-check engineering detail.

13
resources affected

Workloads that will fail silently

These services and jobs have no failure detection. Customers will report broken features or missing data before your on-call team sees them.

Time to detect: Risk: Brand, churn, SLA misses

Including

  • payments-worker Application service
  • search-api Application service
  • events-router Background job / serverless function
  • daily-report-cron Background job / serverless function
  • + 9 more
2
resources affected

Alarms you've paid for that won't actually notify

These alarms exist but are disabled, missing their notification target, or stuck in INSUFFICIENT_DATA. They will not page anyone when something fails.

Time to detect: Risk: False sense of safety

Including

  • daily-report-cron Background job / serverless function
  • audit-log-shipper Background job / serverless function

Every gap, grouped first by category of damage, then by the service it affects, then by the specific check. Use this view to assign work to engineers — the same data lives in alarm-coverage-missing.json.

Workloads that will fail silently

No alarm exists. Failures will go undetected until someone reports them.

13 services · 33 checks
audit-log-shipper Background job / serverless function us-east-1 URGENT

1,180,000 invocations in the last 30 days — no error alarm configured to catch failures.

  • Lambda Errors P1

    Errors mean failed customer requests or dropped events. Without this alarm, outages go undetected until users complain.

  • Lambda ConcurrentExecutions high P2

    Approaching the regional concurrency ceiling causes throttling without warning. This alarm is the only early warning.

  • Lambda Duration high P2

    Slow invocations rack up cost and risk hitting timeout cliffs that fail entire requests.

daily-report-cron Background job / serverless function us-east-1 URGENT

320,000 invocations in the last 30 days — no error alarm configured to catch failures.

  • Lambda Throttles P1

    Throttles mean Lambda hit the concurrency ceiling and dropped invocations — silent data loss for async workflows.

  • Lambda ConcurrentExecutions high P2

    Approaching the regional concurrency ceiling causes throttling without warning. This alarm is the only early warning.

  • Lambda Duration high P2

    Slow invocations rack up cost and risk hitting timeout cliffs that fail entire requests.

events-buffer Message queue us-east-1 WATCH

2,140,000 messages sent in the last 30 days on a queue with missing required alarms.

  • SQS visible message count high P2

    A backlog of visible messages means producers are outpacing consumers. Without an alarm, the team only finds out when latency-sensitive downstream systems start failing or when SQS retention quietly drops messages.

events-buffer-dlq Message queue us-east-1 URGENT

Oldest message in queue is 192.0h — no age-of-oldest alarm configured. A stuck consumer will not page.

  • SQS oldest message age high P1

    A growing oldest-message age means consumers are dead, slow, or misconfigured — the queue is silently absorbing work the system is supposed to be doing. This is the canonical 'something downstream broke' signal that infra-level CPU alarms can't see.

  • SQS visible message count high P2

    A backlog of visible messages means producers are outpacing consumers. Without an alarm, the team only finds out when latency-sensitive downstream systems start failing or when SQS retention quietly drops messages.

events-router Background job / serverless function us-east-1 URGENT

58,200 invocations in the last 30 days on a function with missing required alarms.

  • Lambda ConcurrentExecutions high P2

    Approaching the regional concurrency ceiling causes throttling without warning. This alarm is the only early warning.

  • Lambda Duration high P2

    Slow invocations rack up cost and risk hitting timeout cliffs that fail entire requests.

image-resizer Background job / serverless function us-west-2 URGENT

18 errors out of 2,412,300 invocations in the last 30 days — a 0.001% error rate — no error alarm configured.

  • Lambda Errors P1

    Errors mean failed customer requests or dropped events. Without this alarm, outages go undetected until users complain.

  • Lambda Throttles P1

    Throttles mean Lambda hit the concurrency ceiling and dropped invocations — silent data loss for async workflows.

  • Lambda ConcurrentExecutions high P2

    Approaching the regional concurrency ceiling causes throttling without warning. This alarm is the only early warning.

  • Lambda Duration high P2

    Slow invocations rack up cost and risk hitting timeout cliffs that fail entire requests.

orders-db-prod Database us-east-1 URGENT

Database peaked at 91% CPU in the last 30 days — no CPU alarm configured.

  • RDS DatabaseConnections high P2

    Approaching max_connections causes new requests to fail to connect entirely — full outage signal.

  • RDS CPU high P2

    Sustained DB CPU pressure cascades into request queueing across every service that depends on this database.

  • RDS FreeableMemory low P2

    Memory pressure forces the DB to hit disk for queries it should serve from cache, multiplying latency.

payments-buffer Message queue us-east-1 WATCH

380,000 messages sent in the last 30 days on a queue with missing required alarms.

  • SQS visible message count high P2

    A backlog of visible messages means producers are outpacing consumers. Without an alarm, the team only finds out when latency-sensitive downstream systems start failing or when SQS retention quietly drops messages.

payments-worker Application service us-east-1 WATCH

Peak CPU reached 65% in the last 30 days — no CPU alarm configured.

  • ECS Service CPU high P2

    Sustained CPU saturation degrades request latency and can cascade into 5xx errors before autoscaling catches up.

  • ECS Service Memory high P2

    Memory pressure leads to OOM-killed tasks; without an alarm, services fail silently between health checks.

search-api Application service us-west-2 URGENT

Active service with no task-count alarm — failures to keep the desired task count running will go undetected.

  • ECS RunningTaskCount below DesiredTaskCount P1

    Tasks failing to stay running means lost capacity. Without this alarm, partial outages persist until a customer reports them.

  • ECS Service CPU high P2

    Sustained CPU saturation degrades request latency and can cascade into 5xx errors before autoscaling catches up.

sessions-cluster Database cluster us-east-1 WATCH

Active database (peak 220 connections) with missing required alarms.

  • Aurora cluster FreeLocalStorage low P1

    Aurora local storage exhaustion causes write failures. Recovery requires emergency scaling under load.

  • Aurora cluster DatabaseConnections high P2

    Approaching max_connections at the cluster level causes new requests to fail — full outage signal.

  • Aurora cluster FreeableMemory low P2

    Memory pressure forces the DB to hit disk for queries it should serve from cache, multiplying latency.

users-db-prod Database us-east-1 URGENT

Active database with no free-storage alarm — storage exhaustion is unrecoverable and will go undetected until queries fail.

  • RDS FreeStorageSpace low P1

    Running out of storage takes the database read-only or fully offline. Recovery requires emergency scaling under load.

  • RDS DatabaseConnections high P2

    Approaching max_connections causes new requests to fail to connect entirely — full outage signal.

  • RDS CPU high P2

    Sustained DB CPU pressure cascades into request queueing across every service that depends on this database.

  • RDS FreeableMemory low P2

    Memory pressure forces the DB to hit disk for queries it should serve from cache, multiplying latency.

webhook-handler Background job / serverless function eu-west-1 URGENT

845,700 invocations in the last 30 days — no error alarm configured to catch failures.

  • Lambda Errors P1

    Errors mean failed customer requests or dropped events. Without this alarm, outages go undetected until users complain.

  • Lambda ConcurrentExecutions high P2

    Approaching the regional concurrency ceiling causes throttling without warning. This alarm is the only early warning.

  • Lambda Duration high P2

    Slow invocations rack up cost and risk hitting timeout cliffs that fail entire requests.

Alarms you've paid for that won't actually notify

Alarm exists but is disabled, missing its target, or not receiving data.

2 services · 2 checks
audit-log-shipper Background job / serverless function us-east-1 URGENT

1,180,000 invocations in the last 30 days — no error alarm configured to catch failures.

  • Lambda Throttles P1 in INSUFFICIENT_DATA — not receiving the metric, cannot fire

    Throttles mean Lambda hit the concurrency ceiling and dropped invocations — silent data loss for async workflows.

daily-report-cron Background job / serverless function us-east-1 URGENT

320,000 invocations in the last 30 days — no error alarm configured to catch failures.

  • Lambda Errors P1 actions disabled — no one will be paged

    Errors mean failed customer requests or dropped events. Without this alarm, outages go undetected until users complain.

Coverage by service type

Each row is a category of your infrastructure. Coverage = the percentage of standard monitoring checks (service up/down, error rates, capacity) that have working alarms today. Gap = what's missing. Anything under 80% is a known incident class.

Resource type Resources Required Met Coverage Gap
Application service 3 9 5 55.6% 44.4%
Background job / serverless function 5 20 3 15.0% 85.0%
Database cluster 1 4 1 25.0% 75.0%
Database 2 8 1 12.5% 87.5%
Message queue 3 6 2 33.3% 66.7%

How you compare

Industry baseline = the typical coverage we see across mid-market cloud engineering teams. Your account = the percentage of standard checks that have working alarms today. A 90+ score means most failures will be detected automatically; below 60 means most failures will be reported by customers first.

Industry baseline
90%
Your account
26%

Recommended next steps

In priority order. The first action is the highest-leverage thing to do this week; the last is the engagement that closes the loop.

  1. This week: fix the 2 alarms that already exist but are broken (disabled, missing notification target, or receiving no data). Your team already authored these — they just aren't wired correctly. AlarmFabric fixes them in one deploy.
  2. This week: close the 33 standard monitoring gaps where no judgment call is needed. These are off-the-shelf checks (service up/down, error rates, capacity) — they don't need custom thresholds. AlarmFabric implements the entire pack via your read-only audit role.
  3. Engage: have AlarmFabric close these gaps in your account within 24 hours. Your team grants a read-only role with permission to create alarms — no broader access, no application code changes, fully reversible in one command.

What happens next

Two paid fabrics close this loop.

DiscoveryFabric is the free audit you just read. AlarmFabric and OpsFabric are the paid multi-agent fabrics that act on what it found.

AlarmFabric · paid

Close these 35 monitoring gaps in your account. One deploy.

  • We deploy the missing alarms using a read-only role you grant — no broader access required
  • Each alarm is connected to your team's on-call tool (Slack, PagerDuty, Opsgenie, email) so it actually notifies someone
  • Every alarm is reversible — your team can remove the entire set with one command if needed
  • We re-run the audit on a schedule so newly-deployed services don't quietly drift below the bar
Book a demo →

OpsFabric · paid

When the alarms fire, we run the incident — end to end.

  • Your team gets the incident in Slack with the failing service, recent logs, and the most likely cause already analyzed
  • Suggested fix arrives at the same time — your team approves it, or you let it run automatically once you've built trust
  • The ticket is created and tracked in Jira; the post-mortem is drafted in Confluence — no after-hours documentation work
  • Three trust levels: AI suggests / your team approves / fully autonomous — your call, per category of incident
Book a demo →

Pilot pricing during the first customer cohort [email protected]