audit-log-shipper
Background job / serverless function
us-east-1
URGENT
1,180,000 invocations in the last 30 days — no error alarm configured to catch failures.
-
Lambda Errors
P1
Errors mean failed customer requests or dropped events. Without this alarm, outages go undetected until users complain.
-
Lambda ConcurrentExecutions high
P2
Approaching the regional concurrency ceiling causes throttling without warning. This alarm is the only early warning.
-
Lambda Duration high
P2
Slow invocations rack up cost and risk hitting timeout cliffs that fail entire requests.
daily-report-cron
Background job / serverless function
us-east-1
URGENT
320,000 invocations in the last 30 days — no error alarm configured to catch failures.
-
Lambda Throttles
P1
Throttles mean Lambda hit the concurrency ceiling and dropped invocations — silent data loss for async workflows.
-
Lambda ConcurrentExecutions high
P2
Approaching the regional concurrency ceiling causes throttling without warning. This alarm is the only early warning.
-
Lambda Duration high
P2
Slow invocations rack up cost and risk hitting timeout cliffs that fail entire requests.
events-buffer
Message queue
us-east-1
WATCH
2,140,000 messages sent in the last 30 days on a queue with missing required alarms.
-
SQS visible message count high
P2
A backlog of visible messages means producers are outpacing consumers. Without an alarm, the team only finds out when latency-sensitive downstream systems start failing or when SQS retention quietly drops messages.
events-buffer-dlq
Message queue
us-east-1
URGENT
Oldest message in queue is 192.0h — no age-of-oldest alarm configured. A stuck consumer will not page.
-
SQS oldest message age high
P1
A growing oldest-message age means consumers are dead, slow, or misconfigured — the queue is silently absorbing work the system is supposed to be doing. This is the canonical 'something downstream broke' signal that infra-level CPU alarms can't see.
-
SQS visible message count high
P2
A backlog of visible messages means producers are outpacing consumers. Without an alarm, the team only finds out when latency-sensitive downstream systems start failing or when SQS retention quietly drops messages.
events-router
Background job / serverless function
us-east-1
URGENT
58,200 invocations in the last 30 days on a function with missing required alarms.
-
Lambda ConcurrentExecutions high
P2
Approaching the regional concurrency ceiling causes throttling without warning. This alarm is the only early warning.
-
Lambda Duration high
P2
Slow invocations rack up cost and risk hitting timeout cliffs that fail entire requests.
image-resizer
Background job / serverless function
us-west-2
URGENT
18 errors out of 2,412,300 invocations in the last 30 days — a 0.001% error rate — no error alarm configured.
-
Lambda Errors
P1
Errors mean failed customer requests or dropped events. Without this alarm, outages go undetected until users complain.
-
Lambda Throttles
P1
Throttles mean Lambda hit the concurrency ceiling and dropped invocations — silent data loss for async workflows.
-
Lambda ConcurrentExecutions high
P2
Approaching the regional concurrency ceiling causes throttling without warning. This alarm is the only early warning.
-
Lambda Duration high
P2
Slow invocations rack up cost and risk hitting timeout cliffs that fail entire requests.
orders-db-prod
Database
us-east-1
URGENT
Database peaked at 91% CPU in the last 30 days — no CPU alarm configured.
-
RDS DatabaseConnections high
P2
Approaching max_connections causes new requests to fail to connect entirely — full outage signal.
-
RDS CPU high
P2
Sustained DB CPU pressure cascades into request queueing across every service that depends on this database.
-
RDS FreeableMemory low
P2
Memory pressure forces the DB to hit disk for queries it should serve from cache, multiplying latency.
payments-buffer
Message queue
us-east-1
WATCH
380,000 messages sent in the last 30 days on a queue with missing required alarms.
-
SQS visible message count high
P2
A backlog of visible messages means producers are outpacing consumers. Without an alarm, the team only finds out when latency-sensitive downstream systems start failing or when SQS retention quietly drops messages.
payments-worker
Application service
us-east-1
WATCH
Peak CPU reached 65% in the last 30 days — no CPU alarm configured.
-
ECS Service CPU high
P2
Sustained CPU saturation degrades request latency and can cascade into 5xx errors before autoscaling catches up.
-
ECS Service Memory high
P2
Memory pressure leads to OOM-killed tasks; without an alarm, services fail silently between health checks.
search-api
Application service
us-west-2
URGENT
Active service with no task-count alarm — failures to keep the desired task count running will go undetected.
-
ECS RunningTaskCount below DesiredTaskCount
P1
Tasks failing to stay running means lost capacity. Without this alarm, partial outages persist until a customer reports them.
-
ECS Service CPU high
P2
Sustained CPU saturation degrades request latency and can cascade into 5xx errors before autoscaling catches up.
sessions-cluster
Database cluster
us-east-1
WATCH
Active database (peak 220 connections) with missing required alarms.
-
Aurora cluster FreeLocalStorage low
P1
Aurora local storage exhaustion causes write failures. Recovery requires emergency scaling under load.
-
Aurora cluster DatabaseConnections high
P2
Approaching max_connections at the cluster level causes new requests to fail — full outage signal.
-
Aurora cluster FreeableMemory low
P2
Memory pressure forces the DB to hit disk for queries it should serve from cache, multiplying latency.
users-db-prod
Database
us-east-1
URGENT
Active database with no free-storage alarm — storage exhaustion is unrecoverable and will go undetected until queries fail.
-
RDS FreeStorageSpace low
P1
Running out of storage takes the database read-only or fully offline. Recovery requires emergency scaling under load.
-
RDS DatabaseConnections high
P2
Approaching max_connections causes new requests to fail to connect entirely — full outage signal.
-
RDS CPU high
P2
Sustained DB CPU pressure cascades into request queueing across every service that depends on this database.
-
RDS FreeableMemory low
P2
Memory pressure forces the DB to hit disk for queries it should serve from cache, multiplying latency.
webhook-handler
Background job / serverless function
eu-west-1
URGENT
845,700 invocations in the last 30 days — no error alarm configured to catch failures.
-
Lambda Errors
P1
Errors mean failed customer requests or dropped events. Without this alarm, outages go undetected until users complain.
-
Lambda ConcurrentExecutions high
P2
Approaching the regional concurrency ceiling causes throttling without warning. This alarm is the only early warning.
-
Lambda Duration high
P2
Slow invocations rack up cost and risk hitting timeout cliffs that fail entire requests.