Skip to content
ThinkByAIThinkByAI
[C—02]Cloud Production Care

What should be monitored in a SaaS application?

The signals that actually matter for SaaS reliability — and how to avoid alert fatigue.

C—02 · Cloud Production CareBy ThinkByAI Engineering6 min read

Monitoring everything is the same as monitoring nothing. This article focuses on the signals that genuinely predict and detect problems in a SaaS application — and how to keep alerts meaningful.

The four golden signals

When teams ask what to monitor, the honest answer is to start with four signals and resist the urge to add more until those are solid. Latency: how long requests take. Traffic: how much demand the system is handling. Errors: the rate of requests that fail. Saturation: how full your resources are — CPU, memory, connections, disk. These four describe the health of almost any service in language a non-specialist can follow.

The appeal of the golden signals is that they are user-facing. A customer never sees your CPU graph, but they feel latency and errors directly. By watching these four first, you anchor your monitoring to what actually affects the people paying you, rather than to whichever metric happened to be easy to collect.

Application errors and traces

The golden signals tell you something is wrong; application errors and traces tell you what. You want every unhandled exception captured with enough context — the request, the user action, the stack — to reproduce it without guesswork. An aggregated error rate hides the one bug hitting ten percent of signups; grouped, deduplicated error reporting surfaces it.

Tracing connects the dots across services. When a single user action passes through several components, a trace shows you where the time went and where the failure started, instead of leaving you to correlate timestamps by hand. For a small team, this is the difference between resolving an incident in minutes and spending an afternoon spelunking through logs.

Database health

The database is usually the first thing to buckle under load, and it is often the least watched. Keep an eye on connection counts, because exhausting the pool takes the whole application down even when every server is healthy. Watch query latency and the slowest queries, since a single unindexed query can quietly drag everything with it as your data grows.

Replication lag, disk usage, and cache hit rates round out the picture. These are not signals you check daily; they are signals you alert on, so that a database approaching its limits warns you while you still have time to act. The pattern to avoid is discovering your storage is full at the moment writes start failing.

Infrastructure and capacity

Underneath the application sits the compute, storage, and network it runs on, and these have hard ceilings. Track CPU and memory across your servers, disk space on anything that writes data, and the saturation of any managed component you depend on. The aim is to see a resource trending toward its limit days before it gets there.

Capacity monitoring is what lets you scale on purpose rather than in a panic. When you can see usage climbing steadily, adding capacity is a calm planned change. When you cannot, your first warning is an outage, and your first response is improvised. The cost of watching is trivial; the cost of not watching is measured in downtime.

Designing alerts people trust

Monitoring everything is the same as monitoring nothing, because an alert that fires constantly is an alert people learn to ignore. The discipline is to alert only on conditions that are both real and actionable — something is genuinely wrong, and a human needs to do something about it now. Everything else belongs on a dashboard, not in a pager.

A few rules keep alerts trustworthy and keep your team responsive when one fires:

  • Alert on symptoms users feel — errors and latency — before alerting on internal causes.
  • Set thresholds against normal behavior, not arbitrary round numbers, to cut false alarms.
  • Make every alert carry a clear next step, so the person paged knows what to check first.
  • Route low-urgency signals to a channel for review, and reserve paging for things that cannot wait.
  • Delete or tune any alert that has fired repeatedly without ever requiring action.
Related services
[C—02]More in Cloud Production Care

Have a prototype or a question?

Book a Production Readiness Audit and get a clear, honest path to production.

Book Audit