Monitoring Node.js Apps with Prometheus and Grafana

Monitoring Node.js Apps with Prometheus and Grafana

After flying blind in production one too many times, I set up Prometheus and Grafana for our Node.js services. It took some trial and error. Here's what the setup actually looks like once it's working.

Here's a fun story. Last year we deployed a Node.js service that passed every test, ran great in staging, and promptly ate 2GB of memory in production over the course of six hours. Nobody noticed until the OOM killer took it out at 3am. No alerts, no dashboards, nothing. Just a PagerDuty call from our infra team asking why the container kept restarting.

That's what "running without monitoring" actually looks like. Not some hypothetical risk -- a real, middle-of-the-night incident that could have been caught with a single memory gauge and a threshold alert. If you're running Node.js in production without Prometheus and Grafana (or something equivalent), you're not being brave. You're being reckless.

So let's fix that. I'm going to walk you through the entire setup -- from instrumenting your Node.js app, to configuring Prometheus scraping, to building Grafana dashboards that actually tell you something useful, to alert rules that will save your sleep.

Why Prometheus and Grafana (And Not Something Else)

There are a lot of monitoring options out there. Datadog, New Relic, AWS CloudWatch, Elastic APM. They're all fine. But Prometheus + Grafana has become the default open-source monitoring stack for good reason:

  • Pull-based architecture. Your app exposes a /metrics endpoint. Prometheus scrapes it. Your app doesn't need to know where Prometheus lives, doesn't need outbound network access, and doesn't need to handle failed pushes. This is a genuinely better model for most setups.
  • It's free. Not "free tier with gotchas" free. Actually free. This matters a lot when you're shipping 50 microservices and Datadog wants to charge you per host per month.
  • PromQL is powerful. Once you learn it, you can answer questions like "what's my p99 latency for POST requests to /api/orders, broken down by status code, over the last hour?" in a single query.
  • Kubernetes-native. If you're on k8s, Prometheus service discovery just works. Annotate your pod, and it starts getting scraped automatically.

The tradeoff? You host it yourself. You're responsible for storage, retention, and HA. For small teams, something like Grafana Cloud's free tier can bridge that gap. But if you're already running infrastructure, adding Prometheus is not a big ask.

Setting Up prom-client (The Part Everyone Gets Right)

The prom-client library is the official Prometheus client for Node.js. Installation is the easy part:

npm install prom-client

Create a metrics module. This is the foundation everything else builds on:

// src/metrics.js
const client = require('prom-client');

const register = new client.Registry();

// Default metrics: event loop lag, memory, CPU, GC duration, active handles
client.collectDefaultMetrics({
  register,
  prefix: 'nodejs_',
  gcDurationBuckets: [0.001, 0.01, 0.1, 1, 2, 5]
});

module.exports = { client, register };

That collectDefaultMetrics call gives you a surprising amount for free -- heap usage, event loop lag, GC pauses, active handles. Don't skip it. I've seen people write custom versions of metrics that prom-client already collects out of the box.

Now expose the metrics endpoint. If you're using Express:

// src/app.js
const express = require('express');
const { register } = require('./metrics');

const app = express();

app.get('/metrics', async (req, res) => {
  try {
    res.set('Content-Type', register.contentType);
    res.end(await register.metrics());
  } catch (err) {
    res.status(500).end(err.message);
  }
});

app.listen(3000, () => {
  console.log('Server running on port 3000');
});

Hit http://localhost:3000/metrics and you'll see a wall of text in Prometheus exposition format. Every metric has a HELP line and a TYPE line. If that endpoint works, you're 30% of the way there. Seriously -- most of the value comes from what you do next.

Custom Metrics: This Is Where It Actually Matters

Default metrics tell you about the Node.js runtime. That's important, but it doesn't tell you if your users are having a bad time. For that, you need custom application metrics. Prometheus gives you four types, and you need to understand when to use each one.

Counters only go up. Total requests, total errors, total jobs processed. The name should end with _total by convention:

const { client, register } = require('./metrics');

const httpRequestsTotal = new client.Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'route', 'status_code'],
  registers: [register]
});

// In your middleware
app.use((req, res, next) => {
  res.on('finish', () => {
    httpRequestsTotal.inc({
      method: req.method,
      route: req.route ? req.route.path : req.path,
      status_code: res.statusCode
    });
  });
  next();
});

A word on labels: they're incredibly useful for slicing data, but they're also a trap. Every unique combination of label values creates a separate time series in Prometheus. If you add a label for userId, you just created a time series for every user. Your Prometheus storage will explode and your queries will crawl. Stick to low-cardinality labels: HTTP method, route pattern (not the actual URL with IDs), status code, environment. That's it.

Gauges go up and down. Current active connections, queue depth, cache size:

const activeConnections = new client.Gauge({
  name: 'http_active_connections',
  help: 'Number of active HTTP connections',
  registers: [register]
});

const queueSize = new client.Gauge({
  name: 'job_queue_size',
  help: 'Current number of jobs in the processing queue',
  registers: [register]
});

app.use((req, res, next) => {
  activeConnections.inc();
  res.on('finish', () => activeConnections.dec());
  next();
});

// Poll your queue periodically
setInterval(async () => {
  const size = await jobQueue.count();
  queueSize.set(size);
}, 5000);

Histograms are what you want for latency. They bucket observed values so you can calculate percentiles later with PromQL. This is the single most important custom metric you'll add:

const httpRequestDuration = new client.Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
  registers: [register]
});

app.use((req, res, next) => {
  const end = httpRequestDuration.startTimer();
  res.on('finish', () => {
    end({
      method: req.method,
      route: req.route ? req.route.path : req.path,
      status_code: res.statusCode
    });
  });
  next();
});

Those bucket boundaries matter. The defaults (0.005 to 10 seconds) work for a typical web API, but if you're running something that does sub-millisecond work or something that routinely takes 30 seconds, adjust them. If all your requests fall into one bucket, your percentile calculations will be useless.

Summaries calculate quantiles on the client side. Honestly? Use histograms unless you have a specific reason not to. Summaries can't be aggregated across instances, which makes them nearly useless in a multi-replica deployment. The only time I reach for summaries is when I need exact percentiles from a single instance and I know I'll never scale horizontally:

const dbQueryDuration = new client.Summary({
  name: 'db_query_duration_seconds',
  help: 'Duration of database queries in seconds',
  labelNames: ['operation', 'collection'],
  percentiles: [0.5, 0.9, 0.95, 0.99],
  maxAgeSeconds: 300,
  ageBuckets: 5,
  registers: [register]
});

async function findUser(id) {
  const end = dbQueryDuration.startTimer({ operation: 'find', collection: 'users' });
  try {
    return await User.findById(id);
  } finally {
    end();
  }
}

Monitoring the Event Loop, Memory, and CPU (The Stuff That Bites You)

The event loop is the heart of Node.js. When it lags, your entire application becomes unresponsive. Not slow -- unresponsive. Every connected client waits. This is the metric that would have caught our memory leak at 3am instead of the OOM killer.

The default metrics from prom-client already track event loop lag, but I like adding a more granular custom measurement:

const eventLoopLag = new client.Gauge({
  name: 'nodejs_eventloop_lag_seconds',
  help: 'Event loop lag in seconds',
  registers: [register]
});

const eventLoopLagHistogram = new client.Histogram({
  name: 'nodejs_eventloop_lag_duration_seconds',
  help: 'Event loop lag distribution in seconds',
  buckets: [0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1],
  registers: [register]
});

function measureEventLoopLag() {
  const start = process.hrtime.bigint();
  setImmediate(() => {
    const lag = Number(process.hrtime.bigint() - start) / 1e9;
    eventLoopLag.set(lag);
    eventLoopLagHistogram.observe(lag);
    setTimeout(measureEventLoopLag, 1000);
  });
}

measureEventLoopLag();

The trick here is simple: schedule something with setImmediate, then measure how long it actually took to run. In a healthy app, that's under a millisecond. If it's consistently above 50ms, something is blocking your event loop and you need to investigate. Above 100ms? You have an active incident whether you know it or not.

For memory, track all the different types. RSS, heap total, heap used, external, array buffers. The relationship between these tells a story:

const memoryUsage = new client.Gauge({
  name: 'nodejs_memory_usage_bytes',
  help: 'Memory usage in bytes by type',
  labelNames: ['type'],
  registers: [register]
});

setInterval(() => {
  const mem = process.memoryUsage();
  memoryUsage.set({ type: 'rss' }, mem.rss);
  memoryUsage.set({ type: 'heapTotal' }, mem.heapTotal);
  memoryUsage.set({ type: 'heapUsed' }, mem.heapUsed);
  memoryUsage.set({ type: 'external' }, mem.external);
  memoryUsage.set({ type: 'arrayBuffers' }, mem.arrayBuffers);
}, 5000);

If heapUsed keeps climbing and never comes back down after GC cycles, you have a memory leak. If external is growing, you're probably leaking Buffers. If rss is high but heapUsed is normal, you might have native addon issues or fragmentation. These distinctions matter when you're debugging at 3am.

CPU tracking rounds it out:

const cpuUsage = new client.Gauge({
  name: 'nodejs_cpu_usage_percentage',
  help: 'CPU usage percentage',
  labelNames: ['type'],
  registers: [register]
});

let previousCpuUsage = process.cpuUsage();
let previousTime = Date.now();

setInterval(() => {
  const currentCpuUsage = process.cpuUsage(previousCpuUsage);
  const currentTime = Date.now();
  const elapsed = (currentTime - previousTime) * 1000;

  cpuUsage.set({ type: 'user' }, (currentCpuUsage.user / elapsed) * 100);
  cpuUsage.set({ type: 'system' }, (currentCpuUsage.system / elapsed) * 100);

  previousCpuUsage = process.cpuUsage();
  previousTime = currentTime;
}, 5000);

Configuring Prometheus to Actually Scrape Your App

Your app is emitting metrics. Now you need Prometheus to come get them. Here's a prometheus.yml that covers the common cases:

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  scrape_timeout: 10s

rule_files:
  - "alert_rules.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - 'alertmanager:9093'

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'nodejs-app'
    metrics_path: '/metrics'
    scrape_interval: 10s
    static_configs:
      - targets:
          - 'app-server-1:3000'
          - 'app-server-2:3000'
          - 'app-server-3:3000'
        labels:
          environment: 'production'
          service: 'myapp'

  - job_name: 'nodejs-app-staging'
    metrics_path: '/metrics'
    scrape_interval: 30s
    static_configs:
      - targets:
          - 'staging-server:3000'
        labels:
          environment: 'staging'
          service: 'myapp'

A few things people get wrong here:

  • Scrape interval too short. I've seen teams set 1-second scrape intervals "for accuracy." Congratulations, you've turned your monitoring into a DDoS attack on your own app. 10-15 seconds is fine for almost everything.
  • No labels on targets. Always add environment and service labels. Future-you will be grateful when you're trying to filter production from staging in a Grafana dashboard.
  • Scrape timeout >= scrape interval. If your scrape takes longer than the interval, Prometheus will show gaps. Keep the timeout well below the interval.

If you're on Kubernetes, forget static configs entirely. Use service discovery:

scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port, __meta_kubernetes_pod_ip]
        action: replace
        target_label: __address__
        regex: (\d+);(([A-Fa-f0-9]*:)*[A-Fa-f0-9]+|\d+\.\d+\.\d+\.\d+)
        replacement: $2:$1

Now just add prometheus.io/scrape: "true" as a pod annotation and Prometheus picks it up automatically. No config changes, no restarts. This is the correct way to do it in k8s.

Spin up Prometheus with Docker to get going quickly:

docker run -d \\
  --name prometheus \\
  -p 9090:9090 \\
  -v ./prometheus.yml:/etc/prometheus/prometheus.yml \\
  -v ./alert_rules.yml:/etc/prometheus/alert_rules.yml \\
  prom/prometheus

Building Grafana Dashboards That Actually Help

Grafana is where metrics become useful. But I need to be honest: most Grafana dashboards I've seen in the wild are terrible. They're either empty (someone set them up and never added panels) or they're covered in 47 graphs that nobody looks at. Good dashboards answer specific questions at a glance.

First, get Grafana running:

docker run -d \\
  --name grafana \\
  -p 3001:3000 \\
  -e GF_SECURITY_ADMIN_PASSWORD=your_secure_password \\
  grafana/grafana

Add Prometheus as a data source (Configuration -> Data Sources -> add Prometheus pointing to http://prometheus:9090), then build your dashboard around these core PromQL queries.

Request rate (how busy are we?):

rate(http_requests_total{service="myapp"}[5m])

Error rate as a percentage (are users having a bad time?):

sum(rate(http_requests_total{service="myapp", status_code=~"5.."}[5m]))
/
sum(rate(http_requests_total{service="myapp"}[5m])) * 100

Latency percentiles (how slow is slow?):

histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket{service="myapp"}[5m])) by (le))
histogram_quantile(0.90, sum(rate(http_request_duration_seconds_bucket{service="myapp"}[5m])) by (le))
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service="myapp"}[5m])) by (le))

Memory and event loop:

nodejs_memory_usage_bytes{service="myapp", type="heapUsed"}
nodejs_eventloop_lag_seconds{service="myapp"}
http_active_connections{service="myapp"}

Here's how I organize dashboards that people actually use: top row has the three things you check first -- request rate, error rate, p99 latency. Big stat panels, green/yellow/red thresholds. Second row has resource metrics -- memory, CPU, event loop lag. Time series graphs so you can see trends. Third row has the detailed breakdowns -- per-route latency, error counts by endpoint, queue depths.

Use Grafana's template variables to add an environment dropdown and an instance dropdown. That way one dashboard serves staging and production, and you can drill into individual instances when something looks wrong.

Alerting Rules That Won't Ruin Your Life

Dashboards are great for investigation. But nobody stares at dashboards all day (and if someone on your team does, they're doing it wrong). You need alerts. The trick is setting thresholds that catch real problems without waking you up for noise.

Here's an alert_rules.yml that I've refined through actual production use:

# alert_rules.yml
groups:
  - name: nodejs_alerts
    interval: 30s
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status_code=~"5.."}[5m]))
          /
          sum(rate(http_requests_total[5m])) * 100 > 5
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value | printf \"%.1f\" }}% (threshold: 5%)"

      - alert: HighResponseTime
        expr: |
          histogram_quantile(0.95,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
          ) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High response time detected"
          description: "p95 response time is {{ $value | printf \"%.2f\" }}s"

      - alert: HighEventLoopLag
        expr: nodejs_eventloop_lag_seconds > 0.1
        for: 3m
        labels:
          severity: warning
        annotations:
          summary: "Event loop lag is high"
          description: "Event loop lag is {{ $value | printf \"%.3f\" }}s on {{ $labels.instance }}"

      - alert: HighMemoryUsage
        expr: |
          nodejs_memory_usage_bytes{type="heapUsed"}
          /
          nodejs_memory_usage_bytes{type="heapTotal"} * 100 > 90
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High heap memory usage"
          description: "Heap usage is {{ $value | printf \"%.0f\" }}% on {{ $labels.instance }}"

      - alert: InstanceDown
        expr: up{job="nodejs-app"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Node.js instance is down"
          description: "{{ $labels.instance }} has been unreachable for more than 1 minute"

The for clause is critical. It's the difference between a useful alert and alert fatigue that makes your team ignore every notification. HighErrorRate needs to sustain for 5 minutes before firing -- this prevents a single burst of retries from waking someone up. HighMemoryUsage waits 10 minutes because memory can spike temporarily during GC cycles and that's fine.

A common mistake: people set the HighResponseTime threshold based on what they think latency should be, not what it actually is. Run your app for a week first. Look at the real p95. Set your threshold at 2-3x that baseline. Otherwise you'll get paged constantly and eventually start ignoring alerts, which is worse than having no alerts at all.

Route alerts through Alertmanager:

# alertmanager.yml
route:
  group_by: ['alertname', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'slack-notifications'
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty-critical'

receivers:
  - name: 'slack-notifications'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
        channel: '#alerts'
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}\n{{ end }}'

  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: 'YOUR_PAGERDUTY_SERVICE_KEY'

The routing here is intentional: warnings go to Slack where someone will see them during business hours. Critical alerts go to PagerDuty and wake someone up. If you're sending everything to PagerDuty, your on-call engineer will burn out in a month. If you're sending everything to Slack, critical alerts will get buried under a pile of warnings. Split them.

That's the whole pipeline: prom-client instruments your app, Prometheus scrapes and stores the metrics, Grafana visualizes them, and Alertmanager routes notifications when things go wrong. Start with the basics I've described here -- request rate, error rate, latency, memory, event loop lag. Get those running, learn what "normal" looks like for your app over a couple of weeks, and then iterate on your thresholds. Don't try to monitor everything on day one. Monitor the five things that will actually tell you when your users are having a bad experience, and add more as you learn what questions you keep asking during incidents.

Written by Anurag Kumar

Full-stack developer passionate about Node.js and building fast, scalable web applications. Writing about what I learn every day.

Comments (0)

No comments yet. Be the first to share your thoughts!