Last week, our production API started experiencing mysterious crashes around 3 AM every night. The symptoms were clear: memory usage would steadily climb throughout the day until the process exhausted available RAM and the container was killed by the OOM killer.

The Investigation Begins

The first step was to capture a heap snapshot during peak memory usage. I deployed a debug build with heap profiling enabled:

const v8 = require('v8');
const fs = require('fs');

function takeHeapSnapshot() {
  const filename = `heap-${Date.now()}.heapsnapshot`;
  const snapshot = v8.writeHeapSnapshot(filename);
  console.log(`Heap snapshot written to ${snapshot}`);
}

// Trigger snapshot via API endpoint
app.get('/debug/heap', (req, res) => {
  takeHeapSnapshot();
  res.send('Snapshot taken');
});

Finding the Culprit

After analyzing the heap snapshot in Chrome DevTools, I noticed something unusual: thousands of event listeners attached to our WebSocket connections. The smoking gun was in our connection handler:

// BEFORE (leaky code)
io.on('connection', (socket) => {
  const checkAuth = setInterval(() => {
    validateToken(socket.token);
  }, 60000);

  socket.on('disconnect', () => {
    console.log('Client disconnected');
    // BUG: interval never cleared!
  });
});

Every time a client disconnected, the interval timer kept running, holding references to the socket object and preventing garbage collection.

The Fix

The solution was simple once identified:

// AFTER (fixed)
io.on('connection', (socket) => {
  const checkAuth = setInterval(() => {
    validateToken(socket.token);
  }, 60000);

  socket.on('disconnect', () => {
    clearInterval(checkAuth); // Clear the interval
    console.log('Client disconnected');
  });
});

Lessons Learned

Always clean up resources: Event listeners, timers, database connections - anything that holds references needs explicit cleanup.
Monitor memory in production: We now have Prometheus metrics tracking heap usage with alerts at 80% threshold.
Use WeakMap for caching: When building caches, use WeakMap instead of plain objects to allow garbage collection of unused entries.

Prevention Tools

I've added these tools to our development workflow:

clinic.js: Automated performance profiling
node --inspect: Built-in debugging with heap snapshots
Artillery: Load testing to catch leaks before production

The API has been stable for two weeks now, with memory usage remaining flat even under heavy load. Sometimes the best debugging sessions are the ones that teach you to write better code in the first place.

Valerii Pohorzhelskyi

Debugging a Production Memory Leak in Node.js

The Investigation Begins

Finding the Culprit

The Fix

Lessons Learned

Prevention Tools