The Nodesource Blog

#shoptalk Subscribe

Fix Production Problems (in Your Sleep) With N|Solid Threshold Actions

Have you ever had a tricky performance problem in production code that only seems to occur when you're sleeping? Most of the time, when you find out about these the next morning, the only thing you have to go on is a spike in a graph somewhere, and maybe a log entry. Figuring out what’s wrong with “rare” problems like these can be hard.

N|Solid's threshold-based monitoring functionality is designed to help you with problems like these. These features bring the type of information available to you during development into your production environment, without code modification or performance degradation.

As of N|Solid v1.3, thresholds come in two flavors: CPU and heap. When these thresholds are exceeded by an application, they can be set to trigger various actions (heap snapshots, CPU profiles, email notifications). In this post, we'll be focusing on CPU profiling.

Let's begin with some code that uses regular expressions to cause performance problems. Although regular expressions are extremely useful and common, they have the potential to block the event loop and burn CPU if they aren’t constructed carefully. When an endpoint containing such a defect is exposed to the internet, there is potential for malicious clients to exploit it (this is known as a regular expression denial-of-service attack, or regex DoS).

First, let’s bring up some sample code into N|Solid:

  • If you don't have it running already, install and start up N|Solid

  • Copy the snippet below and save it in a file called bad-regex.js:

    // reminder: this is demo code specifically written to cause problems
    var badRegex = /^((xy)*)+$/;
    
    
    function repeat (unit, times) {
      var result = '';
      var counter = 0;
      while (counter++ < times) {
        result += unit;
      }
      return result;
    }
    
    
    function random (min, max) {
      return min + (Math.random() * (max - min));
    }
    
    
    function timedRegex () {
      var regexInput = repeat('xy', random(20, 26)) + 'x';
      var start = process.hrtime();
      badRegex.test(regexInput);
      console.log('regex input:', regexInput);
      console.log('regex took %d ns', process.hrtime(start)[1]);
    }
    
    
    setInterval(timedRegex, 1000);
    
  • Start the demo application: NSOLID_APPNAME="Regex Trouble" NSOLID_HUB=2379 nsolid bad-regex.js

Now let's configure a threshold and some threshold actions in the N|Solid console:

  • From the main application page, click on the Regex Trouble application to enter the cluster view.

  • Click on Threshold Settings, located near the upper-right corner of the page.

  • Switch to the CPU Used threshold parameter, then flip the switch immediately below to enable it.

  • Drag the CPU threshold slider immediately below until the value is set to 75%. You may have to set this to a lower percentage if you are following along on a more powerful machine or instance.

  • Set the threshold duration to 0. Increasing this value allows you to prevent short blips of activity from triggering a threshold at all.

  • Set the threshold interval to 30 minutes. The code above has the potential to generate a CPU spike every second, and setting this will ensure that your inbox does not become clogged with repeated notifications.

  • Select Initiate CPU Profile; when enabled, N|Solid will start a CPU profile on your behalf, leaving an entry in the CPU profiles list in the process view.

  • Select Send a Notification, then enter a valid email address into the box that appears.

  • Click the Update button to save these settings, then click the DONE button in the upper-right corner to close the threshold settings panel.

  • Watch the process dot move, and, if necessary, adjust the threshold bar so that the dot crosses over the dot when it jumps to the right. Once it crosses, the CPU threshold bar will turn orange.

  • Check your email and look for a message from nsolid-notifications@nodesource.com. Click on the link to the CPU profile, which will take you directly to the profile taken on your behalf when the spike happened.

  • Switch to the Flamegraph visualization type, and click on the widest bar at the top of the graph, which should be labeled timedRegex.

  • The call stack table in the upper-right corner should tell you that this call was invoked thousands of times and is responsible for the vast majority of the program's running time. Hovering over the top entry will reveal the full path to the file containing the problematic function.

Finally, let’s clean up:

  • Go back to the terminal session running bad-regex.js and press <ctrl> c to shut it down.

  • Try running your own node project under N|Solid, or shut down N|Solid by navigating to the terminal sessions used to start it and pressing <ctrl> c to shut down the console, proxy, and etcd.

Although this example was short and contrived, hopefully you've now seen how N|Solid can help you easily diagnose a serious problem affecting your application's performance and reliability in production.

Learn more about N|Solid