WP How to Tame “Monitorture” and Build a Developer-Friendly Monitoring Environment

Archive

How to Tame “Monitorture” and Build a Developer-Friendly Monitoring Environment

How to Tame “Monitorture” and Build a Developer-Friendly Monitoring Environment

Incapsula DDoS mitigation and web application security protects millions of websites, applications and networks from many of the bad things that happen online.

What does that mean? It means we give you all the services you want, such as running your web application on the internet, including web application firewall, load balancing, CDN and DDoS Protection.

Incapsula sits between your clients (i.e. the people trying to access your website) and your servers. We are positioned in the middle and our points of presence (PoPs), are deployed around the world for faster delivery. We’re able to see all the traffic coming in through our PoPs, filter out all the bad stuff and let the good stuff through, while caching as much as possible to speed up access to your site. That’s the basic idea behind Incapsula.

incapsula-concept

The Incapsula service is comprised of thousands of servers running dozens of software components that integrate with several third-party systems. As you can imagine, there are a lot of moving parts. And these parts have to work in harmony to deliver the level of service and user experience our customers expect. When things break we have to restore them as soon as possible and as a result invest a lot in monitoring.

I recently had the pleasure of delivering a short presentation at Velocity in N.Y. about how we do monitoring at Incapsula and I’d like to share it with you here.

Common Approach Toward Monitoring

When most people think of monitoring they think of metrics. And that means defining lots of metrics, reporting it to a server running Zabbix or Nagios or any other dinosaur monitoring software and (after reading the book, taking the course or spending half your life working with these systems) defining alerts when these metrics cross their thresholds. That’s pretty simple to implement but, over time, it doesn’t scale.

Let’s say you have 1,000 servers to monitor and you monitor 100 different metrics. You’ve just created a system with 100,000 unique outputs. What happens if just one metric changes? The system spews out 1,000 alerts. As the number of simultaneous production events grows it’s harder to make sense of the alerts.

I was never satisfied with this approach, and we never really invested in it. I was also unable to bring myself or any of my developers to work with Zabbix and we weren’t able to get past monitoring the CPU temperatures in the servers and the amount of network traffic on the NICs.

The reason why developers aren’t thrilled to work on legacy monitoring systems is because it requires a tremendous level of complexity to configure and monitor these services. Take for example a simple Zabbix expression for checking the amount of free disk space:

({TRIGGER.VALUE}=0 and {server:vfs.fs.size[/,free].max(5m)}<10G) or
({TRIGGER.VALUE}=1 and {server:vfs.fs.size[/,free].min(10m)}<40G)

Alert if there are less than 10GB free disk space in the last five minutes and recover if there are more than 40GB in the last 10 minutes.

When monitoring production services that span dozens of servers and several software components, the configuration becomes unbearable. As a result, this sort of monitoring is passed along for sys admins to do.

What’s Wrong With the Sys Admins Doing the Monitoring? Isn’t That Part of Their Job Anyway?

At Incapsula we deploy a new release to production every week. Each release introduces new code we need to monitor. Having someone other than the developers do the monitoring means they have to hand over these tasks to the sys admins on a weekly basis. By the time these sys admins gets around to the monitoring, it’s time for the next release. This means that monitoring is always behind schedule. In addition, having the developers do their own monitoring increases accountability and engagement (a whole other topic to discuss in another post).

By contrast, there are lots of other components to monitor that don’t change every week, such as servers, disks and network devices. Monitoring each device independently was a good strategy that fit well with the kind of monitoring system we had. This led to separate infrastructure monitoring from a more high-level applicative monitoring, which had to evolve to fit the way developers think and like to work.

Developer-Friendly Monitoring

Developers like working on software projects, right? So our monitoring system is just another service the developers work on in the same development environment and processes. We call it NetControl.

NetControl is a Java-based globally distributed system that is deployed as a cluster with one of the nodes serving as leader. The leader is in charge of collecting data from the nodes, processing it and sending alerts or taking action as needed. The cluster is managed using an Apache Zookeeper cluster, an industry standard service for performing operations that require synchronization, like electing a leader or managing runtime configuration.

monitoring

Monitoring scenarios are defined using a combination of JSON configuration and JavaScript code that implements a simple interface. Most basic scenarios only require the JSON configuration as we already implemented a few common templates. We chose JavaScript as the programming language to make it easy for all developers and operations engineers to create their monitoring scenarios. Trust me, nothing is funnier than seeing your C developers look up JavaScript questions in Stack Overflow!

NetControl’s model is comprised of Scenarios, Checks and Events. Checks are simple operations like performing an HTTP or DNS request to a service running on multiple servers. Scenarios run the checks, aggregate their results and report Events if needed to one or more Notification Channels.

For example, here’s the configuration for the scenario that checks the availability of our management console, my.incapsula.com:

{
   "name": "MY",
   "federated": true,
   "checks": [
      {
         "name": "my", "type": "HTTP", 
         "config": {
            "url": "https://my.incapsula.com/admin/login",
            "intervalUnit": "SECONDS", "intervalDuration": 30
         }
      }
   ],
   "statsDescriptors": {
      "responseTime": {"aggFunction": "AVERAGE"}
   },
   "eventsManager": {
      "clazz": "events.CheckBasedEventsManager",
      "criteria": "MAJORITY"
   },
   "notificationChannels": [
      {"id": "slack", "group": "server"},
      {"id": "email", "group": "server"},
      {"id": "pagerduty", "group": "server"}
   ]
}

Setting the federated flag to true tells NetControl to run the scenario from all nodes and aggregate the results, sending out an alert if a majority of the nodes think our application is down and to eliminate false positives related to network connectivity from a single location.

If a more advanced aggregation is required, developers can implement a JavaScript object that overrides the default logic:

var scenario = {

   state: {
      "x": 1, 
      "y": 2
   },
   
   processCheck: function(check): {

   },	
   
   generateEvents: function(check): {

   }
};

The processCheck function is called when a check completes running and allows the developer to extract information from its response if necessary or report custom metrics. After a scenario completes processing all its checks, generateEvents is called to aggregate the results and produce events if necessary.

At a high-level, the processing of scenarios is as follows:


for (Scenario scenario : scenarios) {
   for (Check check : scenario.checks) {
      if (check.shouldCheck()) {
         Checker checker = getChecker(check);
         checker.check(check);   // Non-blocking asynchronous execution here
      }
   }
   if (isLeader()) {
      List events = scenario.processEvents();
      for (NCEvent event : events) {
         event.notify();						
      }
   }
}

NetControl Architecture

NetControl is built on top of Play Framework, a Java-based web application framework. We mainly use Play as an application server to render the user interface of NetControl. We use a single thread to run the scenarios, but since all checks are asynchronous the thread is not blocked on anything and can scale to run many scenarios.

NetControl uses the Rhino JavaScript Engine to execute the JavaScript scenarios, and we are planning to upgrade to Nashorn to gain better debugging capabilities and support for new ECMA features.

NetControl at Incapsula

For Incapsula we run three NetControl clusters:

  • NetControl STAGE for testing
  • NetControl BACKOFFICE for monitoring our backend and management services
  • NetControl PRODUCTION for monitoring the customer production networks

Overall we run a fleet of 20 NetControl servers, installed in various data-centers around the globe.

We integrated NetControl with Slack, PagerDuty and email to make sure alerts reach the on-call developers at any time.

slack_pagerduty

Using NetControl to Manage Our Global Network

NetControl was not conceived as a general-purpose monitoring tool. It was actually built as a network management service to automate the process of routing traffic away from PoPs that experience connectivity issues. At the time we also suffered from frequent data-processing outages, that had they been detected earlier would have been resolved without impacting customers. When we saw the opportunity, we quickly rewrote NetControl to be a framework for developers to implement their monitoring scenarios.

NetControl has proven itself to be very valuable for Incapsula. It’s in active development and we are constantly adding more scenarios to it to expand its operational responsibility. At some point it will transform into NocBot, a full-time robot to manage our network.

nocbot

Since NetControl’s launch about two years ago, Incapsula has deployed dozens of monitoring scenarios without falling into alert fatigue. In fact, when we detect false alarms or false positives, developers are more inclined to fix them because they consider them bugs in their code. And because many scenarios trigger PagerDuty incidents automatically, you may be woken up in the middle of the night because of that bug. All the more reason to fix it!