Spikes, Slopes, and SQL

linear regressionYou have data you’re tracking every day. Maybe a lot of data. And maybe you’re tracking it a lot of different ways. You even create graphs of the data over time, so that you can spot trends and anomalies. And then one day you realize that you have too many graphs…

That’s when you know it’s time to let the data tell you when it needs attention.

What management calls a “trend”, statisticians calls a “slope”. If your data happens to fall in a line, the slope is how much is goes up (or down) each unit of time. $2 dollars per day. 100+ page views per hour. +5 accounts per month. Real data doesn’t always line up nicely for you, though, so what’s the best way to get a clear picture of what’s going on?

Some flavors of SQL will make this pretty easy to get, through linear regression functions. Oracle PL/SQL, for instance:

It’s a little less obvious in flavors without those built-in functions, such as Microsoft T‑SQL:

(In the sample queries and code, replace [measurements] with the table (or view, or subquery) that has your rolled-up data, [value] and [date] with the columns you want to use for your x and y axes. Multiple series are distinguished by [aspectID].)

Here are some additional things you may want to try:

  • Use the WHERE clause to restrict the data to the recent past (30 days or 90 days for dailies, perhaps, or 72 hours for hourlies, etc.)
  • Scale the slopes. Something that’s going from 1 to 100 is probably more interesting than something that’s going from 50,001 to 50,100, so divide the slopes by the MAX(value) for the period. This will give you a more useful column to sort by than the “raw” slope, for some problem spaces.

Another thing you may want to know is when the current value is spiking up or down. A brief spike may not have a big impact on the linear regression, but it’s still a point of interest. Oracle PL/SQL first again:

code3

“Spike” here is the number of standard deviations above (or below) average the most recent data point is. I like to look for instances that are more than two standard deviations from the current average (which should mean I’m getting hits 5% of the time for each measurement, or hits on 5% of the metrics with any given run) by adding:

code4

(Note the addition of ABS there—you want sudden drops as well as sudden increases.) If you put that into a view, your WHERE clause will be a little prettier. Restrict the time scale as needed, just like with slopes.

Getting to the most recent value as well as aggregations of the values in Microsoft T-SQL is a little more cumbersome, but I don’t find either particularly more readable than the other:

Once you employ these basic tools, get them into a view or schedule a stored procedure to run periodically and email you the metrics that trip your thresholds. And if you have any other more readable approaches in these or other SQL flavors, let us know by sharing your comments below, especially if you’ve put these kinds of measurements to interesting uses.



« Back to blog