March 22, 2014

Dealing with Outliers

The art of forecasting is, on many levels, the art of averaging.  And when it comes to averaging, outliers are a pain in the ass.  

Outliers wreak havoc on seemingly-level-headed forecasts, and get them to predict crazy things.  Yet, it's often unclear what counts as an outlier -- and if you're not careful, you end up throwing the baby out with the bath water.  To wit, in eliminating your outliers, you risk eliminating material data (i.e., information you want to factor into your forecast).   

So what are you supposed to do about outliers?  Well, first and foremost:
  • There are many ways to handle outliers.
  • The optimal method is very circumstance-specific, in about a dozen ways.
  • Dealing with outliers generally falls into two categories: elimination and mitigation.
If you spy any weird values, maybe you can just knock out those values from your averages altogether.
  • Eliminate values above/below a fixed value.
"When calculating the average coffee consumption per employee, ignore employees who drinks more than six cups per day."
This is wonderful, when it works -- it's simple and explainable. However, it assumes that (a) you really can ignore this subset, and (b) you know the precise limit to set. 
Example: Caffeinated Carl averages a dozen cups of coffee per day.  Do we really want to let him skew the average?  (Damn that guy.)
Plus, it assumes that this fixed value doesn't change over time:  Maybe in 1970 drinking three cups would be unusual, but by 2014, that's par for the course.  ...Keep in mind that abruptly changing your criteria (e.g., "for 1970's, eliminate more than three cups.  In the 1980's, eliminate more than four...) will lead to weird discontinuities in your averages.  Smarmy analysis will mock it: "How come you ignored Kona Karen's 3.2 cups in 1979, and then suddenly in 1980 you're including her?"
  •  Eliminate values based upon relative extremes.
"When calculating the coffee consumption per employee, ignore the top/bottom 5%'s.  ...Or ignore any employees whose coffee consumption is more than 2 standard deviations from the group."
On the bright side, this is more flexible.  It elegantly handles natural progressions over time (i.e, as people drink more coffee, you naturally exclude larger drinkers), and doesn't give you discontinuity issues, like Kona Karen introduced above. 

On the down side, it's not quite as explainable:  You won't intuitively know precisely which drinkers would be included/excluded without observing the population set they're being compared to.   Even more vexing, you now need a certain population size or else you can get screwy results.  What happens if you're averaging a group of 3 drinkers, and one of those people is Caffeinated Carl?   Do you want to ignore him or not?   I dunno -- you tell me.  I suppose it depends upon what you're using your forecast for.
Sometimes, for one reason or another, you can't just eliminate extreme values -- especially if your data set is too small.  For example, consider if you were trying to take the average value of -40, 10, 14, 100 -- where both -40 and 100 are kind of outliers.  If you average all of them, the average of 21 might seem too high.  Yet if you eliminate the 100, the average of -5.3 might seem too low.  

In these circumstances, you need some way to acknowledge high and low values, but without letting them run roughshod over your average.
  • Take the Median.
There is some charm in just taking the middle value in any set.  This way, you're not ignoring high or low values -- any high value really does lead to a higher average, and any low value leads to a lower average.

However, if your data is unevenly distributed, it can be misleading.  For example, consider the median in the set -4, -3, -2, -1, 0, 201, 202, 203, 204.   Is a value of 0 really a representative average?  Maybe so, maybe not.
  • Take the Mode.
I've honestly never used the mode once, ever, in any forecast.  The mode is a statistically-cool sounding way of saying "Erm...what do people usually do?  Do that."  I suspect it's good for non-additive values, like the "average" favorite flavor of coffee.  But, hey, it's an option.   
  • Take the Harmonic Mean.
The harmonic mean is supposed to be a good way to average rates, and suppresses the impact of outliers.  You won't get far in forecasting without some would-be statistician recommending the harmonic mean (although they're probably just statistic-name-dropping to up their math cred). 

I've actually already written about how to calculate the harmonic mean (vs the geometric and arithmetic mean).  Check it out here.

I've used the harmonic mean, but sparingly.  For me, it's unclear exactly how much it is mitigating your outliers, and personally I don't like using averages when I can't intuitively fathom how it works.  But, hey, it has its own wikipedia entry, so it's pretty legit.
  • Use the Akamai Outlier Method.
This is a method that I co-developed with Tom Leighton, back when he was Chief Science Officer at Akamai.  ("Co-developed" might be a bit overstated -- he wrote the pseudocode, and I just programmed it.  But I was definitely in the room, so that should count for something.)   Results may vary by use, but it works wonderfully for mitigating outliers in very small data sets.  

I should mention that we didn't define this method just to express our creativity:  we attempted more traditional methods (such as those described above), but they each had edge cases where our forecast was screwy.  This method delivered consistently good results.
Here's how it works: 
  1. Take the arithmetic average of the set.  
  2. Identify the most extreme value in your data set. 
  3. Reduce this extreme value until it is no more than X% farther from the average than the next most extreme value (where X is a pre-defined amount; we settled upon 10%.)
This is a bit tricky, because you'll notice that as you reel in your extreme value, your average is changing, too.   And sometimes (this is a bit of a brain-bender), the second-most extreme value changes as you reel in your most extreme value.
Which method to eliminate/mitigate outliers is best?
If you're new to forecasting, you might think that there's a right answer.  Like, an answer that statisticians everywhere could agree upon.  There's not.  Now, granted, they can tell you if you're applying a method incorrectly, but that's about it.
Good in curries, less so pancakes.
A good forecast is like a recipe.  I can't just say "Come hell or high water, you had better add a teaspoon of paprika, or your meal is going to taste like crap."  It just doesn't work like that.  The best I can do is say, "Cardamom has a smokey flavor that's good on  fish. Crushed red pepper is great on pizza."  You are the chef -- I don't even know what the heck you're even making.
 As unpleasant as it might be to hear, this is your best bet:
  1. Try a bunch of different things.  
  2. Inspect the results closely, from a variety of perspectives.
  3. Decide which outlier-handling method(s) give you the results you prefer.
That's the best I've got.  Good luck!

No comments:

Post a Comment