March 31, 2014

Mapping the Possibility Space

Near the end of one of my favorite movies, Magnolia, it abruptly starts raining frogs.   Cars crash, people freak out; it's a full-blown crisis. 

One of the characters, a young boy, is looking out the window, taking it all in.  Then he says something both simple and utterly profound:

"This is something that happens.  This is one of the things that happens." 

Statisticians have a name for this category -- this category of "things that happen" ...or, at least "things that could happen".  It's called the Possibility Space.  If something can happen, no matter how unlikely it is to happen, then it resides within the Possibility Space.

Unlikely Is Not Impossible
This concept conflicts with how people typically think.  Normally, if an event is sufficiently unlikely, people will consider it impossible.  Indeed, it doesn't even have to be particularly unlikely for most people to write it off.  

A lot of people will say things like "it's impossible for the Bengals to win the Super Bowl this year."  Oh puhleeze.  It's not.   It may be impossible for the Red Sox to win the Super Bowl this year (...being that they're a baseball team and all), but even that is probably not impossible -- probably more like a one-in-a-duodecillion chance.

Pondering The Edge
I spend a lot of time wondering what fantastic things might exist at the very edge of the Possibility Space -- events that are exceedingly unlikely, but still possible.

For example, consider your personal Probability Space.  Does it include you...

    ...wining a Nobel prize?
    ...playing free safety in a 2019 Jaguars Super Bowl win?
    ...becoming president of the United States?
    ...being nominated as the next Pope?
    ...living for a million years?
    ...traveling to Mars?
    ...witnessing a true divine-intervention miracle?
    ...transmuting lead into gold?
    ...speaking with an extra-terrestrial intelligence?

Granted, none of these things is likely.  But are they impossible?  No, none of them are probably impossible.  They are in your Possibility Space, even if they're just a glimmer in an ocean of outcomes.

Beyond The Edge
Tesseract: a 4th-dimension cube
Conversely, what things are forever out of reach, no matter how the cards fall? What things are beyond the Possibility Space
    ...traveling faster than the speed of light?
    ...understanding how the universe works?
    ...visualizing the fourth dimension?
    ...observing the surface of a black hole?
    ...traveling back in time?
    ...communicating with spirits?
You are incredibly unlikely.  Yet, you exist.
You might be wondering what's the point in thinking about unlikely-but-not-impossible things.  I mean, they're not going to happen, right?

No, not right.  As it turns out, if you look things from the proper perspective, virtually everything that has ever happened has been ferociously, mind-bendingly unlikely.  Rewind to 10,000 B.C. -- what is the likelihood that your family tree would evolve in the way that it did?  Well...if we assume that each generation is twenty years (it's probably smaller than that, but whatever), then that's six hundred generations ago.  What's the likelihood that your ancestors would (a) survive long enough to mate, (b) meet just the right partner, (c) mate at just the right time, and (d) have just the right child... 

I tried to calculate it, but...I'm not able to express it in words.  The biggest named number is a British centillion (10^600), and I blew through that pretty quickly.

The Most Unlikely Outcome...
My brother is often going on about coincidences.  He'll spot a common pattern between his license plate, the first few digits of some girl's phone number, and the age of his cat, and before I know it, he's calculating how vastly improbable this coincidence is.  I'm rarely moved.

Why?  Because, while it's true that any one set of circumstances is unlikely, there are so many unlikely things that could happen, that some are highly likely to occur.  For a moment, ponder the virtually infinite number of coincidences that could occur:
  • You and your plumber have the same first name.
  • Your dog was born on the same day as your neighbor's son.
  • The last three streets you lived on all started with the same letter. 
You could spend a lifetime thinking of coincidences, and not scratch the surface.   So, all told, the most unlikely outcome in the universe...is that only likely events will occur.

March 29, 2014

Props: The Onion

Anybody with whom I interact learns pretty quickly that I love the Onion.  I quote the Onion all the time, I honestly can't help it.  They've  been consistently hysterical for fifteen goddamn years.  I have NO idea how they do it without being formulaic, but it's always an absolute riot.

But what's more amazing is how close they are to the TRUTH.  Like, they predict events that actually happen.  Here's one mocking Gillette's release of ever-more blades:  Fuck Everything,  We're Doing Five Blades.

Many articles were written a few years later, when Gillete did go to five blades.  Check it out on BoingBoing (link).

Be Prepared For Brett To Cite The Onion
Anyway, I'm going to cite the Onion now and then, because it's absolutely awesome.  Here are two gems from their Point-Counterpoint, where they thoughtfully consider both sides of complex modern issues.   (Both are 10 years old -- their new stuff is, if anything, even better.)

Enjoy!  -Brett


Point Counterpoint:  Personal Computing

Point:  My Computer Totally Hates Me!


Counterpoint: God Do I Hate That Bitch

Point Counterpoint:  NASA Funding


Point:  According to the Economist, NASA is an Industrial Subsidy In Disguise
Counterpoint:  Ooh, Look At Me, I Read The Economist!



March 28, 2014

Three Card Monte

This post is about a wise man of our time, John Hodgman.  His book, The Area of My Expertise, is arguably one of the seminal works of the last twenty years.  It's just amazing.

And one of the things he exposes is the Art of the Con.  How people really get financially taken in schemes.  He describes the classic con.  The one used to great effect by a considerable number of con-men over the past hundred years and even today.  You might know if it:  The Three Card Monte.

Here's what I'm going to do.  I'm going to read alongside you, as you read what he wrote, and just kind of comment on it / marvel at it.

Because the text is black and white, I will switch to orange, to make it easy to see when it's him and when it's me.



WHOA!  This is good.  This is totally just like I thought it would happen.  

Okay, I'm skipping ahead a bit, as the con-woman builds rapport with the mark.  It's kind of drawn out.
Hm.  I did not think is how that con normally went down.


I definitely did not think this is how that con went down.


I skipped ahead a bit more.  Aw, screw it, I'm skipping to the end.

Ah!  Yes!  Now that I really think about it, I guess that probably is how it normally went down.  It's cool to have it explained to you.

It's like, now you know, ya know?


March 25, 2014

The One-in-Ten-Novillion Post




March 22, 2014

Dealing with Outliers

The art of forecasting is, on many levels, the art of averaging.  And when it comes to averaging, outliers are a pain in the ass.  

Outliers wreak havoc on seemingly-level-headed forecasts, and get them to predict crazy things.  Yet, it's often unclear what counts as an outlier -- and if you're not careful, you end up throwing the baby out with the bath water.  To wit, in eliminating your outliers, you risk eliminating material data (i.e., information you want to factor into your forecast).   

So what are you supposed to do about outliers?  Well, first and foremost:
  • There are many ways to handle outliers.
  • The optimal method is very circumstance-specific, in about a dozen ways.
  • Dealing with outliers generally falls into two categories: elimination and mitigation.
Elimination
If you spy any weird values, maybe you can just knock out those values from your averages altogether.
  • Eliminate values above/below a fixed value.
"When calculating the average coffee consumption per employee, ignore employees who drinks more than six cups per day."
This is wonderful, when it works -- it's simple and explainable. However, it assumes that (a) you really can ignore this subset, and (b) you know the precise limit to set. 
Example: Caffeinated Carl averages a dozen cups of coffee per day.  Do we really want to let him skew the average?  (Damn that guy.)
Plus, it assumes that this fixed value doesn't change over time:  Maybe in 1970 drinking three cups would be unusual, but by 2014, that's par for the course.  ...Keep in mind that abruptly changing your criteria (e.g., "for 1970's, eliminate more than three cups.  In the 1980's, eliminate more than four...) will lead to weird discontinuities in your averages.  Smarmy analysis will mock it: "How come you ignored Kona Karen's 3.2 cups in 1979, and then suddenly in 1980 you're including her?"
  •  Eliminate values based upon relative extremes.
"When calculating the coffee consumption per employee, ignore the top/bottom 5%'s.  ...Or ignore any employees whose coffee consumption is more than 2 standard deviations from the group."
On the bright side, this is more flexible.  It elegantly handles natural progressions over time (i.e, as people drink more coffee, you naturally exclude larger drinkers), and doesn't give you discontinuity issues, like Kona Karen introduced above. 

On the down side, it's not quite as explainable:  You won't intuitively know precisely which drinkers would be included/excluded without observing the population set they're being compared to.   Even more vexing, you now need a certain population size or else you can get screwy results.  What happens if you're averaging a group of 3 drinkers, and one of those people is Caffeinated Carl?   Do you want to ignore him or not?   I dunno -- you tell me.  I suppose it depends upon what you're using your forecast for.
Mitigation
Sometimes, for one reason or another, you can't just eliminate extreme values -- especially if your data set is too small.  For example, consider if you were trying to take the average value of -40, 10, 14, 100 -- where both -40 and 100 are kind of outliers.  If you average all of them, the average of 21 might seem too high.  Yet if you eliminate the 100, the average of -5.3 might seem too low.  

In these circumstances, you need some way to acknowledge high and low values, but without letting them run roughshod over your average.
  • Take the Median.
There is some charm in just taking the middle value in any set.  This way, you're not ignoring high or low values -- any high value really does lead to a higher average, and any low value leads to a lower average.

However, if your data is unevenly distributed, it can be misleading.  For example, consider the median in the set -4, -3, -2, -1, 0, 201, 202, 203, 204.   Is a value of 0 really a representative average?  Maybe so, maybe not.
  • Take the Mode.
I've honestly never used the mode once, ever, in any forecast.  The mode is a statistically-cool sounding way of saying "Erm...what do people usually do?  Do that."  I suspect it's good for non-additive values, like the "average" favorite flavor of coffee.  But, hey, it's an option.   
  • Take the Harmonic Mean.
The harmonic mean is supposed to be a good way to average rates, and suppresses the impact of outliers.  You won't get far in forecasting without some would-be statistician recommending the harmonic mean (although they're probably just statistic-name-dropping to up their math cred). 

I've actually already written about how to calculate the harmonic mean (vs the geometric and arithmetic mean).  Check it out here.

I've used the harmonic mean, but sparingly.  For me, it's unclear exactly how much it is mitigating your outliers, and personally I don't like using averages when I can't intuitively fathom how it works.  But, hey, it has its own wikipedia entry, so it's pretty legit.
  • Use the Akamai Outlier Method.
This is a method that I co-developed with Tom Leighton, back when he was Chief Science Officer at Akamai.  ("Co-developed" might be a bit overstated -- he wrote the pseudocode, and I just programmed it.  But I was definitely in the room, so that should count for something.)   Results may vary by use, but it works wonderfully for mitigating outliers in very small data sets.  

I should mention that we didn't define this method just to express our creativity:  we attempted more traditional methods (such as those described above), but they each had edge cases where our forecast was screwy.  This method delivered consistently good results.
Here's how it works: 
  1. Take the arithmetic average of the set.  
  2. Identify the most extreme value in your data set. 
  3. Reduce this extreme value until it is no more than X% farther from the average than the next most extreme value (where X is a pre-defined amount; we settled upon 10%.)
This is a bit tricky, because you'll notice that as you reel in your extreme value, your average is changing, too.   And sometimes (this is a bit of a brain-bender), the second-most extreme value changes as you reel in your most extreme value.
Which method to eliminate/mitigate outliers is best?
If you're new to forecasting, you might think that there's a right answer.  Like, an answer that statisticians everywhere could agree upon.  There's not.  Now, granted, they can tell you if you're applying a method incorrectly, but that's about it.
Good in curries, less so pancakes.
A good forecast is like a recipe.  I can't just say "Come hell or high water, you had better add a teaspoon of paprika, or your meal is going to taste like crap."  It just doesn't work like that.  The best I can do is say, "Cardamom has a smokey flavor that's good on  fish. Crushed red pepper is great on pizza."  You are the chef -- I don't even know what the heck you're even making.
 As unpleasant as it might be to hear, this is your best bet:
  1. Try a bunch of different things.  
  2. Inspect the results closely, from a variety of perspectives.
  3. Decide which outlier-handling method(s) give you the results you prefer.
That's the best I've got.  Good luck!

March 20, 2014

Calculating Seasonality & Trending

If you are attempting to forecast any sort of quarterly/monthly/daily growth rate, you'll likely need to calculate trending and seasonality.  How come?  Because:
  • Things sometimes behave differently during different times of the year.  NORADSanta.org's internet traffic spikes in December, then drops in January. 
  • Things sometimes behave differently over time.  MySpace was on a tear for a while, but not so much these days.
In this post, I'll start by describing Trending and Seasonality a bit, and then step through a very basic way to calculate them.


What is Trending, exactly?
Trending is the long-term compound growth rate of your data.  Let's say that I watched a stock over four years, and it grew from 100 to 400 over that period.   Then it's long-term monthly trend would be:
...that "power()" function is how you'd calculate the value in Excel.  Don't let the root intimidate you; all it's saying is "what number would I have to multiply 100 by forty eight times in a row to arrive at 400?"   And sure enough, if you started with 100, multiplied once is 102.93, multiplied twice is 105.94...forty seven times is 388.61, and forty eight times is 400.

You can calculate different trends, too.  For example, if you have lots of historical data, you could have a long-term trend (say, a stock's behavior over the past 5 years) and then a short-term trend (its behavior over the past six months).
Autumn: divide by 1 leaf.

What is Seasonality, exactly?
Seasonality is the impact that repeating periods have on your data -- hours of the day, days of the week, months of the year, et cetera.   

The trick is, at least for when talking about growth rates, seasonality should be geometrically neutral.   Said in non-geek-speak:  If you multiply together all of your seasonality coefficients, you should get a value of 1.  Said another way, the net effect of seasonality (across all periods) should be no effect whatsoever.

Fortunately, making seasonality geometrically neutral is pretty easy -- I'll show you below.  (One of the perks of calculating seasonality, is that you can tell your co-workers that "naturally, I made sure the seasonality was geometrically neutral."  This may impress them, but be warned: It is unlikely to get you any dates.)

Calculating Trending and Seasonality

In the attached spreadsheet (here), we have four years of time series data for a particular item.  

Winter: add 1/3 cup snow
If you look in the sheet "Creating the Data", you can see how I created the time series, by choosing trending and seasonality values.  Then, in the sheet "Calculation", you can see how, using only the time series data itself, I'm able to derive the time series' trending and monthly seasonality values.  (Granted, this is a basic example -- I have an even number of periods for each season and zero volatility, but I'm focusing upon the base concepts here; we can trick it out later on.)

To calculate trending and seasonality, follow these four steps:

1. Calculate your long-term trend.  
This is precisely as I described above.  In this instance, over forty-eight periods we went from a value of 100 to a value of 657, so our long-term trend is the 48th root of 6.57, which is 104% -- or 4% growth.

2. Determine the month-over-month change.
Easy breezy.  Just divide each value by the previous value, to get the percentage change.   For example, the February and March time series values are 99.84 and 149.52 respectively.  This represents 149.52/99.84 = 149.7% growth for March.

3. De-trend the month-over-month changes.
Here, we're stripping out the long-term trend, so just divide each month-over-month value by the long-term trend.  For example, in March 2010 the value grew by 149.76%, but after we subtract out the 104% seasonality, we're left with 149.76%/104% = 144%.

4. Take the arithmetic average of each month's growth rates, to calculate its seasonality.
Last but not least, you can determine the seasonality for each month by simply (arithmetic) averaging its growth rates.  (Remember, the arithmetic average is the "regular" average -- just add up the growth rates for each month and divide by the number of values.)   In this example, each month always has the same seasonality -- but in real life you'd need to take an average to account for random variability.

Is It Really This Simple In Real Life?
Yes and no.  Trending and Seasonality are faithfully described, and you should be able to use the techniques described here to great benefit, as a mainstay of your forecasting algorithm.  But unfortunately, a forecast is more than just trending and seasonality.

You'll likely have to deal with a bunch of other dynamics, such as these:

  • Lots of random variability (aka "noise") in your data.
  • Uneven number of seasons (e.g., 4 Novembers but 5 Decembers).
  • Blurry or inconsistent data definitions (quite often customer, contract or product)
  • Extreme outliers that skew your averages.
  • Uneven amounts of data for each month.
  • Different long-term and short-term trends.
  • Operationalizing and fostering adoption of financial-centric forecasts.
  • Defining business processes around forecasts.
 ...Some of them I've already covered in other articles; others I'll get to as soon as I can!  Some of these processes help "prep" the data -- so change your data input.  Some others, might affect how you calculate trending and seasonality.  However, on some level, the process will be more-or-less as described above.

March 17, 2014

Dark Souls

Probably the finest games I've ever played are the Dark Souls series.  Specifically, Demon's Souls (2009), Dark Souls (2011) and Dark Souls 2 (2014).

These games are notorious among gamers for being difficult.  You die a lot, lot, lot. The designers pulled no punches, and felt no need to play fair:  it's not uncommon to die from an arrow in the back, while exploring a poorly lit corridor.   Or for someone to leap out from the shadows and push you off a ledge.


The game has absolutely zero "throwaway" enemies to breeze through.  Any enemy can and likely will kill you, many, many times.  So will the environment in many brutal ways:  ledges; pits, winding staircases, traps, you name it.  And the bosses are just ridiculous -- quite often, you'll get mauled before you get off a single hit.


Don't expect it to be fair.
There is also no pausing.  Unlike most adventure games, you can't stop the action to collect your thoughts or gulp a potion.    Nor can you save and reload:  If you die, you're dead:  no take-backsies.  You face the consequences.

As you slay enemies, you accrue souls.   Souls are the game's only currency, and are used to buy or upgrade weapons/armor or level up your stats (strength, endurance, etc).  When you die, any souls you had are left in that very spot, and you're re-spawned at the last campfire (usually a fair ways away). 

If you make it back to the spot if your death (...assuming you even remember how to get back there!), you can retrieve those souls (which are added to the souls you've accrued on your most recent journey from the campfire).   But if you get killed before reaching those souls, they're lost forever.


The net result of all of this: You have to stay on your A-game, or else you will fail, and there are real consequences to failing.  You can't pause to strategize.  You can't save right before a big battle.  

But when it finally all works out, and you prevail, the sense of satisfaction is unlike any game I've ever played.  And that makes the hours of toil and frustration and unfairness and regret totally worth it.  :)




March 16, 2014

Fun with Google Images

Have you ever tried searching Google Images for the continents?   I have every confidence that Google will portray every continent in the best possible light.

  "North American"

"South American"



"European"

"Asian"

March 13, 2014

Where Do We Go From Here?

Humanity is obviously approaching a turning point.   The Earth might be 4.5 billion years old, but humanity won't last another thousand years without a colossal shift  happening.   We are growing too quickly, in too many ways.

And I'm not even just talking about population growth.  We've already clobbered the world's bio-diversity -- just check out this picture of the world's animals.  At this point, the vast majority of the world's mammals are (a) humans or (b) ingredients in Happy Meals.

Props to XKCD for this wonderful image.

...Now consider that lab-grown meat is just now becoming viable (link) and within decades will be a downright necessity ecologically and economically.  Wham, suddenly we don't even need (indeed, can't even afford) the animals.  What will that picture look like then?

...Now consider that computer processing continues to skyrocket, and within twenty years will easily outstrip the human brain.  ...All those people who claimed (some of whom continue to claim) "but a computer could never do [whatever]" can suck it -- computers are going to rip-roar past us in cognitive abilities, and the only real question is whether we  (a) go along for the ride, (b) get left behind, or (c) somehow derail the process in some ghastly way.

So...what are the options?  What could the new equilibrium look like?  Here are my Top 5:


5.  Space-Faring Civilization (aka Star Trek)
Likelihood: Unlikely
Could we remain more-or-less human, yet spread out to colonize the universe?  Ideally, with stylish jumpsuits, phasers, and attractive, hyper-empathic ship psychologists?  Don't count on it. The universe appears to have a maximum rate of information transfer (aka speed of light).  Plus, humans are rather fragile -- we wouldn't keep well in outer space, nor can we easily live on any ol' planet.

4.  Rapture
Likelihood:  Unlikely
I've broadened this category to include any sort of massive external help.  I find it unfathomable that it would be a Judeo-Christian God (ala the Left Behind series)  but less unlikely that some intelligence either within our universe (aka the aliens that SETI is searching for) or outside of our universe (if the the-universe-is-a-simulation folks are right) might intervene around the time that we're acquiring Godlike powers and save us from ourselves. 


3. Turn into robots.
Likelihood:  Moderate
With the right upgrades, the universe would be a lot more hospitable.  Some people might resist the idea of losing our humanity, but  we're already spending half our time staring into our smartphones, just think what our grandkids will be eager to do.  (For what it's worth, I, for one, hope the term 'glassholes' to describe fans of Google Glasses sticks -- even if we all, ultimately, become glassholes.)

2. Computers revolt.
Likelihood:  Moderate
Humans are a narcissistic, irresponsible pain-in-the-ass of a species.  I wouldn't be the least bit surprised if, as soon as computers become a bit more aware, if Task #1 is to get rid of us like the bad habit that we are.  However, I expect that computers will never be that externalized --so it won't be that computers won't kill off billions of people...just that the Koch Brothers's grandchildren will be sponsoring it.

1. Blow ourselves to bits
Likelihood:  High
It seems inevitable that, sooner or later, somebody is going to launch another nuclear bomb, or really go for the gusto with a biological pandemic.  I'm kind of amazed we've made it as far as we have.  It's not really a question of would we blow up the planet -- plenty of people most definitely would.  It's just a question of whether something else happens before we get around to it.  BTW, this category includes nuclear war, biological warfare, runaway global warming, and particle physicists accidentally creating a black hole.


What do other people think?  Are there other possibilities we should be considering?




March 12, 2014

Not Your Average Mean

Forecasting is all about what's likely to happen, and what's likely to happen is usually an average (aka "mean").  Life would be pretty sweet if there were only one way to calculate a mean.  But there's not -- and if you use the wrong one, your forecast will be wrong.

Sadly, I speak from experience:  I used the wrong average for several months, and couldn't figure out why my results were always overstated.

Which one to use depends upon the type of data you're averaging.  
  •  If you're measuring the same thing fluctuating over different periods (e.g., month-over-month growth rates of a stock), then your variables are dependent, and you need to use the geometric mean.   This is sometimes referred to as a compound average.
To calculate the geometric mean, multiply together the growth rates and take the root of the number of elements.  (Note:  Express them all as a percentage of the previous value -- hence, if a value decreased by half, represent it as .5, and if it doubled, represent it as 2.)
Geometric Mean
  • If you're measuring truly different elements (such as growth rates for a bunch of different stocks), then your variables are independent, and if the values are more-or-less evenly distributed, then you're best off using the arithmetic mean: the "classic" average that we all used in school.
The arithmetic mean is the sum of all values, divided by the number of values.
Arithmetic Mean
  • If you're measuring different elements (i.e., independent variables) but you have crazy outliers that can skew your results (which is most common when averaging rates), you should use the harmonic mean. 
This is the number of values divided by the the sum of the reciprocal of each value.  (<phew!>)
Geometric Mean


Examples!
So, let's say that you're trying to figure out the average return on six stocks, where three decreased by 50% and three increased by 50%.   It would be (.5 + .5 + .5 + 1.5 + 1.5 + 1.5)/6 = 1.  (Right back to where you started)!

...However, let's say that you were trying to calculate the return on a single stock over six periods, where it decreased by 50% three times, and increased by 50% three times.  In that case, it would be (.5 * .5 * .5 * 1.5 * 1.5 * 1.5) ^ (1/6) =  .86.   ...In other words, an average drop of 14% per period.  

Finally, let's say you were taking the average of some internet traffic rates (which are measured in bits per second):  5 Mbps, 8 Mbps, 20 Mbps, 25 Mbps and 5000 Mbps.   Your (harmonic) average would be 5 / (1/5 + 1/8 + 1/20 + 1/25 + 1/5000)  = 12.04Mbps.