Reviewing some basic probability and statistics, I rediscovered the geometric distribution, which tells you the likelihood of winning a game after some number of attempts. With it, we can figure out the number of times you should expect to play the game until the outcome is “success” – or, equivalently, the number of plays you’d need in order to expect one success among them.

Why is this important? They say the odds of winning the current record-setting powerball lottery are 1 in 176 million. I want to be rich. How many tickets do I need to buy to expect a winning one?

The intuition is straightforward. For instance, if there is a p=25% chance of success on any one trial, the expected ratio of success to failures is 1/4, or one out of every four. Which means that any four trials should yield one success on average. Mathematically, if we let X be the number of trials required until a success occurs, then E[X] = 1/p.

So the odds are in my favor after buying 176 million randomly generated tickets. Who wants to loan me $176,000,000?

The proof is neat.

If p is the probability of success, and q = (1-p) is the probability of failure, then the probability of succeeding on the i’th trial after (i-1) failures is:

Pr(X=i) = q^(i-1)p

So

Lines three to four depend on some magic with infinite series. If you have a converging series where x < 1:

We use this last equation in the derivation above.

So what else can we do with this? Suppose we want to know how many rolls of a die are needed to see all six sides come up at least once (given a six-sided, fair die, where each role is an independent event). On the first roll, the probability of getting a side we haven't seen is 1. The next role then has a 5/6 probability of yielding a unique side; and then, after the next unique number comes up, the probability becomes 4/6 ... and so on.

So, 1 + 1/(5/6) + 1/(4/6) + ... = 14.7

Of course, I distrust math: I need evidence. And so Python to the rescue:

=>

PyCon 2012 was a blast. One of the themes running through the conference was Python’s rising importance in data analysis, particularly as an exploratory and interactive platform. David Beazley points this out in his blog, taking note of the pandas and ipython projects.

pandas has always had time series functionality of necessity. It was originally developed for quantitative finance at AQR Capital Management, where Wes and I worked together several years back. There’s a rich DateOffset class to provide for date-time arithmetic (instances of which include Day, Bday, MonthEnd, etc.), and DateRange is a pandas Index composed of regularly recurring datetime objects for constructing time-indexed data structures.

However, we’ve also got a long-running branch in github to take pandas time series capabilities to the next level.

(FAIR WARNING: API subject to change!!!)

scikits.timeseries

I’m currently merging the scikits.timeseries core functionality into pandas. This library is built around the concept of a Date object that carries with it frequency information. To me, it is helpful to think of Date as a particular interval of time. So for instance, a Date instance with daily frequency represents a particular day in which you have an associated observation; with a monthly frequency, the Date object is that particular month. For example,

As Date instances are represented internally as ordinals (essentially, the integer number of elapsed intervals from the interval in which the Gregorian proleptic date Jan 1, 0001 occurs), arithmetic and interval conversions can be blazingly fast. It can also be an idiomatic way to do date arithmetic. For instance, to get the last business day of this month:

The scikits.timeseries implementation turns out to be too rigid in certain ways. There’s no way to define half-day intervals, or say a daily set of intervals offset by one hour. One immediate difference with the pandas implementation is that it will allow for multiples of base intervals. For instance, in pandas timeseries:

And then of course, there is the associated IntervalIndex, which gives you a new index type to play with:

You can alter the interval as you might in scikits.timeseries (say for instance you need to coerce the index to the final hour of each observed interval):

DatetimeIndex

Another large change is DateRange will be deprecated in favor of DatetimeIndex. Rather than a numpy array of (pointers to) Python datetime objects, the representation is an array of 64-bit ints compatible with the numpy datetime64 dtype … well, the standard UTC-microseconds-from-unix-epoch-ignoring-leap-seconds resolution, to be exact. Accessing the index will yield a Timestamp object, which is a datetime-derived subclass, which will carry frequency information (if any). I’ve worked hard to keep backward compatibility with the old DateRange, so hopefully this change will be fairly transparent.

For instance:

Unlike DateRange, a DatetimeIndex can be composed of arbitrary timestamps, making it irregular:

Additionally, there will be new slicing features:

Another new feature will be conversion and resampling using the pandas group-by machinery. For instance, we can do OHLC downsampling:

And upsampling:

Performance

Besides features, there are wins in deriving DatetimeIndex from Int64Index. We get khash-based hashing for free. The int64-compatible dtype allows for vectorized, cython-optimized datetime arithmetic and timezone conversions.

Release

The timeline on stable release is still fuzzy, as there is still plenty to do:

  • API: Crafting an elegant, cohesive interface for the new features. Heavy dogfooding!
  • Internals: the DataFrame block manager needs to become datetime64-aware (perhaps interval-aware as well)
  • Plotting: scikits.timeseries had some great matplotlib plotting infrastructure we are doing our best to port

With all the fundamentals falling in place, pandas time series analysis will be a force to be reckoned with!

© 2014 Adam Klein's Blog Suffusion theme by Sayontan Sinha, modified by Adam :)