PyCon 2012 was a blast. One of the themes running through the conference was Python’s rising importance in data analysis, particularly as an exploratory and interactive platform. David Beazley points this out in his blog, taking note of the pandas and ipython projects.
pandas has always had time series functionality of necessity. It was originally developed for quantitative finance at AQR Capital Management, where Wes and I worked together several years back. There’s a rich DateOffset class to provide for date-time arithmetic (instances of which include Day, Bday, MonthEnd, etc.), and DateRange is a pandas Index composed of regularly recurring datetime objects for constructing time-indexed data structures.
However, we’ve also got a long-running branch in github to take pandas time series capabilities to the next level.
(FAIR WARNING: API subject to change!!!)
scikits.timeseries
I’m currently merging the scikits.timeseries core functionality into pandas. This library is built around the concept of a Date object that carries with it frequency information. To me, it is helpful to think of Date as a particular interval of time. So for instance, a Date instance with daily frequency represents a particular day in which you have an associated observation; with a monthly frequency, the Date object is that particular month. For example,
|
|
In [1]: from scikits.timeseries import Date
In [2]: Date('D', '3/10/12')
Out[2]: <D : 10-Mar-2012>
In [3]: Date('D', '3/10/12').asfreq('M')
Out[3]: <M : Mar-2012> |
As Date instances are represented internally as ordinals (essentially, the integer number of elapsed intervals from the interval in which the Gregorian proleptic date Jan 1, 0001 occurs), arithmetic and interval conversions can be blazingly fast. It can also be an idiomatic way to do date arithmetic. For instance, to get the last business day of this month:
|
|
In [4]: Date('D', '3/10/12').asfreq('M').asfreq('B','END')
Out[4]: <B : 30-Mar-2012> |
The scikits.timeseries implementation turns out to be too rigid in certain ways. There’s no way to define half-day intervals, or say a daily set of intervals offset by one hour. One immediate difference with the pandas implementation is that it will allow for multiples of base intervals. For instance, in pandas timeseries:
|
|
In [19]: i = Interval('3/10/12', '12H')
In [20]: i
Out[20]: Interval('10-Mar-2012 00:00', '12H')
In [21]: i + 1
Out[21]: Interval('10-Mar-2012 12:00', '12H')
In [22]: i + 2
Out[22]: Interval('11-Mar-2012 00:00', '12H') |
And then of course, there is the associated IntervalIndex, which gives you a new index type to play with:
|
|
In [26]: ii = IntervalIndex(start='3/10/12', end='3/12/12', freq='12H')
In [27]: s = Series(np.random.rand(len(ii)), index=ii)
In [28]: s
Out[28]:
10-Mar-2012 00:00 0.566687
10-Mar-2012 12:00 0.937349
11-Mar-2012 00:00 0.031451
11-Mar-2012 12:00 0.729145
12-Mar-2012 00:00 0.212382 |
You can alter the interval as you might in scikits.timeseries (say for instance you need to coerce the index to the final hour of each observed interval):
|
|
In [30]: s.index = s.index.resample('H', 'E')
In [31]: s
Out[31]:
10-Mar-2012 11:00 0.566687
10-Mar-2012 23:00 0.937349
11-Mar-2012 11:00 0.031451
11-Mar-2012 23:00 0.729145
12-Mar-2012 11:00 0.212382 |
DatetimeIndex
Another large change is DateRange will be deprecated in favor of DatetimeIndex. Rather than a numpy array of (pointers to) Python datetime objects, the representation is an array of 64-bit ints compatible with the numpy datetime64 dtype … well, the standard UTC-microseconds-from-unix-epoch-ignoring-leap-seconds resolution, to be exact. Accessing the index will yield a Timestamp object, which is a datetime-derived subclass, which will carry frequency information (if any). I’ve worked hard to keep backward compatibility with the old DateRange, so hopefully this change will be fairly transparent.
For instance:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
|
In [54]: dti = DatetimeIndex(start='1/1/2001', end='12/1/2001', freq='M')
In [55]: s = Series(np.random.rand(len(dti)), dti)
In [56]: s
Out[56]:
2001-01-31 0.479981
2001-02-28 0.742215
2001-03-31 0.846832
2001-04-30 0.388186
2001-05-31 0.398850
2001-06-30 0.628447
2001-07-31 0.743980
2001-08-31 0.466137
2001-09-30 0.762505
2001-10-31 0.300981
2001-11-30 0.164802
In [57]: ts = s.index[0]
In [58]: ts
Out[58]: Timestamp(2001, 1, 31, 0, 0)
In [59]: ts.offset
Out[59]: <1 MonthEnd> |
Unlike DateRange, a DatetimeIndex can be composed of arbitrary timestamps, making it irregular:
|
|
In [74]: x = DatetimeIndex(['1/1/2010', '2/4/2011 16:24:45.123'])
In [75]: x
Out[75]: DatetimeIndex([2010-01-01 00:00:00, 2011-02-04 16:24:45.123000], dtype=datetime64[us])
In [76]: s = Series(np.random.rand(len(x)), x)
In [77]: s
Out[77]:
2010-01-01 00:00:00 0.917592
2011-02-04 16:24:45.123000 0.926309 |
Additionally, there will be new slicing features:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
|
In [4]: dti = DatetimeIndex(start='1/1/2001', end='6/1/2001', freq='D')
In [5]: s = Series(np.random.rand(len(dti)), dti)
In [6]: s.head()
Out[6]:
2001-01-01 0.571153
2001-01-02 0.685178
2001-01-03 0.915576
2001-01-04 0.185320
2001-01-05 0.141775
In [8]: s['Mar 2001']
Out[8]:
2001-03-01 0.602943
2001-03-02 0.129155
2001-03-03 0.932543
2001-03-04 0.534650
2001-03-05 0.311436
2001-03-06 0.746531
2001-03-07 0.655023
2001-03-08 0.399160
2001-03-09 0.024832
2001-03-10 0.178017
2001-03-11 0.059523
2001-03-12 0.209618
2001-03-13 0.722570
2001-03-14 0.149654
2001-03-15 0.932882
2001-03-16 0.309638
2001-03-17 0.567673
2001-03-18 0.794191
2001-03-19 0.608495
2001-03-20 0.051329
2001-03-21 0.262767
2001-03-22 0.316965
2001-03-23 0.832362
2001-03-24 0.118094
2001-03-25 0.687097
2001-03-26 0.741237
2001-03-27 0.082728
2001-03-28 0.984076
2001-03-29 0.563734
2001-03-30 0.656616
2001-03-31 0.319237 |
Another new feature will be conversion and resampling using the pandas group-by machinery. For instance, we can do OHLC downsampling:
|
|
In [11]: dti = DatetimeIndex(start='1/1/2001', end='1/31/2001 23:59:59', freq='S')
In [12]: s = Series(np.random.rand(len(dti)), dti)
In [13]: s2 = s.convert('5Min', how='ohlc')
In [14]: s2.head()
Out[14]:
close high low open
2001-01-01 00:00:00 0.328680 0.328680 0.328680 0.328680
2001-01-01 00:05:00 0.612600 0.998840 0.002629 0.907757
2001-01-01 00:10:00 0.264069 0.997718 0.000011 0.698224
2001-01-01 00:15:00 0.387993 0.999033 0.002355 0.357982
2001-01-01 00:20:00 0.508035 0.999880 0.005871 0.367008 |
And upsampling:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
|
In [21]: dti = DatetimeIndex(start='1/1/2001', end='1/3/2001', freq='D')
In [22]: s = Series(np.random.rand(len(dti)), dti)
In [23]: s
Out[23]:
2001-01-01 0.790748
2001-01-02 0.827149
2001-01-03 0.836007
In [24]: s.convert('12H', method='pad')
Out[24]:
2001-01-01 00:00:00 0.790748
2001-01-01 12:00:00 0.790748
2001-01-02 00:00:00 0.827149
2001-01-02 12:00:00 0.827149
2001-01-03 00:00:00 0.836007 |
Performance
Besides features, there are wins in deriving DatetimeIndex from Int64Index. We get khash-based hashing for free. The int64-compatible dtype allows for vectorized, cython-optimized datetime arithmetic and timezone conversions.
Release
The timeline on stable release is still fuzzy, as there is still plenty to do:
- API: Crafting an elegant, cohesive interface for the new features. Heavy dogfooding!
- Internals: the DataFrame block manager needs to become datetime64-aware (perhaps interval-aware as well)
- Plotting: scikits.timeseries had some great matplotlib plotting infrastructure we are doing our best to port
With all the fundamentals falling in place, pandas time series analysis will be a force to be reckoned with!