PyCon 2012 was a blast. One of the themes running through the conference was Python’s rising importance in data analysis, particularly as an exploratory and interactive platform. David Beazley points this out in his blog, taking note of the pandas and ipython projects.

pandas has always had time series functionality of necessity. It was originally developed for quantitative finance at AQR Capital Management, where Wes and I worked together several years back. There’s a rich DateOffset class to provide for date-time arithmetic (instances of which include Day, Bday, MonthEnd, etc.), and DateRange is a pandas Index composed of regularly recurring datetime objects for constructing time-indexed data structures.

However, we’ve also got a long-running branch in github to take pandas time series capabilities to the next level.

(FAIR WARNING: API subject to change!!!)

scikits.timeseries

I’m currently merging the scikits.timeseries core functionality into pandas. This library is built around the concept of a Date object that carries with it frequency information. To me, it is helpful to think of Date as a particular interval of time. So for instance, a Date instance with daily frequency represents a particular day in which you have an associated observation; with a monthly frequency, the Date object is that particular month. For example,

As Date instances are represented internally as ordinals (essentially, the integer number of elapsed intervals from the interval in which the Gregorian proleptic date Jan 1, 0001 occurs), arithmetic and interval conversions can be blazingly fast. It can also be an idiomatic way to do date arithmetic. For instance, to get the last business day of this month:

The scikits.timeseries implementation turns out to be too rigid in certain ways. There’s no way to define half-day intervals, or say a daily set of intervals offset by one hour. One immediate difference with the pandas implementation is that it will allow for multiples of base intervals. For instance, in pandas timeseries:

And then of course, there is the associated IntervalIndex, which gives you a new index type to play with:

You can alter the interval as you might in scikits.timeseries (say for instance you need to coerce the index to the final hour of each observed interval):

DatetimeIndex

Another large change is DateRange will be deprecated in favor of DatetimeIndex. Rather than a numpy array of (pointers to) Python datetime objects, the representation is an array of 64-bit ints compatible with the numpy datetime64 dtype … well, the standard UTC-microseconds-from-unix-epoch-ignoring-leap-seconds resolution, to be exact. Accessing the index will yield a Timestamp object, which is a datetime-derived subclass, which will carry frequency information (if any). I’ve worked hard to keep backward compatibility with the old DateRange, so hopefully this change will be fairly transparent.

For instance:

Unlike DateRange, a DatetimeIndex can be composed of arbitrary timestamps, making it irregular:

Additionally, there will be new slicing features:

Another new feature will be conversion and resampling using the pandas group-by machinery. For instance, we can do OHLC downsampling:

And upsampling:

Performance

Besides features, there are wins in deriving DatetimeIndex from Int64Index. We get khash-based hashing for free. The int64-compatible dtype allows for vectorized, cython-optimized datetime arithmetic and timezone conversions.

Release

The timeline on stable release is still fuzzy, as there is still plenty to do:

  • API: Crafting an elegant, cohesive interface for the new features. Heavy dogfooding!
  • Internals: the DataFrame block manager needs to become datetime64-aware (perhaps interval-aware as well)
  • Plotting: scikits.timeseries had some great matplotlib plotting infrastructure we are doing our best to port

With all the fundamentals falling in place, pandas time series analysis will be a force to be reckoned with!

  • http://twitter.com/timClicks Tim McNamara

    This is really great to see. I’m sure that the unification work is difficult, but it’s very helpful. As a minor thing, I would really like to be able to specify the date format. I find MM/DD/YYYY really baffling.

    • AdamDKlein

      I was thinking about this at one point. I agree, should definitely support other date formats.

  • fawce

    Looks superb, especially the internal representation.
    How much timezone support will you provide? Locale affects everything from date format to business days (Sun through Thu vs. Mon through Fri, Holidays, etc).

    • AdamDKlein

      For DatetimeIndex, you can attach tzinfo and carry it around with the time stamps, and there are cython methods to do vectorized time zone conversions. There’s probably more work to do here, but time zone support is an important component and hasn’t been forgotten.

    • http://blog.wesmckinney.com Wes McKinney

      Need to look more at API requirements / convenience for time zone support– it would be nice if pandas did all the hard work there

  • Iain McFadyen

    I’m also glad to hear that timezone support is on deck.
    On another topic, how easy would it be to specify custom, irregular frequencies? I’ve done this in pandas 0.7.1 using custom DateOffset classes with appropriately crafted onOffset, rollforward, rollback, and apply methods.

    For example: every Monday at noon except when Monday is a US Federal holiday, in which case Tuesday at 1pm.

    Thanks for all your hard work!

    • AdamDKlein

      We’re not removing any capabilities, just adding new ones. Out of curiosity, do you have any use cases not satisfied (or too slow, etc)?

© 2014 Adam Klein's Blog Suffusion theme by Sayontan Sinha, modified by Adam :)