Reviewing some basic probability and statistics, I rediscovered the geometric distribution, which tells you the likelihood of winning a game after some number of attempts. With it, we can figure out the number of times you should expect to play the game until the outcome is “success” – or, equivalently, the number of plays you’d need in order to expect one success among them.

Why is this important? They say the odds of winning the current record-setting powerball lottery are 1 in 176 million. I want to be rich. How many tickets do I need to buy to expect a winning one?

The intuition is straightforward. For instance, if there is a p=25% chance of success on any one trial, the expected ratio of success to failures is 1/4, or one out of every four. Which means that any four trials should yield one success on average. Mathematically, if we let X be the number of trials required until a success occurs, then E[X] = 1/p.

So the odds are in my favor after buying 176 million randomly generated tickets. Who wants to loan me $176,000,000?

The proof is neat.

If p is the probability of success, and q = (1-p) is the probability of failure, then the probability of succeeding on the i’th trial after (i-1) failures is:

Pr(X=i) = q^(i-1)p

So

Lines three to four depend on some magic with infinite series. If you have a converging series where x < 1:

We use this last equation in the derivation above. So what else can we do with this? Suppose we want to know how many rolls of a die are needed to see all six sides come up at least once (given a six-sided, fair die, where each role is an independent event). On the first roll, the probability of getting a side we haven't seen is 1. The next role then has a 5/6 probability of yielding a unique side; and then, after the next unique number comes up, the probability becomes 4/6 ... and so on. So, 1 + 1/(5/6) + 1/(4/6) + ... = 14.7 Of course, I distrust math: I need evidence. And so Python to the rescue:

=>

PyCon 2012 was a blast. One of the themes running through the conference was Python’s rising importance in data analysis, particularly as an exploratory and interactive platform. David Beazley points this out in his blog, taking note of the pandas and ipython projects.

pandas has always had time series functionality of necessity. It was originally developed for quantitative finance at AQR Capital Management, where Wes and I worked together several years back. There’s a rich DateOffset class to provide for date-time arithmetic (instances of which include Day, Bday, MonthEnd, etc.), and DateRange is a pandas Index composed of regularly recurring datetime objects for constructing time-indexed data structures.

However, we’ve also got a long-running branch in github to take pandas time series capabilities to the next level.

(FAIR WARNING: API subject to change!!!)

scikits.timeseries

I’m currently merging the scikits.timeseries core functionality into pandas. This library is built around the concept of a Date object that carries with it frequency information. To me, it is helpful to think of Date as a particular interval of time. So for instance, a Date instance with daily frequency represents a particular day in which you have an associated observation; with a monthly frequency, the Date object is that particular month. For example,

As Date instances are represented internally as ordinals (essentially, the integer number of elapsed intervals from the interval in which the Gregorian proleptic date Jan 1, 0001 occurs), arithmetic and interval conversions can be blazingly fast. It can also be an idiomatic way to do date arithmetic. For instance, to get the last business day of this month:

The scikits.timeseries implementation turns out to be too rigid in certain ways. There’s no way to define half-day intervals, or say a daily set of intervals offset by one hour. One immediate difference with the pandas implementation is that it will allow for multiples of base intervals. For instance, in pandas timeseries:

And then of course, there is the associated IntervalIndex, which gives you a new index type to play with:

You can alter the interval as you might in scikits.timeseries (say for instance you need to coerce the index to the final hour of each observed interval):

DatetimeIndex

Another large change is DateRange will be deprecated in favor of DatetimeIndex. Rather than a numpy array of (pointers to) Python datetime objects, the representation is an array of 64-bit ints compatible with the numpy datetime64 dtype … well, the standard UTC-microseconds-from-unix-epoch-ignoring-leap-seconds resolution, to be exact. Accessing the index will yield a Timestamp object, which is a datetime-derived subclass, which will carry frequency information (if any). I’ve worked hard to keep backward compatibility with the old DateRange, so hopefully this change will be fairly transparent.

For instance:

Unlike DateRange, a DatetimeIndex can be composed of arbitrary timestamps, making it irregular:

Additionally, there will be new slicing features:

Another new feature will be conversion and resampling using the pandas group-by machinery. For instance, we can do OHLC downsampling:

And upsampling:

Performance

Besides features, there are wins in deriving DatetimeIndex from Int64Index. We get khash-based hashing for free. The int64-compatible dtype allows for vectorized, cython-optimized datetime arithmetic and timezone conversions.

Release

The timeline on stable release is still fuzzy, as there is still plenty to do:

  • API: Crafting an elegant, cohesive interface for the new features. Heavy dogfooding!
  • Internals: the DataFrame block manager needs to become datetime64-aware (perhaps interval-aware as well)
  • Plotting: scikits.timeseries had some great matplotlib plotting infrastructure we are doing our best to port

With all the fundamentals falling in place, pandas time series analysis will be a force to be reckoned with!

As my colleague Wes McKinney likes to say (quoting Matthew Goodman): “Are you using IPython? If not, you’re doing it wrong!”

You shouldn’t have to wait for an exception to invoke the interactive debugger, and you definitely should be using the IPython debugger. One convenience function in the pandas (pandas.util.testing) code base is this:

def debug(f, *args, **kwargs):
    from pdb import Pdb as OldPdb
    try:
        from IPython.core.debugger import Pdb
        kw = dict(color_scheme='Linux')
    except ImportError:
        Pdb = OldPdb
        kw = {}
    pdb = Pdb(**kw)
    return pdb.runcall(f, *args, **kwargs)

You can invoke it on a function and arguments like so:

debug(test_function, arg1, arg2, named_arg1='hello')

You will get all the interactive IPython goodness as you step through your code. Funny enough, doesn’t seem like qtconsole version supports tab completion. Maybe will file a bug report…

The Python path determines how the Python interpreter locates modules. How exactly does Python construct the path?

Using the official docs on sys.path, with its footnote reference to the site module, I’ll recap the process.

If a script is executed, the interpreter sets the first entry of sys.path to that script’s directory. If Python is launched interactively, the first entry is the empty string (“”), meaning Python will scan the present working directory first. The next entries of sys.path are the contents of the PYTHONPATH environment variable, if it exists. Then, installation-dependent entries are appended (example below).

When initializing, the interpreter normally imports the site module automatically. The module, on import, executes code to find .pth files in known site-packages directory locations, which themselves contain entries which are either paths to add to sys.path, or import calls. If we really want to trace what’s going on, we can launch a Python interpreter with -S to prevent loading the site module automatically, and instead trace the import.

(Note, I am working within a virtualenv called py27.)

(py27) ~$ python -S
Python 2.7.2+ (default, Oct 4 2011, 20:06:09)
[GCC 4.6.1] on linux2
>>> import sys
>>> for p in sys.path:
... print p
...

/home/adam/.virtualenvs/py27/lib/python2.7/
/home/adam/.virtualenvs/py27/lib/python2.7/plat-linux2
/home/adam/.virtualenvs/py27/lib/python2.7/lib-tk
/home/adam/.virtualenvs/py27/lib/python2.7/lib-old
/home/adam/.virtualenvs/py27/lib/python2.7/lib-dynload

I have no PYTHONPATH, so these are just my installation-dependent paths. Now, we need to add the directory where the pdb module lives, so we can import it:

>>> sys.path += ["/usr/lib/python2.7"]
>>> import pdb
>>> pdb.run("import site")
> <string>(1)<module>()
(Pdb) s
--Call--
> /home/adam/.virtualenvs/py27/lib/python2.7/site.py(64)()
-> """

I’ll spare you the debugging session details, and summarize what I see:

– site.py grabs orig-prefix.txt from <VIRTUAL_ENV>/lib/python2.7, which for me contains “/usr”, and extends the sys.path array to contain additional “/usr”-based paths.

– site.py then scans the site-packages (in lib/python2.7). For each .pth file (in alphabetical order), step through its entries. If an entry begins with “import”, call exec() on the line; otherwise append the (absolute) path to sys.path. Then do the same in the user site-packages directory (in local/lib/python2.7).

Note, easy-install.pth contains executable code, eg:

import sys; sys.__plen = len(sys.path)
./setuptools-0.6c11-py2.7.egg
./pip-1.0.2-py2.7.egg
/home/adam/code/ipython
...
import sys; new=sys.path[sys.__plen:]; del sys.path[sys.__plen:]; p=getattr(sys,'__egginsert',0); sys.path[p:p]=new; sys.__egginsert = p+len(new)

The executable lines move all the entries (some of which are .egg zipped packages) up to the top of the path.

– After stepping through all .pth files, add the existing site-packages directories themselves.

– Finally, attempt to call “import sitecustomize.py” (which doesn’t do anything on my install).

Cython is my new favorite tool. It lets you write compiled C extension modules for the CPython interpreter using annotated Python for speed-critical parts and pure Python for non-critical parts. Further, you can import and call C functions directly. The user guide is (surprisingly?) well-written.

In particular, it lets you do blazing computations using Numpy. See this excellent whitepaper.

But what about that old Python extension module you have lying around? What if you want to utilize Cython to call into it, fast, bypassing its Python API? You don’t want to rip out all the C(++) code you care about from that module and recompile it into a new Cython extension module. Or maybe you do. But suppose you don’t.

You’ll just have to give that rickety old extension a C API and expose it properly!

Let’s imagine you’ve got a function “myfunc” in your old extension module called “myold”. So for example in the file myoldmodule.cpp you may have:

static float64_t myfunc(float64_t x) { ... }

We need to create a new header file, myold_capi.h, that declares and exports the relevant symbols that live in the compiled myold module, and that we would like to import into the new Cython module to call. We use the Python Capsule mechanism for this, and the following comes right out of the Python documentation.

#ifndef _MYOLD_CAPI_H_
#define _MYOLD_CAPI_H_

/* import required header files here */

#ifdef __cplusplus
extern "C" {
#endif

/* Total number of C API functions to export */
#define MYOLD_CAPI_pointers 1

/* C API functions to export */
#define MYOLD_myfunc_NUM 0
#define MYOLD_myfunc_RETURN float64_t
#define MYOLD_myfunc_PROTO (float64_t x)

#ifdef MYOLD_MODULE
/* This section is used when compiling myold */

static MYOLD_myfunc_RETURN myfunc MYOLD_myfunc_PROTO;
#else
/* This section is used in modules that compile against myold's C API */

static void **MYOLD_CAPI;

#define myfunc \
     (*(MYOLD_myfunc_RETURN (*)MYOLD_myfunc_PROTO) MYOLD_CAPI[MYOLD_myfunc_NUM])


/* Return -1 on error, 0 on success.
   PyCapsule_Import will set an exception if there's an error.  */

static int
import_myold(void)
{
    MYOLD_CAPI = (void **)PyCapsule_Import("myold._C_API", 0);
    return (myold_CAPI != NULL) ? 0 : -1;
}

#endif

#ifdef __cplusplus
}
#endif

#endif /* !defined(_MYOLD_CAPI_H_) */

Now, we have to include this header in our old module, myoldmodule.cpp. So right before, say,

PyObject* pModule = 0;

Add these lines:

#define MYOLD_MODULE
#include "myold_capi.h"

Finally, in your PyInit_myold() or initmyold() function that initializes your module, you need to create the Capsule holding the array of function pointers you are exporting:

    // start capsule creation for C API
   
    static void *MYOLD_CAPI[MYOLD_CAPI_pointers];

    MYOLD_CAPI[MYOLD_myfunc_NUM] = (void *)myfunc;

    /* Create a Capsule containing the API pointer array's address */
    PyObject *c_api_object = PyCapsule_New((void *)MYOLD_CAPI, "myold._C_API", NULL);

    if (c_api_object != NULL)
        PyModule_AddObject(pModule, "_C_API", c_api_object);

    // end capsule creation

Awesome. Now, does your old C module still compile? I hope so!

Next, we need to create a new Cython header file, myold.pxd. It should look something like this:

cdef extern from "myold_capi.h":
    # C-API exports via the myold capsule
    float64_t myfunc(float64_t x)
    # must call this before using module
    int import_myold()

Now, go ahead and write your new Cython module, for example mynew.pyx:

from myold cimport *

# The following call is required to initialize the
# static capsule variable that holds the pointers
# to the myold C API functions
import_myold()

cdef class NewClass():
    cpdef float64_t mynewfunc(self, float64_t x):
        return myfunc(x)

Not too bad!

I’ve got a spanking new install of Kubuntu 11.10, and I need to get it set up for Python data hacking.  Sure, I could spring for an Enthought Python Distrubution, but where would be the masochism in that?

Inspired by Zed, let’s do this the hard way.

The linux distro comes with Python 2.7.2. Perfect! Or, use Pythonbrew to set up a local Python build that you want to use. I presume you know how to get to the command line, as well as how to edit text files using emacs, vim, pico, whatever.

Let’s get some tools:

sudo apt-get install git gfortran g++

We need to compile against Python headers and get setuptools and pip:

sudo apt-get install python-dev python-pip

Let’s isolate our Python distro from carnage:

sudo apt-get python-virtualenv
sudo pip install virtualenvwrapper

Now these lines to your ~/.bashrc:

source /usr/local/bin/virtualenvwrapper.sh
export WORKON_HOME=$HOME/.virtualenvs

Now open a new terminal and establish a virtual environment, say “py27”:

mkvirtualenv py27
workon py27

We need some math libraries (ATLAS + LAPACK):

sudo apt-get install libatlas-base-dev liblapack-dev

Ok, now to install and build all the scientific python hotness:

pip install numpy scipy

For matplotlib, we need lots of libraries. This one is dependency heavy. Note we can ask Ubuntu what we need, what’s installed, and what is not:

apt-cache depends python-matplotlib | awk '/Depends:/{print $2}' | xargs dpkg --get-selections

Easiest thing to do is just build all the dependencies (just say yes if it asks to install deps of matplotlib instead of python-matplotlib):

sudo apt-get build-dep python-matplotlib

Ok, now this should work:

pip install matplotlib

Now, of course, we need the IPython interpreter. Don’t settle for 0.11!

pip install -e git+https://github.com/ipython/ipython.git#egg=ipython
cd ~/.virtualenvs/py27/src/ipython
python setupegg.py install

Note, you may need to sudo rm /usr/bin/ipython.py if there is a conflict.

Ok, let’s beef up the IPython interpreter. Note the pip commands FAIL. This is ok. We’ll do it by hand.

sudo apt-get install qt4-dev-tools

pip install sip
cd ~/.virtualenvs/py27/build/sip
python configure.py
make
sudo make install

pip install pyqt
cd ~/.virtualenvs/py27/build/pyqt
python configure.py
make
sudo make install

# clean up
cd ~/.virtualenvs/py27/
rm -rf build

Just a few more things, you won’t be disappointed.

sudo apt-get install libzmq-dev
pip install tornado pygments pyzmq

Alright, let’s get pandas. It’s under heavy development (Wes is a beast); so lets pull the latest from git.

pip install nose cython
pip install -e git+https://github.com/wesm/pandas#egg=pandas

# we should get statsmodels too
pip install -e git+https://github.com/statsmodels/statsmodels#egg=statsmodels

Btw, you’ll note this git stuff goes into your ~/.virtualenvs/py27/src directory, if you want to git pull and update.

OK! Phew! For the grand finale:

Run the amazing qtconsole:

ipython qtconsole --pylab=inline

Or the even more amazing WEB BROWSER:

ipython notebook --pylab=inline

Launch a browser and point to http://localhost:8888/. For kicks, try opening one of Wes’s tutorial workbooks, here. You may have to fiddle a bit, but it should work.

Enjoy!

© 2014 Adam Klein's Blog Suffusion theme by Sayontan Sinha, modified by Adam :)