Reviewing some basic probability and statistics, I rediscovered the geometric distribution, which tells you the likelihood of winning a game after some number of attempts. With it, we can figure out the number of times you should expect to play the game until the outcome is “success” – or, equivalently, the number of plays you’d need in order to expect one success among them.

Why is this important? They say the odds of winning the current record-setting powerball lottery are 1 in 176 million. I want to be rich. How many tickets do I need to buy to expect a winning one?

The intuition is straightforward. For instance, if there is a p=25% chance of success on any one trial, the expected ratio of success to failures is 1/4, or one out of every four. Which means that any four trials should yield one success on average. Mathematically, if we let X be the number of trials required until a success occurs, then E[X] = 1/p.

So the odds are in my favor after buying 176 million randomly generated tickets. Who wants to loan me $176,000,000?

The proof is neat.

If p is the probability of success, and q = (1-p) is the probability of failure, then the probability of succeeding on the i’th trial after (i-1) failures is:

Pr(X=i) = q^(i-1)p

So

1
2
3
4
5
6
E[X] = Sum over i of i*Pr(i)
= p + 2pq + 3pq^2 ...
= p(1 + 2q + 3q^2 + ...)
= p/(1-q)^2
= p/p^2
= 1/p

Lines three to four depend on some magic with infinite series. If you have a converging series where x < 1:

1
2
3
4
5
6
7
1 + x + x^2 + ... = S
(1 - x)(1 + x + x^2 + ...) = (1 - x)S
(1 + x + x^2 + ...) - (x + x^2 + ...) = (1 - x)S
1 = (1 - x)S
1/(1-x) = S
1/(1-x) = 1 + x + x^2 + ...
1/(1-x)^2 = 1 + 2x + 3x^2 + ... (taking the derivative of both sides)

We use this last equation in the derivation above.

So what else can we do with this? Suppose we want to know how many rolls of a die are needed to see all six sides come up at least once (given a six-sided, fair die, where each role is an independent event). On the first roll, the probability of getting a side we haven't seen is 1. The next role then has a 5/6 probability of yielding a unique side; and then, after the next unique number comes up, the probability becomes 4/6 ... and so on.

So, 1 + 1/(5/6) + 1/(4/6) + ... = 14.7

Of course, I distrust math: I need evidence. And so Python to the rescue:

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import numpy as np
from pandas import Series
def reroll():
return np.random.randint(6)
class Trial:
def __init__(self):
self._count = 0
self._seen = {}
def run(self):
while len(self._seen) < 6:
self._count += 1
roll = reroll()
self._seen[roll] = roll
def count(self):
return self._count
trial_count = 10000
trials = np.empty(trial_count, dtype='i4')
for i in range(trial_count):
trial = Trial()
trial.run()
trials[i] = trial.count()
s = Series(trials)
print s.describe()

=>

1
2
3
4
5
6
7
8
count 10000.000000
mean 14.659600
std 6.218472
min 6.000000
25% 10.000000
50% 13.000000
75% 18.000000
max 58.000000

PyCon 2012 was a blast. One of the themes running through the conference was Python’s rising importance in data analysis, particularly as an exploratory and interactive platform. David Beazley points this out in his blog, taking note of the pandas and ipython projects.

pandas has always had time series functionality of necessity. It was originally developed for quantitative finance at AQR Capital Management, where Wes and I worked together several years back. There’s a rich DateOffset class to provide for date-time arithmetic (instances of which include Day, Bday, MonthEnd, etc.), and DateRange is a pandas Index composed of regularly recurring datetime objects for constructing time-indexed data structures.

However, we’ve also got a long-running branch in github to take pandas time series capabilities to the next level.

(FAIR WARNING: API subject to change!!!)

scikits.timeseries

I’m currently merging the scikits.timeseries core functionality into pandas. This library is built around the concept of a Date object that carries with it frequency information. To me, it is helpful to think of Date as a particular interval of time. So for instance, a Date instance with daily frequency represents a particular day in which you have an associated observation; with a monthly frequency, the Date object is that particular month. For example,

1
2
3
4
5
6
7
In [1]: from scikits.timeseries import Date
In [2]: Date('D', '3/10/12')
Out[2]: <D : 10-Mar-2012>
In [3]: Date('D', '3/10/12').asfreq('M')
Out[3]: <M : Mar-2012>

As Date instances are represented internally as ordinals (essentially, the integer number of elapsed intervals from the interval in which the Gregorian proleptic date Jan 1, 0001 occurs), arithmetic and interval conversions can be blazingly fast. It can also be an idiomatic way to do date arithmetic. For instance, to get the last business day of this month:

1
2
In [4]: Date('D', '3/10/12').asfreq('M').asfreq('B','END')
Out[4]: <B : 30-Mar-2012>

The scikits.timeseries implementation turns out to be too rigid in certain ways. There’s no way to define half-day intervals, or say a daily set of intervals offset by one hour. One immediate difference with the pandas implementation is that it will allow for multiples of base intervals. For instance, in pandas timeseries:

1
2
3
4
5
6
7
8
9
10
In [19]: i = Interval('3/10/12', '12H')
In [20]: i
Out[20]: Interval('10-Mar-2012 00:00', '12H')
In [21]: i + 1
Out[21]: Interval('10-Mar-2012 12:00', '12H')
In [22]: i + 2
Out[22]: Interval('11-Mar-2012 00:00', '12H')

And then of course, there is the associated IntervalIndex, which gives you a new index type to play with:

1
2
3
4
5
6
7
8
9
10
11
In [26]: ii = IntervalIndex(start='3/10/12', end='3/12/12', freq='12H')
In [27]: s = Series(np.random.rand(len(ii)), index=ii)
In [28]: s
Out[28]:
10-Mar-2012 00:00 0.566687
10-Mar-2012 12:00 0.937349
11-Mar-2012 00:00 0.031451
11-Mar-2012 12:00 0.729145
12-Mar-2012 00:00 0.212382

You can alter the interval as you might in scikits.timeseries (say for instance you need to coerce the index to the final hour of each observed interval):

1
2
3
4
5
6
7
8
9
In [30]: s.index = s.index.resample('H', 'E')
In [31]: s
Out[31]:
10-Mar-2012 11:00 0.566687
10-Mar-2012 23:00 0.937349
11-Mar-2012 11:00 0.031451
11-Mar-2012 23:00 0.729145
12-Mar-2012 11:00 0.212382

DatetimeIndex

Another large change is DateRange will be deprecated in favor of DatetimeIndex. Rather than a numpy array of (pointers to) Python datetime objects, the representation is an array of 64-bit ints compatible with the numpy datetime64 dtype … well, the standard UTC-microseconds-from-unix-epoch-ignoring-leap-seconds resolution, to be exact. Accessing the index will yield a Timestamp object, which is a datetime-derived subclass, which will carry frequency information (if any). I’ve worked hard to keep backward compatibility with the old DateRange, so hopefully this change will be fairly transparent.

For instance:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
In [54]: dti = DatetimeIndex(start='1/1/2001', end='12/1/2001', freq='M')
In [55]: s = Series(np.random.rand(len(dti)), dti)
In [56]: s
Out[56]:
2001-01-31 0.479981
2001-02-28 0.742215
2001-03-31 0.846832
2001-04-30 0.388186
2001-05-31 0.398850
2001-06-30 0.628447
2001-07-31 0.743980
2001-08-31 0.466137
2001-09-30 0.762505
2001-10-31 0.300981
2001-11-30 0.164802
In [57]: ts = s.index[0]
In [58]: ts
Out[58]: Timestamp(2001, 1, 31, 0, 0)
In [59]: ts.offset
Out[59]: <1 MonthEnd>

Unlike DateRange, a DatetimeIndex can be composed of arbitrary timestamps, making it irregular:

1
2
3
4
5
6
7
8
9
10
11
In [74]: x = DatetimeIndex(['1/1/2010', '2/4/2011 16:24:45.123'])
In [75]: x
Out[75]: DatetimeIndex([2010-01-01 00:00:00, 2011-02-04 16:24:45.123000], dtype=datetime64[us])
In [76]: s = Series(np.random.rand(len(x)), x)
In [77]: s
Out[77]:
2010-01-01 00:00:00 0.917592
2011-02-04 16:24:45.123000 0.926309

Additionally, there will be new slicing features:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
In [4]: dti = DatetimeIndex(start='1/1/2001', end='6/1/2001', freq='D')
In [5]: s = Series(np.random.rand(len(dti)), dti)
In [6]: s.head()
Out[6]:
2001-01-01 0.571153
2001-01-02 0.685178
2001-01-03 0.915576
2001-01-04 0.185320
2001-01-05 0.141775
In [8]: s['Mar 2001']
Out[8]:
2001-03-01 0.602943
2001-03-02 0.129155
2001-03-03 0.932543
2001-03-04 0.534650
2001-03-05 0.311436
2001-03-06 0.746531
2001-03-07 0.655023
2001-03-08 0.399160
2001-03-09 0.024832
2001-03-10 0.178017
2001-03-11 0.059523
2001-03-12 0.209618
2001-03-13 0.722570
2001-03-14 0.149654
2001-03-15 0.932882
2001-03-16 0.309638
2001-03-17 0.567673
2001-03-18 0.794191
2001-03-19 0.608495
2001-03-20 0.051329
2001-03-21 0.262767
2001-03-22 0.316965
2001-03-23 0.832362
2001-03-24 0.118094
2001-03-25 0.687097
2001-03-26 0.741237
2001-03-27 0.082728
2001-03-28 0.984076
2001-03-29 0.563734
2001-03-30 0.656616
2001-03-31 0.319237

Another new feature will be conversion and resampling using the pandas group-by machinery. For instance, we can do OHLC downsampling:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
In [11]: dti = DatetimeIndex(start='1/1/2001', end='1/31/2001 23:59:59', freq='S')
In [12]: s = Series(np.random.rand(len(dti)), dti)
In [13]: s2 = s.convert('5Min', how='ohlc')
In [14]: s2.head()
Out[14]:
close high low open
2001-01-01 00:00:00 0.328680 0.328680 0.328680 0.328680
2001-01-01 00:05:00 0.612600 0.998840 0.002629 0.907757
2001-01-01 00:10:00 0.264069 0.997718 0.000011 0.698224
2001-01-01 00:15:00 0.387993 0.999033 0.002355 0.357982
2001-01-01 00:20:00 0.508035 0.999880 0.005871 0.367008

And upsampling:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
In [21]: dti = DatetimeIndex(start='1/1/2001', end='1/3/2001', freq='D')
In [22]: s = Series(np.random.rand(len(dti)), dti)
In [23]: s
Out[23]:
2001-01-01 0.790748
2001-01-02 0.827149
2001-01-03 0.836007
In [24]: s.convert('12H', method='pad')
Out[24]:
2001-01-01 00:00:00 0.790748
2001-01-01 12:00:00 0.790748
2001-01-02 00:00:00 0.827149
2001-01-02 12:00:00 0.827149
2001-01-03 00:00:00 0.836007

Performance

Besides features, there are wins in deriving DatetimeIndex from Int64Index. We get khash-based hashing for free. The int64-compatible dtype allows for vectorized, cython-optimized datetime arithmetic and timezone conversions.

Release

The timeline on stable release is still fuzzy, as there is still plenty to do:

  • API: Crafting an elegant, cohesive interface for the new features. Heavy dogfooding!
  • Internals: the DataFrame block manager needs to become datetime64-aware (perhaps interval-aware as well)
  • Plotting: scikits.timeseries had some great matplotlib plotting infrastructure we are doing our best to port

With all the fundamentals falling in place, pandas time series analysis will be a force to be reckoned with!

I like to think I know C well enough. I’ve read K&R cover to cover, and I’ve done my fair share of C hacking over the years. But I’ve learned that with C, you have to be prepared, always, and at all times, to eat humble pie. I feel compelled to share this whopper of a slice that came steaming hot from Van Der Linden – which I’ve now determined should be required reading for anyone even thinking about ever going near C.

#include "stdio.h"

int main(int argc, char *argv[]) {
    unsigned int x = 1;
    if(-1 < x)
        printf("yes\n");
    else
        printf("no\n");
    return 0;
}

Simple enough? WRONG. When a binary operation is invoked on two incompatible types (in this case, an unsigned int and an int), the “usual arithmetic conversions” apply.

Time to put on that language lawyer hat. According to the C99 rules:

If the operand that has unsigned integer type has rank greater than or equal to the rank of the type of the other operand, the operand with signed integer type is converted to the type of the operand with unsigned integer type.

Rank?

The rank of any unsigned integer type is equal to the rank of the corresponding signed integer type, if any.

Which means, of course, that -1 is “promoted”, at least on the x86-64 platform. And you can be damn sure that 0xFFFFFFFF when considered 4,294,967,295 decimal, is greater than one. Now imagine that your unsigned int definition lives in some other translation unit, long, long ago and far, far away. No, -Wall will not save you.

Another psychopathic killer: automatic string concatenation. Take the following code:

#include "stdio.h"

int main(int argc, char *argv[]) {
    char *config[] = {
        "a man ",
        "a plan ",
        "a canal "
        "panama ",
    };
 
    printf("%lu\n", sizeof(config)/sizeof(config[0]));
    printf("%s\n", config[2]);

    return 0;
}

Do you see the bug? I’ve misplaced the comma, which should come after “a canal ” instead of “panama “. Don’t let that final comma fool you: the C compiler just ignores it. However, the C compiler is just jonesing to remove that NUL terminating char ‘\0′ from ‘a canal \0′ and append ‘panama \0′, in which case you’ve just wound up with three, not four, config parameters; and with ‘a canal panama\0′ as your third config entry.

Happy bug hunting!

My friend was kind enough to give me a 64 gig solid-state disk drive to throw into my development box. I was curious about hard disk speed compared to what I’ve been running: a four-disk RAID5 array of Seagate Barracuda 7200 320GB 7200 RPM SATA (3.0Gb/s) on an Areca ARC-1210 PCI-Express x8 SATA II (3.0Gb/s) Controller Card. Just checked my Newegg order history: the two hard drives are from January 2006, the other two from Sept 2006. I didn’t know they had gotten so old! Haven’t had any failures yet, though.

A few tests (sdb5 = raid5, sda5 = sdd):

sudo hdparm -tT /dev/sdb5

/dev/sdb5:
 Timing cached reads:   15534 MB in  1.99 seconds = 7788.19 MB/sec
 Timing buffered disk reads: 456 MB in  3.01 seconds = 151.39 MB/sec

sudo hdparm -tT /dev/sda5

/dev/sda5:
 Timing cached reads:   14776 MB in  1.99 seconds = 7407.07 MB/sec
 Timing buffered disk reads: 558 MB in  3.00 seconds = 185.80 MB/sec

Interesting: the cached throughput on the RAID controller is very good, beating my motherboard controller by a sound 400MB/sec. The reads on the SDD are faster by about 30MB/sec. What about random access seek times? I found some C code to do random seeks. Save the code as seeker.c, and compile with gcc seeker.c -o seeker.

#define _LARGEFILE64_SOURCE

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <errno.h>
#include <time.h>
#include <signal.h>
#include <sys/fcntl.h>
#include <sys/ioctl.h>
#include <linux/fs.h>

#define BLOCKSIZE 4096
#define TIMEOUT 30

int count;
time_t start;

void done()
{
    time_t end;

    time(&end);

    if (end < start + TIMEOUT) {
        printf(".");
        alarm(1);
        return;
    }

    if (count) {
      printf(".\nResults: %d seeks/second, %.2f ms random access time\n",
         count / TIMEOUT, 1000.0 * TIMEOUT / count);
    }
    exit(EXIT_SUCCESS);
}

void handle(const char *string, int error)
{
    if (error) {
        perror(string);
        exit(EXIT_FAILURE);
    }
}

int main(int argc, char **argv)
{
    char buffer[BLOCKSIZE];
    int fd, retval;
    unsigned long numblocks;
    off64_t offset;

    setvbuf(stdout, NULL, _IONBF, 0);

    printf("Seeker v2.0, 2007-01-15, "
           "http://www.linuxinsight.com/how_fast_is_your_disk.html\n");

    if (argc != 2) {
        printf("Usage: seeker <raw disk device>\n");
        exit(EXIT_SUCCESS);
    }

    fd = open(argv[1], O_RDONLY);
    handle("open", fd < 0);

    retval = ioctl(fd, BLKGETSIZE, &numblocks);
    handle("ioctl", retval == -1);
    printf("Benchmarking %s [%luMB], wait %d seconds",
           argv[1], numblocks / 2048, TIMEOUT);

    time(&start);
    srand(start);
    signal(SIGALRM, &done);
    alarm(1);

    for (;;) {
        offset = (off64_t) numblocks * random() / RAND_MAX;
        retval = lseek64(fd, BLOCKSIZE * offset, SEEK_SET);
        handle("lseek64", retval == (off64_t) -1);
        retval = read(fd, buffer, BLOCKSIZE);
        handle("read", retval < 0);
        count++;
    }
    /* notreached */
}

You need an accurate blocksize; for me, that was 4096.

$ sudo dumpe2fs /dev/sdb5 | grep 'Block size'
dumpe2fs 1.41.14 (22-Dec-2010)
Block size:               4096

So the results?

RAID5:

Seeker v2.0, 2007-01-15, http://www.linuxinsight.com/how_fast_is_your_disk.html
Benchmarking /dev/sdb5 [420909MB], wait 30 seconds..............................
Results: 85 seeks/second, 11.73 ms random access time

SDD:

Seeker v2.0, 2007-01-15, http://www.linuxinsight.com/how_fast_is_your_disk.html
Benchmarking /dev/sda5 [27235MB], wait 30 seconds..............................
Results: 5429 seeks/second, 0.18 ms random access time

The SSD is two orders of magnitude improvement over the RAID5. Not surprising, but still pretty awesome.

Finally, I wanted to migrate my /home directory onto the SSD. This process worked for me:

sudo mkdir /mnt/ocz
sudo mount /dev/sda5 /mnt/ocz
sudo find . -depth -print0 | sudo cpio --null --sparse -pvd /mnt/ocz
sudo mv /home /home_old
sudo mkdir /home
sudo mount /dev/sda5 /home

Don’t forget to add the mount to /etc/fstab so it stays there! Best to use the UUID of the device, so:

11:29 ~ $ sudo blkid
[sudo] password for adam:
/dev/sda5: LABEL="Linux_OCZ" UUID="..." TYPE="ext4"

So /etc/fstab gets another entry, e.g.:

# /etc/fstab: static file system information.
#
# Use 'blkid' to print the universally unique identifier for a
# device; this may be used with UUID= as a more robust way to name devices
# that works even if disks are added and removed. See fstab(5).
#
# <file system> <mount point>   <type>  <options>       <dump>  <pass>
# /home => ocz ssd disk
UUID=...        /home           ext4    errors=remount-ro 0       1

As my colleague Wes McKinney likes to say (quoting Matthew Goodman): “Are you using IPython? If not, you’re doing it wrong!”

You shouldn’t have to wait for an exception to invoke the interactive debugger, and you definitely should be using the IPython debugger. One convenience function in the pandas (pandas.util.testing) code base is this:

def debug(f, *args, **kwargs):
    from pdb import Pdb as OldPdb
    try:
        from IPython.core.debugger import Pdb
        kw = dict(color_scheme='Linux')
    except ImportError:
        Pdb = OldPdb
        kw = {}
    pdb = Pdb(**kw)
    return pdb.runcall(f, *args, **kwargs)

You can invoke it on a function and arguments like so:

debug(test_function, arg1, arg2, named_arg1='hello')

You will get all the interactive IPython goodness as you step through your code. Funny enough, doesn’t seem like qtconsole version supports tab completion. Maybe will file a bug report…

The Python path determines how the Python interpreter locates modules. How exactly does Python construct the path?

Using the official docs on sys.path, with its footnote reference to the site module, I’ll recap the process.

If a script is executed, the interpreter sets the first entry of sys.path to that script’s directory. If Python is launched interactively, the first entry is the empty string (“”), meaning Python will scan the present working directory first. The next entries of sys.path are the contents of the PYTHONPATH environment variable, if it exists. Then, installation-dependent entries are appended (example below).

When initializing, the interpreter normally imports the site module automatically. The module, on import, executes code to find .pth files in known site-packages directory locations, which themselves contain entries which are either paths to add to sys.path, or import calls. If we really want to trace what’s going on, we can launch a Python interpreter with -S to prevent loading the site module automatically, and instead trace the import.

(Note, I am working within a virtualenv called py27.)

(py27) ~$ python -S
Python 2.7.2+ (default, Oct 4 2011, 20:06:09)
[GCC 4.6.1] on linux2
>>> import sys
>>> for p in sys.path:
... print p
...

/home/adam/.virtualenvs/py27/lib/python2.7/
/home/adam/.virtualenvs/py27/lib/python2.7/plat-linux2
/home/adam/.virtualenvs/py27/lib/python2.7/lib-tk
/home/adam/.virtualenvs/py27/lib/python2.7/lib-old
/home/adam/.virtualenvs/py27/lib/python2.7/lib-dynload

I have no PYTHONPATH, so these are just my installation-dependent paths. Now, we need to add the directory where the pdb module lives, so we can import it:

>>> sys.path += ["/usr/lib/python2.7"]
>>> import pdb
>>> pdb.run("import site")
> <string>(1)<module>()
(Pdb) s
--Call--
> /home/adam/.virtualenvs/py27/lib/python2.7/site.py(64)()
-> """

I’ll spare you the debugging session details, and summarize what I see:

- site.py grabs orig-prefix.txt from <VIRTUAL_ENV>/lib/python2.7, which for me contains “/usr”, and extends the sys.path array to contain additional “/usr”-based paths.

- site.py then scans the site-packages (in lib/python2.7). For each .pth file (in alphabetical order), step through its entries. If an entry begins with “import”, call exec() on the line; otherwise append the (absolute) path to sys.path. Then do the same in the user site-packages directory (in local/lib/python2.7).

Note, easy-install.pth contains executable code, eg:

import sys; sys.__plen = len(sys.path)
./setuptools-0.6c11-py2.7.egg
./pip-1.0.2-py2.7.egg
/home/adam/code/ipython
...
import sys; new=sys.path[sys.__plen:]; del sys.path[sys.__plen:]; p=getattr(sys,'__egginsert',0); sys.path[p:p]=new; sys.__egginsert = p+len(new)

The executable lines move all the entries (some of which are .egg zipped packages) up to the top of the path.

- After stepping through all .pth files, add the existing site-packages directories themselves.

- Finally, attempt to call “import sitecustomize.py” (which doesn’t do anything on my install).

Cython is my new favorite tool. It lets you write compiled C extension modules for the CPython interpreter using annotated Python for speed-critical parts and pure Python for non-critical parts. Further, you can import and call C functions directly. The user guide is (surprisingly?) well-written.

In particular, it lets you do blazing computations using Numpy. See this excellent whitepaper.

But what about that old Python extension module you have lying around? What if you want to utilize Cython to call into it, fast, bypassing its Python API? You don’t want to rip out all the C(++) code you care about from that module and recompile it into a new Cython extension module. Or maybe you do. But suppose you don’t.

You’ll just have to give that rickety old extension a C API and expose it properly!

Let’s imagine you’ve got a function “myfunc” in your old extension module called “myold”. So for example in the file myoldmodule.cpp you may have:

static float64_t myfunc(float64_t x) { ... }

We need to create a new header file, myold_capi.h, that declares and exports the relevant symbols that live in the compiled myold module, and that we would like to import into the new Cython module to call. We use the Python Capsule mechanism for this, and the following comes right out of the Python documentation.

#ifndef _MYOLD_CAPI_H_
#define _MYOLD_CAPI_H_

/* import required header files here */

#ifdef __cplusplus
extern "C" {
#endif

/* Total number of C API functions to export */
#define MYOLD_CAPI_pointers 1

/* C API functions to export */
#define MYOLD_myfunc_NUM 0
#define MYOLD_myfunc_RETURN float64_t
#define MYOLD_myfunc_PROTO (float64_t x)

#ifdef MYOLD_MODULE
/* This section is used when compiling myold */

static MYOLD_myfunc_RETURN myfunc MYOLD_myfunc_PROTO;
#else
/* This section is used in modules that compile against myold's C API */

static void **MYOLD_CAPI;

#define myfunc \
     (*(MYOLD_myfunc_RETURN (*)MYOLD_myfunc_PROTO) MYOLD_CAPI[MYOLD_myfunc_NUM])


/* Return -1 on error, 0 on success.
   PyCapsule_Import will set an exception if there's an error.  */

static int
import_myold(void)
{
    MYOLD_CAPI = (void **)PyCapsule_Import("myold._C_API", 0);
    return (myold_CAPI != NULL) ? 0 : -1;
}

#endif

#ifdef __cplusplus
}
#endif

#endif /* !defined(_MYOLD_CAPI_H_) */

Now, we have to include this header in our old module, myoldmodule.cpp. So right before, say,

PyObject* pModule = 0;

Add these lines:

#define MYOLD_MODULE
#include "myold_capi.h"

Finally, in your PyInit_myold() or initmyold() function that initializes your module, you need to create the Capsule holding the array of function pointers you are exporting:

    // start capsule creation for C API
   
    static void *MYOLD_CAPI[MYOLD_CAPI_pointers];

    MYOLD_CAPI[MYOLD_myfunc_NUM] = (void *)myfunc;

    /* Create a Capsule containing the API pointer array's address */
    PyObject *c_api_object = PyCapsule_New((void *)MYOLD_CAPI, "myold._C_API", NULL);

    if (c_api_object != NULL)
        PyModule_AddObject(pModule, "_C_API", c_api_object);

    // end capsule creation

Awesome. Now, does your old C module still compile? I hope so!

Next, we need to create a new Cython header file, myold.pxd. It should look something like this:

cdef extern from "myold_capi.h":
    # C-API exports via the myold capsule
    float64_t myfunc(float64_t x)
    # must call this before using module
    int import_myold()

Now, go ahead and write your new Cython module, for example mynew.pyx:

from myold cimport *

# The following call is required to initialize the
# static capsule variable that holds the pointers
# to the myold C API functions
import_myold()

cdef class NewClass():
    cpdef float64_t mynewfunc(self, float64_t x):
        return myfunc(x)

Not too bad!

Sadly, this is often the case:

X11

But not today! I wanted to get my dual monitors set up with my good ole’ GTS 250 and got a cringe-inducing nvidia-settings error:

“Failed to set MetaMode (1) ‘DFP-0: nvidia-auto-select@1680×1050 +0+0, DFP-1: nvidia-auto-select @1680×1050 +0+0 (Mode 1680×1050, id: 52) on X screen 0.”

Could a recently-released, updated driver solve this issue? Yes.

From the command line:

sudo add-apt-repository ppa:ubuntu-x-swat/x-updates
sudo apt-get update
sudo apt-get install nvidia-current

Next: reboot. Update additional packages per any suggestions.

No tears!

After reading this post about equations every computer scientist should know, I got to thinking. I love clear, plain-English explanations of powerful mathematical concepts. Not the what, but the why. I’d like to share my understanding of the binomial formula:

This is usually stated “N choose K”, and provides a way to calculate the number of ways in which you may select (or draw) K items from a collection of N items in total, where you don’t care about the order in which you draw them.

Now, if we have N elements, there are N ways to choose a first element; and, if you don’t put that first element back, N-1 ways to choose a second; and so on. That means the total number of ways to draw N elements (counting the order in which we draw them) is N x (N-1) x … x 1. We call this N!, “N factorial”.

But, we don’t care about drawing N elements. We actually care about just K elements. We don’t care about the remaining (N-K) elements. For example, if we are trying to figure out 5 choose 2, we draw from five choices, and then from four, but that’s it. No need to worry about the other (5-2) = 3 draws. So, instead of calculating 5x4x3x2x1, we calculate just 5×4 by “removing” 3x2x1, by dividing. That is, we calculate (5x4x3x2x1)/(3x2x1), or, 5!/(5-2)!. That’s where (N-K)! in the denominator comes from.

Finally, when we draw K elements, we don’t care the order in which we draw them. Following our example, with 5 choose 2, if we calculate 5×4 = 20 choices, we double-count the cases where we choose the same two items in a different order. That means we have to “remove” (divide by) the ways in which we choose K elements in different orders. This is K! ways.

So, there’s the prize: N choose K = N! / [K!x(N-K)!]

I’ve got a spanking new install of Kubuntu 11.10, and I need to get it set up for Python data hacking.  Sure, I could spring for an Enthought Python Distrubution, but where would be the masochism in that?

Inspired by Zed, let’s do this the hard way.

The linux distro comes with Python 2.7.2. Perfect! Or, use Pythonbrew to set up a local Python build that you want to use. I presume you know how to get to the command line, as well as how to edit text files using emacs, vim, pico, whatever.

Let’s get some tools:

sudo apt-get install git gfortran g++

We need to compile against Python headers and get setuptools and pip:

sudo apt-get install python-dev python-pip

Let’s isolate our Python distro from carnage:

sudo apt-get python-virtualenv
sudo pip install virtualenvwrapper

Now these lines to your ~/.bashrc:

source /usr/local/bin/virtualenvwrapper.sh
export WORKON_HOME=$HOME/.virtualenvs

Now open a new terminal and establish a virtual environment, say “py27″:

mkvirtualenv py27
workon py27

We need some math libraries (ATLAS + LAPACK):

sudo apt-get install libatlas-base-dev liblapack-dev

Ok, now to install and build all the scientific python hotness:

pip install numpy scipy

For matplotlib, we need lots of libraries. This one is dependency heavy. Note we can ask Ubuntu what we need, what’s installed, and what is not:

apt-cache depends python-matplotlib | awk '/Depends:/{print $2}' | xargs dpkg --get-selections

Easiest thing to do is just build all the dependencies (just say yes if it asks to install deps of matplotlib instead of python-matplotlib):

sudo apt-get build-dep python-matplotlib

Ok, now this should work:

pip install matplotlib

Now, of course, we need the IPython interpreter. Don’t settle for 0.11!

pip install -e git+https://github.com/ipython/ipython.git#egg=ipython
cd ~/.virtualenvs/py27/src/ipython
python setupegg.py install

Note, you may need to sudo rm /usr/bin/ipython.py if there is a conflict.

Ok, let’s beef up the IPython interpreter. Note the pip commands FAIL. This is ok. We’ll do it by hand.

sudo apt-get install qt4-dev-tools

pip install sip
cd ~/.virtualenvs/py27/build/sip
python configure.py
make
sudo make install

pip install pyqt
cd ~/.virtualenvs/py27/build/pyqt
python configure.py
make
sudo make install

# clean up
cd ~/.virtualenvs/py27/
rm -rf build

Just a few more things, you won’t be disappointed.

sudo apt-get install libzmq-dev
pip install tornado pygments pyzmq

Alright, let’s get pandas. It’s under heavy development (Wes is a beast); so lets pull the latest from git.

pip install nose cython
pip install -e git+https://github.com/wesm/pandas#egg=pandas

# we should get statsmodels too
pip install -e git+https://github.com/statsmodels/statsmodels#egg=statsmodels

Btw, you’ll note this git stuff goes into your ~/.virtualenvs/py27/src directory, if you want to git pull and update.

OK! Phew! For the grand finale:

Run the amazing qtconsole:

ipython qtconsole --pylab=inline

Or the even more amazing WEB BROWSER:

ipython notebook --pylab=inline

Launch a browser and point to http://localhost:8888/. For kicks, try opening one of Wes’s tutorial workbooks, here. You may have to fiddle a bit, but it should work.

Enjoy!

© 2012 Adam Klein's Blog Suffusion theme by Sayontan Sinha, modified by Adam :)