Link

Switching from R to Python

I’ve lately switched from R to use Python and its companion scientific libraries, such as numpy, scipy, pandas, and matplotlib, and after a few months being really immersed in it, I have to agree with John Cook, “I’d rather do math in a general-purpose language than try to do general-purpose programming in a math language.”

Some of the advantages I’ve seen in using Python for my data analysis have been:

  • Increased speed (BLAS/LAPACK routines of course run the same speed)
  • Better memory usage
  • A sane object system
  • Less “magic” in the language and standard library syntax (YMMV, of course)

R’s main advantages are its huge library of statistical packages, including a great graphing package in ggplot, its nice language introspection, and its convenient (in a relative sense) syntax for working with tabular data. I have to admit, the former is quite the advantage, but I’ve found replacements for many things packages, and I’m content to write ports for the few esoteric things that I’ve needed, like John Storey’s qvalue package.

R has better language introspection due to its basis in Scheme, allowing for wholesale invention of new syntax and domain-specific languages, but Python’s language introspection is good enough for what I use day-to-day. I can monkey-patch modules and objects, do dynamic method dispatching, access the argspec of functions, walk up live stack traces, and eval strings if I really, really need to. I can live with only having that much power.

For tabular data, pandas is not a bad replacement for R data frames. There’s an initial hurdle in learning about how indexes work and interact with the rest of the data frame, but once you get over that, it isn’t so bad. Of course, in some respects “getting over that” is the worst part of pandas, because the documentation is more a set of tutorials rather than a complete API documentation, but after some trial and error learning, I feel like I’ve gotten to the point where I can be about as productive in pandas munging data sets as I was in R. More so, in fact, since for working with raw strings and numbers, I can always fall back on Python’s native data structures and numpy, which are pretty great. In the past, even when I was using R for most of my analysis, I would still do data processing, clean up, and munging in Python. Now, I don’t have to switch my mental and computer contexts anymore, and I gain the advantages I listed above.

The documentation issues can really bite you, though. numpy and scipy are great, but the lack of documentation in pandas combined with its attempts to be cute and magical make it a huge pain to debug sometimes. For example, running the following contrived example gives a difficult to understand error:

>>> import pandas
>>> x = pandas.DataFrame({'a': range(6), 'b': range(3,9),
...     'c': [0,0,1,1,2,2]})
>>> def f(df):
...     a = df['a'] + 'string'
...     raise Exception('Error')
...
>>> x.groupby('c').aggregate(f)

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "[...]/lib/python2.7/site-packages/pandas/core/groupby.py",
line 1591, in aggregate
    result = self._aggregate_generic(arg, *args, **kwargs)
  File "[...]/lib/python2.7/site-packages/pandas/core/groupby.py",
line 1644, in _aggregate_generic
    return self._aggregate_item_by_item(func, *args, **kwargs)
  File "[...]/lib/python2.7/site-packages/pandas/core/groupby.py",
line 1669, in _aggregate_item_by_item
    result[item] = colg.aggregate(func, *args, **kwargs)
  File "[...]/lib/python2.7/site-packages/pandas/core/groupby.py",
line 1309, in aggregate
    result = self._aggregate_named(func_or_funcs, *args, **kwargs)
  File "[...]/lib/python2.7/site-packages/pandas/core/groupby.py",
line 1391, in _aggregate_named
    output = func(group, *args, **kwargs)
  File "<stdin>",
line 2, in f
  File "[...]/lib/python2.7/site-packages/pandas/core/series.py",
line 470, in __getitem__
    return self.index.get_value(self, key)
  File "[...]/lib/python2.7/site-packages/pandas/core/index.py",
line 678, in get_value
    return self._engine.get_value(series, key)
  File "engines.pyx", line 81, in pandas.lib.IndexEngine.get_value
(pandas/src/tseries.c:123878)
  File "engines.pyx", line 89, in pandas.lib.IndexEngine.get_value
(pandas/src/tseries.c:123693)
  File "engines.pyx", line 135, in pandas.lib.IndexEngine.get_loc
(pandas/src/tseries.c:124485)
KeyError: 'a'

KeyError?? The data frame obviously has a column labeled a, so what’s going on? Well, if you delve into the source code with a debugger, pandas catches any and all exceptions when calling f and just tries to execute it in many completely different ways until something works, even if that’s not what you’d want. The actual exception then gets thrown far past where the original bug is (that is, the exception being thrown in f itself), and any helpful information from that exception is completely masked out.

In this sense, pandas is really non-Pythonic, in that it feels very magical and non-explicit. It would be better for pandas to have multiple methods covering different ways of dispatching functions on the DataFrame than to have one function that can potentially misinterpret what you’re trying to do. To quote from the Zen of Python, “explicit is better than implicit”.

But that’s a small cost to pay for a dramatically better, faster language to program in.