# Physician career satisfaction vs. salary

My friend @lesterleung posted a link to this really interesting report by Medscape on physician compensation in the USA.

The whole set of slides is pretty interesting, but I was curious in particular how much career satisfaction correlated with salary, at least if you broke physicians up into different specialties. I plotted each specialty's average response to "if you had to do it all over again, would you choose your own specialty again?" against their average salary in 2012. The line of best fit isn't too bad:

# Quick BibTeX capitalization-preserving one-liner

While writing my thesis, I realized that I need to preserve the capitalization of some words like proper or gene names in the titles of references that I imported from the Pubmed database, but having to manually comb through my BibTeX file and surround everything with curly braces was not my idea of fun. vim to the rescue!

This regex is a bit hairy, but it works and it's idempotent. Every once in a while, I just rerun the command to make sure all the capitalizations in the titles are "protected". One side effect that I didn't feel was worth correcting is that it add braces around the first word of every title, but it seems like a relatively harmless effect that isn't worth the effort to fix, since it can be hairy to eliminate the false negatives for names appearing at the beginning.

Anyway, here is the one-liner:

:%g/^\s*Title\s*=/s/\v(<Title>)@!([^{]<\w+>|<\w+>[^}])&.@=(<\w*\u\w*>).@=/{\3}/g

Currently, I limited its substitutions to the title field, but you can substitute any other field name for Title (note it appears twice) as long as that field is all in one line, as it is with my BibTeX file.

# Switching from R to Python

I've lately switched from R to use Python and its companion scientific libraries, such as numpy, scipy, pandas, and matplotlib, and after a few months being really immersed in it, I have to agree with John Cook, "I'd rather do math in a general-purpose language than try to do general-purpose programming in a math language."

Some of the advantages I've seen in using Python for my data analysis have been:

• Increased speed (BLAS/LAPACK routines of course run the same speed)
• Better memory usage
• A sane object system
• Less "magic" in the language and standard library syntax (YMMV, of course)

R's main advantages are its huge library of statistical packages, including a great graphing package in ggplot, its nice language introspection, and its convenient (in a relative sense) syntax for working with tabular data. I have to admit, the former is quite the advantage, but I've found replacements for many things packages, and I'm content to write ports for the few esoteric things that I've needed, like John Storey's qvalue package.

R has better language introspection due to its basis in Scheme, allowing for wholesale invention of new syntax and domain-specific languages, but Python's language introspection is good enough for what I use day-to-day. I can monkey-patch modules and objects, do dynamic method dispatching, access the argspec of functions, walk up live stack traces, and eval strings if I really, really need to. I can live with only having that much power.

For tabular data, pandas is not a bad replacement for R data frames. There's an initial hurdle in learning about how indexes work and interact with the rest of the data frame, but once you get over that, it isn't so bad. Of course, in some respects "getting over that" is the worst part of pandas, because the documentation is more a set of tutorials rather than a complete API documentation, but after some trial and error learning, I feel like I've gotten to the point where I can be about as productive in pandas munging data sets as I was in R. More so, in fact, since for working with raw strings and numbers, I can always fall back on Python's native data structures and numpy, which are pretty great. In the past, even when I was using R for most of my analysis, I would still do data processing, clean up, and munging in Python. Now, I don't have to switch my mental and computer contexts anymore, and I gain the advantages I listed above.

The documentation issues can really bite you, though. numpy and scipy are great, but the lack of documentation in pandas combined with its attempts to be cute and magical make it a huge pain to debug sometimes. For example, running the following contrived example gives a difficult to understand error:

>>> import pandas
>>> x = pandas.DataFrame({'a': range(6), 'b': range(3,9),
...     'c': [0,0,1,1,2,2]})
>>> def f(df):
...     a = df['a'] + 'string'
...     raise Exception('Error')
...
>>> x.groupby('c').aggregate(f)

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "[...]/lib/python2.7/site-packages/pandas/core/groupby.py",
line 1591, in aggregate
result = self._aggregate_generic(arg, *args, **kwargs)
File "[...]/lib/python2.7/site-packages/pandas/core/groupby.py",
line 1644, in _aggregate_generic
return self._aggregate_item_by_item(func, *args, **kwargs)
File "[...]/lib/python2.7/site-packages/pandas/core/groupby.py",
line 1669, in _aggregate_item_by_item
result[item] = colg.aggregate(func, *args, **kwargs)
File "[...]/lib/python2.7/site-packages/pandas/core/groupby.py",
line 1309, in aggregate
result = self._aggregate_named(func_or_funcs, *args, **kwargs)
File "[...]/lib/python2.7/site-packages/pandas/core/groupby.py",
line 1391, in _aggregate_named
output = func(group, *args, **kwargs)
File "<stdin>",
line 2, in f
File "[...]/lib/python2.7/site-packages/pandas/core/series.py",
line 470, in __getitem__
return self.index.get_value(self, key)
File "[...]/lib/python2.7/site-packages/pandas/core/index.py",
line 678, in get_value
return self._engine.get_value(series, key)
File "engines.pyx", line 81, in pandas.lib.IndexEngine.get_value
(pandas/src/tseries.c:123878)
File "engines.pyx", line 89, in pandas.lib.IndexEngine.get_value
(pandas/src/tseries.c:123693)
File "engines.pyx", line 135, in pandas.lib.IndexEngine.get_loc
(pandas/src/tseries.c:124485)
KeyError: 'a'

KeyError?? The data frame obviously has a column labeled a, so what's going on? Well, if you delve into the source code with a debugger, pandas catches any and all exceptions when calling f and just tries to execute it in many completely different ways until something works, even if that's not what you'd want. The actual exception then gets thrown far past where the original bug is (that is, the exception being thrown in f itself), and any helpful information from that exception is completely masked out.

In this sense, pandas is really non-Pythonic, in that it feels very magical and non-explicit. It would be better for pandas to have multiple methods covering different ways of dispatching functions on the DataFrame than to have one function that can potentially misinterpret what you're trying to do. To quote from the Zen of Python, "explicit is better than implicit".

But that's a small cost to pay for a dramatically better, faster language to program in.

# Publication Bias

I thought this post by Justin Esarey was a fun exploration of "publication bias", which is the bias academics feel in favor of publishing new and interesting results over negative results:

As far as I know, no one’s really tried to formally assess the impact of these phenomena or to propose any kind of diagnostic of how susceptible any particular result is to these threats to inference.

Check it out, it's pretty neat.