Doing science with Python 3

Up until recently, I basically ignored Python 3 in my day-to-day Python practice. Sure, I listened to some podcasts and read some articles, but Python 2.7 is doing everything I want, so why add another item to the load of things to think about? Turns out, I’m currently writing a little library, and the question arises, should I support Python 3? If yes, how, and how hard is it? Or maybe I can claim that the scientific Python tool set is not quite ready for Python 3 and can ignore it for a little longer?

Well, no such luck — once I went ahead and installed it to see for myself, Python 3 with the packages I use most intensely turned out to be astonishingly well-behaved. Here is how I proceeded, both for my own records and in case this is useful for someone.

0. Background

System before install: Apple OS X 10.6.8 (Snow Leopard) with Python 2.7.5 from python.org installed as the default Python. I use Doug Hellmann’s virtualenvwrapper to manage my virtual environments, but up to now I didn’t use –no-site-packages, and some packages (scipy, for example) are installed globally. As far as easily possible, packages are installed with pip. However, the underlying shared libraries that are prerequisites for some of the scientific Python packages [1] are mostly managed with Homebrew.

Intended situation after install:

  • Python 2.7.5 remains the default Python
  • Python 3.3.3 available via the python3 command
  • A whole virtual environment using python3, with all the most common science tools

1. Install Python 3 from python.org

I downloaded the DMG file called Python 3.3.3 Mac OS X 64-bit/32-bit x86-64/i386 Installer (for Mac OS X 10.6 and later) and ran it. This didn’t overwrite the python command. Python 3 is, as expected, in /Library/Frameworks/Python.framework/Versions/3.3/, and the python3 executable is symlinked to /usr/local/bin/python3.

2. Install pip for Python 3

The easiest way, I believe:

curl http://python-distribute.org/distribute_setup.py | python3
curl https://raw.github.com/pypa/pip/master/contrib/get-pip.py | python3

3. Set up the virtual environment

I installed the virtualenv libraries for Python 3 rather than trying to use those for Python 2.7.5. (Python3 comes with its own tool to manage virtual environments, pyvenv, but I prefer to continue using my existing Python2.7 virtual environments rather than learn at this stage how the new tool works.)

/Library/Frameworks/Python.framework/Versions/3.3/bin/pip install virtualenv
/Library/Frameworks/Python.framework/Versions/3.3/bin/pip install virtualenvwrapper
/Library/Frameworks/Python.framework/Versions/3.3/bin/virtualenv --no-site-packages -p /usr/local/bin/python3 --distribute .virtualenvs/science3
workon science3
which pip
/Users/[username]/.virtualenvs/science3/bin/pip

The last command is to check that the pip command is indeed the one in our new virtual environment.

4. Get installing

pip install numpy
pip install pyzmq
pip install tornado
pip install jinja2
pip install ipython
pip install GDAL
pip install pyproj
pip instal h5py
pip install netcdf4
pip install matplotlib
...

Note that most of these require shared libraries to be installed beforehand. pyzmq requires zeromq for example; pyzmq, tornado and jinja2 are required for iPython (which is called afterwards as ipython3). The Geospatial Data Abstraction Library can be quite tricky to compile if you need support for many scientific data file formats (the HDF family, netCDF, ….), but luckily it doesn’t care if it is bound into Python 2 or Python 3.   Matplotlib will also install some prerequisites.

In the end, the following Python 3 packages are installed via pip:

(science3)$ pip freeze
Cython==0.19.2
GDAL==1.10.0
Jinja2==2.7.1
MarkupSafe==0.18
basemap==1.0.3
h5py==2.2.0
ipython==1.1.0
matplotlib==1.3.1
netCDF4==1.0.7
nose==1.3.0
numpy==1.8.0
pyparsing==2.0.1
pyproj==1.9.3
python-dateutil==2.2
pyzmq==14.0.0
readline==6.2.4.1
scikit-image==0.9.3
scikit-learn==0.14.1
scipy==0.13.1
six==1.4.1
tornado==3.1.1

5. What didn’t quite work

There were two glitches, one to do with the Matplotlib Basemap toolkit, the other with scipy.

The Basemap package from mpl_toolkits is a 120 MB download. That’s why I keep a version (not the newest one) saved locally and install from there:

pip install basemap-1.0.3.tar.gz

On a sidenote, this and some package installs (mostly those with code hosted in Google Code) came back with this warning:

You are installing a potentially insecure and unverifiable file. Future versions of pip will default to disallowing insecure files.

It installed fine, but importing Basemap (“from mpl_toolkits.basempap import Basemap”) fails with the error “ValueError: level must be >= 0”. Some googling shows that this has happened for a few Python packages with Python 3.3.3. Maybe upgrading the Basemap toolkit to the newest version will fix it. Right now this isn’t the highest priority.

As for scipy, the issue was quite different: A C code file (implementing a highly specialized numerical linear algebra algorithm — unsymmetric multifrontal sparse LU factorization) refused to compile (_umfpack_wrap.c). I am doubtful the issue even has anything to do with Python 3. In any event, I had been using a binary scipy package with Python 2.7, so I wouldn’t have seen the issue.

The solution was provided on a mailing list, to remove  UMFPACK altogether (“export UMFPACK=None”), and indeed scipy installed just fine without it. There is a related issue open for scipy on Github.

6. Conclusions

Python 3 feels just like Python always did! I don’t think the upgrade will change my way I go about designing software in Python, which is a relief. I made an iPython3 Notebook showing off some basic tasks (“open some weird scientific data files, read some metadata, plot the contents”).

[1] The top of my list consists of zeromq for iPython; gdal, geos, proj and maybe udunits for projected geospatial data; libpng, libtiff, libgeotiff for imagery; hdf4, hdf5, netcdf to access the scientific file formats I use most often — your list may be slightly different.

Sankey diagrams, bad charts, and science careers

Yesterday, a friend posted this chart to Facebook, noting that the topic was “uk ph.d. graduate career paths” and that in their experience (as an academic in North America), the percentages looked pretty close. I share my friend’s concern about career options for PhDs, but looking at the diagram, the thing that stands out to me is how terrible it is — as a chart.

Screen shot 2013-11-10 at 11.57.09

Its source is a 2010 Royal Society policy report (PDF) entitled “The Scientific Century: securing our future prosperity”. In the original, Fig. 1.6 has a caption:

This diagram illustrates the transition points in typical academic scientific careers following a PhD and shows the flow of scientifically-trained people into other sectors. It is a simplified snapshot based on recent data from HEFCE[33], the Research Base Funders Forum[34] and from the Higher Education Statistics Agency’s (HESA) Annual Destinations of Leavers from Higher Education’ (DLHE) survey. It also draws on Vitae’s analysis of the DLHE survey[35]. It does not show career breaks or moves back into academic science from other sectors.

So what’s so bad about the chart? Some obvious issues:

  • It is unclear what goes in on the left and to a lesser degree what is covered by the end points. The report indicates in a footnote that the term “science” is used “as shorthand for disciplines in  the natural sciences, technology, engineering and mathematics,” but the three documents used for input categorise the fields in different ways, and there is no indication which fields exactly would have been selected.
  • Line thickness is not proportional to percentage weight. The 26.5% and 30% streams have the same thickness, and the 17% stream is much less  than half the thickness of either. The 3.5% stream is more than half the thickness of the 17% stream. 
  • Why does “Permanent Research Staff” not end in an arrow? And why does the arrow from “Permanent Research Staff” to “Careers Outside Science” bend backwards (to suggest it is a step back in one’s career, that is, an implicit value judgement?) and then not even merge with the output stream?
  • Does it really mean to suggest that no one goes from “Early Career Research” (that is, a post-doc) to “Career Outside Science” (or to industry research)? In my experience, watching post-docs, that is quite a common choice for post-docs precisely because non-academic jobs may be offering better pay and conditions, or because they don’t have a choice at that stage.

A graph like this is called a Sankey Diagram. They are very common to illustrate flows of energy, or of any quantity that is overall conserved (like here, the cohort of PhD. I wondered if I could make a better one (except for the flaws in content itself), even though I’ve never made one. I like to use R for data visualization tasks (or Python of course), so I quickly found out about a) Ramnath Vaidyanathan’s rather intriguing rCharts library, which provides interfaces from R to a variety of JavaScript plotting libraries and b) the implementation of the Sankey plugin for d3.js by someone called timelyportfolio. The integration is still a little rough for the newbie, but some crucial remarks at the end of  someone else’s tutorial got me started. (I’ve long been wanting to play with d3.js anyway, as it has impressive capabilities for geographic visualizations.)

Here’s my version:

Screen shot 2013-11-10 at 11.56.56

Well, the fonts are too small. Click for full-sized image.

One advantage of plotting directly to HTML5/JavaScript is that sharing charts is extremely easy. As produced by d3.js, the chart isn’t too impressive, with several links overlapping. But as it is interactive, I manually cleaned it up and took the above screenshot.[1]

The cleaner chart illustrates most of the issues with the original one. Clearly it is unrealistic that any post-doc who later ends up in a career outside science or in non-academic research goes through another academic research staff position first. (And some go from post-doc directly to professor.) A bigger problem is the absence of differentiation by discipline. What does it mean that maybe 25% of STEM PhDs go through a period as temporary academic researchers before ending up outside science? I completely agree that this part of a researcher’s career is currently highly problematic in most Western countries (keywords: low compensation, high job insecurity, high expectations of personal investment in research), but there is a huge difference between a graduate from many engineering disciplines, where highly qualified people are finding highly satisfying “outside science” jobs, and fields where not staying in academia or public research after a PhD is the equivalent of a career change (think of astrophysics or pure mathematics). Also, the longer I think about it and look at some of the source documents (Vitae report, PDF) the more questions come up. Does Medicine count? Is teaching part of “career outside science”? What about higher education lectureships?

So in the end I remain with the feeling that no graph would have been more useful than this graph. The only thing it illustrates is confusion and uncertainty in the career paths, and as such, wouldn’t using a work of art to make the point have been more honest than what I can only call the illusion of science?

[1] For anyone interested, the code is here. It was also an opportunity to try out graphs in R.