April 2013 – Boreal Perspectives

The news has a way of trotting around my mind while I wait for the snow to melt. One recent item was that Elsevier, the academic publishing giant, has acquired Mendeley, a startup that runs one of the most popular academic citation management applications.

This news comes on the background of the persistent rumblings that have been going on in the journal publishing space. There are many ways of approaching this debate, which revolves around the keyword “open access” and by now has matured to the point of touching on business models (who should pay? how much? what’s the role of institutional libraries and repositories?), pitfalls (what about academic vanity presses? how how to make peer review both effective and transparent ?) and a variety of interests (the taxpayer, who funds research that is locked behind paywalls; the corporations who are making large profits based on researchers’ volunteer work; the individual scientists who want to do the right thing without jeopardising their career).

Reference management is only a tiny aspect of the whole mess, but an interesting one. If we’re changing how scientific publications are disseminated the workflows and habits of researchers, who are both readers and authors, and the data we produce about them, become very relevant. Software can be a tool to empower the individual. I use Zotero, which makes my collection searchable, retrieves and manages metadata and PDF files from countless journal sites, and lets me export citations in any format I could wish for, saving me oodles of time and hassle.

Mendeley and Zotero are often mentioned in the same breath as they are similar in capabilities, and the basic offering is free of charge. Now some will disapprove of Mendeley’s decision to sell the company to Elsevier simply on the basis of Elsevier’s role as the principal bad guy in the story of large publishing corporations’ abusive practices towards academic scientific research. There’s a boycott going on after all. Others speculate about consolidation within Elsevier’s services.

But to understand why the unhappiness goes even deeper it is useful to step back and consider what’s at the bottom of it: the ways to make such software viable and the trade-offs they involve:

Charge for the software license itself. This is the strategy pursued by EndNote (which belongs to Thomson Reuters and has a list price US $ 250-300) as well as RefWorks (ProQuest, US $ 100/year) and the slightly cheaper Papers (a small startup acquired by Springer, $ 80). Such a strategy requires a sales and marketing team and substantial overhead that make your cost go up. The competition for software licensing fees from institutions is crowded, however, and the pot is limited.
Charge for storage of documents you save within the software. Zotero and Mendeley both do this. This is a straightforward, useful service offer, but I believe somewhat limited in how much revenue can be generated. Online storage is getting cheaper and cheaper, unlike staff to manage the offering. In the case of Zotero, a technically savvy user (or their institution) can get around storage limitations by setting up their own WebDAV service, as described in the product documentation. (This is what I do.)
Charge for premium features, such as collaboration tools, and thereby target institutions or teams. Mendeley does this.
Get supported by institutions and non-profits. This is Zotero’s case: it is free and open-source software that started as a simple Firefox plug-in and is now a project of George Mason University’s Roy Rosenzweig Center for History and New Media, which receives funding from various big-name charitable foundations; it also incorporates volunteer contributions from the user community.
Or get supported by a large commercial company. What’s in for them? Access to the data you’re collecting: Any reference application that offers synchronisation of metadata is sitting on a large heap of information about what and how researchers — active scientists at the cusp of their careers — are using the scientific literature. If you’re an academic publisher this information is highly valuable.

So clearly Mendeley isn’t the first to be associated with a publishing house, and all of the major reference management tools pursue multiple approaches at once. But the last two of them are pretty much mutually exclusive: An open-source project can’t simply sell their users’ data to the highest bidder. If they tried, they might get forked and see their users run away. Zotero’s privacy policy is admirably clear, Conversely, if you start out with the intent to mine user-generated data you can hardly claim you’re serving the public good, and can’t very well open up your source code — your very competitive advantage. Not even Google, who do support free software projects, do that.

So this is where the rub is: Mendeley didn’t just sell themselves out; they sold their users out. Even if contrary to expectations the assurances that “nothing changes” turn out to be the truth, they still turned their users into cogs in Elsevier’s profit-generating machine (36% margin, last time I checked). Like Papers is, most likely, for Springer, or RefWorks for ProQuest, or even the expensive EndNote for Thomson Reuters. There’s a saying that if you aren’t the customer you are the product, but I think this is not quite correct: you can also be the means of production.

And this, too, is why I did pay attention to who is behind the reference management software I was choosing. It may be naive and sometimes premature, but I’m more inclined to trust a project that has a strong non-profit institution behind it. If it’s a big corporation, I want to know up-front and make a choice about sharing my data with them. If it’s a start-up, the question is how much can I trust in their integrity and ability to develop a sustainable plan without selling me out, as well as themselves?

NCL stands for the NCAR Command Language, a free programming language that specializes in scientific data vizualisation. It is one of those rather idiosyncratic niche languages that few outside meteorology, climate science, oceanography or earth science have heard of but that can be a viable alternative to, say, Python with Nympy and Matplotlib, R, or for that matter the non-free but widespread Matlab.

It was during a pretty awesome NCL workshop that I started appreciating NCL more fully. Its main advantage, beyond the pretty powerful (and well documented) graphics capabilities, is that is provides a unified interface to some unwieldy scientific and geospatial data formats. Coming from the world of commercial relational databases, it took me quite a while to understand how scientists think about data: a wide and deep topic for another post. For now, suffice it to say that scientific data often comes in large files that contain multiple series of data, plus metadata (data that describes either all or some of that data), and all need to be automatically processed. A typical task might be to read in such a file, retrieve several of the data series, select a sub-set of them (say, for some time span or some geographic area), calculate a new quantity, and plot the result on a map, with a proper map projection and latitude/longitudes. As I work daily with satellite data files (in formats that go by names such as HDF-EOS, NetCDF or HDF5), I appreciate a programming framework that swallows my downloaded GB and gives me this with a few lines of code and without costing an arm and a leg.

So for the scientist, trying out NCL might be a good idea, and on Mac OS X (10.6) this didn’t take more than a large binary download and a few clicks. Not so last weekend when I set up an Ubuntu system, unfortunately. So I apologize for this post becoming now very boring and describe how I got it to run and what decisions I made on the way.

0. The general approach

The following took place in a Virtualbox VM in order to create as reproducible a base as possible, that is a completely fresh install of Ubuntu 11.10 (Oneiric Ocelot) (the 32 bit desktop version).

What I’m describing is my third attempt, finally successful, at getting NCL up and running. Attempt 1 was to use the pre-compiled binaries, but the available 32-bit Debian Linux version just threw odd error messages that told me that at the very least some libraries and tools weren’t what it expected. This may have been fixable, but the task would have been frustrating and ultimately not very instructive.

Better to install from source, I thought, thinking I could use packages from the Ubuntu repositories for most of the depenencies: libcairo, zlib or bison, or even widespread specialized stuff like GDAL (the Geopsatial Data Abstraction Layer software) are pretty standard, and the specialized file format libraries are available as well. However, this path also turned out to be fraught, mainly because the repository versions of the libraries related to these file types tended to not play nicely with each other, and also because the configuration file for the compilation of NCL itself expected certain libraires (hdf4, hdf5, netcdf, hdf-eos, hdf-eos5…) to be in certain locations relative to each other. I couldn’t get it right.

So at last here is what worked: Install only some of the dependencies, the unproblematic ones, from Ubuntu repositories, and install all of the finicky libraries and everything they expect to be just-so through compilation from source. So here’s the recipe:

1. Install dependencies through the package manager

merian> sudo apt-get install build-essential gfortran csh ksh\
    flex bison openssl vim libcairo2-dev libjpeg62-dev libbz2-dev\
    libudunits2-dev libxaw7-dev libxt-dev libxft-dev libxpm-dev 
merian> sudo ln -s /usr/include/freetype2/freetype /usr/include/freetype

This takes care of: compilers, shells, tools; X11, graphics (including libpng and libjpeg), and some of the required compression libraries (bz2); and the Unidata Units package for unit transformations. It also corrects an odd discrepancy in where exactly the freetype library headers are expected by other source packages.

2. Download source code for the rest of the dependencies

There are a number of libraries NCL can optionally support or use. I decided not to install the whole kitchen sink, but to leave a few as an exercise to the reader, if needed: GRIB (a file format widely used for distributing meteorological data), Triangle (for triangular meshes used in climate modeling) and Vis5D+ (which I’ve never heard of). If you’re not interested in NASA satellite data, you can leave both of the HDF-EOS versions and HDF4 out as well. GDAL and PROJ4 are optional, too, but you really want nice mapping capabilities. This leaves the following tarballs.

http://zlib.net/zlib-1.2.7.tar.gz
http://curl.haxx.se/download/curl-7.29.0.tar.gz
http://www.hdfgroup.org/ftp/lib-external/szip/2.1/src/szip-2.1.tar.gz
http://www.hdfgroup.org/ftp/HDF5/current/src/hdf5-1.8.10-patch1.tar.gz
http://www.hdfgroup.org/ftp/HDF/HDF_Current/src/hdf-4.2.9.tar.gz
http://www.unidata.ucar.edu/downloads/netcdf/ftp/netcdf-4.2.1.1.tar.gz
http://www.unidata.ucar.edu/downloads/netcdf/ftp/netcdf-fortran-4.2.tar.gz
http://download.osgeo.org/gdal/gdal-1.9.2.tar.gz
ftp://edhs1.gsfc.nasa.gov/edhs/hdfeos/latest_release/HDF-EOS2.18v1.00.tar.Z
ftp://edhs1.gsfc.nasa.gov/edhs/hdfeos5/latest_release/HDF-EOS5.1.14.tar.Z

3. Install dependencies from source

I’m following the NCAR instructions pretty closely and am installing in order: zlib first, then szip, libcurl and HDF5, which are needed by NetCDF. I was unable to persuade NetCDF to compile based on a mix of hand-compiled dependencies (in /usr/local/include) and Ubuntu header packets (in /usr/include), so all of the prerequisites had to be compiled, though some of their dependencies, in particular libjpeg and libpng, can be pre-installed. You may want to set the following environment variable, using setenv or export depending if you prefer a csh or bash type shell:

merian> export LD_LIBRARY_PATH=/usr/local/lib

Install zlib:

merian@ubuntu> ./configure --prefix=/usr/local
merian@ubuntu> make check; make install

Then, once zlib is installed, for example for libcurl, the configure command becomes:

merian@ubuntu> ./configure --prefix=/usr/local --with-zlib=/usr/local --with-pic

Pay attention to the libraries that you haven’t installed in /usr/local, and use –with-prefix=/usr in this case.

The biggest single task, and the only one where the NCAR/NCL site is incomplete, is putting NetCDF together. Very helpful are the installation instructions on the NetCDF site, both for the C libraries and the Fortran libraries. Do not worry about HDF4 support for NetCDF – we will give NCL HDF4 support through installing HDF4 directly.

Once you have NetCDF compiled and installed, the rest of the file formats come next: HDF4, HDF-EOS2 and HDF-EOS5. If you don’t need the last two, my suggestion is to simply not install support for them, but if you do follow the NCAR/NCL instructions to the letter. For HDF4, the most important point is not to forget –includedir=/usr/local/include/hdf as the software has its own version of the NetCDF header file (netcdf.h), which would overwrite the existing file. For HDF-EOS2, it is strongly advised not to use ./configure but to proceed as described and run bin/INSTALL-HDFEOS. The process involves manually editing makefiles and moving compiled libraries and header files to /usr/local/lib and /usr/local/include.

GDAL and PROJ4 are will typically be installed on any system that includes geospatial tools. The reason we install them from source is that GDAL comes with its own versions of HDF and other libraries, but needs to be linked against the same versions we built here. For the GDAL libraries, this configuration command worked well for me:

merian> ./configure --with-static-proj4=/usr/local --prefix=/usr/local \
    --without-pam --with-png=/usr --with-gif=internal \
    --with-libtiff=internal --with-geotiff=internal \
    --with-jpeg=/usr --with-libz=/usr/local \
    -with-sqlite3=no --with-expat=no --with-curl=no \
    --without-ld-shared --with-hdf4=no --with-hdf5=no\
    --with-pg=no --without-grib --with-freexl=no --with-geos=no

4. Install NCL

Downloading an NCL package requires a free registration at Earth System Grid/NCAR, which I’ve been told serves for NCAR to be able to count downloads and justify their funding. I do wonder if there isn’t a better way of achieving this.

Again, the instructions are nearly everywhere spot-on good. The process likes the environment variable NCARG to be set — I untar the source distribution, cd into the ncl_ncarg-[version] directory and run

merian> export NCARG=`pwd`

The SYSTEM_INCLUDE line returns “LINUX”, so that’s our makefile template. One change needs to be made in this file, regarding the Fortran compiler options:

#define FcOptions  -fPIC -fno-second-underscore -fno-range-check

Without the last option I got an arithmetic overflow error. It looked like it was only in a typecasting routine or other, but I guess I could have fixed that bug instead. In the configuration dialog, I used /usr/local/ncarg as the installation location, and I like then in the very end create symbolic links for the binary executables:

merian> ln -s /usr/local/ncarg/bin/* /usr/local/bin/

During the configuration dialogue, answer “yes” to all the optional dependencies you have compiled and “no” to those you don’t. And then compile. The make target “Everything” cleans out all previously compiled products, so takes longest if you have to fix a single error. “All” is faster, but what can happen is that the actual ncl executable may have got compiled but not moved to its final destination. So if your compilation output has no error but you can’t find the ncl executable, take a look in $NCARG/ni/src/ncl/. Otherwise it should be in /usr/local/ncarg/bin/ (or wherever your target location is.

Finally, test as described.

5. Final remarks

Frankly, I found this process painful. And I wonder why, much too often, we scientists put up with this — it’s not the first time that I’m seeing this. This is not a question of skills. I have the highest regard for the people at NCAR who maintain and develop NCL and provide free and high-quality support, thereby also spreading good coding practices and efficiency to newcomers. Maybe it’s the opposite: Because a typical computational scientists has the skills to deal with such an installation process, we don’t make the effort of fixing, say, the broken makefile of HDF-EOS2 once and for all, and don’t write install scripts that do away with the need to tell the software 592 times that we want it to be installed under /usr/local/ . Maybe there are license restrictions that would make it hard to repackage everything nicely, I don’t know. If so, we should make it clear that there’d be an advantage to getting rid of them.

Month: April 2013

Elsevier acquires Mendeley: some thoughts

Installing NCL on Ubuntu GNU/Linux 11.10