## December 27, 2014

### Matthew Rocklin

#### Towards Out-of-core ND-Arrays

tl;dr We propose a system for task-centered computation, show an example with out-of-core nd-arrays, and ask for comments.

Note: This post is not user-focused. It is intended for library writers.

## Motivation

Recent notebooks (links 1, 2) describe how Blaze handles out-of-core single-dataset tabular computations in the following stages.

1. Partition the dataset into chunks
2. Apply some computation on each chunk
3. Concatenate the results (hopefully smaller)
4. Apply another computation into the concatenated result

Steps 2 and 4 require symbolic analysis of what work should be done; Blaze does this well. Steps 1 and 3 are more about coordinating where data goes and when computation executes.

This setup is effective for a broad class of single-dataset tabular computations. It fails for more complex cases. Blaze doesn’t currently have a good target for describing complex inter-block data dependencies. The model for breaking apart data and arranging computations (1 and 3) is too simple.

A good example of a complex case is an nd-array matrix-matrix multiply / dot product / tensor contraction. In this case a blocked approach has a more complex communication pattern. This post is about finding a simple framework that allows us to express these patterns. It’s about finding a replacement for steps 1 and 3 above.

The common solution to this problem is to describe the computation as a bipartite directed acyclic graph where nodes are computations and data and edges indicate which pieces of data a computation takes as input and delivers as output.

Many solutions to this problem exist, both theoretical algorithms and implemented software. Forgive me for describing yet-another system.

## dask

We use a low-tech representation of a task dependency graph. We use a dictionary of key-value pairs where keys are any hashable identifier and values are one of the following:

1. A value, like 1
2. A tuple containing a function and arguments, like (inc, 1). This is like an s-expression and should be interpreted as an unevaluated inc(1)
3. A tuple containing a function and arguments. Arguments may include other keys, (inc, 'my_key')

This is more clear with an example. We show this example on the right.

d = {'x': 1,
'y': (inc, 'x'),
'z': (add, 'y', 10)}

The dask library contains a small reference implementation to get values associated to keys in this task graph.

>>> import dask
1
>>> dask.get(d, 'y')  # Triggers computation
2
>>> dask.get(d, 'z')  # Triggers computation
12

In principle this could be executed by a variety of different implementations each with different solutions for distributed computing, caching, etc..

Dask also includes convenience functions to help build this graph.

>>> dask.set(d, 'a', add, args=['x', 'y'])

Although this is mainly to help those who feel uncomfortable putting the parenthesis on the left side of a function call to avoid immediate execution

>>> # d['a'] =  add( 'x', 'y')  # intend this
>>>   d['a'] = (add, 'x', 'y')  # but write this to avoid immediate execution

## Why low tech?

These “graphs” are just dictionaries of tuples. Notably, we imported dask after we built our graph. The framework investment is very light.

• Q: Why don’t we build Task and Data classes and construct a Python framework to represent these things formally?
• A: Because people have to learn and buy in to that framework and that’s hard to sell. Dictionaries are easier to sell. They’re also easy to translate into other systems. Additionally, I was able to write a reference implementation in a couple dozen lines.

It’s easy to build functions that create dicts like this for various applications. There is a decent chance that, if you’ve made it this far in this blogpost, you already understand the spec.

## ND-Arrays

I want to encode out-of-core ND-Array algorithms as data. I’ve written a few functions that create dask-style dictionaries to help me describe a decent class of blocked nd-array computations.

The following section is a specific example applying these ideas to the domain of array computing. This is just one application and not core to the idea of task scheduling. The core ideas to task scheduling and the dask implementation have already been covered above.

### Getting blocks from an array

First, we break apart a large possibly out-of-core array into blocks. For convenience in these examples we work in in-memory numpy arrays rather than on-disk arrays. Jump to the end if you’d like to see a real OOC dot product on on-disk data.

We make a function ndget to pull out a single block

>>> x = np.arange(24).reshape((4, 6))
array([[ 0,  1,  2,  3,  4,  5],
[ 6,  7,  8,  9, 10, 11],
[12, 13, 14, 15, 16, 17],
[18, 19, 20, 21, 22, 23]])

>>> # Cutting into (2, 3) shaped blocks, get the (0, 0)th block
>>> ndget(x, (2, 3), 0, 0)
array([[0, 1, 2],
[6, 7, 8]])

>>> # Cutting into (2, 3) shaped blocks, get the (1, 0)th block
>>> ndget(x, (2, 3), 1, 0)
array([[12, 13, 14],
[18, 19, 20]])

We now make a function getem that makes a dict that uses this ndget function to pull out all of the blocks. This creates more keys in our dictionary, one for each block. We name each key by the key of the array followed by a block-index.

• getem: Given a large possibly out-of-core array and a blocksize, pull apart that array into many small blocks
>>> d = {'X': x}  # map the key 'X' to the data x

>>> getem('X', blocksize=(2, 3), shape=(4, 6))
{('X', 0, 0): (ndget, 'X', (2, 3), 0, 0),
('X', 1, 0): (ndget, 'X', (2, 3), 1, 0),
('X', 1, 1): (ndget, 'X', (2, 3), 1, 1),
('X', 0, 1): (ndget, 'X', (2, 3), 0, 1)}

>>> d.update(_)  # dump in getem dict

So we have a single original array, x and using getem we describe how to get many blocks out of x using the function ndget for on each block.

• ndget actually does work on data
• getem creates dask dict that describes on what ndget should operate

We haven’t done work yet. We only do work when we finally call dask.get on the appropriate key for one of the blocks.

>>> dask.get(d, ('X', 1, 0))  # Get block corresponding to key ('X' ,1, 0)
array([[12, 13, 14],
[18, 19, 20]])

We use numpy.ndarrays for convenience. This would have worked with anything that supports numpy-style indexing, including out-of-core structures like h5py.Dataset, tables.Array, or bcolz.carray.

### Example: Embarrassingly Parallel Computation

If we have a simple function

def inc(x):
return x + 1

That we want to apply to all blocks of the dataset we could, in principle, add the following to our dictionary.

>>> d.update({('X-plus-1', i, j): (inc, ('X', i, j)) for i in range(2)
for j in range(2)})
array([[1, 2, 3],
[7, 8, 9]])

Our use of keys like ('name', i, j) to refer to the i,jth block of an array is an incidental convention and not intrinsic to dask itself. This use of tuples as keys should not be confused with the use of tuples in values to encode unevaluated functions.

### Index expressions

A broad class of array computations can be written with index expressions

– Matrix transpose

– Matrix-matrix multiply

Fortunately, the blocked versions of these algorithms look pretty much the same. To leverage this structure we made the function top for tensor operations (ideas for a better name welcome). This writes index operations like the following for blocked transpose:

>>> top(np.transpose, 'Z', 'ji', 'X', 'ij', numblocks={'X': (2, 2)})
{('Z', 0, 0): (numpy.transpose, ('X', 0, 0)),
('Z', 0, 1): (numpy.transpose, ('X', 1, 0)),
('Z', 1, 0): (numpy.transpose, ('X', 0, 1)),
('Z', 1, 1): (numpy.transpose, ('X', 1, 1))}

The first argument np.transpose is the function to apply to each block. The second and third arguments are the name and index pattern of the output. The succeeding arguments are the key and index pattern of the inputs. In this case the index pattern is the reverse. We map the ijth block to the jith block of the output after we call the function np.transpose. Finally we have the numblocks keyword arguments that give the block structure of the inputs. Index structure can be any iterable.

### Matrix Multiply

We represent tensor contractions like matrix-matrix multiply with indices that are repeated in the inputs and missing in the output like the following. In the following example the index 'j' is a contracted dummy index.

>>> top(..., Z, 'ik', X, 'ij', Y, 'jk', numblocks=...)

In this case the function receives an iterator of blocks of data that iterate over the dummy index, j. We make such a function to take iterators of square array blocks, dot product the pairs, and then sum the results. This is the inner-most loop of a conventional blocked-matrix-matrix multiply algorithm.

def dotmany(A, B):
return sum(map(np.dot, A, B))

By combining this per-block function with top we get an out-of-core dot product.

>>> top(dotmany, 'Z', 'ik', 'X', 'ij', 'Y', 'jk',  numblocks={'X': (2, 2),
'Y': (2, 2)})
{('Z', 0, 0): (dotmany, [('X', 0, 0), ('X', 0, 1)],
[('Y', 0, 0), ('Y', 1, 0)]),
('Z', 0, 1): (dotmany, [('X', 0, 0), ('X', 0, 1)],
[('Y', 0, 1), ('Y', 1, 1)]),
('Z', 1, 0): (dotmany, [('X', 1, 0), ('X', 1, 1)],
[('Y', 0, 0), ('Y', 1, 0)]),
('Z', 1, 1): (dotmany, [('X', 1, 0), ('X', 1, 1)],
[('Y', 0, 1), ('Y', 1, 1)])}

The top function inspects the index structure of the inputs and outputs and constructs dictionaries that reflect this structure, matching indices between keys and creating lists of keys over dummy indices like j.

And that was it, we have an out-of-core dot product. Calling dask.get on these keys results in out-of-core execution.

## Full example

Here is a tiny proof of concept for an out-of-core dot product. I wouldn’t expect users to write this. I would expect libraries like Blaze to write this.

### Create random array on disk

import bcolz
import numpy as np
b = bcolz.carray(np.empty(shape=(0, 1000), dtype='f8'),
rootdir='A.bcolz', chunklen=1000)
for i in range(1000):
b.append(np.random.rand(1000, 1000))
b.flush()

### Define computation A.T * A

d = {'A': b}
d.update(getem('A', blocksize=(1000, 1000), shape=b.shape))

# Add A.T into the mix
d.update(top(np.transpose, 'At', 'ij', 'A', 'ji', numblocks={'A': (1000, 1)}))

# Dot product A.T * A
d.update(top(dotmany, 'AtA', 'ik', 'At', 'ij', 'A', 'jk',
numblocks={'A': (1000, 1), 'At': (1, 1000)}))

### Do work

>>> dask.get(d, ('AtA', 0, 0))
CPU times: user 2min 57s, sys: 6.59 s, total: 3min 4s
Wall time: 2min 49s
array([[ 334071.93541158,  250297.16968262,  250404.87729587, ...,
250436.85274716,  250330.64262904,  250590.98832611],
[ 250297.16968262,  333451.72293343,  249978.2751824 , ...,
250103.20601281,  250014.96660956,  250251.0146828 ],
[ 250404.87729587,  249978.2751824 ,  333279.76376277, ...,
249961.44796719,  250061.8068036 ,  250125.80971858],
...,
[ 250436.85274716,  250103.20601281,  249961.44796719, ...,
333444.797894  ,  250021.78528189,  250147.12015207],
[ 250330.64262904,  250014.96660956,  250061.8068036 , ...,
250021.78528189,  333240.10323875,  250307.86236815],
[ 250590.98832611,  250251.0146828 ,  250125.80971858, ...,
250147.12015207,  250307.86236815,  333467.87105673]])

Three minutes for a 7GB dot product. This runs at about half the FLOPS of a normal in-memory matmul. I’m not sure yet why the discrepancy. Also, this isn’t using an optimized BLAS; we have yet to leverage multiple cores.

This isn’t trivial to write, but it’s not bad either.

## Complexity and Usability

This system is not appropriate for users; it’s unPythonic, low-level, and LISP-y. However I believe that something like this would be an appropriate standard for infrastructural libraries. It’s a simple and easy standard for code to target.

Using projects like into and blaze we can build a usable high-level front-end onto dask for the subproblems of arrays and tables. Blaze could generate these dictionaries and then hand them off to other systems to execute.

## Execution

Using the reference implementation, multithreading, HDF5/BColz, and out-of-core caching systems like chest I think that we can build a decent out-of-core ndarray solution that fully leverages a large workstation.

Ideally other people come along and build better execution engines / task schedulers. This might be an appropriate application for IPython parallel.

## Help

This could use design and technical feedback. What would encourage community buy-in to a system like this?

## December 25, 2014

### Titus Brown

#### On gaining tenure as an open scientist

On December 10th, 2014, I was formally awarded tenure at UC Davis, where I will start as an Associate Professor in the School of Veterinary Medicine on January 5th, 2015. In my research statement for my job application, I wrote:

Open science and scientific reproducibility: I am a strong advocate of open science, open source, open data, open access, and the use of social media in research as a way to advance research more broadly.

I've been an advocate of "open" for decades, and since becoming an Assistant Professor at Michigan State University in spring 2008, I have explored a variety of approaches to doing science more openly:

While being open, I achieved several career milestones, including publishing several senior author papers, graduating several doctoral students, giving an invited talk at a Gordon Conference, keynoting several international conferences, having several highly ranked universities recruit me, getting an NIH R01 (and overall bringing in more than $2m in grants while at MSU), getting recruited and hired at an Associate Professor level, gaining tenure, and becoming a Moore Investigator with sustained medium-term funding. There are certainly many more successful people than me out there, but I personally don't know what else I could wish for - my plate is full and I'm pursuing research I believe to be important. Here are some things I've observed over the years. It is possible to achieve some measure of traditional success while being open. Grants; publications; tenure. 'nuff said. Being open has become an increasingly positive driver of my career. Over the years, the benefit of openness has become increasingly obvious. I can't actually point to many negatives - the biggest specific problem I saw was that various people in my administrative chain at MSU didn't understand why I was pursuing open, but it's not clear that that had any negative consequences on my career. Meanwhile, thousands of people are using our software, I have a reasonably large and very enthusiastic blog and Twitter following, a number of papers with lots of citations, more invitations to speak than I can take, and excellent applications for grad student and postdoc positions. My current research program developed independently of open. I think I would have been doing approximately the same things at a research level whether or not I had been so open about it. (With my Moore award, that's changing in the future.) In particular, some grant awards resulted from connections made through my blog, but most did not. Open is not the point, and maybe shouldn't be. I would still rather hire an excellent closed scientist than a mediocre open scientist. But I would definitely push hiring an excellent open scientist over an excellent closed scientist. In my case, Davis definitely didn't hire me because I was an open scientist; they hired me because they liked my research and training efforts, and because there was a need for them at Davis. Open simply didn't figure in. Some people will argue that your time is better spent elsewhere... Of the six or so letters evaluating my tenure case at Davis, I reportedly had one letter that was rather negative and argued that I was not living up to my potential. My chair agreed with my suspicion that one of the reasons for this backhanded compliment was the author of the letter felt I was wasting my time by investing in social media, open science, and reproducible research. ...but people can always find a reason to criticize you. There are very few science faculty who I think are doing everything right, and even fewer to whom I wouldn't offer friendly advice if asked. Certainly, If I wanted to tear someone down I could probably do so. The majority of my tenure letter writers, the entirety of my new department, and (presumably) the rest of the hierarchy at Davis all strongly supported my tenure case. Most of my colleagues have been very supportive. Virtually none of the active research faculty in my departments at MSU question the need for change, or the utility of open science/access/source/data. However, they do get stuck on the details of how to incentivize, drive, and implement open science. The majority of negative or "WTF?" comments I've received have been from a subset of administrators; my most charitable perspective on this is that administrators are generally much more concerned with policy and incentives than individual faculty, and consequently weight the obstacles higher than the opportunities. Open science will (eventually) win out in basic research... There are too many reasons for granting agencies to support open science for science to remain closed, absent significant monetary drivers for it to remain closed. ...but there's a fundamental asymmetry in closed vs open that's retarding progress towards open, and scientific progress more broadly. A closed scientist can make use of preprints, open source and open data; an open scientist cannot make use of closed science products until they are published. See: Prisoner's Dilemma. I will have more to say on this over the next years... By remaining so closed, science is ignoring the role of serendipity in progress. I regularly read articles bemoaning the cost of openness, and I think many of these people are choosing the somewhat certain (but suboptimal) consequences of being closed over the insecurity of the uncertain (but potentially very positive) consequences of being open. As scientific funding becomes every more stringent and competitiveness grows, the advantages of being conservative will evaporate for all but the academic 1%. (As Shirley Tilghman says, for many "who have succeeded in the system, there appears to be little to be gained from messing with it". That's going to change quickly as the data science asteroid hits science, among other things; I expect to see fairly rapid turnover in that 1%.) Open science needs more practitioners. A few years back I made a conscious decisions to be less of a cheerleader and more of a practitioner. I enjoy doing science, and I enjoy talking about it, and I think the experiments we do in my lab on how to promulgate and accelerate our science through openness are just as important as the policy making, the grant funding, the infrastructure building, and yes, the publicity and cheerleading that is going on. We need it all! Scientific culture is generational; change will come, but slowly. Most senior scientists (the ones who sit on editorial boards, review panels, and tenure committees) are already overwhelmed and are unlikely to change -- as Mike Eisen says, "the publishing behavior of most established scientists has proven [...] to be beyond amendment." But there's hope; here's a great quote in the Atlantic article, from Virginia Barbour: "There's a generation of scientists who are running labs and running tenure committees who were brought up on a very different system" said Barbour. "I have huge hopes on the generation that's coming up, because they’re a generation built on openness and availability, and for a lot of things we're talking about it may well take a generational change. It may not be until those people are in positions of authority that things will really change," she said. Word. ---titus Thanks to Tracy Teal for reading and commenting on a draft of this post! ## December 23, 2014 ### Luis Pedro Coelho #### luispedro Just released mahotas 1.2.4. The single improvement over 1.2.3 is that PIL (or Pillow) based IO is automatically used as a fallback. This will hopefully make the package easier to use for new users who have one fewer dependency to install. ## December 22, 2014 ### Titus Brown #### Three movies, plus two more Here are three movies that I can recommend, starting with the best. Jiro Dreams of Sushi - a great movie about the pursuit of perfection, one slice of fish at a time. Truly inspiring. Software engineers and scientists will recognize and identify with Jiro's aspirations. The Angels' Share - a fun, laid back movie about a scotch whisky caper. Family fun. Plus the lead male character is just plain cute! Somm - a somewhat intense movie about preparing for the Master Sommelier exam. I was hoping it would be another Jiro Dreams of Sushi, but instead it was much more about competition and study. On that front, anyone who has done a PhD will be a bit amused about the statements of hard labor - sure, it's more intense, but for a shorter period of time. Of the three, Somm was my least favorite, but it was still very good. I'm also halfway through two more movies that I hope to finish soon: Chef - Jon Favreau moves from being a restaurant chef to running a food truck. Great chemistry, good characters, delicious-looking food. To Be Takei a surprisingly engrossing movie about George Takei's life and career. --titus ## December 19, 2014 ### Gaël Varoquaux #### PRNI 2016: call for organization PRNI (Pattern Recognition for NeuroImaging) is an IEEE conference about applying pattern recognition and machine learning to brain imaging. It is a mid-sized conference (about 150 attendee), and is a satellite of OHBM (the annual “Human Brain Mapping” meeting). The steering committee is calling for bids to organize the conference in June 2016, in Europe, as a satellite the OHBM meeting in Geneva. ## December 18, 2014 ### Stefan Pfenninger #### Calliope, an open-source energy systems modeling framework Energy system models are used by analysts in academia, government and industry. They allow researchers to build internally coherent simulations and scenarios of the extraction, conversion, transportation and use of energy, either today or in the future, at scales ranging from cities to entire countries or continents. These models are particularly important because they can help with understanding and planning the transition from a fossil-fuel dominated energy system to one primarily consisting of clean and renewable energy. For about a year, I’ve been working Calliope, a framework to develop energy system models using a modern and open source Python-based toolchain (the name derives from the idea of “multi-scale energy systems modeling” which leads to the clever abbreviation MUSES). Basically, I started to work on Calliope because I wanted something that ticks all these boxes: • Capable of using data with arbitrary (to some degree) resolution in space and time in an intelligent way • Able to define a number of model parameters to iterate over, generate a range of models to do so, and deploy these to a computing cluster, then process the results in a unified manner • Allow the specification of energy models in a text-based, modular, hierarchical data format (based on YAML and CSV files) • Written in a modern high-level language (Python) • Modular and easily extensible • Use a permissive Open Source license (Apache 2.0) The goal of Calliope is not to compete with established large modeling systems such as TIMES. Rather, it is to provide a framework that enables the rapid construction of (small or large) models built to answer specific research questions. Furthermore, by combining its model specification format and the open code base, its aim is also to contribute to more open sharing of data and code in the energy modeling community. It has its own website at www.callio.pe (alas, with little content so far). It’s still work in progress, with significant further functionality to be added in future releases. Currently, I’m the only developer, but contributions from others for whom this might be a useful tool are very welcome. I’m using Calliope for several ongoing research projects and it is undergoing continued development, so more material including academic papers will become available over the coming months. For now, there is a brief tutorial in the documentation that explains some of the things that Calliope does. Comments, criticism and contributions are very welcome! ## December 17, 2014 ### Titus Brown #### My review of &quot;Determining the quality and complexity of NGS...&quot; I was a reviewer on Determining the quality and complexity of next-generation sequencing data without a reference genome by Anvar et al., PDF here. Here is the top bit of my review. One interesting side note - the authors originally named their tool kMer and I complained about it in my review. And they renamed it to kPal! Which is much less confusing. The authors show that a specific set of low-k k-mer profile analysis tools can identify biases and errors in sequencing samples as well as determine sample distances between metagenomic samples. All of this is done independently of reference genomes/transcriptomes, which is very important. The paper is well written and quite clear. I found it easy to read and easy to understand. The work is also novel, I believe. Highlights of the paper for me included a solid discussion of k-mer size selection, a thorough exploration of how to compare various k-mer-based statistics, the excellent quality evaluation bit (Figure 3), I was a bit surprised by the shift from quality assessment to metagenomic analysis, but there is an underlying continuity in the approach that makes this a reasonable transition. There might be a way to update the text to make this transition easier for the non-bioinformatic reader. It's hard to pick out one particularly important result; the two biggest results are (a) k-mer based and reference free quality evaluation works quite well, and (b) k-mer analysis does a great job of grouping metagenome samples. The theory work on transitioning between k-mer sizes is potentially of great technical interest as well. ### PythonXY #### Python(x, y) 2.7.9.0 Released! Hi All, I'm happy to announce that Python(x, y) 2.7.9.0 is available for immediate download.from any of the mirrors. The full change log can be viewed here. Please post your comments and suggestions to the mailing list. Work on the 64 bit & 3.x versions progresses slowly but surely. They will happen eventually :) ### What's new in general: • The official Visual C++ for python 2.7 has replaced MinGW as the bundled compiler. Finally everyone can enjoy native python binary extenions for win32. MinGW is still available as an additional plugin. • Console2 has been replaced with ConsoleZ - a actively developed fork. Which adds split views and many other improvements. • Maintaining 2/3 code base is much easier with the inclusion of python-modernize. Also all backports have been updated and expanded with unittest2. • All packages updated to their latest versions (at the time) - IPython, GDAL, SciPy, ITK, VTK. sqlalchemy, Astropy, SWIG etc. • User Local install support has been revived - it is still experimental though. • Numpy has been held back at 1.8.2 to avoid potential compatibility issues. ### New noteworthy packages: • cvxpy- A domain-specific language for modeling convex optimization problems in Python. • grin - Utility which searches directories of source code better than grep or find. Have fun! -Gabi Davar ## December 16, 2014 ### Titus Brown #### The post-apocalyptic world of binary containers The apocalypse is nigh. Soon, binary executables and containers in object stores will join the many Web-based pipelines and the several virtual machine images on the dystopic wasteland of "reproducible science." Anyway. I had a conversation a few weeks back with a senior colleague about container-based approaches (like Docker) wherein they advocated the shipping around of executable binary blobs with APIs. I pointed out that blobs of executable code were not a good replacement for understanding or a path to progress (see my blog post on that) and they vehemently disagreed, saying that they felt it was an irrelevant point to the progress of science. That made me sad. One of the things I really like about Docker is that the community emphasizes devops-style clean installs and configurations over deployment and distribution of binary blobs (images, VMs, etc.) Let's make sure to keep that; I think it's important for scientific progress to be able to remix software. I'll just close with this comment: The issue of whether I can use your algorithm is largely orthogonal to the issue of whether I can understand your algorithm. The former is engineering progress; the latter is scientific progress. --titus p.s. While I do like to pick on the Shock/Awe/MG-RAST folk because their pipeline is utterly un-reusable by anyone, anywhere, ever, I am being extremely unfair in linking to their paper as part of this blog post. They're doing something neat that I am afraid will ultimately lead in bad directions, but they're not espousing a binary-only view of the world at all. I'm just being funny. Honest. p.p.s. Bill Howe and I also agree. So I'm being multiply unfair. I know. ## December 15, 2014 ### Enthought #### Plotting in Excel with PyXLL and Matplotlib Author: Tony Roberts, creator of PyXLL, a Python library that makes it possible to write add-ins for Microsoft Excel in Python. Download a FREE 30 day trial of PyXLL here. Python has a broad range of tools for data analysis and visualization. While Excel is able to produce various types of plots, sometimes it’s either […] ## December 11, 2014 ### Filipe Saraiva #### Moving to Stockholm Stockholm [This post is off-topic for some Planet readers, sorry for it. I just expect to get some help with free software communities.] Exciting times are coming to me before the end of year. Next week (probably) I am going to Stockholm to live for 5 months. I am getting an visiting researcher position at KTH – Royal Institute of Technology as part of my PhD course. I will work with PSMIX group – Power System Management with related Information Exchange, leaded by Prof. Lars Nordström. The subject of my project is the modelling and simulation of smart grids (power system plus a layer of communication and decision making automation) features using multi-agent systems. I expect to work with the simulation platform developed by PSMIX based in Raspberry PI and SPADE framework. The platform is described in this paper. Well, I am very anxious with this travel because two things: the communication in English and the Sweden winter. The second is my main concern. Gosh, going out from the all the days Brazilian summer to the Nordic winter! :'( But ♫ I will survive ♪ (I expect). If you know someone to help me with tips about apartment rent, I will be very glad. Rent accommodation in Stockholm is very hard and expensive. Thanks and I will send news! ## December 08, 2014 ### Philip Herron #### redbrain812 Cython, is an epic project lets just be up front about it. I do make and have made some use of it professionally in previous projects. There is one thing that keeps coming up and it really bugs me: “Cython to speed up your Python programs”. This in my opinion is a mis-representation of what cython is, but i am however being “hyper-critical”. Cython has many uses i have illustrated them probably more than most different cases aimed at the hacker. Extending pure C/C++ projects directly without writing any C/C++ and doing it in Python is where it can seriously shine and be insane how powerful it can become, if your setup your project accordingly. My frustration comes from some small criticism on how i don’t really talk about speeding up _your_ python code. There are several reasons why i seriously avoiding talking about this in the book and my recent cython article for linux format. But to be clear you need to understand what cython is and what it does. Cython provides a way of natively working with C/C++ types and code. And the side-effect is your can compile cython code which just so happens to look a lot like python to be really efficient and strip out the use of the Python Runtime since you are directly using C types. This is fine but let’s be clear. Cython and Python are two very different Programming language, it just so happens that Cython can compile (most) Python code but Python cannot compile Cython code. This is essentially the problem. So when your brand cython as a way to speed up python programs your essentially saying that you need to re-write your code in cython. And outside of the toy examples of single cython files it will not work. Remember cython compiles single files into single python modules. Meaning you cannot therefore point cython at a package or set of modules to compile this all down magically without some serious downsides on project structure and common-sense. Cython just so happens to fit extremely well into scientific computing which is entirely different topic to normal software engineering and not only that scientific code is an entirely different beast of engineering. And in the scientific world having some python scripts with single files that do heavy computation using cython here to use C Types and compile it as a shared module to be imported into Python code works extremely well. Even on lists or maps scientific people will use the libPython api to directly access data. I feel there was 2 sides to the coin those who are scientific which will want this section and another section of people who want to know what can i do with cython in my code (games, servers, applications), really anything outside of statistical computation programs. Not only that but having a solid understanding of the uses i demonstrate will make this side seem trivial outside of the more fine-grained ways of making it just that little bit faster using more cython specifics and libpython specifics which would fit better as a paper than a technical book since it would up to much more change. ### Continuum Analytics #### Bokeh 0.7 Released! We are excited to announce the release of version 0.7 of Bokeh, an interactive web plotting library for Python! This release includes many major new features: • IPython widgets and animations without a Bokeh server • Touch UI working for tools on mobile devices • Vastly improved linked data table • More new (and improving) bokeh.charts (high level charting interface) • Color mappers on the python side • Improved toolbar • Many new tools: lasso, poly, and point selection, crosshair inspector ## December 07, 2014 ### Titus Brown #### Letter of resignation Dear <chairs>, I am resigning my Assistant Professor position at Michigan State University effective January 2nd, 2015. Sincerely, CTB. Anticipated FAQ: • Why? I'm moving to UC Davis. • Do you have an employment contract with UC Davis?? Nope. But I'm starting there in January, anyway. Or that's the plan. And yes, that's how this kind of thing happen. I'm carefully refraining from geopolitical commentary on Twitter at the moment :). • Do you have tenure there? Nope. Wheels are still turning. • Aren't you nervous about that? Not really. Even if I don't get tenure, they've promised to hire me. And if that doesn't work out, someone will hire me... right? Guys? Guys? Why did it go silent all of a sudden? • Why didn't you use 'I am hereby...' at the beginning? I don't know, seemed pretentious. --titus ## December 04, 2014 ### Fabian Pedregosa #### Data-driven hemodynamic response function estimation My latest research paper[1] deals with the estimation of the hemodynamic response function (HRF) from fMRI data. This is an important topic since the knowledge of a hemodynamic response function is what makes it possible to extract the brain activation maps that are used in most of the impressive applications of machine learning to fMRI, such as (but not limited to) the reconstruction of visual images from brain activity [2] [3] or the decoding of numbers [4]. Besides the more traditional paper that describes the method, I've put online the code I used for the experiments. The code at this stage is far from perfect but it should help in reproducing the results or improving the method. I've also put online an ipython notebook with the analysis of a small piece of data. I'm obviously glad to receive feedback/bug reports/patches for this code. 1. Pedregosa, Fabian, et al. "Data-driven HRF estimation for encoding and decoding models.", Neuroimage, (2014). Also available here as an arXiv preprint. 2. Kay, Kendrick N., et al. "Identifying natural images from human brain activity." Nature 452.7185 (2008). 3. Eger, Evelyn, et al. "Deciphering cortical number coding from human brain activity patterns." Current Biology 19.19 (2009). ## December 03, 2014 ### Spyder Hi all, On the behalf of Spyder's development team, I'm pleased to announce that Spyder 2.3.2 has been released and is available for Windows XP/Vista/7/8, GNU/Linux and MacOS X: https://bitbucket.org/spyder-ide/spyderlib/downloads This release represents more than 2 months of development since 2.3.1 and introduces major enhancements and new features: - Editor - Improve cells visualization - Add support for drag selection and improve look of line number area - Open on it any text file present in the Variable Explorer - View and edit IPython notebooks as Json files - Syntax highlighting for Json and Yaml files - Variable Explorer: - Import csv files as Pandas DataFrames - Improve browsing speed for NumPy arrays and DataFrames with more than 1e5 elements - IPython Console - Add a stop button to easily stop computations We fixed almost 40 bugs, merged 13 pull requests from 8 authors and added about 150 commits between these two releases. This is a very important bugfix release which solves a lot of unicode problems in our consoles, the variable explorer and the main interface, so everyone is encouraged to update. For a full list of fixes see our changelog ## December 01, 2014 ### Titus Brown #### What version of Python should we use to for computational science courses? Brian O'Shea (a physics prof at Michigan State) asked me the following, and I thought I'd post it on my blog to get a broader set of responses. I know the answer is "Python 3", but I would appreciate specific thoughts from people with experience either with the specific packages below, or in teaching Python 3 more generally. (For the record, I continue to teach and develop in Python 2, but each year comes closer to switching to Python 3...) We're going to start teaching courses for the new CMSE [ computational science] department next year, and we're using Python as the language. I need to decide whether to teach it using python 2.x or python 3.x. I'm curious about which one you have chosen to use when teaching your own courses, and why you made that choice. (Also, it'd be interesting to know if/why you regret that choice?) The intro courses are aimed at students who plan to use computational and data science for research, other classes, and ultimately in their academic/industrial careers. We anticipate that it'll mostly be science/math and engineering students in the beginning, but there's significant interest from social science and humanities folks on campus. Given the audience and goals, my choice of programming language is fairly utilitarian - we want to introduce students to standard numerical packages (numpy, scipy, matplotlib, h5py, pandas, ipython) and also some of the discipline-specific packages, and get them into the mindset of "I have a problem, somebody has probably already written a tool to address this, let's not reinvent the wheel." So, I want to choose the version of Python that's likely to work well with pretty much any relatively widely-used Python package. My impression, based on a variety of blog posts and articles that I've found, is that the mainstream libraries work just fine with Python 3 (e.g., matplotlib), but a lot of other stuff simply doesn't work at this point. This course is going to be the gateway course for our new minor/major, and a lot of later courses will be based on it (the graduate version of the course will be the gateway course for the certificate, and presumably taken by lots of grad students here at MSU). I'd like to make the most sensible choice given that we'll be creating course materials that will be used by other faculty, and which may stick around for a while... Anyway, any thoughts you have would be appreciated! Brian sent me these links as examples of the discussion: http://sjbyrnes.com/?page_id=67 http://cyrille.rossant.net/whats-wrong-with-scientific-python/ https://jakevdp.github.io/blog/2013/01/03/will-scientists-ever-move-to-python-3/ http://nbviewer.ipython.org/github/jrjohansson/scientific-python-lectures/blob/master/Lecture-0-Scientific-Computing-with-Python.ipynb My strongest personal advice at this point is that Brian should invest in the Anaconda distribution as a teaching foundation. Thoughts? Comments? --titus ## November 29, 2014 ### Titus Brown #### Annotating papers with pipeline steps - suggestions? A few months ago, I wrote a short description of how we make our papers replicable in the lab. One problem with this process is that for complex pipelines, it's not always obvious how to connect a number in the paper to the steps in the pipeline that produced it -- there are lots of files and outputs involved in all of this. You can sometimes figure it out by looking at the numbers and correlating with what's in the paper, and/or trying to understand the process from the bottom up, but that's quite difficult. Even my students and I (who can meet in person) sometimes have trouble tracking things down quickly. Now, on my latest paper, I've started annotating results in the paper source code with Makefile targets. For example, if you are interested in how we got these results, you can use the comment in the LaTeX file to go straight to the associated target in the Makefile. That's pretty convenient, but then I got to thinking -- how do we communicate this to the reader more directly? Not everyone wants to go to the github repo, read the LaTeX source, and then go find the relevant target in the Makefile (and "not everyone" is a bit of an understatement :). But there's no reason we couldn't link directly to the Makefile target in the PDF, is there? And, separately, right now it is a reasonably large burden to copy the results from the output of the scripts into the LaTeX file. Surely there's a way to get the computer to do this, right? So, everyone -- two questions! First, assuming that I'm going the nouveau traditional publishing route of producing a PDF for people to download from a journal site, is there a good, or suggested, or even better yet journal-supported way to link directly to the actual computational methods? (Yes, yes, I know it's suboptimal to publish into PDFs. Let me know when you've got that figured out, and I'll be happy to try it out. 'til then kthxbye.) I'd like to avoid intermediaries like individual DOIs for each method, if at all possible; direct links to github FTW. Ideally, it would make it through the publishing process. I could, of course, make it available as a preprint, and point interested people at that. Second, what's a good way to convey results from a script into a LaTeX (or more generally any text document)? With LaTeX I could make use of the 'include' directive; or I could do something with templating; or...? Well, and third, is there any (good, scriptable) way to do both (1) and (2) at the same time? Anyway, hit me with your wisdom; would love to hear that there's already a good solution! --titus ## November 21, 2014 ### Filipe Saraiva #### Presenting a Season of KDE 2014 student – Minh Ngo Season of KDE is an outreach program hosted by the KDE community. This year I am working as a mentor to a long time requested project related with Cantor – the development of Python 3 backend. You can read more about Cantor in my blog (texts in English and Portuguese). So, let’s say welcome and good luck to Minh Ngo, the student behind this project! Hi, My name is Minh, Minh Ngo I’m BSc graduated student. I’m Vietnamese, but unlike other Vietnamese students spent most of my life in Ukraine. Currently, I’m preparing myself to the Master degree that will start in the next semester. Open source is my free time hobby, so I would like to make something that is useful for the community. Previously, I was participated in the GSoC 2013 program and in several open source projects. Some of my personal projects is available on my github page https://github.com/Ignotus, not so popular like other cool projects, but several are used by other people and this fact makes me very happy . Cantor is one of opportunities to spend time to create an useful thing and win an exclusive KDE T-shirt :). I decided to start my contribution with the Python3 backend, because few months ago I studied several courses that are related with Machine Learning, so I was looking for a stable desktop backend for IPython. A notebook version IPython I do not entirely like and its qtconsole version doesn’t satisfy me in terms of functionality, therefore I decided to find some existent frontend for IPython that I can tune for myself. And the story with Cantor began after than Happy hacking! ## November 19, 2014 ### Matthew Rocklin #### Blaze Datasets tl;dr Blaze aids exploration by supporting full databases and collections of datasets. This post was composed using Blaze version 0.6.6. You can follow along with the following conda command. conda install -c blaze blaze=0.6.6  When we encounter new data we need to explore broadly to find what exists before we can meaningfully perform analyses. The tools for this task are often overlooked. This post outlines how Blaze explores collections and hierarchies of datasets, cases where one entity like a database or file or directory might hold many tables or arrays. We use examples from HDF5 files and SQL databases. Blaze understands how the underlying libraries work so that you don’t have to. ## Motivating problem - Intuitive HDF5 File Navigation For example, if we want to understand the contents of this set of HDF5 files encoding meteorological data then we need to navigate a hierarchy of arrays. This is common among HDF5 files. Typically we navigate these files in Python with h5py or pytables. >>> import h5py >>> f = h5py.File('OMI-Aura_L2-OMAERO_2014m1105t2304-o54838_v003-2014m1106t215558.he5') HDF5 files organize datasets with an internal file system. The h5py library accesses this internal file system through successive calls to .keys() and item access. >>> f.keys() ['HDFEOS', 'HDFEOS INFORMATION'] >>> f['/HDFEOS'].keys() ['ADDITIONAL', 'SWATHS'] >>> f['/HDFEOS/SWATHS'].keys() ['ColumnAmountAerosol'] >>> f['/HDFEOS/SWATHS/ColumnAmountAerosol'].keys() ['Data Fields', 'Geolocation Fields'] >>> f['/HDFEOS/SWATHS/ColumnAmountAerosol/Data Fields'].keys() ['AerosolIndexUV', 'AerosolIndexVIS', 'AerosolModelMW', 'AerosolModelsPassedThreshold', 'AerosolOpticalThicknessMW', ... >>> f['/HDFEOS/SWATHS/ColumnAmountAerosol/Data Fields/TerrainPressure'] <HDF5 dataset "TerrainPressure": shape (1643, 60), type "<i2"> >>> f['/HDFEOS/SWATHS/ColumnAmountAerosol/Data Fields/TerrainPressure'][:] array([[1013, 1013, 1013, ..., 1013, 1013, 1013], [1013, 1013, 1013, ..., 1013, 1013, 1013], [1013, 1013, 1013, ..., 1013, 1013, 1013], ..., [1010, 992, 986, ..., 1013, 1013, 1013], [ 999, 983, 988, ..., 1013, 1013, 1013], [1006, 983, 993, ..., 1013, 1013, 1013]], dtype=int16) This interaction between programmer and interpreter feels like a long and awkward conversation. Blaze improves the exploration user experience by treating the entire HDF5 file as a single Blaze variable. This allows users to both explore and compute on larger collections of data in the same workflow. This isn’t specific to HDF5 but works anywhere many datasets are bundled together. ## Exploring a SQL Database For example, a SQL database can be viewed as a collection of tables. Blaze makes it easy to to access a single table in a database using a string URI specifying both the database and the table name. >>> iris = Data('sqlite:////my.db::iris') # protocol://database::table-name >>> iris sepal_length sepal_width petal_length petal_width species 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa ... This only works if we know what table we want ahead of time. The approach above assumes that the user is already familiar with their data. To resolve this problem we omit the table name and access the database as a variable instead. We use the same interface to access the entire database as we would a specific table. >>> db = Data('sqlite:////my.db') # protocol://database >>> db Data: Engine(sqlite:////home/mrocklin/workspace/blaze/blaze/examples/data/iris.db) DataShape: { iris: var * { sepal_length: ?float64, sepal_width: ?float64, petal_length: ?float64, petal_width: ?float64, species: ?string ... The db expression points to a SQLAlchemy engine. We print the engine alongside a truncated datashape showing the metadata of the tables in that database. Note that the datashape maps table names to datashapes of those tables, in this case a variable length collection of records with fields for measurements of flowers. Blaze isn’t doing any work of the grunt work here, SQLAlchemy is. SQLAlchemy is a mature Python library that interacts with a wide variety of SQL databases. It provides both database reflection (as we see above) along with general querying (as we see below). Blaze provides a convenient front-end. We seamlessly transition from exploration to computation. We query for the shortest and longest sepal per species. >>> by(db.iris.species, shortest=db.iris.sepal_length.min(), ... longest=db.iris.sepal_length.max()) species longest shortest 0 Iris-setosa 5.8 4.3 1 Iris-versicolor 7.0 4.9 2 Iris-virginica 7.9 4.9 Blaze doesn’t pull data into local memory, instead it generates SQLAlchemy which generates SQL which executes on the foreign database. The (much smaller) result is pulled back into memory and rendered nicely using Pandas. ## A Larger Database Improved metadata discovery on SQL databases overlaps with the excellent work done by yhat on db.py. We steal their example, the Lahman Baseball statistics database, as an example of a richer database with a greater variety of tables. >>> lahman = Data('sqlite:///baseball-archive-2012.sqlite') >>> lahman.dshape # ask for the entire datashape dshape("""{ battingpost: var * { yearID: ?int32, round: ?string, playerID: ?string, teamID: ?string, lgID: ?string, G: ?int32, AB: ?int32, R: ?int32, H: ?int32, 2B: ?int32, 3B: ?int32, HR: ?int32, RBI: ?int32, SB: ?int32, CS: ?int32, BB: ?int32, SO: ?int32, IBB: ?int32, HBP: ?int32, SH: ?int32, SF: ?int32, GIDP: ?int32 }, awardsmanagers: var * { managerID: ?string, awardID: ?string, yearID: ?int32, lgID: ?string, tie: ?string, notes: ?string }, hofold: var * { hofID: ?string, yearid: ?int32, votedBy: ?string, ballots: ?int32, votes: ?int32, inducted: ?string, category: ?string }, salaries: var * { yearID: ?int32, teamID: ?string, lgID: ?string, playerID: ?string, salary: ?float64 }, pitchingpost: var * { playerID: ?string, yearID: ?int32, round: ?string, teamID: ?string, lgID: ?string, W: ?int32, L: ?int32, G: ?int32, GS: ?int32, CG: ?int32, SHO: ?int32, SV: ?int32, IPouts: ?int32, H: ?int32, ER: ?int32, HR: ?int32, BB: ?int32, SO: ?int32, BAOpp: ?float64, ERA: ?float64, IBB: ?int32, WP: ?int32, HBP: ?int32, BK: ?int32, BFP: ?int32, GF: ?int32, R: ?int32, SH: ?int32, SF: ?int32, GIDP: ?int32 }, managers: var * { managerID: ?string, yearID: ?int32, teamID: ?string, lgID: ?string, inseason: ?int32, G: ?int32, W: ?int32, L: ?int32, rank: ?int32, plyrMgr: ?string }, teams: var * { yearID: ?int32, lgID: ?string, teamID: ?string, franchID: ?string, divID: ?string, Rank: ?int32, G: ?int32, Ghome: ?int32, W: ?int32, L: ?int32, DivWin: ?string, WCWin: ?string, LgWin: ?string, WSWin: ?string, R: ?int32, AB: ?int32, H: ?int32, 2B: ?int32, 3B: ?int32, HR: ?int32, BB: ?int32, SO: ?int32, SB: ?int32, CS: ?int32, HBP: ?int32, SF: ?int32, RA: ?int32, ER: ?int32, ERA: ?float64, CG: ?int32, SHO: ?int32, SV: ?int32, IPouts: ?int32, HA: ?int32, HRA: ?int32, BBA: ?int32, SOA: ?int32, E: ?int32, DP: ?int32, FP: ?float64, name: ?string, park: ?string, attendance: ?int32, BPF: ?int32, PPF: ?int32, teamIDBR: ?string, teamIDlahman45: ?string, teamIDretro: ?string }, teamshalf: var * { yearID: ?int32, lgID: ?string, teamID: ?string, Half: ?string, divID: ?string, DivWin: ?string, Rank: ?int32, G: ?int32, W: ?int32, L: ?int32 }, fieldingpost: var * { playerID: ?string, yearID: ?int32, teamID: ?string, lgID: ?string, round: ?string, POS: ?string, G: ?int32, GS: ?int32, InnOuts: ?int32, PO: ?int32, A: ?int32, E: ?int32, DP: ?int32, TP: ?int32, PB: ?int32, SB: ?int32, CS: ?int32 }, master: var * { lahmanID: ?int32, playerID: ?string, managerID: ?string, hofID: ?string, birthYear: ?int32, birthMonth: ?int32, birthDay: ?int32, birthCountry: ?string, birthState: ?string, birthCity: ?string, deathYear: ?int32, deathMonth: ?int32, deathDay: ?int32, deathCountry: ?string, deathState: ?string, deathCity: ?string, nameFirst: ?string, nameLast: ?string, nameNote: ?string, nameGiven: ?string, nameNick: ?string, weight: ?int32, height: ?float64, bats: ?string, throws: ?string, debut: ?string, finalGame: ?string, college: ?string, lahman40ID: ?string, lahman45ID: ?string, retroID: ?string, holtzID: ?string, bbrefID: ?string }, fieldingof: var * { playerID: ?string, yearID: ?int32, stint: ?int32, Glf: ?int32, Gcf: ?int32, Grf: ?int32 }, pitching: var * { playerID: ?string, yearID: ?int32, stint: ?int32, teamID: ?string, lgID: ?string, W: ?int32, L: ?int32, G: ?int32, GS: ?int32, CG: ?int32, SHO: ?int32, SV: ?int32, IPouts: ?int32, H: ?int32, ER: ?int32, HR: ?int32, BB: ?int32, SO: ?int32, BAOpp: ?float64, ERA: ?float64, IBB: ?int32, WP: ?int32, HBP: ?int32, BK: ?int32, BFP: ?int32, GF: ?int32, R: ?int32, SH: ?int32, SF: ?int32, GIDP: ?int32 }, managershalf: var * { managerID: ?string, yearID: ?int32, teamID: ?string, lgID: ?string, inseason: ?int32, half: ?int32, G: ?int32, W: ?int32, L: ?int32, rank: ?int32 }, appearances: var * { yearID: ?int32, teamID: ?string, lgID: ?string, playerID: ?string, G_all: ?int32, G_batting: ?int32, G_defense: ?int32, G_p: ?int32, G_c: ?int32, G_1b: ?int32, G_2b: ?int32, G_3b: ?int32, G_ss: ?int32, G_lf: ?int32, G_cf: ?int32, G_rf: ?int32, G_of: ?int32, G_dh: ?int32, G_ph: ?int32, G_pr: ?int32 }, halloffame: var * { hofID: ?string, yearid: ?int32, votedBy: ?string, ballots: ?int32, needed: ?int32, votes: ?int32, inducted: ?string, category: ?string }, awardsplayers: var * { playerID: ?string, awardID: ?string, yearID: ?int32, lgID: ?string, tie: ?string, notes: ?string }, schoolsplayers: var * { playerID: ?string, schoolID: ?string, yearMin: ?int32, yearMax: ?int32 }, schools: var * { schoolID: ?string, schoolName: ?string, schoolCity: ?string, schoolState: ?string, schoolNick: ?string }, teamsfranchises: var * { franchID: ?string, franchName: ?string, active: ?string, NAassoc: ?string }, allstarfull: var * { playerID: ?string, yearID: ?int32, gameNum: ?int32, gameID: ?string, teamID: ?string, lgID: ?string, GP: ?int32, startingPos: ?int32 }, tmp_batting: var * { playerID: ?string, yearID: ?int32, stint: ?int32, teamID: ?string, lgID: ?string, G: ?int32, G_batting: ?int32, AB: ?int32, R: ?int32, H: ?int32, 2B: ?int32, 3B: ?int32, HR: ?int32, RBI: ?int32, SB: ?int32, CS: ?int32, BB: ?int32, SO: ?int32, IBB: ?int32, HBP: ?int32, SH: ?int32, SF: ?int32, GIDP: ?int32, G_old: ?int32 }, seriespost: var * { yearID: ?int32, round: ?string, teamIDwinner: ?string, lgIDwinner: ?string, teamIDloser: ?string, lgIDloser: ?string, wins: ?int32, losses: ?int32, ties: ?int32 }, awardssharemanagers: var * { awardID: ?string, yearID: ?int32, lgID: ?string, managerID: ?string, pointsWon: ?int32, pointsMax: ?int32, votesFirst: ?int32 }, awardsshareplayers: var * { awardID: ?string, yearID: ?int32, lgID: ?string, playerID: ?string, pointsWon: ?float64, pointsMax: ?int32, votesFirst: ?float64 }, fielding: var * { playerID: ?string, yearID: ?int32, stint: ?int32, teamID: ?string, lgID: ?string, POS: ?string, G: ?int32, GS: ?int32, InnOuts: ?int32, PO: ?int32, A: ?int32, E: ?int32, DP: ?int32, PB: ?int32, WP: ?int32, SB: ?int32, CS: ?int32, ZR: ?float64 } }""") Seeing at once all the tables in the database, all the columns in those tables, and all the types of those columns provides a clear and comprehensive overview of our data. We represent this information as a datashape, a type system covers everything from numpy arrays to SQL databases and Spark collections. We use standard Blaze expressions to navigate more deeply. Things like auto-complete work fine. >>> lahman.<tab> lahman.allstarfull lahman.fieldingpost lahman.pitchingpost lahman.appearances lahman.fields lahman.relabel lahman.awardsmanagers lahman.halloffame lahman.salaries lahman.awardsplayers lahman.hofold lahman.schema lahman.awardssharemanagers lahman.isidentical lahman.schools lahman.awardsshareplayers lahman.like lahman.schoolsplayers lahman.battingpost lahman.managers lahman.seriespost lahman.data lahman.managershalf lahman.teams lahman.dshape lahman.map lahman.teamsfranchises lahman.fielding lahman.master lahman.teamshalf lahman.fieldingof lahman.pitching lahman.tmp_batting And we finish by a fun multi-table computation, joining two tables on year, team, and league and then computing the average salary by team name and year >>> joined = join(lahman.teams, lahman.salaries, ['yearID', 'teamID', 'lgID']) >>> by(joined[['name', 'yearID']], average_salary=joined.salary.mean()) name yearID average_salary 0 Anaheim Angels 1997 1004370.064516 1 Anaheim Angels 1998 1214147.058824 2 Anaheim Angels 1999 1384704.150000 3 Anaheim Angels 2000 1715472.233333 4 Anaheim Angels 2001 1584505.566667 5 Anaheim Angels 2002 2204345.250000 6 Anaheim Angels 2003 2927098.777778 7 Anaheim Angels 2004 3723506.185185 8 Arizona Diamondbacks 1998 898527.777778 9 Arizona Diamondbacks 1999 2020705.852941 ... Looks good, we compute and store to CSV file with into >>> into('salaries.csv', ... by(j[['name', 'yearID']], total_salary=j.salary.mean())) (Final result here: salaries.csv) ## Beyond SQL SQL isn’t unique, many systems hold collections or hierarchies of datasets. Blaze supports navigation over Mongo databases, Blaze servers, HDF5 files, or even just dictionaries of pandas DataFrames or CSV files. >>> d = {'accounts 1': CSV('accounts_1.csv'), ... 'accounts 2': CSV('accounts_2.csv')} >>> accounts = Data(d) >>> accounts.dshape dshape("""{ accounts 1: var * {id: ?int64, name: string, amount: ?int64}, accounts 2: var * {id: ?int64, name: string, amount: ?int64} }""") None of this behavior was special-cased in Blaze. The same mechanisms that select a table from a SQL database select a column from a Pandas DataFrame. ## Finish with HDF5 example To conclude we revisit our motivating problem, HDF5 file navigation. ### Raw H5Py Recall that we previously had a long back-and-forth conversation with the Python interpreter. >>> import h5py >>> f = h5py.File('OMI-Aura_L2-OMAERO_2014m1105t2304-o54838_v003-2014m1106t215558.he5') >>> f.keys() ['HDFEOS', 'HDFEOS INFORMATION'] >>> f['/HDFEOS'].keys() ['ADDITIONAL', 'SWATHS'] ... ### H5Py with Blaze expressions With Blaze we get a quick high-level overview and an API that is shared with countless other systems. >>> from blaze import Data >>> d = Data(f) # Wrap h5py file with Blaze interactive expression >>> d Data: <HDF5 file "OMI-Aura_L2-OMAERO_2014m1105t2304-o54838_v003-2014m1106t215558.he5" (mode r+)> DataShape: { HDFEOS: { ADDITIONAL: {FILE_ATTRIBUTES: {}}, SWATHS: { ColumnAmountAerosol: { Data Fields: { AerosolIndexUV: 1643 * 60 * int16, ... By default we see the data and a truncated datashape. We ask for the datashape explicitly to see the full picture. >>> d.dshape dshape("""{ HDFEOS: { ADDITIONAL: {FILE_ATTRIBUTES: {}}, SWATHS: { ColumnAmountAerosol: { Data Fields: { AerosolIndexUV: 1643 * 60 * int16, AerosolIndexVIS: 1643 * 60 * int16, AerosolModelMW: 1643 * 60 * uint16, AerosolModelsPassedThreshold: 1643 * 60 * 10 * uint16, AerosolOpticalThicknessMW: 1643 * 60 * 14 * int16, AerosolOpticalThicknessMWPrecision: 1643 * 60 * int16, AerosolOpticalThicknessNUV: 1643 * 60 * 2 * int16, AerosolOpticalThicknessPassedThreshold: 1643 * 60 * 10 * 9 * int16, AerosolOpticalThicknessPassedThresholdMean: 1643 * 60 * 9 * int16, AerosolOpticalThicknessPassedThresholdStd: 1643 * 60 * 9 * int16, CloudFlags: 1643 * 60 * uint8, CloudPressure: 1643 * 60 * int16, EffectiveCloudFraction: 1643 * 60 * int8, InstrumentConfigurationId: 1643 * uint8, MeasurementQualityFlags: 1643 * uint8, NumberOfModelsPassedThreshold: 1643 * 60 * uint8, ProcessingQualityFlagsMW: 1643 * 60 * uint16, ProcessingQualityFlagsNUV: 1643 * 60 * uint16, RootMeanSquareErrorOfFitPassedThreshold: 1643 * 60 * 10 * int16, SingleScatteringAlbedoMW: 1643 * 60 * 14 * int16, SingleScatteringAlbedoMWPrecision: 1643 * 60 * int16, SingleScatteringAlbedoNUV: 1643 * 60 * 2 * int16, SingleScatteringAlbedoPassedThreshold: 1643 * 60 * 10 * 9 * int16, SingleScatteringAlbedoPassedThresholdMean: 1643 * 60 * 9 * int16, SingleScatteringAlbedoPassedThresholdStd: 1643 * 60 * 9 * int16, SmallPixelRadiancePointerUV: 1643 * 2 * int16, SmallPixelRadiancePointerVIS: 1643 * 2 * int16, SmallPixelRadianceUV: 6783 * 60 * float32, SmallPixelRadianceVIS: 6786 * 60 * float32, SmallPixelWavelengthUV: 6783 * 60 * uint16, SmallPixelWavelengthVIS: 6786 * 60 * uint16, TerrainPressure: 1643 * 60 * int16, TerrainReflectivity: 1643 * 60 * 9 * int16, XTrackQualityFlags: 1643 * 60 * uint8 }, Geolocation Fields: { GroundPixelQualityFlags: 1643 * 60 * uint16, Latitude: 1643 * 60 * float32, Longitude: 1643 * 60 * float32, OrbitPhase: 1643 * float32, SolarAzimuthAngle: 1643 * 60 * float32, SolarZenithAngle: 1643 * 60 * float32, SpacecraftAltitude: 1643 * float32, SpacecraftLatitude: 1643 * float32, SpacecraftLongitude: 1643 * float32, TerrainHeight: 1643 * 60 * int16, Time: 1643 * float64, ViewingAzimuthAngle: 1643 * 60 * float32, ViewingZenithAngle: 1643 * 60 * float32 } } } }, HDFEOS INFORMATION: { ArchiveMetadata.0: string[65535, 'A'], CoreMetadata.0: string[65535, 'A'], StructMetadata.0: string[32000, 'A'] } }""") When we see metadata for everything in this dataset all at once the full structure becomes apparent. We have two collections of arrays, all shaped (1643, 60); the collection in Data Fields holds what appear to be weather measurements while the collection in Geolocation Fields holds coordinate information. Moreover we can quickly navigate this structure to compute relevant information. >>> d.HDFEOS.SWATHS.ColumnAmountAerosol.Geolocation_Fields.Latitude array([[-67., -67., -67., ..., -61., -60., -60.], [-67., -67., -67., ..., -61., -60., -60.], [-67., -67., -68., ..., -61., -61., -60.], ..., [ 69., 70., 71., ..., 85., 84., 84.], [ 69., 70., 71., ..., 85., 85., 84.], [ 69., 70., 71., ..., 85., 85., 84.]], dtype=float32) >>> d.HDFEOS.SWATHS.ColumnAmountAerosol.Geolocation_Fields.Longitude array([[ 46., 43., 40., ..., -3., -5., -7.], [ 46., 43., 40., ..., -3., -5., -7.], [ 46., 43., 40., ..., -4., -5., -7.], ..., [ 123., 124., 124., ..., -141., -131., -121.], [ 123., 124., 124., ..., -141., -130., -120.], [ 123., 123., 124., ..., -140., -130., -119.]], dtype=float32) >>> d.HDFEOS.SWATHS.ColumnAmountAerosol.Data_Fields.TerrainPressure array([[1013, 1013, 1013, ..., 1013, 1013, 1013], [1013, 1013, 1013, ..., 1013, 1013, 1013], [1013, 1013, 1013, ..., 1013, 1013, 1013], ..., [1010, 992, 986, ..., 1013, 1013, 1013], [ 999, 983, 988, ..., 1013, 1013, 1013], [1006, 983, 993, ..., 1013, 1013, 1013]], dtype=int16) >>> d.HDFEOS.SWATHS.ColumnAmountAerosol.Data_Fields.TerrainPressure.min() 620 ## Final Thoughts Often the greatest challenge is finding what you already have. Discovery and exploration are just as important as computation. By extending the Blaze’s expression system to hierarchies of datasets we create a smooth user experience from first introductions to data all the way to analytic queries and saving results. This work was easy. The pluggable architecture of Blaze made it surprisingly simple to extend the Blaze model from tables and arrays to collections of tables and arrays. We wrote about 40 significant lines of code for each supported backend. ## November 18, 2014 ### Titus Brown #### RFC: The khmer project: what are we, and what are our goals? As we think about the next few years of khmer development, it is helpful to explore what khmer is, and what our goals for khmer development are. This can provide guiding principles for development, refactoring, extension, funding requests, and collaborations. Comments solicited! Links: ## Definition khmer is an open source project that serves as: • a stable research platform for novel CS/bio research on data structures and algorithms, mostly k-mer based; • a set of of command line tools for various kinds of data transformations; • a test bed for software engineering practice in science; • a Python library for working with k-mers and graph structures; • an exercise in community building in scientific software engineering; • an effort to participate in and sustainably grow the bioinformatics ecosystem. ## Goals Our long term goals for khmer, in some rough order of priority, are: • Keep khmer versatile and agile enough to easily enable the CS and bio we want to do. Practical implications: limit complexity of internals as much as possible. • Continue community building. Practical implications: run khmer as a real open source project, with everything done in the open; work nicely with other projects. • Build, sustain, and maintain a set of protocols and recipes around khmer. Practical implications: take workflow design into account. • Improve the efficiency (time/memory) of khmer implementations. Practical implications: optimize, but not at expense of clean code. Some specifics: streaming; variable sized counters. • Lower barriers to an increasing user base. Practical implications: find actual pain points, address if it’s easy or makes good sense. Some specifics: hash function k > 32, stranded hash function, integrate efficient k-mer cardinality counting, implement dynamically sized data structures. • Keep khmer technologically up to date. Practical implications: transition to Python 3. --titus p.s. Thanks to Qingpeng Zhang, Travis Collier, and Matt MacManes for comments in the draft stage! ### Thomas Wiecki #### A modern guide to getting started with Data Science and Python Python has an extremely rich and healthy ecosystem of data science tools. Unfortunately, to outsiders this ecosystem can look like a jungle (cue snake joke). In this blog post I will provide a step-by-step guide to venturing into this PyData jungle. What's wrong with the many lists of PyData packages out there already you might ask? I think that providing too many options can easily overwhelm someone who is just getting started. So instead, I will keep a very narrow scope and focus on the 10% of tools that allow you to do 90% of the work. After you mastered these essentials you can browse the long lists of PyData packages to decide which to try next. The upside is that the few tools I will introduce already allow you to do most things a data scientist does in his day-to-day (i.e. data i/o, data munging, and data analysis). ## Installation It has happened quite a few times that people came up to me and said "I heard Python is amazing for data science so I wanted to start learning it but spent two days installing Python and all the other modules!". It's quite reasonable to think that you have to install Python if you want to use it but indeed, installing the full PyData stack manually when you don't know which tools you actually need is quite an undertaking. So I strongly recommend against doing that. Fortunately for you, the fine folks at Continuum have created the Anaconda Python distribution that installs most of the PyData stack for you, and the modules it doesn't provide out of the box can easily be installed via a GUI. The distribution is also available for all major platforms so save yourself the two days and just use that! ## IPython Notebook After Python is installed, most people start by launching it. Again, very reasonable but unfortunately dead wrong. I don't know a single SciPythonista that uses the Python command shell directly (YMMV). Instead, IPython, and specifically the IPython Notebook are incredibly powerful Python shells that are used ubiquitously in PyData. I strongly recommend you directly start using the IPython Notebook (IPyNB) and don't bother with anything else, you won't regret it. In brief, the IPyNB is a Python shell that you access via your web browser. It allows you to mix code, text, and graphics (even interactive ones). This blog post was written in an IPyNB and it's rare to go find a talk at a Python conference that does not use the IPython Notebook. It comes preinstalled by Anaconda so you can just start using it. Here's an example of what it looks like: In [1]: print('Hello World')  Hello World  This thing is a rocket -- every time I hear one of the core devs talk at a conference I am flabbergasted by all the new things they cooked up. To get an idea for some of the advanced capabilities, check out this short tutorial on IPython widgets. These allow you to attach sliders to control a plot interactively: In [1]: from IPython.display import YouTubeVideo YouTubeVideo('wxVx54ax47s') # Yes, it can also embed youtube videos.  Out[1]: ## Pandas Normally, people recommend you start by learning NumPy (pronounced num-pie, not num-pee!) which is the library that provides multi-dimensional arrays. Certainly this was the way to go a few years ago but I hardly use NumPy at all today. The reason is that NumPy became more of a core library that's used by other libraries which provide a much nicer interface. Thus, the main library to use for working with data is Pandas. It can input and output data from all kinds of formats (including databases), do joins and other SQL-like functions for shaping the data, handle missing values with ease, support time series, has basic plotting capabilities and basic statistical functionality and much more. There is certainly a learning curve to all its features but I strongly suggest you go through most of the documentation as a first step. Trust me, the time you invest will be set off a thousand fold by being more efficient in your data munging. Here are a few quick tricks to whet your appetite: In [18]: import pandas as pd df = pd.DataFrame({ 'A' : 1., 'B' : pd.Timestamp('20130102'), 'C' : pd.Series(1, index=list(range(4)), dtype='float32'), 'D' : pd.Series([1, 2, 1, 2], dtype='int32'), 'E' : pd.Categorical(["test", "train", "test", "train"]), 'F' : 'foo' })  In [19]: df  Out[19]: A B C D E F 0 1 2013-01-02 1 1 test foo 1 1 2013-01-02 1 2 train foo 2 1 2013-01-02 1 1 test foo 3 1 2013-01-02 1 2 train foo Columns can be accessed by name: In [17]: df.B  Out[17]: 0 2013-01-02 1 2013-01-02 2 2013-01-02 3 2013-01-02 Name: B, dtype: datetime64[ns]  Compute the sum of D for each category in E: In [21]: df.groupby('E').sum().D  Out[21]: E test 2 train 4 Name: D, dtype: int32  Doing this is in NumPy (or *gasp* Matlab!) would be much more clunky. There's a ton more. If you're not convinced, check out 10 minutes to pandas where I borrowed this from. ## Seaborn The main plotting library of Python is Matplotlib. However, I don't recommend using it directly for the same reason I don't recommend spending time learning NumPy initially. While Matplotlib is very powerful, it is its own jungle and requires lots of tweaking to make your plots look shiny. So instead, I recommend to start using Seaborn. Seaborn essentially treats Matplotlib as a core library (just like Pandas does with NumPy). I will briefly illustrate the main benefits of seaborn. Specifically, it: 1. creates aesthetically pleasing plots by default (for one thing, it does not default to the jet colormap), 2. creates statistically meaningful plots, and 3. understands the pandas DataFrame so the two work well together. While pandas comes prepackaged with anaconda, seaborn is not directly included but can easily be installed with conda install seaborn. ### Statistically meaningful plots In [5]: %matplotlib inline # IPython magic to create plots within cells  In [7]: import seaborn as sns # Load one of the data sets that come with seaborn tips = sns.load_dataset("tips") sns.jointplot("total_bill", "tip", tips, kind='reg');  As you can see, with just one line we create a pretty complex statistical plot including the best fitting linear regression line along with confidence intervals, marginals and the correlation coefficients. Recreating this plot in matplotlib would take quite a bit of (ugly) code, including calls to scipy to run the linear regression and manually applying the linear regression formula to draw the line (and I don't even know how to do the marginal plots and confidence intervals from the top of my head). This and the next example are taken from the tutorial on quantitative linear models. ### Works well with Pandas DataFrame Data has structure. Often, there are different groups or categories we are interested in (pandas' groupby functionality is amazing in this case). For example, the tips data set looks like this: In [9]: tips.head()  Out[9]: total_bill tip sex smoker day time size 0 16.99 1.01 Female No Sun Dinner 2 1 10.34 1.66 Male No Sun Dinner 3 2 21.01 3.50 Male No Sun Dinner 3 3 23.68 3.31 Male No Sun Dinner 2 4 24.59 3.61 Female No Sun Dinner 4 We might ask if smokers tip differently than non-smokers. Without seaborn, this would require a pandas groupby together with the complex code for plotting a linear regression. With seaborn, we can provide the column name we wish to split by as a keyword argument to col: In [11]: sns.lmplot("total_bill", "tip", tips, col="smoker");  Pretty neat, eh? As you dive deeper you might want to control certain details of these plots at a more fine grained level. Because seaborn is just calling into matplotlib you probably will want to start learning this library at that point. For most things I'm pretty happy with what seaborn provides, however. # Conclusions The idea of this blog post was to provide a very select number of packages which maximize your efficiency when starting with data science in Python. # Further reading ## Acknowledgements Thanks to Katie Green and Andreas Dietzel for helpful feedback on an earlier draft. ### Continuum Analytics #### Blaze Datasets tl;dr Blaze aids exploration by supporting full databases and collections of datasets. ## November 14, 2014 ### Titus Brown #### A Saturday morning conversation about publishing inconclusions Here's an excerpt from an e-mail to a student whose committee I'm on; they were asking me about a comment their advisor had made that they shouldn't put a result in a paper because "It'll confuse the reviewer." One thing to keep in mind is that communicating the results _is_ important. "It'll confuse the reviewer" is not always shorthand for "well, it's true but we can't say it because it's too confusing", which is how I think you're interpreting your advisor's comment; it is also shorthand for "well, we don't really have the depth of data/information to understand why the result is this way, so there's something more complicated going on, but we can't talk about it because we don't understand it ourselves." These can lead to the most productive bits of a project, if followed up (but it takes a lot of time to do so...) Their response: I'm curious, however, if we don't understand what's going on ourselves, shouldn't that be all the more reason to publish something? Because then other people with more knowledge can read it and they may know what's going on? Or at least help us? And my response: Well, that's definitely not how it works. Let me see if I can think of some reasons why... First, it's much easier to get confused than it is to fight your way out of confusion - and you usually learn something in the process, obviously. So you have a lot more people who are confused than have learned something, and the latter is considered more interesting and worthy of communication than the former. Second, writing papers is a lot of work and takes a lot of time. If you're confused (and hence probably wrong about any conclusions) it's probably not worth the effort... Third, reading papers is also a lot of work! Most papers are never read seriously by anyone other than the reviewers. So doubling or tripling the number of papers to include confusing or inconclusive data would probably not be helpful. Fourth, most conclusions require multiple lines of evidence, most which are often not developed until you have at least some solid hypotheses about why you're seeing what you're seeing from one line of evidence. So a lot of those pubs would be wrong. Fifth, failure is frequently underdetermined -- that is, we may have wrong or incorrect results and not know or know why. It's rarely worth chasing down exactly why something didn't work unless it doesn't work reproducibly and it's critical and important to your results. That having been said, there's a movement to make publishing data easier and bring data citations into standard practice, which I think is partly targeted at the same problem you're asking about. What am I missing? thanks, --titus ### William Stein #### SageMathCloud Notifications are Now Better I just made live a new notifications systems for SageMathCloud, which I spent all week writing. These notifications are what you see when you click the bell in the upper right. This new system replaces the one I made live two weeks ago. Whenever somebody actively *edits* (using the web interface) any file in any project you collaborate on, a notification will get created or updated. If a person *comments* on any file in any project you collaborate on (using the chat interface to the right), then not only does the notification get updated, there is also a little red counter on top of the bell and also in the title of the SageMathCloud tab. In particular, people will now be much more likely to see the chats you make on files. NOTE: I have not yet enabled any sort of daily email notification summary, but that is planned. Some technical details: Why did this take all week? It's because the technology that makes it work behind the scenes is something that was fairly difficult for me to figure out how to implement. I implemented a way to create an object that can be used simultaneously by many clients and supports realtime synchronization.... but is stored by the distributed Cassandra database instead of a file in a project. Any changes to that object get synchronized around very quickly. It's similar to how synchronized text editing (with several people at once) works, but I rethought differential synchronization carefully, and also figured out how to synchronize using an eventually consistent database. This will be useful for implementing a lot other things in SageMathCloud that operate at a different level than "one single project". For example, I plan to add functions so you can access these same "synchronized databases" from Python processes -- then you'll be able to have sage worksheets (say) running on several different projects, but all saving their data to some common synchronized place (backed by the database). Another application will be a listing of the last 100 (say) files you've opened, with easy ways to store extra info about them. It will also be easy to make account and project settings more realtime, so when you change something, it automatically takes effect and is also synchronized across other browser tabs you may have open. If you're into modern Single Page App web development, this might remind you of Angular or React or Hoodie or Firebase -- what I did this week is probably kind of like some of the sync functionality of those frameworks, but I use Cassandra (instead of MongoDB, say) and differential synchronization. I BSD-licensed the differential synchronization code that I wrote as part of the above. ## November 12, 2014 ### Jake Vanderplas #### The Hipster Effect: An IPython Interactive Exploration This week I started seeing references all over the internet to this paper: The Hipster Effect: When Anticonformists All Look The Same. It essentially describes a simple mathematical model which models conformity and non-conformity among a mutually interacting population, and finds some interesting results: namely, conformity among a population of self-conscious non-conformists is similar to a phase transition in a time-delayed thermodynamic system. In other words, with enough hipsters around responding to delayed fashion trends, a plethora of facial hair and fixed gear bikes is a natural result. Also naturally, upon reading the paper I wanted to try to reproduce the work. The paper solves the problem analytically for a continuous system and shows the precise values of certain phase transitions within the long-term limit of the postulated system. Though such theoretical derivations are useful, I often find it more intuitive to simulate systems like this in a more approximate manner to gain hands-on understanding. By the end of this notebook, we'll be using IPython's incredible interactive widgets to explore how the inputs to this model affect the results. ## Mathematically Modeling Hipsters We'll start by defining the problem, and going through the notation suggested in the paper. We'll consider a group of $$N$$ people, and define the following quantities: • $$\epsilon_i$$ : this value is either $$+1$$ or $$-1$$. $$+1$$ means person $$i$$ is a hipster, while $$-1$$ means they're a conformist. • $$s_i(t)$$ : this is also either $$+1$$ or $$-1$$. This indicates person $$i$$'s choice of style at time $$t$$. For example, $$+1$$ might indicated a bushy beard, while $$-1$$ indicates clean-shaven. • $$J_{ij}$$ : The influence matrix. This is a value greater than zero which indicates how much person $$j$$ influences person $$i$$. • $$\tau_{ij}$$ : The delay matrix. This is an integer telling us the length of delay for the style of person $$j$$ to affect the style of person $$i$$. The idea of the model is this: on any given day, person $$i$$ looks at the world around him or her, and sees some previous day's version of everyone else. This information is $$s_j(t - \tau_{ij})$$. The amount that person $$j$$ influences person $$i$$ is given by the influence matrix, $$J_{ij}$$, and after putting all the information together, we see that person $$i$$'s mean impression of the world's style is $m_i(t) = \frac{1}{N} \sum_j J_{ij} \cdot s_j(t - \tau_{ij})$ Given the problem setup, we can quickly check whether this impression matches their own current style: • if $$m_i(t) \cdot s_i(t) > 0$$, then person $$i$$ matches those around them • if $$m_i(t) \cdot s_i(t) < 0$$, then person $$i$$ looks different than those around them A hipster who notices that their style matches that of the world around them will risk giving up all their hipster cred if they don't change quickly; a conformist will have the opposite reaction. Because $$\epsilon_i$$ = $$+1$$ for a hipster and $$-1$$ for a conformist, we can encode this observation in a single value which tells us what which way the person will lean that day: $x_i(t) = -\epsilon_i m_i(t) s_i(t)$ Simple! If $$x_i(t) > 0$$, then person $$i$$ will more likely switch their style that day, and if $$x_i(t) < 0$$, person $$i$$ will more likely maintain the same style as the previous day. So we have a formula for how to update each person's style based on their preferences, their influences, and the world around them. But the world is a noisy place. Each person might have other things going on that day, so instead of using this value directly, we can turn it in to a probabilistic statement. Consider the function $\phi(x;\beta) = \frac{1 + \tanh(\beta \cdot x)}{2}$ We can plot this function quickly: In [1]: %matplotlib inline import numpy as np import matplotlib.pyplot as plt # Use seaborn styles for nice-looking plots import seaborn; seaborn.set()  In [2]: x = np.linspace(-1, 1, 1000) for beta in [0.5, 1, 5]: plt.plot(x, 0.5 * (1 + np.tanh(beta * x)), label=r'$\beta = {0:.1f}$'.format(beta)) plt.legend(loc='upper left', fontsize=18) plt.xlabel('$x$', size=18); plt.ylabel(r'$\phi(x;\beta)$', size=18);  This gives us a nice way to move from our preference $$x_i$$ to a probability of switching styles. Here $$\beta$$ is inversely related to noise. For large $$\beta$$, the noise is small and we basically map $$x > 0$$ to a 100% probability of switching, and $$x<0$$ to a 0% probability of switching. As $$\beta$$ gets smaller, the probabilities get less and less distinct. ## The Code Let's see this model in action. We'll start by defining a class which implements everything we've gone through above: In [3]: class HipsterStep(object): """Class to implement hipster evolution Parameters ---------- initial_style : length-N array values > 0 indicate one style, while values <= 0 indicate the other. is_hipster : length-N array True or False, indicating whether each person is a hipster influence_matrix : N x N array Array of non-negative values. influence_matrix[i, j] indicates how much influence person j has on person i delay_matrix : N x N array Array of positive integers. delay_matrix[i, j] indicates the number of days delay between person j's influence on person i. """ def __init__(self, initial_style, is_hipster, influence_matrix, delay_matrix, beta=1, rseed=None): self.initial_style = initial_style self.is_hipster = is_hipster self.influence_matrix = influence_matrix self.delay_matrix = delay_matrix self.rng = np.random.RandomState(rseed) self.beta = beta # make s array consisting of -1 and 1 self.s = -1 + 2 * (np.atleast_2d(initial_style) > 0) N = self.s.shape[1] # make eps array consisting of -1 and 1 self.eps = -1 + 2 * (np.asarray(is_hipster) > 0) # create influence_matrix and delay_matrix self.J = np.asarray(influence_matrix, dtype=float) self.tau = np.asarray(delay_matrix, dtype=int) # validate all the inputs assert self.s.ndim == 2 assert self.s.shape[1] == N assert self.eps.shape == (N,) assert self.J.shape == (N, N) assert np.all(self.J >= 0) assert np.all(self.tau > 0) @staticmethod def phi(x, beta): return 0.5 * (1 + np.tanh(beta * x)) def step_once(self): N = self.s.shape[1] # iref[i, j] gives the index for the j^th individual's # time-delayed influence on the i^th individual iref = np.maximum(0, self.s.shape[0] - self.tau) # sref[i, j] gives the previous state of the j^th individual # which affects the current state of the i^th individual sref = self.s[iref, np.arange(N)] # m[i] is the mean of weighted influences of other individuals m = (self.J * sref).sum(1) / self.J.sum(1) # From m, we use the sigmoid function to compute a transition probability transition_prob = self.phi(-self.eps * m * self.s[-1], beta=self.beta) # Now choose steps stochastically based on this probability new_s = np.where(transition_prob > self.rng.rand(N), -1, 1) * self.s[-1] # Add this to the results, and return self.s = np.vstack([self.s, new_s]) return self.s def step(self, N): for i in range(N): self.step_once() return self.s def trend(self, hipster=True, return_std=True): if hipster: subgroup = self.s[:, self.eps > 0] else: subgroup = self.s[:, self.eps < 0] return subgroup.mean(1), subgroup.std(1)  Now we'll create a function which plots the trend for a certain number of time steps: In [4]: def plot_results(Npeople=500, Nsteps=200, hipster_frac=0.8, initial_state_frac=0.5, delay=20, log10_beta=0.5, rseed=42): rng = np.random.RandomState(rseed) initial_state = (rng.rand(1, Npeople) > initial_state_frac) is_hipster = (rng.rand(Npeople) > hipster_frac) influence_matrix = abs(rng.randn(Npeople, Npeople)) influence_matrix.flat[::Npeople + 1] = 0 delay_matrix = 1 + rng.poisson(delay, size=(Npeople, Npeople)) h = HipsterStep(initial_state, is_hipster, influence_matrix, delay_matrix=delay_matrix, beta=10 ** log10_beta, rseed=rseed) h.step(Nsteps) def beard_formatter(y, loc): if y == 1: return 'bushy-\nbeard' elif y == -1: return 'clean-\nshaven' else: return '' t = np.arange(Nsteps + 1) fig, ax = plt.subplots(2, sharex=True, figsize=(8, 6)) ax[0].imshow(h.s.T, cmap='binary', interpolation='nearest') ax[0].set_ylabel('individual') ax[0].axis('tight') ax[0].grid(False) mu, std = h.trend(True) ax[1].plot(t, mu, c='red', label='hipster') ax[1].fill_between(t, mu - std, mu + std, color='red', alpha=0.2) mu, std = h.trend(False) ax[1].plot(t, mu, c='blue', label='conformist') ax[1].fill_between(t, mu - std, mu + std, color='blue', alpha=0.2) ax[1].set_xlabel('time') ax[1].set_ylabel('Trend') ax[1].legend(loc='best') ax[1].set_ylim(-1.1, 1.1); ax[1].set_xlim(0, Nsteps) ax[1].yaxis.set_major_formatter(plt.FuncFormatter(beard_formatter))  ## Exploring the Result With this code in place, we can now explore the result. We'll start by seeing what happens when just 10% of the population is made up of non-conformist hipsters: In [5]: plot_results(hipster_frac=0.1)  Let's describe this plot briefly: the top panel has 500 rows and 200 columns: each row represents an individual person, and the color (white or black) along the row represents the style of that person at that time. In the bottom panel, we see the mean and standard deviation of the styles of all hipsters (red) and all conformists (blue). This plot shows something relatively unsurprising: when there are only a few hipsters in the population, we quickly reach an equilibrium where hipsters all have one style (a bushy beard) while the norm-core conformists have the opposite (clean shaven faces). Let's see what happens if there are more hipsters in the population: In [6]: plot_results(hipster_frac=0.5)  With half the population made up of hipsters, the trend washes out. There is so much noise and confusion about styles, that both the hipsters and the conformists have a wide range of styles at any given time. Now let's see what happens when we have even more hipsters: In [7]: plot_results(hipster_frac=0.8)  Now this is getting interesting! With a population dominated by hipsters, we end up approaching steady cycles in trends. The hipsters start trying to distance themselves from the "normal" style, and then the normal style moves to catch up with them. The hipsters then swing the other way, and the process repeats. This is an example of the "phase transition" that the author of the original paper talked about in analogy to thermodynamic systems: above certain critical values of the model parameters, a qualitatively new type of behavior appears out of the noise. This oscillation can be thought of as a rough and simplistic mathematical model for recurrent cycles of cultural and fashion trends that anyone over a couple decades old has probably noticed over time. But let's explore this even more. ## Fully Interactive One of the nice pieces of the IPython notebook is the ability to quickly create interactive visualizations. Unfortunately this only works when you're viewing the notebook live (i.e. a static HTML view on a blog post won't give you any interaction). If you're reading this on my blog or on nbviewer, then you can download the notebook and run it with IPython to see these features. What we'll do is to call IPython's interactive tools on our Python function, which will create javascript sliders allowing us to explore the parameter space of this hipster conformity model. I'd encourage you to download the notebook and try it out! In [8]: from IPython.html.widgets import interact, fixed interact(plot_results, hipster_frac=(0.0, 1.0), delay=(1, 50), initial_state_frac=(0.0, 1.0), log10_beta=(-2.0, 2.0), Npeople=fixed(500), Nsteps=fixed(200), rseed=fixed(42));  Again, unless you download the notebook and run it on a local Python kernel, all you'll see is a static graphic above. But with the interactive version, you can really start to see how these various parameters affect the system. ## Conclusion This has been a lot of fun, and if you've read this far I hope this helped you understand the mathematics of Hipster-dom! For more information and analysis, go read the paper. It goes much deeper than the rough, discrete approximation I've used here. For further ideas, I'd love to see a simulation of how this looks if we add-in spatial information, and create a delay related to that information. Would you start to see pockets of people adapting similar styles? My guess is yes, but I'm not entirely sure... there's only one way to find out. Happy coding! This post was written entirely in the IPython notebook. You can download this notebook, or see a static view here. ## November 10, 2014 ### Titus Brown #### Some thoughts on Journal Impact Factor A colleague just e-mailed me to ask me how I felt about journal impact factor being such a big part of the Academic Ranking of World Universities - they say that 20% of the ranking weight comes from # of papers published in Nature and Science. So what do I think? ## On evaluations I'm really not a big fan of rankings and evaluations in the first place. This is largely because I feel that evaluations are rarely objective. For one very specific example, last year at MSU I got formal evaluations from both of my departments. Starting with the same underlying data (papers published, students graduated, grants submitted/awarded, money spent, classes taught, body mass/height ratio, gender weighting, eye color, miles flown, talks given, liquid volume of students' tears generated, committees served on, etc.) one department gave me a "satisfactory/satisfactory/satisfactory" while the other department gave me an "excellent/excellent/excellent." What fun! (I don't think the difference was the caliber in departments.) Did I mention that these rankings helped determine my raise for the year? Anyhoo, I find the rating and ranking scheme within departments at MSU to be largely silly. It's done in an ad hoc manner by untrained people, and as far as I can tell, is biased against people who are not willing to sing their own praises. (I brought up the Dunning-Kruger effect in my last evaluation meeting. Heh.) That's not to say there's not serious intent -- they are factored into raises, and at least one other purpose of evaluating assistant professors is so that once you fire their ass (aka "don't give them tenure") there's a paper trail of critical evaluations where you explained that they were in trouble. Metrics are part of the job, though; departments evaluate their faculty so they can see who, if anyone, needs help or support or mentoring, and to do this, they rely at least in part on metrics. Basically, if someone has lots of funding and lots of papers, they're probably not failing miserably at being a research professor; if they're grant-poor and paper-poor, they're targets for further investigation. There are lots of ways to evaluate, but metrics seem like an inextricable part of it. ## Back to the impact factor Like faculty evaluations, ranking by the impact factor of the journals that university faculty publish in is an attempt to predict future performance using current data. But impact factor is extremely problematic for many reasons. It's based on citations, which (over the long run) may be an OK measure of impact, but are subject to many confounding factors, including field-specific citation patterns. It's an attempt to predict the success of individual papers on a whole-journal basis, which falls apart in the face of variable editorial decisions. High-impact journals are also often read more widely by people than low-impact journals, which yields a troubling circularity in terms of citation numbers (you're more likely to cite a paper you've read!) Worse, the whole system is prone to being gamed in various ways, which is leading to high rates of retractions for high-impact journals, as well as outright fraud. Impact factor is probably a piss-poor proxy for paper impact, in other words. If impact factor was just a thing that didn't matter, I wouldn't worry. The real trouble is that impact factors have real-world effect - many countries use impact factor of publications as a very strong weight in funding and promotion decisions. Interestingly, the US is not terribly heavy handed here - most universities seem pretty enlightened about considering the whole portfolio of a scientist, at least anecdotally. But I can name a dozen countries that care deeply about impact factor for promotions and raises. And apparently impact factor affects university rankings, too! Taking a step back, it's not clear that any good ranking scheme can exist, and if it does, we're certainly not using it. All of this is a big problem if you care about fostering good science. The conundrum is that many people like rankings, and it seems futile to argue against measuring and ranking people and institutions. However, any formalized ranking system can be gamed and perverted, which ends up sometimes rewarding the wrong kind of people, and shutting out some of the right kind of people. (The Reed College position on the US News & World Report ranking system is worth reading here.) More generally, in any ecosystem, the competitive landscape is evolving, and a sensible measure today may become a lousy measure tomorrow as the players evolve their strategies; the stricter the rules of evaluation, and the more entrenched the evaluation system, the less likely it is to adapt, and the more misranking will result. So ranking systems need to evolve continuously. At its heart, this is a scientific management challenge. Rankings and metrics do pretty explicitly set the landscape of incentives and competition. If our goal in science is to increase knowledge for the betterment of mankind, then the challenge for scientific management is to figure out how to incentive behaviors that trend in that direction in the long term. If you use bad or outdated metrics, then you incentivize the wrong kind of behavior, and you waste precious time, energy, and resources. Complicating this is the management structure of academic science, which is driven by many things that include rankings and reputation - concepts that range from precise to fuzzy. My position on all of this is always changing, but it's pretty clear that the journal system is kinda dumb and rewards the wrong behavior. (For the record, I'm actually a big fan of publications, and I think citations are probably not a terribly bad measure of impact when measured on papers and individuals, although I'm always happy to engage in discussions on why I'm wrong.) But the impact factor is especially horrible. The disproportionate effect that high-IF glamour mags like Cell, Nature and Science have on our scientific culture is clearly a bad thing - for example, I'm hearing more and more stories about editors at these journals warping scientific stories directly or indirectly to be more press-worthy - and when combined with the reproducibility crisis I'm really worried about the short-term future of science. Journal Impact Factor and other simple metrics are fundamentally problematic and are contributing to the problem, along with the current peer review culture and a whole host of other things. (Mike Eisen has written about this a lot; see e.g. this post.) In the long term I think a much more experimental culture of peer review and alternative metrics will emerge. But what do we do for now? More importantly: ## How can we change? I think my main advice to faculty is "lead, follow, or get out of the way." Unless you're a recognized big shot, or willing to take somewhat insane gambles with your career, "leading" may not be productive - following or getting out of the way might be best. But there are a lot of things you can do here that don't put you at much risk, including: • be open to a broader career picture when hiring and evaluating junior faculty; • argue on behalf of alternative metrics in meetings on promotion and tenure; • use sites like Google Scholar to pick out some recent papers to read in depth when hiring faculty and evaluating grants; • avoid making (or push back at) cheap shots at people who don't have a lot of high-impact-factor papers; • invest in career mentoring that is more nuanced than "try for lots of C-N-S papers or else" - you'd be surprised how often this is the main advice assistant professors take away... • believe in and help junior faculty that seem to have a plan, even if you don't agree with the plan (or at least leave them alone ;) What if you are a recognized big shot? Well, there are lots of things you can do. You are the people who set the tone in the community and in your department, and it behooves you to think scientifically about the culture and reward system of science. The most important thing you can do is think and investigate. What evidence is there behind the value of peer review? Are you happy with C-N-S editorial policies, and have you talked to colleagues who get rejected at the editorial review stage more than you do? Have you thought about per-article metrics? Do you have any better thoughts on how to improve the system than 'fund more people', and how would you effect changes in this direction by recognizing alternate metrics during tenure and grant review? The bottom line is that the current evaluation systems are the creation of scientists, for scientists. It's our responsibility to critically evaluate them, and perhaps evolve them when they're inadequate; we shouldn't just complain about how the current system is broken and wait for someone else to fix it. ## Addendum: what would I like to see? Precisely predicting the future importance of papers is obviously kind of silly - see this great 1994 paper by Gans and Shepherd on rejected classics papers, for example -- and is subject to all sorts of confounding effects. But this is nonetheless what journals are accustomed to doing: editors at most journals, especially the high impact factor ones, select papers based on projected impact before sending them out for review, and/or ask the reviewers to review impact as well. So I think we should do away with impact review and review for correctness instead. This is why I'm such a big fan of PLOS One and PeerJ, who purport to do exactly that. But then, I get asked, what do we do about selecting out papers to read? Some (many?) scientists claim that they need the filtering effect of these selective journals to figure out what they should be reading. There are a few responses to this. First, it's fundamentally problematic to outsource your attention to editors at journals, for reasons mentioned above. There's some evidence that you're being drawn into a manipulated and high-retraction environment by doing that, and that should worry you. But let's say you feel you need something to tell you what to read. Well, second, this is technologically solvable - that's what search engines already do. There's a whole industry of search engines that give great results based on integrating free text search, automatic content classification, and citation patterns. Google Scholar does a great job here, for example. Third, social media (aka "people you know") provides some great recommendation systems! People who haven't paid much attention to Twitter or blogging may not have noticed, but in addition to person-to-person recommendations, there are increasingly good recommendation systems coming on line. I personally get most of my paper recs from online outlets (mostly people I follow, but I've found some really smart people to follow on Twitter!). It's a great solution! Fourth, if one of the problems is that many journals review for correctness AND impact together, why not separate them? For example, couldn't journals like Science or Nature evolve into literature overlays that highlight papers published in impact-blind journals like PLOS One or PeerJ? I can imagine a number of ways that this could work, but if we're so invested in having editors pick papers for us, why not have them pick papers that have been reviewed for scientific correctness first, and then elevate them to our attention with their magic editorial pen? I don't see too many drawbacks to this vs the current approach, and many improvements. (Frankly this is where I see most of scientific literature going, once preprint archives become omnipresent.) So that's where I want and expect to see things going. I don't see ranking based on predicted impact going away, but I'd like to see it more reflective of actual impact (and be measured in more diverse ways). --titus p.s. People looking for citations of high retraction rate, problematic peer review, and the rest could look at one of my earlier blog posts on problems with peer review. I'd be interested in more citations, though! ## November 06, 2014 ### Fabian Pedregosa #### Plot memory usage as a function of time One of the lesser known features of the memory_profiler package is its ability to plot memory consumption as a function of time. This was implemented by my friend Philippe Gervais, previously a colleague at INRIA and now working for Google. With this feature it is possible to generate very easily a plot of the memory consumption as a function of time. The result will be something like this: where you can see the memory used (in the y-axis) as a function of time (x-axis). Furthermore, we have used two functions, test1 and test2, and it is possible to see with the colored brackets at what time do these functions start and finish. This plot was generated with the following simple script: import time @profile def test1(): n = 10000 a = [1] * n time.sleep(1) return a @profile def test2(): n = 100000 b = [1] * n time.sleep(1) return b if __name__ == "__main__": test1() test2()  what happens here is that we have two functions, test1() and test2() in which we create two lists of different sizes (the one in test2 is bigger). We call time.sleep() for one second so that the function does not return too soon and so we have time to get reliable memory measurements. The decorator @profile is optional and is useful so that memory_profiler knows when the function has been called so he can plot the brackets indicating that. If you don't put the decorator, the example will work just fine except that the brackets will not appear in your plot. Suppose we have saved the script as test1.py. We run the script as $ mprof run test1.py


where mprof is an executable provided by memory_profiler. If the above command was successful it will print something like this

$mprof run test1.py mprof: Sampling memory every 0.1s running as a Python program...  The above command will create a .dat file on your current working directory, something like mprofile_20141108113511.dat. This file (you can inspect it, it's a text file) contains the memory measurements for your program. You can now plot the memory measurements with the command $ mprof plot


This will open a matplotlib window and show you the plot:

As you see, attention has been paid to the default values so that the plot it generates already looks decent without much effort. The not-so-nice-part is that, at least as of November 2014, if you want to customize the plot, well, you'll have to look and modify the mprof script. Some refactoring is still needed in order to make it easier to customize the plots (work in progress).

## November 02, 2014

### Titus Brown

#### Doing biology, however it needs to be done

Sean Eddy wrote an interesting blog post on how scripting is something every biologist should learn to do. This spurred a few discussions on Twitter and elsewhere, most of which devolved into the usual arguments about what, precisely, biologists should be taught.

I always find these discussions not merely predictable but rather besides the point. Statisticians will always complain that people need a better appreciation of stats; bioinformatics will point to alignment or sequence comparison or whatnot; evolutionary biologists will talk about how important evolution is; physicists will point to more math; etc. But the truth is there's very, very little that is necessary to be a biologist.

My perspective on this is informed by my own background, which is a tad idiosyncratic. Despite being an assistant professor in a Microbiology department and (soon) a professor in a VetMed program, I have taken more physics courses than biology courses; heck, I've taught more biology courses than I've taken. I've never taken a chemistry course and know virtually nothing about biochemistry, in particular. Most of my evolutionary knowledge derives from my reading for my research on Avida; I've never been taught anything about ecology and still know very little. I failed my first qual exam at Caltech because I didn't actually know anything about cells (which event, ahem, spurred me to learn). Despite this utter lack of formal background in biology, I spent 5-10 years of my life doing experimental molecular biology for developmental biology research (after 3-5 years learning basic molecular biology), and I've learned what I think this a reasonable amount about cell biology into the bargain. I know a fair bit about developmental biology, evo-devo, molecular biology, and genomics. My PhD is actually in Biology, with a Developmental Biology focus. But I learned it all as I needed it. (My undergrad degree is in Unapplied Mathematics.)

My weakest formal training is probably in stats, where I know enough to point out where whatever system I'm studying is violating standard statistical requirements, but not enough to point how to rescue our approach.

Despite having "run" a bioinformatics lab for the last few years, my formal bioinformatics background is basically nil - I took a strong programming background, learned a bunch of biology and genomics, and then realized that much of bioinformatics is fairly obvious at that point. I don't really understand Hidden Markov Models or sequence alignment (but shh, don't tell anyone!)

With all of this, what do I call myself? Well, I definitely consider myself a biologist, as do at least a few different hiring panels, including one at UC Davis :). And when I talk to other biologists, I think that at least some of them agree - I'm focused on answering biological questions. I do so primarily in collaboration at this point, and primarily from the standpoint of data, but: biology.

So here's my conclusion: to be a biologist, one must be seriously trying to study biology. Period. Clearly you must know something about biology in order to be effective here, and critical thinking is presumably pretty important there; I think "basic competency in scientific practice" is probably the minimum bar, but even there you can imagine lab techs or undergraduates putting in useful work at a pretty introductory level here. I think there are many useful skills to have, but I have a hard time concluding that any of them are strictly necessary.

The more interesting question, to my mind, is what should we be teaching undergraduates and graduate students in biology? And there I unequivocally agree with the people who prioritize some reasonable background in stats, and some reasonable background in data analysis (with R or Python - something more than Excel). What's more important than teaching any one thing in specific, though, is that the whole concept that biologists can avoid math or computing in their training (be it stats, modeling, simulation, programming, data science/data analysis, or whatever) needs to die. That is over. Dead, done, over.

One particular challenge we are facing now is that we don't have many people capable of teaching these younger biologists the appropriate data analysis skills, because most biologists (including the non-research-active faculty that do most teaching) don't know anything about them, and data analysis in biology is about data analysis in biology -- you can't just drop in a physicists or an engineer to teach this stuff.

At the end of the day, though, a scientist either learns what they need to know in order to do their research, or they collaborate with others to do it. As data becomes ever more important in biology, I expect more and more biologists will learn how to do their own analysis. One of my interests is in figuring out how to help biologists to make this transition if they want to.

So perhaps we can shift from talking about what you must know in order to practice biology, and talk about what we're going to teach, to whom, and when, to people who are the biologists of the future?

--titus

## October 27, 2014

### Juan Nunez-Iglesias

It’s time to draw my “continuous integration in Python” series to a close. This final post ties all six previous posts together and is the preferred write-up to share more widely and on which to provide feedback.

Almost everything I know about good Python development I’ve learned from Stéfan van der Walt, Tony Yu, and the rest of the scikit-image team. But a few weeks ago, I was trying to emulate the scikit-image CI process for my own project: cellom2tif, a tool to liberate images from a rather useless proprietary format. (I consider this parenthetical comment sufficient fanfare to announce the 0.2 release!) As I started copying and editing config files, I found that even from a complete template, getting started was not very straightforward. First, scikit-image has much more complicated requirements, so that a lot of the .travis.yml file was just noise for my purposes. And second, as detailed in the previous posts, a lot of the steps are not found or recorded anywhere in the repository, but rather must be navigated to on the webpages of GitHub, Travis, and Coveralls. I therefore decided to write this series as both a notetaking exercise and a guide for future CI novices. (Such as future me.)

To recap, here are my six steps to doing continuous integration in Python with pytest, Travis, and Coveralls:

If you do all of the above at the beginning of your projects, you’ll be in a really good place one, two, five years down the line, when many academic projects grow far beyond their original scope in unpredictable ways and end up with much broken code. (See this wonderful editorial by Zeeya Merali for much more on this topic.)

## Reducing the boilerplate with PyScaffold

But it’s a lot of stuff to do for every little project. I was about to make myself some minimal setup.cfg and .travis.yml template files so that I could have these ready for all new projects, when I remembered PyScaffold, which sets up a Python project’s basic structure automatically (setup.py, package_name/__init__.py, etc.). Sure enough, PyScaffold has a --with-travis option that implements all my recommendations, including pytest, Travis, and Coveralls. If you set up your projects with PyScaffold, you’ll just have to turn on Travis-CI on your GitHub repo admin and Coveralls on coveralls.io, and you’ll be good to go.

## When Travises attack

I’ve made a fuss about how wonderful Travis-CI is, but it breaks more often than I’d like. You’ll make some changes locally, and ensure that the tests pass, but when you push them to GitHub, Travis fails. This can happen for various reasons:

• your environment is different (e.g. NumPy versions differ between your local build and Travis’s VMs).
• you’re testing a function that depends on random number generation and have failed to set the seed.
• you depend on some web resource that was temporarily unavailable when you pushed.
• Travis has updated its VMs in some incompatible way.
• you have more memory/CPUs locally than Travis allows.
• some other, not-yet-understood-by-me reason.

Of these, the first three are acceptable. You can use conda to match your environments both locally and on Travis, and you should always set the seed for randomised tests. For network errors, Travis provides a special function, travis_retry, that you can prefix your commands with.

Travis VM updates should theoretically be benign and not cause any problems, but, in recent months, they have been a significant source of pain for the scikit-image team: every monthly update by Travis broke our builds. That’s disappointing, to say the least. For simple builds, you really shouldn’t run into this. But for major projects, this is an unnecessary source of instability.

Further, Travis VMs don’t have unlimited memory and disk space for your builds (naturally), but the limits are not strictly defined (unnaturally). This means that builds requiring “some” memory or disk space randomly fail. Again, disappointing. Travis could, for example, guarantee some minimal specs that everyone could program against — and request additional space either as special exemptions or at a cost.

Finally, there’s the weird failures. I don’t have any examples on hand but I’ll just note that sometimes Travis builds fail, where your local copy works fine every single time. Sometimes rebuilding fixes things, and other times you have to change some subtle but apparently inconsequential thing before the build is fixed. These would be mitigated if Travis allowed you to clone their VM images so you could run them on a local VM or on your own EC2 allocation.

A too-common Travis occurrence: randomly failing tests

In all though, Travis is a fantastic resource, and you shouldn’t let my caveats stop you from using it. They are just something to keep in mind before you pull all your hair out.

## The missing test: performance benchmarks

Testing helps you maintain the correctness of your code. However, as Michael Droettboom eloquently argued at SciPy 2014, all projects are prone to feature creep, which can progressively slow code down. Airspeed Velocity is to benchmarks what pytest is to unit tests, and allows you to monitor your project’s speed over time. Unfortunately, benchmarks are a different beast to tests, because you need to keep the testing computer’s specs and load constant for each benchmark run. Therefore, a VM-based CI service such as Travis is out of the question.

If your project has any performance component, it may well be worth investing in a dedicated machine only to run benchmarks. The machine could monitor your GitHub repo for changes and PRs, check them out when they come in, run the benchmarks, and report back. I have yet to do this for any of my projects, but will certainly consider this strongly in the future.

The above tools all work great as part of GitHub’s pull request (PR) development model. It’s a model that is easy to grok, works well with new programmers, and has driven massive growth in the open-source community. Lately, I recommend it with a bit more trepidation than I used to, because it does have a few high-profile detractors, notably Linux and git creator Linus Torvalds, and OpenStack developer Julien Danjou. To paraphrase Julien, there are two core problems with GitHub’s chosen workflow, both of which are longstanding and neither of which shows any sign of improving.

First, comments on code diffs are buried by subsequent changes, whether the changes are a rebase or they simply change the diff. This makes it very difficult for an outside reviewer to assess what discussion, if any, resulted in the final/latest design of a PR. This could be a fairly trivial fix (colour-code outdated diffs, rather than hiding them), so I would love to see some comments from GitHub as to what is taking so long.

Expect to see a lot of these when using pull requests.

Second, bisectability is broken by fixup commits. The GitHub development model is not only geared towards small, incremental commits being piled on to a history, but it actively encourages these with their per-commit badging of a user’s contribution calendar. Fixup commits make bug hunting with git bisect more difficult, because some commits will not be able to run a test suite at all. This could be alleviated by considering only commits merging GitHub PRs, whose commit message start with Merge pull request #, but I don’t know how to get git to do this automatically (ideas welcome in the comments).

I disagree with Julien that there is “no value in the social hype [GitHub] brings.” In fact, GitHub has dramatically improved my coding skills, and no doubt countless others’. For many, it is their first experience with code review. Give credit where it is due: GitHub is driving the current, enormous wave of open-source development. But there is no doubt it needs improvement, and it’s sad to see GitHub’s developers apparently ignoring their critics. I hope the latter will be loud enough soon that GitHub will have no choice but to take notice.

This series, including this post, sums up my current thinking on CI in Python. It’s surely incomplete: I recently came across a curious “Health: 88%” badge on Mitchell Stanton-Cook’s BanzaiDB README. Clicking it took me to the project’s landscape.io page, which appears to do for coding style what Travis does for builds/tests and Coveralls does for coverage. How it measures “style” is not yet clear to me, but it might be another good CI tool to keep track of. Nevertheless, since it’s taken me a few years to get to this stage in my software development practice, I hope this series will help other scientists get there faster.

If any more experienced readers think any of my advice is rubbish, please speak up in the comments! I’ll update the post(s) accordingly. CI is a big rabbit hole and I’m still finding my way around.

## October 22, 2014

### Titus Brown

#### Infrastructure for Data Intensive Biology - a statement of work

Since being chosen as a Moore Foundation Data Driven Discovery Investigator, I've been putting together the paperwork at UC Davis to actually receive the money. Part of that is putting together a budget and a Statement of Work to help guide the conversation between me, Davis, and the Moore Foundation. Here's what I sent Chris Mentzel at Moore:

Title: Infrastructure for Data Intensive Biology

OUTCOME: In support of demonstrating the high level of scientific impact that data scientists deliver through their focus on interdisciplinary data-driven research, funds from this award will be used to better understand gene function in non-model organisms through the development of new workflows and better data sharing technology for large-scale data analysis.

Research direction 1: Develop and extend protocols for non-model genomic and transcriptomic analysis.

Research direction 2: Integrate and extend existing workflow and data analysis software into a cloud-enabled deployment system with a Web interface for executing protocols.

Research direction 3: Investigate and develop a distributed graph database system and distributed query functionality to support distributed data-driven discovery. (DDDD :)

For more of the background, see my full award submission, my presentation, and a science fiction story that I would like to enable.

--titus

## October 17, 2014

### Juan Nunez-Iglesias

We’re finally ready to wrap up this topic. By now you can:

But, much as exercise is wasted if your bathroom scale doesn’t automatically tweet about it, all this effort is for naught if visitors to your GitHub page can’t see it!

Most high-profile open-source projects these days advertise their CI efforts. Above, I cheekily called this showing off, but it’s truly important: anyone who lands on your GitHub page is a potential user or contributor, and if they see evidence that your codebase is stable and well-tested, they are more likely to stick around.

You’ll get an overlay, with a pull-down menu for all the different options for getting the badge. You can grab the image URL, or you can directly grab the Markdown to put into your markdown-formatted README, or a bunch of other options, including RST:

Meanwhile, on your repository’s Coveralls page, on the right-hand side, you will find another pull-down menu with the appropriate URLs:

And you’re done! You’ve got automated tests, tons of test coverage, you’re running everything correctly thanks to configuration files, and all this is getting run on demand thanks to Travis and Coveralls. And thanks to badging, the whole world knows about it:

### William Stein

#### A Non-technical Overview of the SageMathCloud Project

What problems is the SageMathCloud project trying to solve? What pain points does it address? Who are the competitors and what is the state of the technology right now?

## What problems you’re trying to solve and why are these a problem?

• Computational Education: How can I teach a course that involves mathematical computation and programming?
• Computational Research: How can I carry out collaborative computational research projects?
• Cloud computing: How can I get easy user-friendly collaborative access to a remote Linux server?

## What are the pain points of the status quo and who feels the pain?

• Student/Teacher pain:
• Getting students to install software needed for a course on their computers is a major pain; sometimes it is just impossible, due to no major math software (not even Sage) supporting all recent versions of Windows/Linux/OS X/iOS/Android.
• Getting university computer labs to install the software you need for a course is frustrating and expensive (time and money).
• Even if computer labs worked, they are often being used by another course, stuffy, and students can't possibly do all their homework there, so computation gets short shrift. Lab keyboards, hardware, etc., all hard to get used to. Crappy monitors.
• Painful confusing problems copying files around between teachers and students.
• Helping a student or collaborator with their specific problem is very hard without physical access to their computer.
• Researcher pain:
• Making backups every few minutes of the complete state of everything when doing research often hard and distracting, but important for reproducibility.
• Copying around documents, emailing or pushing/pulling them to revision control is frustrating and confusing.
• Installing obscuring software is frustarting and distracting from the research they really want to do.
• Everybody:
• It is frustrating not having LIVE working access to your files wherever you are. (Dropbox/Github doesn't solve this, since files are static.)
• It is difficult to leave computations running remotely.

## Why is your technology poised to succeed?

• When it works, SageMathCloud solves every pain point listed above.
• The timing is right, due to massive improvements in web browsers during the last 3 years.
• I am on full sabbatical this year, so at least success isn't totally impossible due to not working on the project.
• I have been solving the above problems in less scalable ways for myself, colleagues and students since the 1990s.
• SageMathCloud has many users that provide valuable feedback.
• We have already solved difficult problems since I started this project in Summer 2012 (and launched first version in April 2013).

There are no competitors with a similar range of functionality. However, there are many webapps that have some overlap in capabilities:
• Mathematical overlap: Online Mathematica: "Bring Mathematica to life in the cloud"
• Python overlap: Wakari: "Web-based Python Data Analysis"; also PythonAnywhere
• Latex overlap: ShareLaTeX, WriteLaTeX
• Web-based IDE's/terminals: target writing webapps (not research or math education): c9.io, nitrous.io, codio.com, terminal.com
• Homework: WebAssign and WebWork
Right now, SageMathCloud gives away for free far more than any other similar site, due to very substantial temporary financial support from Google, the NSF and others.

## What’s the total addressable market?

Though our primary focus is the college mathematics classroom, there is a larger market:
• Students: all undergrad/high school students in the world taking a course involving programming or mathematics
• Teachers: all teachers of such courses
• Researchers: anybody working in areas that involve programming or data analysis
Moreover, the web-based platform for computing that we're building lends itself to many other collaborative applications.

## What stage is your technology at?

• The site is up and running and has 28,413 monthly active users
• There are still many bugs
• I have a precise todo list that would take me at least 2 months fulltime to finish.

## Is your solution technically feasible at this point?

• Yes. It is only a matter of time until the software is very polished.
• Morever, we have compute resources to support significantly more users.
• But without money (from paying customers or investment), if growth continues at the current rate then we will have to clamp down on free quotas for users.

## What technical milestones remain?

• Infrastructure for creating automatically-graded homework problems.
• Fill in lots of details and polish.

## Do you have external credibility with technical/business experts and customers?

• Technical experts: I founded the Sage math software, which is 10 years old and relatively well known by mathematicians.
• Customers: We have no customers; we haven't offered anything for sale.

## Market research?

• I know about math software and its users as a result of founding the Sage open source math software project, NSF-funded projects I've been involved in, etc.

## Is the intellectual property around your technology protected?

• The IP is software.
• The website software is mostly new Javascript code we wrote that is copyright Univ. of Washington and mostly not open source; it depends on various open source libraries and components.
• The Sage math software is entirely open source.

## Who are the team members to move this technology forward?

I am the only person working on this project fulltime right now.
• Everything: William Stein -- UW professor
• Browser client code: Jon Lee, Andy Huchala, Nicholas Ruhland -- UW undergrads
• Web design, analytics: Harald Schilly -- Austrian grad student
• Hardware: Keith Clawson

## Why are you the ideal team?

• We are not the ideal team.
• If I had money maybe I could build the ideal team, leveraging my experience and connections from the Sage project...

## October 16, 2014

### Jake Vanderplas

(Or, Why People Hate Jet – and You Should Too)

I made a little code snippet that I find helpful, and you might too:

In [1]:
def grayify_cmap(cmap):
"""Return a grayscale version of the colormap"""
cmap = plt.cm.get_cmap(cmap)
colors = cmap(np.arange(cmap.N))

# convert RGBA to perceived greyscale luminance
# cf. http://alienryderflex.com/hsp.html
RGB_weight = [0.299, 0.587, 0.114]
luminance = np.sqrt(np.dot(colors[:, :3] ** 2, RGB_weight))
colors[:, :3] = luminance[:, np.newaxis]

return cmap.from_list(cmap.name + "_grayscale", colors, cmap.N)


What this function does is to give you a lumninance-correct grayscale version of any matplotlib colormap. I've found this useful for quickly checking how my plots might appear if printed in black and white, but I think it's probably even more useful for stoking the flame of the internet's general rant against jet.

If you want to take a step toward joining the in-crowd of chromatically-sensitive data viz geeks, your best bet is to start by bashing jet. Even if you don't know it by name, I can guarantee that if you've read many scientific papers, you've seen jet before. For example, here's a snapshot of a plot from neuroscience journal which is skewered by an appropriately ranty blogpost on the subject:

Jet is the default colorbar originally used by matlab, and this default was inherited in the early days of Python's matplotlib package. The reasons not to use jet are numerous, and you can find good arguments against it across the web. For some more subdued and nuanced arguments, I'd start with the paper Rainbow Color Map (Still) Considered Harmful and, for more general visualization tips, Ten Simple Rules for Better Figures.

So what do I have to add to this discussion that hasn't been already said? Well, nothing really, except the code snippet I shared above. Let me show you what it does.

## Taking the Color Out of Jet

Let's start by defining some data and seeing how it looks with the default jet colormap:

In [2]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(0, 6)
y = np.linspace(0, 3)[:, np.newaxis]
z = 10 * np.cos(x ** 2) * np.exp(-y)


We'll use matplotlib's imshow command to visualize this. By default, it will use the "jet" colormap:

In [4]:
plt.imshow(z)
plt.colorbar();


At first glance this might look OK. But upon closer examination, you might notice that jet's Luminance profile is incredibly complicated. Because your eye has different levels of sensitivity to light of different color, the luminance is not simply the sum of the RGB values as you might naively expect, but some weighted Euclidean sum of the individual values. You can find more information than you'd ever need to know on imagemagick's website.

When you take the jet colormap used above and convert it to luminance using the code snippet above, you get this:

In [5]:
plt.imshow(z, cmap=grayify_cmap('jet'))
plt.colorbar();


It's a mess! The greyscale-only version of this colormap has strange luminance spikes in the middle, and makes it incredibly difficult to figure out what's going on in a plot with a modicum of complexity. Much better is to use a colormap with a uniform luminance gradient, such as the built-in grayscale colormap. Let's plot this beside the previous two:

In [6]:
cmaps = [plt.cm.jet, grayify_cmap('jet'), plt.cm.gray]
fig, axes = plt.subplots(1, 3, figsize=(12, 3))

for cmap, ax in zip(cmaps, axes):
im = ax.imshow(z, cmap=cmap)
ax.set_title(cmap.name)
fig.colorbar(im, ax=ax)


In particular, notice that in the left panel, your eye is drawn to the yellow and cyan regions, because the luminance is higher. This can have the unfortunate side-effect of highlighting "features" in your data which may not actually exist!

We can see this Luminance spike more clearly if we look at the color profile of jet up close:

In [7]:
def show_colormap(cmap):
im = np.outer(np.ones(10), np.arange(100))
fig, ax = plt.subplots(2, figsize=(6, 1.5),
subplot_kw=dict(xticks=[], yticks=[]))
ax[0].imshow(im, cmap=cmap)
ax[1].imshow(im, cmap=grayify_cmap(cmap))

show_colormap('jet')


Once you have the grayscale lined-up with the color version, it's easy to point out these luminance spikes in the jet spectrum. By comparison, take a look at the Cube Helix colormap:

In [8]:
show_colormap('cubehelix')


This is a rainbow-like colormap which – by design – has a uniform luminance gradient across its progression of colors. It's certainly not the best choice in all situations, but you could easily argue that it's always a better choice than jet.

## All the Colormaps!

It's useful to see this sort of visualization for all the available colormaps in matplotlib. Here's a quick script that does this:

In [18]:
fig, axes = plt.subplots(36, 6, figsize=(10, 7))
hspace=0.1, wspace=0.1)

im = np.outer(np.ones(10), np.arange(100))

cmaps = [m for m in plt.cm.datad if not m.endswith("_r")]
cmaps.sort()

axes = axes.T.ravel()
for ax in axes:
ax.axis('off')

for cmap, color_ax, gray_ax, null_ax in zip(cmaps, axes[1::3], axes[2::3], axes[::3]):
del null_ax
color_ax.set_title(cmap, fontsize=10)
color_ax.imshow(im, cmap=cmap)
gray_ax.imshow(im, cmap=grayify_cmap(cmap))


There are some colormaps in here that have very nice, linear luminance gradients, and this is something you should keep in mind when choosing your color map.

Much more could be written about choosing an appropriate color map for any given data; for a more in-depth discussion of matplotlib's maps (and some interesting luminance illustrations), you can refer to matplotlib's choosing a colormap documentation.

If you're interested in streamlined statistical plotting in Python with well thought-out default color choices, I'd suggest taking a look at Michael Waskom's seaborn project, and especially the associated Color Palette Tutorial.

I hope you find this grayify_cmap snippet helpful, and thanks for reading!

This post was written entirely in the IPython notebook. You can download this notebook, or see a static view here.

### William Stein

#### Public Sharing in SageMathCloud, Finally

SageMathCloud (SMC) is a free (NSF, Google and UW supported) website that lets you collaboratively work with Sage worksheets, IPython notebooks, LaTeX documents and much, much more. All work is snapshotted every few minutes, and copied out to several data centers, so if something goes wrong with a project running on one machine (right before your lecture begins or homework assignment is due), it will pop up on another machine. We designed the backend architecture from the ground up to be very horizontally scalable and have no single points of failure.

This post is about an important new feature: You can now mark a folder or file so that all other users can view it, and very easily copy it to their own project.

This solves problems:
• Problem: You create a "template" project, e.g., with pre-installed software, worksheets, IPython notebooks, etc., and want other users to easily be able to clone it as a new project. Solution: Mark the home directory of the project public, and share the link widely.

• Problem: You create a syllabus for a course, an assignment, a worksheet full of 3d images, etc., that you want to share with a group of students. Solution: Make the syllabus or worksheet public, and share the link with your students via an email and on the course website. (Note: You can also use a course document to share files with all students privately.) For example...

• Problem: You run into a problem using SMC and want help. Solution: Make the worksheet or code that isn't working public, and post a link in a forum asking for help.
• Problem: You write a blog post explaining how to solve a problem and write related code in an SMC worksheet, which you want your readers to see. Solution: Make that code public and post a link in your blog post.
Here's a screencast.

Each SMC project has its own local "project server", which takes some time to start up, and serves files, coordinates Sage, terminal, and IPython sessions, etc. Public sharing completely avoids having anything to do with the project server -- it works fine even if the project server is not running -- it's always fast and there is no startup time if the project server isn't running. Moreover, public sharing reads the live files from your project, so you can update the files in a public shared directory, add new files, etc., and users will see these changes (when they refresh, since it's not automatic).
As an example, here is the cloud-examples github repo as a share. If you click on it (and have a SageMathCloud account), you'll see this:

## What Next?

There is an enormous amount of natural additional functionality to build on top of public sharing.

For example, not all document types can be previewed in read-only mode right now; in particular, IPython notebooks, task lists, LaTeX documents, images, and PDF files must be copied from the public share to another project before people can view them. It is better to release a first usable version of public sharing before systematically going through and implementing the additional features needed to support all of the above. You can make complicated Sage worksheets with embedded images and 3d graphics, and those can be previewed before copying them to a project.
Right now, the only way to visit a public share is to paste the URL into a browser tab and load it. Soon the projects page will be re-organized so you can search for publicly shared paths, see all public shares that you have previously visited, who shared them, how many +1's they've received, comments, etc.

Also, I plan to eventually make it so public shares will be visible to people who have not logged in, and when viewing a publicly shared file or directory, there will be an option to start it running in a limited project, which will vanish from existence after a period of inactivity (say).

There are also dozens of details that are not yet implemented. For example, it would be nice to be able to directly download files (and directories!) to your computer from a public share. And it's also natural to share a folder or file with a specific list of people, rather than sharing it publicly. If somebody is viewing a public file and you change it, they should likely see the update automatically. Right now when viewing a share, you don't even know who shared it, and if you open a worksheet it can automatically execute Javascript, which is potentially unsafe.  Once public content is easily found, if somebody posts "evil" content publicly, there needs to be an easy way for users to report it.

## Sharing will permeate everything

Sharing has been thought about a great deal during the last few years in the context of sites such as Github, Facebook, Google+ and Twitter. With SMC, we've developed a foundation for interactive collaborative computing in a browser, and will introduce sharing on top of that in a way that is motivated by your problems. For example, as with Github or Google+, when somebody makes a copy of your publicly shared folder, this copy should be listed (under "copies") and it could start out public by default. There is much to do.

One reason it took so long to release the first version of public sharing is that I kept imagining that sharing would happen at the level of complete projects, just like sharing in Github. However, when thinking through your problems, it makes way more sense in SMC to share individual directories and files. Technically, sharing at this level works works well for read-only access, not for read-write access, since projects are mapped to Linux accounts. Another reason I have been very hesitant to support sharing is that I've had enormous trouble over the years with spammers posting content that gets me in trouble (with my University -- it is illegal for UW to host advertisements). However, by not letting search engines index content, the motivation for spammers to post nasty content is greatly reduced.

Imagine publicly sharing recipes for automatically gradable homework problems, which use the full power of everything installed in SMC, get forked, improved, used, etc.

### Juan Nunez-Iglesias

#### Coveralls dashboard

In this series of posts, we’ve covered:

Today I will show you how to continuously check your test coverage using Coveralls.

Travis runs whatever commands you tell it to run in your .travis.yml file. Normally, that’s just installing your program and its requirements, and running your tests. If you wanted instead to launch some nuclear missiles, you could do that. (Assuming you were happy to put the launch keys in a public git repository… =P)

The Coveralls service, once again free for open-source repositories, takes advantage of this: you just need to install an extra piece of software from PyPI, and run it after your tests have passed. Do so by adding the line pip install coveralls to your before_install section, and just the coveralls command to a new after_success section:

language: python
python:
- "2.7"
- "3.4"
before_install:
- pip install pytest pytest-cov
- pip install coveralls
script:
- py.test
after_success:
- coveralls


This will push your coverage report to Coveralls every time Travis is run, i.e., every time you push to your GitHub repository. You will need to turn on the service by:

• going to coveralls.io and signing in with your GitHub credentials
• clicking on “Repositories”, and “Add Repo” (and perhaps on “Sync GitHub Repos”, top right)
• switching your repository’s switch to “On”

That’s it! With just two small changes to your .travis.yml and the flick of a switch, you will get continuous monitoring of your test coverage with each push, and also with each pull request. By default, Coveralls will even comment on your pull requests, so that you instantly know whether someone is not testing their submitted code!

Plus, you get a swanky web dashboard to monitor your coverage!

Tune in tomorrow for the final chapter in continuous integration in Python: showing off! No, it’s not a blog post about blogging about it. =P

(Addendum: the above .travis.yml will run Coveralls twice, once per Python version, so it will also comment twice for your PRs, notify you twice, etc. If you want to test more Python versions, the problem multiplies. Therefore, you can choose to call Coveralls only for a specific Python version, like so:

after_success:
- if [[ $ENV == python=3.4* ]]; then coveralls; fi  This is the approach taken by the scikit-image project.) ## October 15, 2014 ### Continuum Analytics news #### The Leading Enterprise Python Distribution for Data Analytics Is Moving to the Hadoop and Spark Infrastructures Continuum Analytics, the premier provider of Python-based data analytics solutions and services, today announced the release of Anaconda Cluster. Continuum’s enterprise Python distribution for data analytics, Anaconda Server, is moving to the Hadoop and Spark infrastructures. Anaconda Cluster is a highly-scalable cluster resource management tool that lets data scientists streamline deployment without rework. “We would love to pass off clustering to the engineering and IT team, but we are the only ones that know how to manage it right now.” - A Hadoop/Python user in response to the pain points that Anaconda Cluster addresses ### Juan Nunez-Iglesias #### Travis build page This is the fourth post in my Continuous integration (CI) in Python series, and the one that puts the “continuous” in CI! For the introductory three posts, see: ## Introduction to Travis-CI Once you’ve set up your tests locally, it does you no good if you don’t remember to run them! Travis-CI makes this seamless, because it will check out your code and run you tests for each and every commit you push to GitHub! (This is even more important when you are receiving pull requests on GitHub: the tests will be run online, without you having to individually check out each PR and run the tests on your machine!) This is what continuous integration is all about. Once upon a time, the common practice was to pile on new features on a codebase. Then, come release time, there would be a feature freeze, and some time would be spent cleaning up code and removing bugs. In continuous integration, instead, no new feature is allowed into the codebase until it is bug free, as demonstrated by the test suite. ## What to do You need to first add a .travis.yml file to the root of your project. This tells Travis how to install your program’s dependencies, install your program, and run the tests. Here’s an example file to run tests and coverage on our maths.py sample project: language: python python: - "2.7" - "3.4" before_install: - pip install pytest pytest-cov script: - py.test  Pretty simple: tell Travis the language of your project, the Python version (you can specify multiple versions, one per line, to test on multiple Python versions!), how to install test dependencies. Finally, the command to run your tests. (This can be anything, not just pytest or another testing framework, as long as a shell exit status of 0 means “success” and anything else means “failure”.) You can read more about the syntax and options for your .travis.yml in the Travis documentation. There are other sections you can add, such as “virtualenv” to set up Python virtual environments, “install” to add compilation and other installation steps for your library, before testing, and “after_success” to enable e.g. custom notifications. (We will use this last one in the next post.) Once you have created .travis.yml for your project, you need to turn it on for your repo. This is, currently, a fairly painful process, and I can’t wait for the day that it’s enabled by default on GitHub. [Update: see below] In the meantime though: [Update 2014-10-28: Thanks to @hugovk for pointing out that the first four points above can be skipped. It turns out that when you first log in to Travis-CI using your GitHub account, you give them write access to your webhooks. So, when you add a repo from their end, they go ahead and add themselves on the GitHub end! Boom. Way easier.] Voilà! Every push and pull-request to your repository will trigger a job on Travis’s servers, building your dependencies and your software, running your tests, and emailing you if anything went wrong! Amazingly, this is completely free for open source projects, so you really have no excuse for not using it! Follow this blog to learn how to continuously check test coverage using Coveralls, coming in the next post! ## October 14, 2014 ### Matthieu Brucher #### Book review: The Prize – The Epic Quest for Oil, Money & Power Yet another book that my colleague suggested me to read. I also discussed it with another colleague who told me the same: this is a book that anyone in the oil and gas field should read. And what about people not in this industry? #### Content and opinions The book is primarily focused on the big scale, from a geopolitical and strategic point of view, with anecdotes on what happen behind the scenes. But even for the global scale, there are many things in the book that I didn’t know about: how Rockfeller make so rich in the first place, the role of gasoline in World War 1, what drove the 2 oil shocks… The book is quite big, and split in five parts, mainly driven by fundamental changes in the approaches to oil place. Sometimes it is difficult to follow the main thread of the story as the author has to go back in time to follow a different side of History, and also sometimes the name of the biggest companies is confusing, albeit precise. The first part was like a Far West story. It starts in the middle of the 19th century and finishes before World War 1. Basically this is the story of Rockfeller fortune, the fall of his company, as well as a nice description of what happens when everyone looks for oil without regulation. It is “funny” to compare this to the current rush for gas shale in the US and how it is exploited (i.e. without regard for the global economy). The second chapter describes World War 1 as well as the Depression that follows. Here, the politicians start noticing the importance of oil, and how it plays a crucial role in World War 1, as well as how it was a great helper in the recession. But it also indicated that oil needed a minimum price, which is what I think is currently forgotten in the gas shale exploitation: the gas companies are sacrified on the altar of growth. At leat Roosevelt understood that all the economy has to grow, and the oil and gas industry couldn’t be sacrified… The third chapter tackles World War 2, and the different theatres of war. Clearly the different opponents knew about the importance of oil, and also the winners were those who could manage to keep a steady supply of oil. Also here is mentioned the issue we still face: what energy to use when you don’t have oil anymore? Fourth chapter is about growth years before the first oil shock. Clearly the balance shifted then between companies and countries, and even though this part is less interesting in terms of events (there is still the birth of OPEC and the Suez crisis), it still brings some background for the current power balance that is not even hinted in the medias. Also the corruption of the countries by money is also introduced. The fifth chapter describes the current balance, with the oil sword that was so important during the second oil shock and the war between Israel and the Arab nations. The way it was used is not something that I linked with the second oil shock at first, although it seems so blatant. Here is also where the Middle-East story ends with the first Iraq War. Of course, the Middle East is actually the central piece of the whole book, and there are so many pieces that gravitated around it int he book that it was sometimes incredible. There is a small epilogue to sum up what could be after the first Iraq war and then the advent of the war on terrorism. This epilogue is lacking some depth, but it is really difficult to get perspective of events that are happening when you are writing the book! #### Conclusion I thought I understood the story of the 20th century quite well. I was very mistaken. Even though it is a long book, there is so much to discuss that it is sometimes not enough. It is simple enough so that anyone can read it. And everyone should read definitely read it. ### Juan Nunez-Iglesias #### jnuneziglesias This is the third post in my series on continuous integration (CI) in Python. For the first two posts, see 1: automated tests with pytest, and 2: measuring test coverage. By now, you’ve got a bunch of tests and doctests set up in your project, which you run with the command: $ py.test --doctest-modules --cov . --cov-report term-missing


If you happen to have a few script and setup files lying around that you don’t want to test, this might expand to

py.test --doctest-modules --cov . --cov-report term-missing --ignore script.py  As you can see, this is getting cumbersome. You can fix this by adding a setup.cfg file to the root of your project, containing configuration options for your testing. This follows standard INI file syntax. [pytest] addopts = –ignore script.py –doctest-modules –cov-report term-missing –cov . Note that the “ignore” flag only tells pytest not to run that file during testing. You also have to tell the coverage module that you don’t want those lines counted for coverage. Do this in a .coveragerc file, with similar syntax: [run] omit = *script.py Once you’ve done this, you can invoke your tests with a simple, undecorated call to just py.test. To find out more about your configuration options, see the pytest basic test configuration page, and Ned Batchelder’s excellent .coveragerc syntax guide. That’s it for this entry of my CI series. Follow this blog for the next two entries, setting up Travis CI and Coveralls. ## October 13, 2014 ### Titus Brown #### The emerging discipline of Data Intensive Biology Yesterday I gave my third keynote address ever, at the Australasian Genomics Technology Association's annual meeting in Melbourne (talk slides here). On my personal scale of talks, it was a 7 or 8 out of 10: I gave it a lot of energy, and I think the main messages got across, but I ended up conveying a few things that I regretted - in particular, the Twitter feed pointed that I'd slagged on biologists a few times (slag, v: British informal. Criticize (someone) in an abusive and insulting manner). Absolutely not my intent! I consider myself a biologist, albeit one who works primarily on biological data analysis. So I'm starting to call myself a data-intensive biologist. And I thought now would be a good time to talk about the emerging discipline of Data Intensive Biology. Before I go on, this post is dedicated to my Australian hosts. I've had a wonderful time so far, and I've been tremendously impressed with the Australasian genomics and bioinformatics research communities. They've got something great going on here and I look forward to further interactions! ## Who are Data Intensive Biologists? In my talk, I included the de rigeur picture of a tidal wave, reproduced below. This tidal wave is intended to represent the data deluge in biology, the combination of lots of -omics and sensor data that is starting to hit the field of biology. If you look closely at the picture, you'll see three groups of researchers. • the inland researchers, who are up on the shore well away from the boardwalk. They are looking at the tidal wave, wondering if the water's going to reach them; they're far away from it, though, so they're not really too worried about it. • the boardwalk researchers, who are on the little walkway at the base of the crashing wave. Some of them are looking at the wave in horror, aware that this is going to be painful; others are busy putting on little lifevests, in the vain hope that this will save them; and the third group are looking the other way, wondering what everyone is talking about. • the surfer dudes and dudettes, who are having the time of their lives, surfing down the face of the wave. Every now and then they fall off, but the water's deep and they can get right back on the board. (OK, it's an imperfect analogy.) The surfers are the data intensive biologists: they love the data, they love the thrill of data analysis, and they're going somewhere fast, although maybe not in a forward direction. ## The anatomy of a Data Intensive Biologist In my ABiC 2014 keynote (keynote #2), I listed out five character types that participate in bioinformatics. These were: 1. Computer scientists 2. Software engineers 3. Data scientists 4. Statisticians 5. Biologists (I missed a 6th, database maintainers, and possibly a 7th, data curators.) In another miscommunication, I meant to say (but did not say during my talk) that almost every effective bioinformatics researcher is some linear combination of these seven characters. I think that data intensive biologists can be defined on this set of axes, too. So: data intensive biologists are biology researchers who are focused on biological questions, but who have substantial footings in many or all of the other fields above. That is, their focus is on making biological progress, but they are using tools from computer science, software engineering, data science, statistics, and databases to study biology. ## Some additional characteristics Data Intensive Biologists: • are usually well grounded in at least one field of biology; • understand that data is not information, but that it's a darn good start; • know that most of our current biological knowledge is limited or wrong, but that we've got to rely on it anyway; • are aware that investing in automation is sometimes repaid 10x in efficiency, except when it's not; • realize that reproducible computational analyses are a great idea; • write software when they have to, but only because they have to; • think data science training is neat, because it gives an enduring competitive edge; • constantly rebalance themselves between the CS, software engineering, data science, and stats perspectives -- almost as if they were surfing across those fields; • get that open science is obvious if not always practical; • are confused as to why biology graduate programs aren't teaching more data analysis; and, finally, data intensive biologists • shouldn't worry about their career options. ## Concluding thoughts If you had to nail down a definition - people like to do that, for some reason :) - I would go with: Data intensive biology: a researcher focused on addressing biological research questions primarily through large scale data analysis or integration. I don't have any interest in being exclusionary with this definition. If you're tackling biological questions in any way, shape, or form, and you're using lots of data to do it, you fit my definition! Oh, and by the way? There are already workshops and faculty positions in this area. Although they're all at Hopkins, so maybe the best you can say is that James Taylor and I agree on terminology :). --titus ## October 10, 2014 ### Titus Brown #### How good is MEGAHIT? A few weeks back, Nick Loman (via Manoj Samanta) brought MEGAHIT to our attention on Twitter. MEGAHIT promised "an ultra-fast single-node solution for large and complex metagenome assembly" and they provided a preprint and some open source software. This is a topic near and dear to my heart (see Pell et al., 2012 and Howe et al., 2014), so I was immediately interested - especially since the paper used our Howe et al. data set to prove out their results. (The twitterati also pointed out that the preprint engaged in some bashing of this previous work, presumably to egg me on. ;) So I thought, heck! Let's take MEGAHIT out for a spin! So my postdoc Sherine Awad and I tried it out. tl; dr? MEGAHIT seems pretty awesome to me, although IDBA and SPAdes still seem to beat it by a bit on the actual assembly results. ## Installing MEGAHIT We ran into some small compilation problems but got it working on an Ubuntu 12.04 system easily enough. Running it was also a snap. It took a few minutes to work through the required options, and voila, we got it running and producing results. (You can see some example command lines here.) First question -- ## How does it do on E. coli? One of the claims made in the paper is that this approach performs well on low-coverage data. To evaluate this, I took a 5m read subset from the E. coli MG1655 dataset (see Chitsaz et al., 2011) and further subsetted it to 1m reads and 500k reads, to get (roughly) 100x, 20x, and 10x data sets. I then ran MEGAHIT with default parameters, specifying 1 GB of memory, and limiting only the upper k size used (because otherwise it crashed) -- again, see the Makefile. For comparison, I also ran SPAdes on the lowest-coverage data, looking only at the contigs (not the scaffolds). After it was all done assembling, I ran QUAST on the results. Measure 100x 20x 10x 10x (SPAdes) N50 73736 52352 9067 18124 Largest alignment 221kb 177kb 31kb 62kb bp in contigs > 1kb 4.5mb 4.5mb 4.5mb 4.5mb Genome fraction 98.0% 98.0% 97.4% 97.9% Misassembled length 2kb 40.8kb 81.3kb 63.6kb (Data: MEGAHIT 100x, 20x, and 10x; and SPAdes 10x.) In summary, it does pretty well - with even pretty low coverage, you're getting 97.4% of the genome in contigs > 500bp (QUAST's default cutoff). Misassemblies grow significantly at low coverage, but you're still only at 2% in misassembled contigs. In comparison to SPAdes at low coverage, the results are ok also. SPAdes performs better in every category, which I would expect -- it's a great assembler! - but MEGAHIT performs well enough to be usable. MEGAHIT is also much, much faster - seconds vs minutes. Next question - ## How well does it do on on a metagenomic data set? Sherine has been running benchmarks for Velvet, SPAdes, and IDBA on the data set from Shakya et al, 2013, a mock community data set. So I asked her to add some MEGAHIT results. She did quality trimming as specified in Kalamazoo, and ran MEGAHIT with 10 GB of RAM. She then used QUAST to evaluate the results against the known good genomes. Measure MEGAHIT SPAdes IDBA # contigs > 1kb 19,982 16,387 16,191 length in contigs >1kb 190.7mb 192.5mb 194.8 # misassemblies 698 894 1223 Bp in misassemblies 12.7mb 28.2mb 21.8mb Metagenome fraction 89.96% 90.42% 90.97% Again, the answer is "MEGAHIT works pretty well." Fewer misassemblies, but also more contigs and a bit less coverage of the known genome. Third question -- ## How fast and memory efficient was MEGAHIT? Very. We didn't actually measure it, but, like, really fast. And low memory, also. We're doing systematic benchmarking on this front for our own paper, and we'll provide details as we get them. (We didn't measure MEGAHIT's performance because we don't have numbers for SPAdes and IDBA yet. We didn't measure SPAdes and IDBA yet because actually doing the benchmarking well is really painful - they take a long time to run. 'nuff said :) ## So, what are your conclusions? So far, +1. Highly recommended to people who are good at command line stuff and general all-around UNIX-y folk. I'd want to play around with it a bit more before strongly recommending it to anyone who wasn't a seasoned bioinformatician. It's rough around the edges, and I haven't looked at the code much yet. It also breaks in various edge cases, but at least it's pretty robust when you just hand it a straight up FASTQ file! That having been said, it works shockingly well and is quite fast and memory efficient. If you're having trouble achieving an assembly any other way I would definitely recommend investing the time to try out MEGAHIT. --titus p.p.s. Thanks to Rayan Chikhi and Lex Nederbragt for reading and commenting on a draft version of this post! ## Appendix: MEGAHIT and digital normalization In the MEGAHIT paper, they commented that they believed that digital normalization could lead to loss of information. So I thought I'd compare MEGAHIT on 100x against MEGAHIT and SPAdes running on digitally normalized 100x: Measure 100x DN (w/MEGAHIT) DN (w/SPAdes) N50 73736 82753 132872 Largest alignment 221kb 222kb 224kb bp in contigs > 1kb 4.5mb 4.5mb 4.6mb Genome fraction 98.0% 98.1% 98.2% Misassembled length 2kb 120kb 48kb (Data: MEGAHIT 100x, MEGAHIT DN, and SPAdes DN.) The short version is, I don't see any evidence that diginorm leads to incompleteness, but clearly diginorm leads to lots of misassemblies when used in conjunction with MEGAHIT or SPAdes on high-coverage genomes. (We have some (ok, lots) of evidence that this doesn't happen with lower coverage genomes, or metagenomes.) That having been said, it's clearly rather assembler-specific, since SPAdes does a much better job than MEGAHIT on dn data. The shorter version? You probably won't need to use diginorm with MEGAHIT, and you shouldn't. That's OK. (There are lots of reasons why you shouldn't use diginorm.) I still don't have any evidence that diginorm drops information in non-polyploid situations. Let me know if you've seen this happen! ## Appendix II: Running your own evaluation All of the E. coli numbers above are available in the 2014-megahit-evaluation github repo. See README.md in that repo for basic install instructions, and Makefile for what I ran and how to run it. Feel free to reproduce, extend, and update! ## October 09, 2014 ### Continuum Analytics #### Tips for More Effective Numba Usage Last week, I wrote about the features Numba has, and what our long term development goals are. Today, I want to dig into the details of how to more effectively use Numba on your programs. ## October 07, 2014 ### Matthieu Brucher #### Book review: Fundamentals of Reservoir Engineering I worked for a long time for the seismic department of my company, and switched to the reservoir department only last year. The problems that are tackled are quite different, and the way they are solved as well. So nothing to do with the book I reviewed a long time ago. So after 2 trainings in reservoir simulation, I also read this book that a colleague of mine labeled as the reference book. #### Content and opinions The book has a nice progression in complexity. Although it won’t tackle anything like model generation (it is someone else’s job after all), it tackles the basic questions that a reservoir engineer has to ask himself. The progression is also close to what I had during my reservoir engineering training. The first four chapters are an introduction to the basic physics, namely how to compute the volume of oil in the reservoir, how does it work, how much can you get out of it, basic thermodynamics of oil (black oil model), or the Darcy law. Then, the four next chapters deal with wells, pressure, which are the most important things in the job. The main values you can get to describe your reservoir come from well tests and the analysis of the results. They may seem basic, with a lot of approximations, but they are still the first maths you do when appreciating a reservoir! 40 years later… The last two chapters are a little bit different, but equally important. I didn’t think much of aquifers before I actually realized that they are the main way of recovering oil… Without them, you have to use quite rapidly enhanced techniques. And they behave quite differently depending on their size, their location. The last chapter deals with relative permeability. It’s something that I found “missing” int he previous chapters, as it was always mentioned during my trainings, and then I noticed that the last chapter tries to cover everything they knew about this topic in the last chapter. #### Conclusion Although the book is quite old and although reservoir simulation has made tremendous progress, the basic approaches are still used today, and this is what the book is about. The location of the exercises and the fact that they are solved just after is also well appreciated. ## October 04, 2014 ### Titus Brown #### Putting together an online presence for a diffuse academic community - how? I would like to build a community site. Or, more precisely, I would like to recognize, collect, and collate information from an already existing but rather diffuse community. The focus of the community will be academic data science, or "data driven discovery". This is spurred largely by the recent selection of the Moore Data Driven Discovery Investigators, as well as the earlier Moore and Sloan Data Science Environments, and more broadly by the recognition that academia is broken when it comes to data science. So, where to start? For a variety of reasons -- including the main practical one, that most academics are not terribly social media integrated and we don't want to try to force them to learn new habits -- I am focusing on aggregating blog posts and Twitter. So, the main question is... how can we most easily collect and broadcast blog posts and articles via a Web site? And how well can we integrate with Twitter? ## First steps and initial thoughts Following Noam Ross's suggestions in the above storify, I put together a WordPress blog that uses the RSS Multi Importer to aggregate RSS feeds as blog posts (hosted on NFSN). I'd like to set this up for the DDD Investigators who have blogs; those who don't can be given accounts if they want to post something. This site also uses a Twitter feed plugin to pull in tweets from the list of DDD Investigators. The resulting RSS feed from the DDDI can be pulled into a River of News site that aggregates a much larger group of feeds. The WordPress setup was fairly easy and I'm going to see how stable it is (I assume it will be very stable, but shrug time will tell :). I'm upgrading my own hosting setup and once that's done, I'll try out River4. ## Next steps and longer-term thoughts Ultimately a data-driven-discovery site that has a bit more information would be nice; I could set up a mostly static site, post it on github, authorize a few people to merge, and then solicit pull requests when people want to add their info or feeds. One thing to make sure we do is track only a portion of feeds for prolific bloggers, so that I, for example, have to tag a post specifically with 'ddd' to make it show up on the group site. This will avoid post overload. I'd particularly like to get a posting set up that integrates well with how I consume content. In particular, I read a lot of things via my phone and tablet, and the ability to post directly from there -- probably via e-mail? -- would be really handy. Right now I mainly post to Twitter (and largely by RTing) which is too ephemeral, or I post to Facebook, which is a different audience. (Is there a good e-mail-to-RSS feed? Or should I just do this as a WordPress blog with the postie plug-in?) The same overall setup could potentially work for a Software Carpentry Instructor community site, a Data Carpentry Instructor community site, trainee info sites for SWC/DC trainees, and maybe also a bioinformatics trainee info site. But I'd like to avoid anything that involves a lot of administration. ## Things I want to avoid Public forums. Private forums that I have to administer or that aren't integrated with my e-mail (which is where I get most notifications, in the end). Overly centralized solutions; I'm much more comfortable with light moderation ("what feeds do I track?") than anything else. Thoughts? --titus ## October 02, 2014 ### Continuum Analytics #### Advanced Features of Conda Part 2 Conda is the package manager that comes with Continuum’s Anaconda distribution. Conda makes it easy to install and manage all your packages and environments on Windows, Mac OS X, and Linux. If you are new to conda, I recommend starting at the documentation, and watching my presentation from SciPy 2014. In this post, I will assume that you are already familiar with conda and its basic usage, both for installing and building packages. I will look at a few advanced features or tips that may not be well known, even among advanced users of conda. These features will help you with feature discovery, customization, and allow you to manage your packages and environments in more advanced ways. In part 1, we looked at --help, configuration, conda update --all, conda list --export and conda create --file, conda clean, and pinning packages. In this post, we will look at some tools that make building packages easier and some more features that help manage your environments and use conda more effectively from the command line. ### Juan Nunez-Iglesias #### jnuneziglesias (Edit: I initially thought it would be cute to number from 0. But it turns out it becomes rather obnoxious to relate English (first, second, …) to 0-indexing. So this was formerly volume 1. But everything else remains the same.) This is the second post in a series about setting up continuous integration for a Python project from scratch. For the first post, see Automated tests with pytest. After you’ve written some test cases for a tiny project, it’s easy to check what code you have automatically tested. For even moderately big projects, you will need tools that automatically check what parts of your code are actually tested. The proportion of lines of code that are run at least once during your tests is called your test coverage. For the same reasons that testing is important, measuring coverage is important. Pytest can measure coverage for you with the coverage plugin, found in the pytest-cov package. Once you’ve installed the extension, a test coverage measurement is just a command-line option away:  ~/projects/maths py.test --doctest-modules --cov .
============================= test session starts ==============================
platform darwin -- Python 2.7.8 -- py-1.4.25 -- pytest-2.6.3
plugins: cov
collected 2 items

maths.py .
test_maths.py .
--------------- coverage: platform darwin, python 2.7.8-final-0 ----------------
Name         Stmts   Miss  Cover
--------------------------------
maths            2      0   100%
test_maths       4      0   100%
--------------------------------
TOTAL            6      0   100%

=========================== 2 passed in 0.07 seconds ===========================


(The --cov takes a directory as i