October 22, 2014

Titus Brown

Infrastructure for Data Intensive Biology - a statement of work

Since being chosen as a Moore Foundation Data Driven Discovery Investigator, I've been putting together the paperwork at UC Davis to actually receive the money. Part of that is putting together a budget and a Statement of Work to help guide the conversation between me, Davis, and the Moore Foundation. Here's what I sent Chris Mentzel at Moore:

Title: Infrastructure for Data Intensive Biology

OUTCOME: In support of demonstrating the high level of scientific impact that data scientists deliver through their focus on interdisciplinary data-driven research, funds from this award will be used to better understand gene function in non-model organisms through the development of new workflows and better data sharing technology for large-scale data analysis.

Research direction 1: Develop and extend protocols for non-model genomic and transcriptomic analysis.

Research direction 2: Integrate and extend existing workflow and data analysis software into a cloud-enabled deployment system with a Web interface for executing protocols.

Research direction 3: Investigate and develop a distributed graph database system and distributed query functionality to support distributed data-driven discovery. (DDDD :)

For more of the background, see my full award submission, my presentation, and a science fiction story that I would like to enable.

Comments and pointers welcome!


by C. Titus Brown at October 22, 2014 10:00 PM

October 17, 2014

Juan Nunez-Iglesias

Readme badges

We’re finally ready to wrap up this topic. By now you can:

But, much as exercise is wasted if your bathroom scale doesn’t automatically tweet about it, all this effort is for naught if visitors to your GitHub page can’t see it!

Most high-profile open-source projects these days advertise their CI efforts. Above, I cheekily called this showing off, but it’s truly important: anyone who lands on your GitHub page is a potential user or contributor, and if they see evidence that your codebase is stable and well-tested, they are more likely to stick around.

Badging your README is easy. (You do have a README, don’t you?) In Travis, go to your latest build. Near the top right, click on the “build passing” badge:

Travis-CI badge

You’ll get an overlay, with a pull-down menu for all the different options for getting the badge. You can grab the image URL, or you can directly grab the Markdown to put into your markdown-formatted README, or a bunch of other options, including RST:

Travis-CI badge URLs

Just copy and paste the appropriate code and add it to your README file wherever you please.

Meanwhile, on your repository’s Coveralls page, on the right-hand side, you will find another pull-down menu with the appropriate URLs:

Coveralls badge URLs

Again, just grab whichever URL is appropriate for your needs (I prefer Markdown-formatted READMEs), and add it to your README, usually next to the Travis badge.

And you’re done! You’ve got automated tests, tons of test coverage, you’re running everything correctly thanks to configuration files, and all this is getting run on demand thanks to Travis and Coveralls. And thanks to badging, the whole world knows about it:

Readme badges

by Juan Nunez-Iglesias at October 17, 2014 01:37 PM

William Stein

A Non-technical Overview of the SageMathCloud Project

What problems is the SageMathCloud project trying to solve? What pain points does it address? Who are the competitors and what is the state of the technology right now?

What problems you’re trying to solve and why are these a problem?

  • Computational Education: How can I teach a course that involves mathematical computation and programming?
  • Computational Research: How can I carry out collaborative computational research projects?
  • Cloud computing: How can I get easy user-friendly collaborative access to a remote Linux server?

What are the pain points of the status quo and who feels the pain?

  • Student/Teacher pain:
    • Getting students to install software needed for a course on their computers is a major pain; sometimes it is just impossible, due to no major math software (not even Sage) supporting all recent versions of Windows/Linux/OS X/iOS/Android.
    • Getting university computer labs to install the software you need for a course is frustrating and expensive (time and money).
    • Even if computer labs worked, they are often being used by another course, stuffy, and students can't possibly do all their homework there, so computation gets short shrift. Lab keyboards, hardware, etc., all hard to get used to. Crappy monitors.
    • Painful confusing problems copying files around between teachers and students.
    • Helping a student or collaborator with their specific problem is very hard without physical access to their computer.
  • Researcher pain:
    • Making backups every few minutes of the complete state of everything when doing research often hard and distracting, but important for reproducibility.
    • Copying around documents, emailing or pushing/pulling them to revision control is frustrating and confusing.
    • Installing obscuring software is frustarting and distracting from the research they really want to do.
  • Everybody:
    • It is frustrating not having LIVE working access to your files wherever you are. (Dropbox/Github doesn't solve this, since files are static.)
    • It is difficult to leave computations running remotely.

Why is your technology poised to succeed?

  • When it works, SageMathCloud solves every pain point listed above.
  • The timing is right, due to massive improvements in web browsers during the last 3 years.
  • I am on full sabbatical this year, so at least success isn't totally impossible due to not working on the project.
  • I have been solving the above problems in less scalable ways for myself, colleagues and students since the 1990s.
  • SageMathCloud has many users that provide valuable feedback.
  • We have already solved difficult problems since I started this project in Summer 2012 (and launched first version in April 2013).

Who are your competitors?

There are no competitors with a similar range of functionality. However, there are many webapps that have some overlap in capabilities:
  • Mathematical overlap: Online Mathematica: "Bring Mathematica to life in the cloud"
  • Python overlap: Wakari: "Web-based Python Data Analysis"; also PythonAnywhere
  • Latex overlap: ShareLaTeX, WriteLaTeX
  • Web-based IDE's/terminals: target writing webapps (not research or math education): c9.io, nitrous.io, codio.com, terminal.com
  • Homework: WebAssign and WebWork
Right now, SageMathCloud gives away for free far more than any other similar site, due to very substantial temporary financial support from Google, the NSF and others.

What’s the total addressable market?

Though our primary focus is the college mathematics classroom, there is a larger market:
  • Students: all undergrad/high school students in the world taking a course involving programming or mathematics
  • Teachers: all teachers of such courses
  • Researchers: anybody working in areas that involve programming or data analysis
Moreover, the web-based platform for computing that we're building lends itself to many other collaborative applications.

What stage is your technology at?

  • The site is up and running and has 28,413 monthly active users
  • There are still many bugs
  • I have a precise todo list that would take me at least 2 months fulltime to finish.

Is your solution technically feasible at this point?

  • Yes. It is only a matter of time until the software is very polished.
  • Morever, we have compute resources to support significantly more users.
  • But without money (from paying customers or investment), if growth continues at the current rate then we will have to clamp down on free quotas for users.

What technical milestones remain?

  • Infrastructure for creating automatically-graded homework problems.
  • Fill in lots of details and polish.

Do you have external credibility with technical/business experts and customers?

  • Business experts: I don't even know any business experts.
  • Technical experts: I founded the Sage math software, which is 10 years old and relatively well known by mathematicians.
  • Customers: We have no customers; we haven't offered anything for sale.

Market research?

  • I know about math software and its users as a result of founding the Sage open source math software project, NSF-funded projects I've been involved in, etc.

Is the intellectual property around your technology protected?

  • The IP is software.
  • The website software is mostly new Javascript code we wrote that is copyright Univ. of Washington and mostly not open source; it depends on various open source libraries and components.
  • The Sage math software is entirely open source.

Who are the team members to move this technology forward?

I am the only person working on this project fulltime right now.
  • Everything: William Stein -- UW professor
  • Browser client code: Jon Lee, Andy Huchala, Nicholas Ruhland -- UW undergrads
  • Web design, analytics: Harald Schilly -- Austrian grad student
  • Hardware: Keith Clawson

Why are you the ideal team?

  • We are not the ideal team.
  • If I had money maybe I could build the ideal team, leveraging my experience and connections from the Sage project...

by William Stein (noreply@blogger.com) at October 17, 2014 01:04 PM

October 16, 2014

Jake Vanderplas

How Bad Is Your Colormap?

(Or, Why People Hate Jet – and You Should Too)

I made a little code snippet that I find helpful, and you might too:

In [1]:
def grayify_cmap(cmap):
    """Return a grayscale version of the colormap"""
    cmap = plt.cm.get_cmap(cmap)
    colors = cmap(np.arange(cmap.N))
    # convert RGBA to perceived greyscale luminance
    # cf. http://alienryderflex.com/hsp.html
    RGB_weight = [0.299, 0.587, 0.114]
    luminance = np.sqrt(np.dot(colors[:, :3] ** 2, RGB_weight))
    colors[:, :3] = luminance[:, np.newaxis]
    return cmap.from_list(cmap.name + "_grayscale", colors, cmap.N)

What this function does is to give you a lumninance-correct grayscale version of any matplotlib colormap. I've found this useful for quickly checking how my plots might appear if printed in black and white, but I think it's probably even more useful for stoking the flame of the internet's general rant against jet.

If you want to take a step toward joining the in-crowd of chromatically-sensitive data viz geeks, your best bet is to start by bashing jet. Even if you don't know it by name, I can guarantee that if you've read many scientific papers, you've seen jet before. For example, here's a snapshot of a plot from neuroscience journal which is skewered by an appropriately ranty blogpost on the subject:

Jet is the default colorbar originally used by matlab, and this default was inherited in the early days of Python's matplotlib package. The reasons not to use jet are numerous, and you can find good arguments against it across the web. For some more subdued and nuanced arguments, I'd start with the paper Rainbow Color Map (Still) Considered Harmful and, for more general visualization tips, Ten Simple Rules for Better Figures.

So what do I have to add to this discussion that hasn't been already said? Well, nothing really, except the code snippet I shared above. Let me show you what it does.

Taking the Color Out of Jet

Let's start by defining some data and seeing how it looks with the default jet colormap:

In [2]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(0, 6)
y = np.linspace(0, 3)[:, np.newaxis]
z = 10 * np.cos(x ** 2) * np.exp(-y)

We'll use matplotlib's imshow command to visualize this. By default, it will use the "jet" colormap:

In [4]:

At first glance this might look OK. But upon closer examination, you might notice that jet's Luminance profile is incredibly complicated. Because your eye has different levels of sensitivity to light of different color, the luminance is not simply the sum of the RGB values as you might naively expect, but some weighted Euclidean sum of the individual values. You can find more information than you'd ever need to know on imagemagick's website.

When you take the jet colormap used above and convert it to luminance using the code snippet above, you get this:

In [5]:
plt.imshow(z, cmap=grayify_cmap('jet'))

It's a mess! The greyscale-only version of this colormap has strange luminance spikes in the middle, and makes it incredibly difficult to figure out what's going on in a plot with a modicum of complexity. Much better is to use a colormap with a uniform luminance gradient, such as the built-in grayscale colormap. Let's plot this beside the previous two:

In [6]:
cmaps = [plt.cm.jet, grayify_cmap('jet'), plt.cm.gray]
fig, axes = plt.subplots(1, 3, figsize=(12, 3))

for cmap, ax in zip(cmaps, axes):
    im = ax.imshow(z, cmap=cmap)
    fig.colorbar(im, ax=ax)

In particular, notice that in the left panel, your eye is drawn to the yellow and cyan regions, because the luminance is higher. This can have the unfortunate side-effect of highlighting "features" in your data which may not actually exist!

We can see this Luminance spike more clearly if we look at the color profile of jet up close:

In [7]:
def show_colormap(cmap):
    im = np.outer(np.ones(10), np.arange(100))
    fig, ax = plt.subplots(2, figsize=(6, 1.5),
                           subplot_kw=dict(xticks=[], yticks=[]))
    ax[0].imshow(im, cmap=cmap)
    ax[1].imshow(im, cmap=grayify_cmap(cmap))

Once you have the grayscale lined-up with the color version, it's easy to point out these luminance spikes in the jet spectrum. By comparison, take a look at the Cube Helix colormap:

In [8]:

This is a rainbow-like colormap which – by design – has a uniform luminance gradient across its progression of colors. It's certainly not the best choice in all situations, but you could easily argue that it's always a better choice than jet.

All the Colormaps!

It can be useful to see this sort of visualization for all the available colormaps in matplotlib. Here's a quick script that does this:

In [18]:
fig, axes = plt.subplots(36, 6, figsize=(10, 7))
fig.subplots_adjust(left=0, right=1, bottom=0, top=1,
                    hspace=0.1, wspace=0.1)

im = np.outer(np.ones(10), np.arange(100))

cmaps = [m for m in plt.cm.datad if not m.endswith("_r")]

axes = axes.T.ravel()
for ax in axes:

for cmap, color_ax, gray_ax, null_ax in zip(cmaps, axes[1::3], axes[2::3], axes[::3]):
    del null_ax
    color_ax.set_title(cmap, fontsize=10)
    color_ax.imshow(im, cmap=cmap)
    gray_ax.imshow(im, cmap=grayify_cmap(cmap))

There are some colormaps in here that have very nice, linear luminance gradients, and this is something you should keep in mind when choosing your color map.

Much more could be written about choosing an appropriate color map for any given data; for a more in-depth discussion of matplotlib's maps (and some interesting luminance illustrations), you can refer to matplotlib's choosing a colormap documentation.

If you're interested in streamlined statistical plotting in Python with well thought-out default color choices, I'd suggest taking a look at Michael Waskom's seaborn project, and especially the associated Color Palette Tutorial.

I hope you find this grayify_cmap snippet helpful, and thanks for reading!

This post was written entirely in the IPython notebook. You can download this notebook, or see a static view here.

by Jake Vanderplas at October 16, 2014 03:00 PM

William Stein

Public Sharing in SageMathCloud, Finally

SageMathCloud (SMC) is a free (NSF, Google and UW supported) website that lets you collaboratively work with Sage worksheets, IPython notebooks, LaTeX documents and much, much more. All work is snapshotted every few minutes, and copied out to several data centers, so if something goes wrong with a project running on one machine (right before your lecture begins or homework assignment is due), it will pop up on another machine. We designed the backend architecture from the ground up to be very horizontally scalable and have no single points of failure.

This post is about an important new feature: You can now mark a folder or file so that all other users can view it, and very easily copy it to their own project.

This solves problems:
  • Problem: You create a "template" project, e.g., with pre-installed software, worksheets, IPython notebooks, etc., and want other users to easily be able to clone it as a new project. Solution: Mark the home directory of the project public, and share the link widely.

  • Problem: You create a syllabus for a course, an assignment, a worksheet full of 3d images, etc., that you want to share with a group of students. Solution: Make the syllabus or worksheet public, and share the link with your students via an email and on the course website. (Note: You can also use a course document to share files with all students privately.) For example...

  • Problem: You run into a problem using SMC and want help. Solution: Make the worksheet or code that isn't working public, and post a link in a forum asking for help.
  • Problem: You write a blog post explaining how to solve a problem and write related code in an SMC worksheet, which you want your readers to see. Solution: Make that code public and post a link in your blog post.
Here's a screencast.

Each SMC project has its own local "project server", which takes some time to start up, and serves files, coordinates Sage, terminal, and IPython sessions, etc. Public sharing completely avoids having anything to do with the project server -- it works fine even if the project server is not running -- it's always fast and there is no startup time if the project server isn't running. Moreover, public sharing reads the live files from your project, so you can update the files in a public shared directory, add new files, etc., and users will see these changes (when they refresh, since it's not automatic).
As an example, here is the cloud-examples github repo as a share. If you click on it (and have a SageMathCloud account), you'll see this:

What Next?

There is an enormous amount of natural additional functionality to build on top of public sharing.

For example, not all document types can be previewed in read-only mode right now; in particular, IPython notebooks, task lists, LaTeX documents, images, and PDF files must be copied from the public share to another project before people can view them. It is better to release a first usable version of public sharing before systematically going through and implementing the additional features needed to support all of the above. You can make complicated Sage worksheets with embedded images and 3d graphics, and those can be previewed before copying them to a project.
Right now, the only way to visit a public share is to paste the URL into a browser tab and load it. Soon the projects page will be re-organized so you can search for publicly shared paths, see all public shares that you have previously visited, who shared them, how many +1's they've received, comments, etc.

Also, I plan to eventually make it so public shares will be visible to people who have not logged in, and when viewing a publicly shared file or directory, there will be an option to start it running in a limited project, which will vanish from existence after a period of inactivity (say).

There are also dozens of details that are not yet implemented. For example, it would be nice to be able to directly download files (and directories!) to your computer from a public share. And it's also natural to share a folder or file with a specific list of people, rather than sharing it publicly. If somebody is viewing a public file and you change it, they should likely see the update automatically. Right now when viewing a share, you don't even know who shared it, and if you open a worksheet it can automatically execute Javascript, which is potentially unsafe.  Once public content is easily found, if somebody posts "evil" content publicly, there needs to be an easy way for users to report it.

Sharing will permeate everything

Sharing has been thought about a great deal during the last few years in the context of sites such as Github, Facebook, Google+ and Twitter. With SMC, we've developed a foundation for interactive collaborative computing in a browser, and will introduce sharing on top of that in a way that is motivated by your problems. For example, as with Github or Google+, when somebody makes a copy of your publicly shared folder, this copy should be listed (under "copies") and it could start out public by default. There is much to do.

One reason it took so long to release the first version of public sharing is that I kept imagining that sharing would happen at the level of complete projects, just like sharing in Github. However, when thinking through your problems, it makes way more sense in SMC to share individual directories and files. Technically, sharing at this level works works well for read-only access, not for read-write access, since projects are mapped to Linux accounts. Another reason I have been very hesitant to support sharing is that I've had enormous trouble over the years with spammers posting content that gets me in trouble (with my University -- it is illegal for UW to host advertisements). However, by not letting search engines index content, the motivation for spammers to post nasty content is greatly reduced.

Imagine publicly sharing recipes for automatically gradable homework problems, which use the full power of everything installed in SMC, get forked, improved, used, etc.

by William Stein (noreply@blogger.com) at October 16, 2014 02:29 PM

Juan Nunez-Iglesias

Coveralls dashboard

In this series of posts, we’ve covered:

Today I will show you how to continuously check your test coverage using Coveralls.

Travis runs whatever commands you tell it to run in your .travis.yml file. Normally, that’s just installing your program and its requirements, and running your tests. If you wanted instead to launch some nuclear missiles, you could do that. (Assuming you were happy to put the launch keys in a public git repository… =P)

The Coveralls service, once again free for open-source repositories, takes advantage of this: you just need to install an extra piece of software from PyPI, and run it after your tests have passed. Do so by adding the line pip install coveralls to your before_install section, and just the coveralls command to a new after_success section:

language: python
    - "2.7"
    - "3.4"
    - pip install pytest pytest-cov
    - pip install coveralls
    - py.test
    - coveralls

This will push your coverage report to Coveralls every time Travis is run, i.e., every time you push to your GitHub repository. You will need to turn on the service by:

  • going to coveralls.io and signing in with your GitHub credentials
  • clicking on “Repositories”, and “Add Repo” (and perhaps on “Sync GitHub Repos”, top right)
  • switching your repository’s switch to “On”

Adding a repo to Coveralls

That’s it! With just two small changes to your .travis.yml and the flick of a switch, you will get continuous monitoring of your test coverage with each push, and also with each pull request. By default, Coveralls will even comment on your pull requests, so that you instantly know whether someone is not testing their submitted code!

Plus, you get a swanky web dashboard to monitor your coverage!

Coveralls dashboard

Tune in tomorrow for the final chapter in continuous integration in Python: showing off! No, it’s not a blog post about blogging about it. =P

(Addendum: the above .travis.yml will run Coveralls twice, once per Python version, so it will also comment twice for your PRs, notify you twice, etc. If you want to test more Python versions, the problem multiplies. Therefore, you can choose to call Coveralls only for a specific Python version, like so:

    - if [[ $ENV == python=3.4* ]]; then

This is the approach taken by the scikit-image project.)

by Juan Nunez-Iglesias at October 16, 2014 06:11 AM

October 15, 2014

Continuum Analytics news

The Leading Enterprise Python Distribution for Data Analytics Is Moving to the Hadoop and Spark Infrastructures

Continuum Analytics, the premier provider of Python-based data analytics solutions and services, today announced the release of Anaconda Cluster. Continuum’s enterprise Python distribution for data analytics, Anaconda Server, is moving to the Hadoop and Spark infrastructures. Anaconda Cluster is a highly-scalable cluster resource management tool that lets data scientists streamline deployment without rework. “We would love to pass off clustering to the engineering and IT team, but we are the only ones that know how to manage it right now.” - A Hadoop/Python user in response to the pain points that Anaconda Cluster addresses

by Continuum at October 15, 2014 12:09 PM

Juan Nunez-Iglesias

Travis build page

This is the fourth post in my Continuous integration (CI) in Python series, and the one that puts the “continuous” in CI! For the introductory three posts, see:

Introduction to Travis-CI

Once you’ve set up your tests locally, it does you no good if you don’t remember to run them! Travis-CI makes this seamless, because it will check out your code and run you tests for each and every commit you push to GitHub! (This is even more important when you are receiving pull requests on GitHub: the tests will be run online, without you having to individually check out each PR and run the tests on your machine!)

This is what continuous integration is all about. Once upon a time, the common practice was to pile on new features on a codebase. Then, come release time, there would be a feature freeze, and some time would be spent cleaning up code and removing bugs. In continuous integration, instead, no new feature is allowed into the codebase until it is bug free, as demonstrated by the test suite.

What to do

You need to first add a .travis.yml file to the root of your project. This tells Travis how to install your program’s dependencies, install your program, and run the tests. Here’s an example file to run tests and coverage on our maths.py sample project:

language: python
    - "2.7"
    - "3.4"
    - pip install pytest pytest-cov
    - py.test

Pretty simple: tell Travis the language of your project, the Python version (you can specify multiple versions, one per line, to test on multiple Python versions!), how to install test dependencies. Finally, the command to run your tests. (This can be anything, not just pytest or another testing framework, as long as a shell exit status of 0 means “success” and anything else means “failure”.)

You can read more about the syntax and options for your .travis.yml in the Travis documentation. There are other sections you can add, such as “virtualenv” to set up Python virtual environments, “install” to add compilation and other installation steps for your library, before testing, and “after_success” to enable e.g. custom notifications. (We will use this last one in the next post.)

Once you have created .travis.yml for your project, you need to turn it on for your repo. This is, currently, a fairly painful process, and I can’t wait for the day that it’s enabled by default on GitHub. In the meantime though:

Voilà! Every push and pull-request to your repository will trigger a job on Travis’s servers, building your dependencies and your software, running your tests, and emailing you if anything went wrong! Amazingly, this is completely free for open source projects, so you really have no excuse for not using it!

Travis build page

Follow this blog to learn how to continuously check test coverage using Coveralls, coming in the next post!

by Juan Nunez-Iglesias at October 15, 2014 11:00 AM

October 14, 2014

Matthieu Brucher

Book review: The Prize – The Epic Quest for Oil, Money & Power

Yet another book that my colleague suggested me to read. I also discussed it with another colleague who told me the same: this is a book that anyone in the oil and gas field should read. And what about people not in this industry?

Content and opinions

The book is primarily focused on the big scale, from a geopolitical and strategic point of view, with anecdotes on what happen behind the scenes. But even for the global scale, there are many things in the book that I didn’t know about: how Rockfeller make so rich in the first place, the role of gasoline in World War 1, what drove the 2 oil shocks… The book is quite big, and split in five parts, mainly driven by fundamental changes in the approaches to oil place. Sometimes it is difficult to follow the main thread of the story as the author has to go back in time to follow a different side of History, and also sometimes the name of the biggest companies is confusing, albeit precise.

The first part was like a Far West story. It starts in the middle of the 19th century and finishes before World War 1. Basically this is the story of Rockfeller fortune, the fall of his company, as well as a nice description of what happens when everyone looks for oil without regulation. It is “funny” to compare this to the current rush for gas shale in the US and how it is exploited (i.e. without regard for the global economy).

The second chapter describes World War 1 as well as the Depression that follows. Here, the politicians start noticing the importance of oil, and how it plays a crucial role in World War 1, as well as how it was a great helper in the recession. But it also indicated that oil needed a minimum price, which is what I think is currently forgotten in the gas shale exploitation: the gas companies are sacrified on the altar of growth. At leat Roosevelt understood that all the economy has to grow, and the oil and gas industry couldn’t be sacrified…

The third chapter tackles World War 2, and the different theatres of war. Clearly the different opponents knew about the importance of oil, and also the winners were those who could manage to keep a steady supply of oil. Also here is mentioned the issue we still face: what energy to use when you don’t have oil anymore?

Fourth chapter is about growth years before the first oil shock. Clearly the balance shifted then between companies and countries, and even though this part is less interesting in terms of events (there is still the birth of OPEC and the Suez crisis), it still brings some background for the current power balance that is not even hinted in the medias. Also the corruption of the countries by money is also introduced.

The fifth chapter describes the current balance, with the oil sword that was so important during the second oil shock and the war between Israel and the Arab nations. The way it was used is not something that I linked with the second oil shock at first, although it seems so blatant. Here is also where the Middle-East story ends with the first Iraq War. Of course, the Middle East is actually the central piece of the whole book, and there are so many pieces that gravitated around it int he book that it was sometimes incredible.

There is a small epilogue to sum up what could be after the first Iraq war and then the advent of the war on terrorism. This epilogue is lacking some depth, but it is really difficult to get perspective of events that are happening when you are writing the book!


I thought I understood the story of the 20th century quite well. I was very mistaken. Even though it is a long book, there is so much to discuss that it is sometimes not enough. It is simple enough so that anyone can read it. And everyone should read definitely read it.

by Matthieu Brucher at October 14, 2014 07:11 AM

Juan Nunez-Iglesias


This is the third post in my series on continuous integration (CI) in Python. For the first two posts, see 1: automated tests with pytest, and 2: measuring test coverage.

By now, you’ve got a bunch of tests and doctests set up in your project, which you run with the command:

$ py.test --doctest-modules --cov . --cov-report term-missing

If you happen to have a few script and setup files lying around that you don’t want to test, this might expand to

$ py.test --doctest-modules --cov . --cov-report term-missing --ignore script.py

As you can see, this is getting cumbersome. You can fix this by adding a setup.cfg file to the root of your project, containing configuration options for your testing. This follows standard INI file syntax.

addopts = –ignore script.py –doctest-modules –cov-report term-missing –cov .

Note that the “ignore” flag only tells pytest not to run that file during testing. You also have to tell the coverage module that you don’t want those lines counted for coverage. Do this in a .coveragerc file, with similar syntax:

omit = *script.py

Once you’ve done this, you can invoke your tests with a simple, undecorated call to just py.test.

To find out more about your configuration options, see the pytest basic test configuration page, and Ned Batchelder’s excellent .coveragerc syntax guide.

That’s it for this entry of my CI series. Follow this blog for the next two entries, setting up Travis CI and Coveralls.

by Juan Nunez-Iglesias at October 14, 2014 02:57 AM

October 13, 2014

Titus Brown

The emerging discipline of Data Intensive Biology

Yesterday I gave my third keynote address ever, at the Australasian Genomics Technology Association's annual meeting in Melbourne (talk slides here). On my personal scale of talks, it was a 7 or 8 out of 10: I gave it a lot of energy, and I think the main messages got across, but I ended up conveying a few things that I regretted - in particular, the Twitter feed pointed that I'd slagged on biologists a few times (slag, v: British informal. Criticize (someone) in an abusive and insulting manner). Absolutely not my intent!

I consider myself a biologist, albeit one who works primarily on biological data analysis. So I'm starting to call myself a data-intensive biologist. And I thought now would be a good time to talk about the emerging discipline of Data Intensive Biology.

Before I go on, this post is dedicated to my Australian hosts. I've had a wonderful time so far, and I've been tremendously impressed with the Australasian genomics and bioinformatics research communities. They've got something great going on here and I look forward to further interactions!

Who are Data Intensive Biologists?

In my talk, I included the de rigeur picture of a tidal wave, reproduced below.


This tidal wave is intended to represent the data deluge in biology, the combination of lots of -omics and sensor data that is starting to hit the field of biology. If you look closely at the picture, you'll see three groups of researchers.

  • the inland researchers, who are up on the shore well away from the boardwalk. They are looking at the tidal wave, wondering if the water's going to reach them; they're far away from it, though, so they're not really too worried about it.
  • the boardwalk researchers, who are on the little walkway at the base of the crashing wave. Some of them are looking at the wave in horror, aware that this is going to be painful; others are busy putting on little lifevests, in the vain hope that this will save them; and the third group are looking the other way, wondering what everyone is talking about.
  • the surfer dudes and dudettes, who are having the time of their lives, surfing down the face of the wave. Every now and then they fall off, but the water's deep and they can get right back on the board. (OK, it's an imperfect analogy.)

The surfers are the data intensive biologists: they love the data, they love the thrill of data analysis, and they're going somewhere fast, although maybe not in a forward direction.

The anatomy of a Data Intensive Biologist

In my ABiC 2014 keynote (keynote #2), I listed out five character types that participate in bioinformatics. These were:

  1. Computer scientists
  2. Software engineers
  3. Data scientists
  4. Statisticians
  5. Biologists

(I missed a 6th, database maintainers, and possibly a 7th, data curators.)

In another miscommunication, I meant to say (but did not say during my talk) that almost every effective bioinformatics researcher is some linear combination of these seven characters. I think that data intensive biologists can be defined on this set of axes, too.

So: data intensive biologists are biology researchers who are focused on biological questions, but who have substantial footings in many or all of the other fields above. That is, their focus is on making biological progress, but they are using tools from computer science, software engineering, data science, statistics, and databases to study biology.

Some additional characteristics

Data Intensive Biologists:

  • are usually well grounded in at least one field of biology;
  • understand that data is not information, but that it's a darn good start;
  • know that most of our current biological knowledge is limited or wrong, but that we've got to rely on it anyway;
  • are aware that investing in automation is sometimes repaid 10x in efficiency, except when it's not;
  • realize that reproducible computational analyses are a great idea;
  • write software when they have to, but only because they have to;
  • think data science training is neat, because it gives an enduring competitive edge;
  • constantly rebalance themselves between the CS, software engineering, data science, and stats perspectives -- almost as if they were surfing across those fields;
  • get that open science is obvious if not always practical;
  • are confused as to why biology graduate programs aren't teaching more data analysis;

and, finally, data intensive biologists

  • shouldn't worry about their career options.

Concluding thoughts

If you had to nail down a definition - people like to do that, for some reason :) - I would go with:

Data intensive biology: a researcher focused on addressing biological research questions primarily through large scale data analysis or integration.

I don't have any interest in being exclusionary with this definition. If you're tackling biological questions in any way, shape, or form, and you're using lots of data to do it, you fit my definition!

Oh, and by the way? There are already workshops and faculty positions in this area. Although they're all at Hopkins, so maybe the best you can say is that James Taylor and I agree on terminology :).


by C. Titus Brown at October 13, 2014 10:00 PM

October 10, 2014

Titus Brown

How good is MEGAHIT?

A few weeks back, Nick Loman (via Manoj Samanta) brought MEGAHIT to our attention on Twitter. MEGAHIT promised "an ultra-fast single-node solution for large and complex metagenome assembly" and they provided a preprint and some open source software. This is a topic near and dear to my heart (see Pell et al., 2012 and Howe et al., 2014), so I was immediately interested - especially since the paper used our Howe et al. data set to prove out their results. (The twitterati also pointed out that the preprint engaged in some bashing of this previous work, presumably to egg me on. ;)

So I thought, heck! Let's take MEGAHIT out for a spin! So my postdoc Sherine Awad and I tried it out.

tl; dr? MEGAHIT seems pretty awesome to me, although IDBA and SPAdes still seem to beat it by a bit on the actual assembly results.

Installing MEGAHIT

We ran into some small compilation problems but got it working on an Ubuntu 12.04 system easily enough.

Running it was also a snap. It took a few minutes to work through the required options, and voila, we got it running and producing results. (You can see some example command lines here.)

First question --

How does it do on E. coli?

One of the claims made in the paper is that this approach performs well on low-coverage data. To evaluate this, I took a 5m read subset from the E. coli MG1655 dataset (see Chitsaz et al., 2011) and further subsetted it to 1m reads and 500k reads, to get (roughly) 100x, 20x, and 10x data sets. I then ran MEGAHIT with default parameters, specifying 1 GB of memory, and limiting only the upper k size used (because otherwise it crashed) -- again, see the Makefile.

For comparison, I also ran SPAdes on the lowest-coverage data, looking only at the contigs (not the scaffolds).

After it was all done assembling, I ran QUAST on the results.

Measure 100x 20x 10x 10x (SPAdes)
N50 73736 52352 9067 18124
Largest alignment 221kb 177kb 31kb 62kb
bp in contigs > 1kb 4.5mb 4.5mb 4.5mb 4.5mb
Genome fraction 98.0% 98.0% 97.4% 97.9%
Misassembled length 2kb 40.8kb 81.3kb 63.6kb

(Data: MEGAHIT 100x, 20x, and 10x; and SPAdes 10x.)

In summary, it does pretty well - with even pretty low coverage, you're getting 97.4% of the genome in contigs > 500bp (QUAST's default cutoff). Misassemblies grow significantly at low coverage, but you're still only at 2% in misassembled contigs.

In comparison to SPAdes at low coverage, the results are ok also. SPAdes performs better in every category, which I would expect -- it's a great assembler! - but MEGAHIT performs well enough to be usable. MEGAHIT is also much, much faster - seconds vs minutes.

Next question -

How well does it do on on a metagenomic data set?

Sherine has been running benchmarks for Velvet, SPAdes, and IDBA on the data set from Shakya et al, 2013, a mock community data set. So I asked her to add some MEGAHIT results. She did quality trimming as specified in Kalamazoo, and ran MEGAHIT with 10 GB of RAM. She then used QUAST to evaluate the results against the known good genomes.

# contigs > 1kb 19,982 16,387 16,191
length in contigs >1kb 190.7mb 192.5mb 194.8
# misassemblies 698 894 1223
Bp in misassemblies 12.7mb 28.2mb 21.8mb
Metagenome fraction 89.96% 90.42% 90.97%

Again, the answer is "MEGAHIT works pretty well." Fewer misassemblies, but also more contigs and a bit less coverage of the known genome.

Third question --

How fast and memory efficient was MEGAHIT?

Very. We didn't actually measure it, but, like, really fast. And low memory, also. We're doing systematic benchmarking on this front for our own paper, and we'll provide details as we get them.

(We didn't measure MEGAHIT's performance because we don't have numbers for SPAdes and IDBA yet. We didn't measure SPAdes and IDBA yet because actually doing the benchmarking well is really painful - they take a long time to run. 'nuff said :)

So, what are your conclusions?

So far, +1. Highly recommended to people who are good at command line stuff and general all-around UNIX-y folk. I'd want to play around with it a bit more before strongly recommending it to anyone who wasn't a seasoned bioinformatician. It's rough around the edges, and I haven't looked at the code much yet. It also breaks in various edge cases, but at least it's pretty robust when you just hand it a straight up FASTQ file!

That having been said, it works shockingly well and is quite fast and memory efficient. If you're having trouble achieving an assembly any other way I would definitely recommend investing the time to try out MEGAHIT.


p.p.s. Thanks to Rayan Chikhi and Lex Nederbragt for reading and commenting on
a draft version of this post!

Appendix: MEGAHIT and digital normalization

In the MEGAHIT paper, they commented that they believed that digital normalization could lead to loss of information. So I thought I'd compare MEGAHIT on 100x against MEGAHIT and SPAdes running on digitally normalized 100x:

Measure 100x DN (w/MEGAHIT) DN (w/SPAdes)
N50 73736 82753 132872
Largest alignment 221kb 222kb 224kb
bp in contigs > 1kb 4.5mb 4.5mb 4.6mb
Genome fraction 98.0% 98.1% 98.2%
Misassembled length 2kb 120kb 48kb

(Data: MEGAHIT 100x, MEGAHIT DN, and SPAdes DN.)

The short version is, I don't see any evidence that diginorm leads to incompleteness, but clearly diginorm leads to lots of misassemblies when used in conjunction with MEGAHIT or SPAdes on high-coverage genomes. (We have some (ok, lots) of evidence that this doesn't happen with lower coverage genomes, or metagenomes.) That having been said, it's clearly rather assembler-specific, since SPAdes does a much better job than MEGAHIT on dn data.

The shorter version? You probably won't need to use diginorm with MEGAHIT, and you shouldn't. That's OK. (There are lots of reasons why you shouldn't use diginorm.)

I still don't have any evidence that diginorm drops information in non-polyploid situations. Let me know if you've seen this happen!

Appendix II: Running your own evaluation

All of the E. coli numbers above are available in the 2014-megahit-evaluation github repo. See README.md in that repo for basic install instructions, and Makefile for what I ran and how to run it. Feel free to reproduce, extend, and update!

by C. Titus Brown at October 10, 2014 10:00 PM

October 09, 2014

Continuum Analytics

Tips for More Effective Numba Usage

Last week, I wrote about the features Numba has, and what our long term development goals are. Today, I want to dig into the details of how to more effectively use Numba on your programs.

by Stanley Seibert at October 09, 2014 03:15 PM

October 07, 2014

Matthieu Brucher

Book review: Fundamentals of Reservoir Engineering

I worked for a long time for the seismic department of my company, and switched to the reservoir department only last year. The problems that are tackled are quite different, and the way they are solved as well. So nothing to do with the book I reviewed a long time ago. So after 2 trainings in reservoir simulation, I also read this book that a colleague of mine labeled as the reference book.

Content and opinions

The book has a nice progression in complexity. Although it won’t tackle anything like model generation (it is someone else’s job after all), it tackles the basic questions that a reservoir engineer has to ask himself. The progression is also close to what I had during my reservoir engineering training.

The first four chapters are an introduction to the basic physics, namely how to compute the volume of oil in the reservoir, how does it work, how much can you get out of it, basic thermodynamics of oil (black oil model), or the Darcy law.

Then, the four next chapters deal with wells, pressure, which are the most important things in the job. The main values you can get to describe your reservoir come from well tests and the analysis of the results. They may seem basic, with a lot of approximations, but they are still the first maths you do when appreciating a reservoir! 40 years later…

The last two chapters are a little bit different, but equally important. I didn’t think much of aquifers before I actually realized that they are the main way of recovering oil… Without them, you have to use quite rapidly enhanced techniques. And they behave quite differently depending on their size, their location. The last chapter deals with relative permeability. It’s something that I found “missing” int he previous chapters, as it was always mentioned during my trainings, and then I noticed that the last chapter tries to cover everything they knew about this topic in the last chapter.


Although the book is quite old and although reservoir simulation has made tremendous progress, the basic approaches are still used today, and this is what the book is about. The location of the exercises and the fact that they are solved just after is also well appreciated.

by Matthieu Brucher at October 07, 2014 07:37 AM

October 04, 2014

Titus Brown

Putting together an online presence for a diffuse academic community - how?

I would like to build a community site. Or, more precisely, I would like to recognize, collect, and collate information from an already existing but rather diffuse community.

The focus of the community will be academic data science, or "data driven discovery". This is spurred largely by the recent selection of the Moore Data Driven Discovery Investigators, as well as the earlier Moore and Sloan Data Science Environments, and more broadly by the recognition that academia is broken when it comes to data science.

So, where to start?

For a variety of reasons -- including the main practical one, that most academics are not terribly social media integrated and we don't want to try to force them to learn new habits -- I am focusing on aggregating blog posts and Twitter.

So, the main question is... how can we most easily collect and broadcast blog posts and articles via a Web site? And how well can we integrate with Twitter?

First steps and initial thoughts

Following Noam Ross's suggestions in the above storify, I put together a WordPress blog that uses the RSS Multi Importer to aggregate RSS feeds as blog posts (hosted on NFSN). I'd like to set this up for the DDD Investigators who have blogs; those who don't can be given accounts if they want to post something. This site also uses a Twitter feed plugin to pull in tweets from the list of DDD Investigators.

The resulting RSS feed from the DDDI can be pulled into a River of News site that aggregates a much larger group of feeds.

The WordPress setup was fairly easy and I'm going to see how stable it is (I assume it will be very stable, but shrug time will tell :). I'm upgrading my own hosting setup and once that's done, I'll try out River4.

Next steps and longer-term thoughts

Ultimately a data-driven-discovery site that has a bit more information would be nice; I could set up a mostly static site, post it on github, authorize a few people to merge, and then solicit pull requests when people want to add their info or feeds.

One thing to make sure we do is track only a portion of feeds for prolific bloggers, so that I, for example, have to tag a post specifically with 'ddd' to make it show up on the group site. This will avoid post overload.

I'd particularly like to get a posting set up that integrates well with how I consume content. In particular, I read a lot of things via my phone and tablet, and the ability to post directly from there -- probably via e-mail? -- would be really handy. Right now I mainly post to Twitter (and largely by RTing) which is too ephemeral, or I post to Facebook, which is a different audience. (Is there a good e-mail-to-RSS feed? Or should I just do this as a WordPress blog with the postie plug-in?)

The same overall setup could potentially work for a Software Carpentry Instructor community site, a Data Carpentry Instructor community site, trainee info sites for SWC/DC trainees, and maybe also a bioinformatics trainee info site. But I'd like to avoid anything that involves a lot of administration.

Things I want to avoid

Public forums.

Private forums that I have to administer or that aren't integrated with my e-mail (which is where I get most notifications, in the end).

Overly centralized solutions; I'm much more comfortable with light moderation ("what feeds do I track?") than anything else.



by C. Titus Brown at October 04, 2014 10:00 PM

October 02, 2014

Continuum Analytics

Advanced Features of Conda Part 2

Conda is the package manager that comes with Continuum’s Anaconda distribution. Conda makes it easy to install and manage all your packages and environments on Windows, Mac OS X, and Linux.

If you are new to conda, I recommend starting at the documentation, and watching my presentation from SciPy 2014.

In this post, I will assume that you are already familiar with conda and its basic usage, both for installing and building packages. I will look at a few advanced features or tips that may not be well known, even among advanced users of conda. These features will help you with feature discovery, customization, and allow you to manage your packages and environments in more advanced ways.

In part 1, we looked at --help, configuration, conda update --all, conda list --export and conda create --file, conda clean, and pinning packages. In this post, we will look at some tools that make building packages easier and some more features that help manage your environments and use conda more effectively from the command line.

by Aaron Meurer at October 02, 2014 12:44 PM

Juan Nunez-Iglesias


(Edit: I initially thought it would be cute to number from 0. But it turns out it becomes rather obnoxious to relate English (first, second, …) to 0-indexing. So this was formerly volume 1. But everything else remains the same.)

This is the second post in a series about setting up continuous integration for a Python project from scratch. For the first post, see Automated tests with pytest.

After you’ve written some test cases for a tiny project, it’s easy to check what code you have automatically tested. For even moderately big projects, you will need tools that automatically check what parts of your code are actually tested. The proportion of lines of code that are run at least once during your tests is called your test coverage.

For the same reasons that testing is important, measuring coverage is important. Pytest can measure coverage for you with the coverage plugin, found in the pytest-cov package. Once you’ve installed the extension, a test coverage measurement is just a command-line option away:

 ~/projects/maths $ py.test --doctest-modules --cov .
============================= test session starts ==============================
platform darwin -- Python 2.7.8 -- py-1.4.25 -- pytest-2.6.3
plugins: cov
collected 2 items 

maths.py .
test_maths.py .
--------------- coverage: platform darwin, python 2.7.8-final-0 ----------------
Name         Stmts   Miss  Cover
maths            2      0   100%
test_maths       4      0   100%
TOTAL            6      0   100%

=========================== 2 passed in 0.07 seconds ===========================

(The --cov takes a directory as input, which I find obnoxious, given that py.test so naturally defaults to the current directory. But it is what it is.)

Now, if I add a function without a test, I’ll see my coverage drop:

def sqrt(x):
    """Return the square root of `x`."""
    return x * 0.5

(The typo is intentional.)

--------------- coverage: platform darwin, python 2.7.8-final-0 ----------------
Name         Stmts   Miss  Cover
maths            4      1    75%
test_maths       4      0   100%
TOTAL            8      1    88%

With one more option, --cov-report term-missing, I can see which lines I haven’t covered, so I can try to design tests specifically for that code:

--------------- coverage: platform darwin, python 2.7.8-final-0 ----------------
Name         Stmts   Miss  Cover   Missing
maths            4      1    75%   24
test_maths       4      0   100%   
TOTAL            8      1    88%   

Do note that 100% coverage does not ensure correctness. For example, suppose I test my sqrt function like so:

def sqrt(x):
    """Return the square root of `x`.

    >>> sqrt(4.0)
    return x * 0.5

Even though my test is correct, and I now have 100% test coverage, I haven’t detected my mistake. Oops!

But, keeping that caveat in mind, full test coverage is a wonderful thing, and if you don’t test something, you’re guaranteed not to catch errors. Further, my example above is quite contrived, and in most situations full test coverage will spot most errors.

That’s it for part 1. Tune in next time to learn how to turn on Travis continuous integration for your GitHub projects!

by Juan Nunez-Iglesias at October 02, 2014 08:04 AM

October 01, 2014

Titus Brown

Announcement: Moore Data Driven Discovery Investigator Award

I am very, very happy to announce that I have been selected to be one of the fourteen Moore Data Driven Discovery Investigators.

This is a signal investment by the Moore Foundation into the burgeoning area of data-intensive science, and it is quite a career booster. It will provide my lab with $1.5m over a period of 5 years, enabling us to dive into certain aspects of infrastructure building that we would otherwise be unable to support. I'm also particularly enthusiastic about the implicit recognition of our efforts in open science, open source, and replicability!

My talk and proposal for the competition are linked here:


You can see the list of all fourteen DDD Investigators, along with various and sundry information that includes talk titles, proposals, and Twitter handles, here.


Affiliations and links:

Dr. C. Titus Brown

Visiting Associate Professor, Population Health and Reproduction, UC Davis School of Veterinary Medicine.

Member, UC Davis Data Science Initiative

Member, UC Davis Genome Center

External faculty, BEACON Center for the Study of Evolution in Action.

Member, Software Carpentry Scientific Advisory Board.

Blog: http://ivory.idyll.org/blog/

Twitter: @ctitusbrown

Slightly outdated Lab Web site: http://ged.msu.edu/

by C. Titus Brown at October 01, 2014 10:00 PM

Introducing the Moore Foundation's Data Driven Discovery (DDD) Investigators

Note: the source data for this is available on github at https://github.com/ctb/dddi

Today, the Moore Foundation announced that they have selected fourteen Moore Data Driven Discovery Investigators.

In reverse alphabetical order, they are:

Dr. Ethan White, University of Florida

Proposal: Data-intensive forecasting and prediction for ecological systems

Source code repository: https://github.com/weecology

Google Scholar: http://scholar.google.com/citations?user=pHmra8cAAAAJ&hl=en

ImpactStory: https://impactstory.org/EthanWhite

Blog: http://jabberwocky.weecology.org/

Twitter: @ethanwhite

Dr. Laura Waller, UC Berkeley

Source code repository: https://github.com/Waller-Lab/CalOptrics

Dr. Matt Turk, University of Illinois at Urbana-Champaign

Proposal: Changing How We Conduct Inquiry

Source code repository: http://bitbucket.org/MatthewTurk/

Google Scholar: http://scholar.google.com/citations?user=QTmv2p0AAAAJ&hl=en

Blog: https://plus.google.com/+MatthewTurk

Twitter: @powersoffour

Dr. Blair D. Sullivan, North Carolina State University

Proposal title: Enabling Science via Structural Graph Algorithms

Source code repository: https://github.com/bdsullivan/

Google Scholar: http://scholar.google.com/citations?user=oZBtvZMAAAAJ&hl=en&oi=ao

Twitter: @BlairDSullivan

Dr. Matthew Stephens, University of Chicago

Proposal title: Gene Regulation and Dynamic Statistical Comparisons

Source code repository: https://github.com/stephens999

Google Scholar: http://scholar.google.com/citations?user=qOQFhUkAAAAJ&hl=en

Blog: http://randomdeviation.blogspot.com/

Twitter: @mstephens999

Dr. Amit Singer, Princeton University

Proposal title: Maximum Likelihood Estimation by Semidefinite Programming: Application to Cryo-Electron Microscopy

Google Scholar: http://scholar.google.com/citations?user=h67w8QsAAAAJ&hl=en

Dr. Kimberly Reynolds, UT Southwestern

Proposal title: Decoding the Genome: Finding Effective Variables from Evolutionary Ensembles

Google Scholar: http://scholar.google.com/citations?user=6bWFU7MAAAAJ&hl=en&oi=ao

Dr. Chris Re, Stanford

Google Scholar: http://scholar.google.com/citations?user=DnnCWN0AAAAJ&hl=en

Dr. Laurel Larsen, UC Berkeley

Proposal title: Developing the informationscape approach to environmental change detection.

Dr. Carl Kingsford, Carnegie Mellon University

Proposal title: Lightweight Algorithms for Petabase-scale Genomics

Google Scholar: http://scholar.google.com/citations?user=V_cvqKcAAAAJ

Twitter: @ckingsford

Dr. Jeffrey Heer, U. Washington Seattle

Proposal: Interactive Data Analysis

Google Scholar: http://scholar.google.com/citations?user=vlgs4G4AAAAJ

Twitter: @jeffrey_heer

Dr. Casey Greene, Dartmouth

Proposal title: Learning the context of publicly available genome-wide data

Google Scholar: http://scholar.google.com/citations?user=ETJoidYAAAAJ&hl=en

Twitter: @GreeneScientist

Dr. C. Titus Brown, UC Davis

Proposal: Infrastructure for Data Intensive Biology

Source code repository: http://github.com/ged-lab/

Google Scholar: http://scholar.google.com/citations?user=O4rYanMAAAAJ

ImpactStory: https://impactstory.org/TitusBrown

Blog: http://ivory.idyll.org/blog/

Twitter: @ctitusbrown

Dr. Joshua Bloom, UC Berkeley

Proposal title: Efficient Data-Driven Astrophysical Inquiry with Machine Learning

Google Scholar: http://scholar.google.com/citations?user=fHkUYk0AAAAJ

Blog: http://5nf5.blogspot.com/

Twitter: @profjsb


by C. Titus Brown at October 01, 2014 10:00 PM

Filipe Saraiva

Cantor: new features in KDE 4.14

KDE 4.14 was released in August 2014 but I did not have time to write about new features in Cantor for that release.

So, let’s fix it now!

New backend: Lua

Cantor family of backends have a new member: Lua, using luajit implementation.

This backend have a lot of features: syntax highlighting, tab complete, figures in worksheet, script editor, and more.

Cantor + Lua in action

Lua backend was developed by Lucas Negri, a Brazilian guy, and this is a reason for me to be very happy. Welcome aboard Lucas!

You can read more about this backend in a text of Lucas blog.

Use utf8 on LaTeX entries

When you to export the worksheet to LaTeX, utf8 will be used as default. This improvement was developed by Lucas.

Support to packaging extension in Sage and Octave backends

Now these backends have an assistant to import packages/modules/libraries.

Support to auto run scripts


Auto run scripts/commands in Python 2 backend

Now Python 2, Scilab, Octave, Sage, Maxima, Qalculate, and KAlgebra backends have support to auto run scripts. You can configure a set of scripts or commands and they will run automatically after the worksheet launch!

Add CTRL+Space as alternative default code completion to worksheet

Default code completion command in worksheet is TAB key, but now we have an alternative command too: CTRL + Space. It will maintain consistence between script editor (where the default code completion is CTRL + Space) and worksheet.

Initial support to linear algebra and plot assistants in Python 2

I developed the initial support to 2 amazing plugins in Python 2 backend: the linear algebra plugin and the plot plugin.

First, let’s see the linear algebra plugin. In menu bar go to Linear Algebra > Create Matrix. A window to matrix creation will be open, as below. You must to put the values in the cells.

python3_linearalgebraMatrix creation assistant

After push ‘Ok’ button, the matrix command from numpy  module will be loaded in the worksheet, automatically.

python2_linearalgebra_resultNew matrix created

For now this plugin have implemented just the matrix creation.

Let’s see the plot plugin now. You can use it to create 2D and 3D plot. Let’s to do x = numpy.arange(0.0, 2.0, 0.01) and, in menu bar, go to Graphics > Graphics 2D. The window below will be open.

python2_graphicPloting 2D assistant

You can set some expression to be the Y axis (in this case I am using numpy.sin) and a variable name to X axis (this case, 2 * x * numpy.pi). You could to put just x in variable name to do a plot with the values of x.

After push ‘Ok’ button, the command using pylab will be load in worksheet to make the graphic.

python2_graphic_result3D plotting assistant have a similar way to create the pictures.

How you can see, to use this assistants we need to have some python modules in the workspace, and they must to have the same name used in the plugins. There are a lot of ways to import modules in python environment (import foo; import foo as [anyname]; from foo import *; etc), so to do a generic way to use it is impossible (well, if you have some idea I would like to hear it).

My choice was to import numpy, scipy, matplotlib and pylab when Python 2 backend is loaded by Cantor. Well, I intent to change it because that modules will be mandatory to use Python 2 backend correctly, and pylab is not longer recommended in recent matplotlib version. So, wait for some changes in this plugin soon.

In any case, I would like to hear the opinions of scientific python community about this features.


For now we are working in Cantor port to Qt5/KF5. You can follow the work in ‘frameworks‘ branch on Cantor repository.


If you use or appreciate my work in Cantor or another free software project, please consider to make a donation for me, then I can to continue my contributions to improve Cantor.

You can consider make a donation to KDE too, and help with the maintenance of this great free software community and their products.

by Filipe Saraiva at October 01, 2014 05:58 PM

William Stein

SageMathCloud Course Management

by William Stein (noreply@blogger.com) at October 01, 2014 01:05 PM

Juan Nunez-Iglesias


(Edit: I initially thought it would be cute to number from 0. But it turns out it becomes rather obnoxious to relate English (first, second, …) to 0-indexing. So this was formerly volume 0. But everything else remains the same.)

I just finished the process of setting up continuous integration from scratch for one of my projects, cellom2tif, a simple image file converter/liberator. I thought I would write a blog post about that process, but it has slowly mutated into a hefty document that I thought would work better as a series. I’ll cover automated testing, test coverage, and how to get these to run automatically for your project with Travis-CI and Coveralls.

Without further ado, here goes the first post: how to set up automated testing for your Python project using pytest.

Automated tests, and why you need them

Software engineering is hard, and it’s incredibly easy to mess things up, so you should write tests for all your functions, which ensure that nothing obviously stupid is going wrong. Tests can take a lot of different forms, but here’s a really basic example. Suppose this is a function in your file, maths.py:

def square(x):
    return x ** 2

Then, elsewhere in your package, you should have a file test_maths.py with the following function definition:

from maths import square
def test_square():
    x = 4
    assert square(x) == 16

This way, if someone (such as your future self) comes along and messes with the code in square(), test_square will tell you whether they broke it.

Testing in Python with pytest

A whole slew of testing frameworks, such as nose, will then traverse your project, look for files and functions whose names begin with test_, run them, and report any errors or assertion failures.

I’ve chosen to use pytest as my framework because:

  • it is a strict superset of both nose and Python’s built-in unittest, so that if you run it on projects set up with those, it’ll work out of the box;
  • its output is more readable than nose’s; and
  • its fixtures provide a very nice syntax for test setup and cleanup.

For a quick intro to pytest, you can watch Andreas Pelme’s talk at this year’s EuroPython conference:

But the basics are very simple: sprinkle files named test_something.py throughout your project, each containing one or more test_function() definition; then type py.test on the command line (at your project root directory), and voila! Pytest will traverse all your subdirectories, gather up all the test files and functions, and run your tests.

Here’s the output for the minimal maths project described above:

~/projects/maths $ py.test
============================= test session starts ==============================
platform darwin -- Python 2.7.8 -- py-1.4.20 -- pytest-2.5.2
collected 1 items

test_maths.py .

=========================== 1 passed in 0.03 seconds ===========================

In addition to the test functions described above, there is a Python standard called doctest in which properly formatted usage examples in your documentation are automatically run as tests. Here's an example:

def square(x):
    """Return the square of a number `x`.


    >>> square(5)
    return x ** 2

(See my post on NumPy docstring conventions for what should go in the ellipsis above.)

Depending the complexity of your code, doctests, test functions, or both will be the appropriate route. Pytest supports doctests with the --doctest-modules flag. (This runs both your standard tests and doctests.)

~/projects/maths $ py.test --doctest-modules
============================= test session starts ==============================
platform darwin -- Python 2.7.8 -- py-1.4.20 -- pytest-2.5.2
collected 2 items

maths.py .
test_maths.py .

=========================== 2 passed in 0.06 seconds ===========================

Test-driven development

That was easy! And yet most people, my past self included, neglect tests, thinking they’ll do them eventually, when the software is ready. This is backwards. You’ve probably heard the phrase “Test-driven development (TDD)”; this is what they’re talking about: writing your tests before you’ve written the functionality to pass them. It might initially seem like wasted effort, like you’re not making progress in what you actually want to do, which is write your software. But it’s not:

By spending a bit of extra effort to prevent bugs down the road, you will get to where you want to go faster.

That’s it for volume 0! Watch out for the next post: ensuring your tests are thorough by measuring test coverage.

by Juan Nunez-Iglesias at October 01, 2014 08:40 AM

September 30, 2014

Matthieu Brucher

Book review: Intelligent Systems in Oil Field Development under Uncertainty

How to know whether or not to produce an oil field, and how to know how? The oil industry is used to make a lot of simulations to satisfy the SEC rules, and to know how much oil they can drill today and in the future. This book goes further than just the usual how much from a field under specific condition, by adding the development plan as well as the oil price in the equation.

Content and opinions

The book seems to be based on things Petrobras, the Brazilian national oil company, does, so it’s fair to assume they actually do far more than what is presented in the book.

There are four chapters that make the core of the book, exposing the different algorithms, explaining bits of them (you will still need to dig further to understand what they actually are, and obviously even more to duplicate the results), but also giving lots of examples.

Let’s start with the first two chapters, they are not in the “core” of the book. The first one explains why you need to do uncertainties in the oil world. Main reason: why would you spend billions of dollars when you are not sure you will get them back? The second chapters explains some financial theory. Usual uncertainty analysis (the one for the SEC for instance) don’t involve financial studies. They just require you to give your P1/proven oil and gas reserves (10% chance that you get this oil out of the field). Then for your management, you want to get the P2 (50%). Here, you go one step further, because these 2 values don’t show how much money you will get. So the financial framework is important to understand how you can get to the final step of the simulation.

The third chapter deals with the Machine Learning algorithms that will be used through the book. Several evolutionary algorithms (genetic algorithms and genetic programming) are now mainstream, but some of them are not (cultural algorithms or quantum-inspired GA). It’s always interesting to see new algorithms, especially as they are applied to a concrete problem in one of the biggest industries. Then the authors tackle neural networks, with a simple approach (compared to my previous review), fuzzy logic (I hadn’t heard about it for quite some time) and the mix between the two of them, neuro-fuzzy models.

The fourth chapter is about the exploration process and mainly uses evolutionary algorithms to optimize the development plan, that is the number of wells and their type to optimize the number of oil barrels at the end of the life of the field. It is interesting to see that they don’t simulate an actual complete development plan (when are the wells drilled and then started). Injection of prior knowledge is addressed, and it isn’t always making the result better, interesting to know!

Fifth chapter starts introducing the market behavior by adding the option to activate new parts of the development plan, depending on several parameters. Here, the authors use genetic algorithms and fuzzy models, adding also the simulation of the oil price during the life of the field, ending up with Monte Carlo methods to define the “option” price. I have to say that I didn’t understand everything in their examples, but I understood the global ideas, without knowing much of the financial side. The explanation of how the fuzzy numbers and statistics differ in the end is lacking, although they are quite close. It’s a different approach, with a different theory anyway.

The sixth chapter and the last one in the “core” takes everything together and tries to make it even further. It introduced additional mathematical tools but wasn’t as thorough as the last three. It was still interesting, but obviously taking advantage of this one requires mastering the other ones before.

The last chapter present the tools used in production and the architecture decisions behind. It has less value as you can’t get the tool…


I picked up the book because I’m currently in the oil industry, and I have interests in intelligent systems. Even though I can’t use the tools now, it gives some good leads to build such a system, and hopefully I may work on such systems some day (OK, I’ve already designed the core framework around the reservoir simulator, before I even read the book).

by Matthieu Brucher at September 30, 2014 03:42 PM

Continuum Analytics news

Continuum Analytics Releases Anaconda 2.1

Continuum Analytics, the premier provider of Python-based data analytics solutions and services, announced today the release of the latest version of Anaconda, its free, enterprise-ready collection of libraries for Python.

Anaconda enables big data management, analysis, and cross-platform visualization for business intelligence, scientific analysis, engineering, machine learning, and more. The latest release, version 2.1, adds a new version of the Anaconda Launcher and PyOpenSSL, as well as updates NumPy, Blaze, Bokeh, Numba, and 50 other packages.

by Corinna Bahr at September 30, 2014 08:00 AM

September 28, 2014

Gaël Varoquaux

Improving your programming style in Python

Here are some references on software development techniques and patterns to help write better code. They are intended for the casual programmer, and certainly not an advanced developer.

They are listed in order of difficulty.

Software carpentry


These are the original notes from Greg Wilson’s course on software engineering at the university of Toronto. This course is specifically intended for scientists, but not computer science students. It is very basic and does not cover design issues.

A tutorial introduction to Python


This tutorial is easier to follow than Guido’s tutorial, thought it does not go as much in depth.

Python Essential Reference



These are two chapters out of David Beazley’s excellent book Python Essential Reference. They allow to understand more deeply how python works. I strongly recommend this book to anybody serious about python.

An Introduction to Regular Expressions


If you are going to do any sort of text manipulation, you absolutely need to know how to use regular expressions: powerful search and replace patterns.

Software design for maintainability

My own post

A case of shameless plug: this is a post that I wrote a few years ago. I think that it is still relevant.

Writing a graphical application for scientific programming using TraitsUI


Building interactive graphical application is a difficult problem. I have found that the traitsUI module provides a great answer to this problem. This is a tutorial intended for the non programmer.

An introduction to Python iterators


This article may not be terribly easy to follow, but iterator are a great feature of Python, so this is definitely worth reading.

Functional programming


Functional programming is a programming style where mathematical functions are successively applied to immutable objects to go from the inputs of the program to its outputs in a succession of transformation. It is appreciated by some because it is easy to analyze and prove. In certain cases it can be very readable.

Patterns in Python


This document exposes a few design patterns in Python. Design patterns are solutions to recurring development problems using object oriented programming. I suggest this reading only if you are familiar with OOP.

General Object-Oriented programming advice

Designing Object-oriented code actually requires some care: when you are building your set of abstractions, you are designing the world in which you are going to be condemned to living (or actually coding). I would advice people to keep things as simple as possible, and follow the SOLID principles:


Using decorators to do meta-programming in Python


A very beautiful article for the advanced python user. Meta-programming is a programming technique that involves changing the program at the run-time. This allows to add new abstractions to the code the programmer writes, thus creating a “meta-language”. This article shows this very well.

A Primer on Python Metaclass Programming


Metaclasses allow to define new style of objects, that can have different calling, creation or inheritance rules. This is way over my head, but I am referencing it here for the record.

Iterators in Python


Learn to use the itertools (but don’t abuse them)!

Related to the producer/consumer problem with iterators, see:


by Gaël Varoquaux at September 28, 2014 10:00 PM

September 23, 2014

Continuum Analytics

Introducing Blaze - HMDA Practice

We look at data from the Home Mortgage Disclosure Act, a collection of actions taken on housing loans by various governmental agencies (gzip-ed csv file here) (thanks to Aron Ahmadia for the pointer). Uncompressed this dataset is around 10GB on disk, so we don’t want to load it up into memory with a modern commercial notebook.

Instead, we use Blaze to investigate the data, select down to the data we care about, and then migrate that data into a suitable computational backend.

by Matt Rocklin at September 23, 2014 10:00 AM

September 19, 2014

Gaël Varoquaux

Hiring an engineer to mine large functional-connectivity databases

Work with us to leverage leading-edge machine learning for neuroimaging

At Parietal, my research team, we work on improving the way brain images are analyzed, for medical diagnostic purposes, or to understand the brain better. We develop new machine-learning tools and investigate new methodologies for for quantifying brain function from MRI scans.

One of our important alley of contributions is in deciphering “functional connectivity”: analysis the correlation of brain activity to measure interactions across the brain. This direction of research is exciting because it can be used to probe the neural-support of functional deficits in incapacitated patients, and thus lead to new biomarkers on functional pathologies, such as autism. Indeed, functional connectivity can be computed without resorting to complicated cognitive tasks, unlike most functional imaging approaches. The flip side is that exploiting such “resting-state” signal requires advanced multivariate statistics tools, something at which the Parietal team excels.

For such multivariate processing of brain imaging data, Parietal has an ecosystem of leading-edge high-quality tools. In particular we have built the foundations of the most successful Python machine learning library, scikit-learn, and we are growing a dedicate software, nilearn, that leverages machine-learning for neuroimaging. To support this ecosystem, we have dedicated top-notch programmers, lead by the well-known Olivier Grisel.

We are looking for a data-processing engineer to join our team and work on applying our tools on very large neuroimaging databases to learn specific biomarkers of pathologies. For this, the work will be shared with the CATI, the Fench platform for multicentric neuroimaging studies, located in the same building as us. The general context of the job is the NiConnect project, a multi-organisational research project that I lead and that focuses on improving diagnostic tools on resting-state functional connectivity. We have access to unique algorithms and datasets, before they are published. What we are now missing between those two, and that link could be you.

If you want more details, they can be found on the job offer. This post is to motivate the job in a personal way, that I cannot give in an official posting.

Why take this job?

I don’t expect some to take this job only because it pays the bill. To be clear, the kind of person I am looking for has no difficulties finding a job elsewhere. So, if you are that person, why would you take the job?

  • To join a great team with many experts, focused on finding elegant solutions to hard problems at the intersection of machine learning, cognitive science, and software. Choose to work with great people, knowledgeable, passionate, and fun.
  • To work on interesting problems, that matter. They are interesting because they are challenging but we have the skills to solve them. They matter because they can make brain research better.
  • To learn. NeuroImaging + Machine learning is a quickly growing topic. If you come from a NeuroImaging background and want to add to your CV an actual expertise in machine learning for NeuroImaging. This is the place to be.

What would make me excited in a resume?

  • A genuine experience in neuroimaging data processing, especially large databases.
  • Talent with computers and ideally some Python experience.
  • The unlikely combination of research training (graduate or undergraduate) and experience in a non academic setting.
  • A problem-solving mindset.
  • A good ability to write about neuroimaging and data processing in English: who knows, if everything goes to plan, you could very well be publishing about new biomarkers.

Now if you are interested and feel up for the challenge, read the real job offer, and send me your resume.

by Gaël Varoquaux at September 19, 2014 10:00 PM

September 18, 2014

Continuum Analytics

Numba 0.14 Release and Project Status

Continuing our goal to have a new release of Numba every month, we are celebrating the month of September with Numba version 0.14! Along with a number of bug fixes, it brings support for many new features, including:

  • Support for nearly all of the NumPy math functions (including comparison, logical, bitwise and some previously missing float functions) in nopython mode.
  • The NumPy datetime64 and timedelta64 dtypes are supported in nopython mode with NumPy 1.7 and later.
  • Support for NumPy math functions on complex numbers in nopython mode.
  • ndarray.sum() is supported in nopython mode.
  • Improved warnings and error messages.
  • Support for NumPy record arrays on the GPU.

by Stanley Seibert at September 18, 2014 10:00 AM

September 16, 2014

Continuum Analytics

Introducing Blaze - Migrations

tl;dr Blaze migrates data efficiently between a variety of data stores.

In our last post on Blaze expressions we showed how Blaze can execute the same tabular query on a variety of computational backends. However, this ability is only useful if you can migrate your data to the new computational system in the first place.

To help with this, Blaze provides the into function which moves data from one container type to another.

by Matt Rocklin at September 16, 2014 10:00 AM

Matthieu Brucher

Book review: Machine Learning for Adaptive Many-Core Machines – A Practical Approach

Nice title, surfing on the many core hype, and with a practical approach! What more could one expect from a book on such an interesting subject?

Content and opinions

Let’s start first with the table of contents. What the authors call many-core machines is GPUs. You won’t see anything else, like MPI or OpenMP. One of the central topics is also big data, but expect for the part that GPU can process data for some algorithms faster than a CPU, it’s not really big data. I would call it large datasets, but not big data, as in terabytes of data!

OK, so let’s start with the core review of the book.

The book starts with an introduction on Machine Learning, what will be addressed, and the authors’ GPU library for Machine Learning. If you don’t know anything about GPUs, you will be given a short introduction, but it may not be enough for you to understand the difference between CPUs and GPUs, and why we don’t run everything on GPUs.

The next part tackles supervised learning. Each time, you get a general description of algorithms, then the GPU implementation, and then results and discussions. The authors’ favorite algorithms seem to be neural networks, as everything is compared to these. It’s the purpose of the first chapter in this part, with the classic Back-Propagation algorithm, or the Multiple BP algorithm. They also cover their own tool to automate training, which is nice, as it is usually the issue with neural networks. The next chapter handles missing data and the different kind of missing data. The way it is handled by the library is through an activation mechanism that seems to work, but I don’t know how reliable the training of such NN is, although the results are quite good, so the training system must work. Then we have Support Vector Machines, and curiously, it’s not the “usual” kernel that is mainly used in the book. Also, for once, you don’t the full definition of a kernel, but in a way, you don’t need it for the purpose of this book. The last algorithm of this part is the Incremental Hypersphere Classifier. After a shorter presentation, the authors deal with the experimental setups, and they also chain IHC with SVM. All things considered, this is a small set of supervised learning algorithms. Compared to what you can find in scikit-learn, it is really a small subset.

The third part is about unsupervised and semi-supervised learning. The first algorithm is the Non-Negative Matrix Factorization, presented as a non linear algorithm (when it obviously is), so quite strange. The semi-supervised approach is to divide the set of original features in subsets that have meaning together (eyes, mouth, nose…). Two different implementation are provided in the library, and presented with results in the book. The second chapter is about the current hype in neural networks: deep learning. After the presentation of the basic tool of Deep Belief Networks, the Restricted Boltzmann Machines, you directly jump to implementation and results. I have to say that the fact that the book didn’t describe the architecture of DBNs properly, and it quite difficult to know what DBNs actually are and how they can give such nice results.

The last part consists of the final chapter, a conclusion and a sum-up of the algorithms described in the book.


The book doesn’t display all the code that was developed. I certainly didn’t want it to do that, because the code is available on SourceForge. It was a good description of several machine learning algorithms ported on the GPU, with results and comparisons, but it didn’t live up to the expectation of the book title and the introduction. A GPU is kind of many-cores, but we know that the future many-cores won’t be that kind of cores only. It also falls short for the big data approach.

by Matthieu Brucher at September 16, 2014 07:53 AM

September 14, 2014

Jan Hendrik Metzen

Unsupervised feature learning with Sparse Filtering

An illustration of unsupervised learning of features for images from the Olivetti faces dataset using the sparse filtering algorithm. This work is based on the paper "Sparse Filtering" by the authors Jiquan Ngiam, Pang Wei Koh, Zhenghao Chen, Sonia Bhaskar, and Andrew Y. Ng published in NIPS 2011.

The sparse filtering algorithm does not try to model the data's distribution but rather to learn features which are sparsely activated, in the sense that

  • for each example, only a small subset of features is activated ("Population Sparsity")
  • each feature is only activated on a small subset of he examples ("Lifetime Sparsity")
  • features are roughly activated equally often ("High Dispersal")

This sparsity is encoded as an objective function and L-BFGS is used to minimize this function.

This notebook illustrates the algorithm on the Olivetti faces dataset.

The source code of the sparse-filtering implementation is avalaible at https://github.com/jmetzen/sparse-filtering

In [2]:
import numpy as np
import pylab as plt

%matplotlib inline
In [3]:
# Install with "pip install sparse_filtering"
from sparse_filtering import SparseFiltering

Prepare dataset

This will load the Olivetti faces dataset, normalize the examples (global and local centering), and convert each example into a 2D structure (64*64 pixel image). Thereupon, 25 patches of size 16x16pixels are extracted randomly from each image.

In [4]:
from sklearn.datasets import fetch_olivetti_faces
dataset = fetch_olivetti_faces(shuffle=True)
faces = dataset.data

n_samples, _ = faces.shape

faces_centered = faces - faces.mean(axis=0)  # global centering

faces_centered -= \
    faces_centered.mean(axis=1).reshape(n_samples, -1)  # local centering

faces_centered = \
    faces_centered.reshape(n_samples, 64, 64)  # Reshaping to 64*64 pixel images
plt.imshow(faces_centered[0], cmap=plt.get_cmap('gray'))
_ = plt.title("One example from dataset with n=%s example" % n_samples)
In [5]:
# Extract 25 16x16 patches randomly from each image
from sklearn.feature_extraction.image import extract_patches_2d
patch_width = 16

patches = [extract_patches_2d(faces_centered[i], (patch_width, patch_width),
                              max_patches=25, random_state=i)
              for i in range(n_samples)]
patches = np.array(patches).reshape(-1, patch_width * patch_width)
In [6]:
# Show 25 exemplary patches
plt.figure(figsize=(8, 8))
for i in range(25):
    plt.subplot(5, 5, i+1)
    plt.imshow(patches[i].reshape(patch_width, patch_width), cmap=plt.get_cmap('gray'))
_ = plt.suptitle("25 exemplary extracted patches")

Train Sparse Filtering estimator

The Sparse Filtering estimator is trained on entires dataset and 64 features extractors are learned. Note that this is computationally expensive and might take several minutes. After training, the corresponding features are extracted for the whole dataset.

In [7]:
n_features = 64   # How many features are learned
estimator = SparseFiltering(n_features=n_features, 
                            maxfun=200,  # The maximal number of evaluations of the objective function
                            iprint=10)  # after how many function evaluations is information printed
                                        # by L-BFGS. -1 for no information
features = estimator.fit_transform(patches)

The following graphic illustrates the learned feature detectors. Note that most of these correspond to Gabor-like edge detectors, while some seem to correspond to noise. The latter is potentially either because L-BFGS was stopped before convergence or because the number of features (64) was chosen too large.

In [8]:
plt.figure(figsize=(12, 10))
for i in range(estimator.w_.shape[0]):
    plt.subplot(int(np.sqrt(n_features)), int(np.sqrt(n_features)), i + 1)
    plt.pcolor(estimator.w_[i].reshape(patch_width, patch_width),
               cmap=plt.cm.RdYlGn, vmin=estimator.w_.min(),
    plt.title("Feature %4d" % i)

Activation histogram

We check if the learned feature extractors have the desired sparsity properties:

In [9]:
plt.hist(features.flat, bins=50)
_ = plt.title("Feature activation histogram")

The graphic confirms that features have small activation typically.

In [10]:
activated_features = (features > 0.1).mean(0)
plt.xlabel("Feature activation ratio over all examples")
_ = plt.title("Lifetime Sparsity Histogram")

Lifetime Sparsity: The figure confirms that each feature is only active (i.e., has an activation above 0.1) for a small ratio of the examples (approx. 25%)

In [11]:
activated_features = (features > 0.1).mean(1)
plt.hist(activated_features, bins=10)
plt.xlabel("Ratio of active features in example")
_ = plt.title("Population Sparsity Histogram")

Population Sparsity: Each example activates only few features (approx. 25-30% of the features)

In [12]:
plt.xlabel("Mean Squared Feature Activation")
_ = plt.title("Dispersal Histogram")

High Dispersal: All features have similar statistics; no one feature has significantly more “activity” than the other features. This is checked by plotting the mean squared activations of each feature obtained by averaging the squared values in the feature matrix across the examples.

Learn a Second Layer

Learning a second layer of features using greedy layer-wise stacking on the features produced by the first layer.

In [13]:
n_features2 = 10   # How many features are learned
estimator2 = SparseFiltering(n_features=n_features2, 
                             maxfun=200,  # The maximal number of evaluations of the objective function
                             iprint=10)  # after how many function evaluations is information printed
                                        # by L-BFGS. -1 for no information
features2 = estimator2.fit_transform(features)
In [14]:
plt.figure(figsize=(10, 6))
for i in range(n_features2):
    # Select the top-6 layer-1 features used within the layer-2 features
    indices = np.argsort(estimator2.w_[i])[-6:][::-1]
    for j, ind in enumerate(indices):
        plt.subplot(6, n_features2, j*n_features2 + i + 1)
        plt.pcolor(estimator.w_[ind].reshape(patch_width, patch_width),"
                   cmap=plt.cm.RdYlGn, vmin=estimator.w_.min(),
plt.subplots_adjust(wspace=0.4, hspace=0.01)

Each column of the graphic shows the top-6 layer-1 feature extractors used within one of the 10 layer-2 features. According to the paper, this "discovers meaningful features that pool the first layer features". This, however, is only partially reproducable on the Olivetti dataset.


In summary

  • sparse filtering is a unsupervsied feature learning algorithm that does not try to model the data's distribution but rather to learn features directly which are sparsely activated
  • the learned feature extractors obey specific sparsity conditions (lifetime sparsity, population sparsity, and high dispersal)
  • deep archictectures can be trained using greedy layer-wise stacking
  • the only parameter requiring tuning is the number of features per layer

We will analyse the quality of the learned features in a predictive setting in a follow-up post.

This post was written as an IPython notebook. You can download this notebook.

by Jan Hendrik Metzen at September 14, 2014 09:18 AM

September 11, 2014

Continuum Analytics

Bokeh 0.6 Released!

We are excited to announce the release of version 0.6 of Bokeh, an interactive web plotting library for Python!

This release includes some major new features:

  • Abstract Rendering recipes for large data sets: isocontour, heatmap
  • New charts in bokeh.charts: Time Series and Categorical Heatmap
  • Sophisticated Hands-on Table widget
  • Complete Python 3 support for bokeh-server
  • Much expanded User Guide and Dev Guide
  • Multiple axes and ranges now supported
  • Object query interface to help with plot styling
  • Hit-testing (hover tool support) for patch glyphs

by Bryan Van de Ven at September 11, 2014 11:30 AM

September 09, 2014

Titus Brown

The MHacks IV hackathon - a student report

Note: this hackathon report was written by ArisKnight Winfree.

This past weekend was the University of Michigan’s hackathon, MHacks IV. Over 1100 students from universities and high schools across the United States took over Ann Arbor from September 5th through September 7th to participate in one of the most anticipated hackathons in the country.

A hackathon is an event where students come together to dream up and build amazing things together -- but under a strict time limit. At MHacks, students spent 36 hours straight collaborating with each other and working with mentors (alongside representatives of some of the most competitive companies in the world) to learn new things and solve interesting problems.

Over $15,000 dollars worth of prizes were awarded in MHacks IV.

We had over 21 students representing Michigan State University in Ann Arbor this weekend -- their names are highlighted in bold.

An all Michigan State team that made GoalKeeper ended up winning FOUR prizes: Best URL from Namecheap, Best use of Venmo API, Best use of Ziggeo API, and a prize from IBM for using their hosting platform.

Goalkeeper is a social platform for holding yourself accountable, whether you want to get in shape, establish good study habits, or write the next great American novel. Users establish goals, then post videos showing that they're keeping up, both logging their achievements and encouraging their friends to keep them on track. In addition to peer pressure, users are incentivized to stick with their goals because every failure to check in means the user is expected to donate $5 to charity.

URL: http://challengepost.com/software/goalkeeper

Team Members: Caitlin McDonald, Erin Hoffman

Backseat Driver is a real time manual transmission driving instructor that could be displayed on a Chrysler vehicle's radio/head unit, reading information from the car's OBD II sensors using the OpenXC platform tools from Chrysler. They won awards for most connected car from Chrysler and best use of mobile sensor data from FarmLogs.

URL: http://mhacks-iv.challengepost.com/submissions/26116-backseat-driver

Team Members: Katelyn Dunaski, Michael Ray, Colin Szechy, Riyu Banerjee.

DrinkNoDrive is an app that prevents drinking and driving by monitoring a person’s alcohol use. The app keeps track of your consumption, calculate your BAC, and will order an Uber for you when it’s time to go home. They were awarded the Best Use of the Uber API prize.

URL: http://mhacks-iv.challengepost.com/submissions/26188-drinknodrive

Team Members: Chris McGrath, Drew Laske, Tommy Truong, Anh Nguyen.

First place winner: Power Glove - Power Glove 2.0 is an affordable and customizable virtual reality system, focused on finger and hand movement interaction with the computer.

URL: http://mhacks-iv.challengepost.com/submissions/26338-power-glove-2-0

Second place winner: Android for iPhone - We built a full Android environment running inside your iPhone. We even used the newest version of Android possible, the L developer preview!

URL: http://mhacks-iv.challengepost.com/submissions/26315-android-for-all

Third Place winner: Smash Connect – while Super Smash Bros. 64 is a lot of fun to play with a controller, our project uses your body as a controller. Two Microsoft Kinect cameras are used to [Camera 1] Capture user specific commands (Kick, Punch, Special Move, Jump) and [Camera 2] is used to watch both players fight each other to play the game.

URL: http://mhacks-iv.challengepost.com/submissions/26052-super-smash-bros-64-kinect-controller

List of Companies at Mhacks: Quixey. Mashery. Nest. Thiel fellowship. IBM. Apple. Chrysler. Facebook. Thalmiclabs. Union Pacific. Codility. FarmLogs. Twillio. Ziggeo. Capital One. Microsoft. MLH. Yo. Pillar. Meetup. Sales force. Jawbone. State Farm. Namecheap. Castle. Firebase. Sendgrid. MongoDb, Ebay, Paypal. stackoverflow. Bloomberg. Goldman Sachs. Chrysler. Moxtra. Pillar. Andreessen Horowitz. Uber. Techsmith. Indeed.

The students of Michigan State University were so inspired by the amazing energy and innovation at the University of Michigan’s hackathon (and others) that we are planning on hosting our own student-run hackathon. It is tentatively planned for Spring of 2015.

by ArisKnight Winfree at September 09, 2014 10:00 PM

Continuum Analytics

New Anaconda Launcher - 1.0

We are pleased to announce a new version of the Anaconda Launcher, a graphical user interface that allows Anaconda users to easily discover, install, update, and launch applications with conda.

by David Li at September 09, 2014 12:33 PM

Matthieu Brucher

Book review: It Sounded Good When We Started – A Project Manager’s Guide to Working with People on Projects

There are a lot of books on software project management best practices. But usually, they are not guides to work with people. And it is people who make projects, not money or computers.

Content and opinions

Strangely, the book is not about software projects, its content mainly stems from a hardware project. I find this great, as there are no reasons why people issues are different between different types of project.

I won’t dwell in the details of all chapters of the book, as it is quite long (25 chapters), with no subpart. The basic content of a chapter is an introduction based on stories that happened mainly during a project the authors call Delphi. Based on these stories, the authors talk about what was badly done, then advice on what should have been done, ending in a conclusion for signs to detect before everything goes bad and solutions.

Of course, during the 25 chapters, there are some common good practices, mainly around communication that help solving several issues. But there are enough stories inside that you don’t get bored when reading the same piece of advice twice.


For once, a book on software management (I found it on the IEEE website, so I consider it as such) that can be applied on any project. It’s refreshing, still accurate, although the book is almost a decade old.

by Matthieu Brucher at September 09, 2014 07:24 AM

September 07, 2014

Titus Brown

The NIH #ADDSup meeting - training breakout report

At the NIH ADDS meeting, we had several breakout sessions. Michelle Dunn and I led the training session. For this breakout, we had the following agenda:

First, build "sticky-note clouds", with one sticky-note cloud with notes for each of the following topics:

  1. Desired outcomes of biomedical data science training;
  2. Other relevant training initiatives and connections;
  3. Topics that are part of data science;
  4. Axes of issues (e.g. online training, vs in person?)

Second, coalesce and summarize each topic;

Third, refine concepts and issues;

and fourth, come back with:

  1. three recommended actionable items that NIH can do _now_
  2. three things that NIH should leave to others (but be aware of)

This was not the greatest agenda, because we ended up spending only a very little time at the end talk about specific actionable items, but otherwise it was a great brainstorming session!

You can read the output document here; below, I expand and elaborate on what _I_ think everything meant.

Desired outcomes:

In no particular order,

  • Increased appreciation by biomedical scientists of the role that data science could play;
  • More biomedical scientists biologists using data science tools;
  • More data scientists developing tools for biology;
  • Increased cross-training of data scientists in biology;
  • Map(s) and guidebooks to the data science landscape;
  • Increased communication between biomedical scientists and data scientists;
  • Making sure that all biomed grad students have a basic understanding of data science;
  • Increased data science skills for university staff who provide services

Data science topics list

Data science topics list:

Stat & experimental design, modeling, math, prob, stochastic processing, NLP/text mining, Machine learning, inference, viz for exploration, regression, classification, Graphical models, data integration

CS - principles of metadata, metadata format algorithms, databases, optimization, complexity, programming, scripting, data management, data wrangling, distributed processing, cloud computing, software development, scripting, Workflows, tool usage

Presentation of results, computational thinking, statistical thinking

We arranged these topics on two diagrams, one set on the "data analysis workflow" chart and one set on the "practical to abstract skill" chart.


Patches welcome -- I did these in OmniGraffle, and you can grab the files here and here.

Axes/dichotomies of training:

The underlying question here was, how multidimensional is the biomedical training subspace? (Answer: D is around 10.)

  • Clinical to biomedical to basic research - what topics do you target with your training?
  • Just-in-time vs. background training - do you train for urgent research needs, or for broad background skills?
  • Formal vs. informal - is the training formal and classroom oriented, or more informal and unstructured?
  • Beginner vs. advanced - do you target beginner level or advanced level students?
  • In-person vs. online - do you develop online courses or in-person courses?
  • Short vs. long - short (two-day, two week) or long (semester, longer) or longer (graduate career) training?
  • Centralized physically vs. federated/distributed physically - training centers, to which everyone travels? Or do you fly the trainers to the trainees? Or do you train the trainers and then have them train locally?
  • Andragogy v. pedagogy (Full professor through undergrads) - do your target more senior or more junior people?
  • Project-based vs. structured - do you assign open projects, or fairly stereotyped ones, or just go with pure didactic teaching?
  • Individual learning vs team learning - self-explanatory.

I should say that I, at least, have no strong evidence that any one point on any of these spectra will be a good fit for even a large minority of potential trainees and goals; we're just trying to lay out the options here.

Existing training initiatives and data science connections:


(The first two were from the group; the last three were from the meeting more broadly)

  1. Solicit “crazier” R25s in existing BD2K framework; maybe open up another submission round before April.

    Provide a list of topics and axes from our discussion; fund things like good software development practice, group-project camp for biomedical data driven discovery, development of good online data sets and explicit examples, hackathons, data management and metadata training, software development training, and communication training.

  2. Solicit proposals for training coordination cores (2-3) to gather, catalog, and annotate existing resources, cooperate nationally and internationally, and collaborate on offering training. Also, develop evaluation and assessment criteria that can be applied across workshops.

  3. CTB: fund early stage more broadly; don’t just limit training to graduate students in biomedical fields. For example, the biology grad student of today is the biomedical post doc of tomorrow.

  4. Mercè Crosas: Fund internships, hands-on collaborative contributions in multi-disciplinary Data Science Labs

  5. Leverage T-32 existing by adding req to include data science. (Addendum: there are already supplements available to do this)

  6. Right now we're evaluating education grants individually; it would be good to also have a wide and deep evaluation done across grants.

A few other comments --

Training should cover senior and junior people

About 3% of the NIH budget is spent on training

Also, Daniel Mietchen (via online document comment) said: Consider using open formats. A good example is the Knight Challenge, currently asking for proposals around the future of libraries: https://www.newschallenge.org/challenge/libraries/brief.html

and that's all, folks!


by C. Titus Brown at September 07, 2014 10:00 PM

September 06, 2014

Titus Brown

Recipe: Estimate genome size and coverage from shotgun sequencing data

In Extracting shotgun reads based on coverage in the data set, we showed how to get a read coverage spectrum for a shotgun data set. This is a useful diagnostic tool that can be used to estimate total genome size, average coverage, and repetitive content.

Uses for this recipe include estimating the sequencing depth of a particular library, estimating the repeat content, and generally characterizing your shotgun data.

See Estimating (meta)genome size from shotgun data for a recipe that estimates the single-copy genome size; that recipe can be used for any metagenomic or genomic data set, including ones subjected to whole genome amplification, while this recipe can only be used for data from unamplified/unbiased whole genome shotgun sequencing.

Let's assume you have a genome that you've sequenced to some unknown coverage, and you want to estimate the total genome size, repeat content, and sequencing coverage. If your reads are in reads.fa, you can generate a read coverage spectrum from your genome like so:

load-into-counting.py -x 1e8 -k 20 reads.kh reads.fa
~/dev/khmer/sandbox/calc-median-distribution.py reads.kh reads.fa reads-cov.dist
./plot-coverage-dist.py reads-cov.dist reads-cov.png --xmax=600 --ymax=500

and the result looks like this:


For this (simulated) data set, you can see three peaks: one on the far right, which contains the high-abundance reads from repeats; one in the middle, which contains the reads from the single-copy genome; and one all the way on the left at ~1, which contains all of the highly erroneous reads

Pick a coverage number near the peak that you think represents the single-copy genome - with sufficient coverage, and in the absence of significant heterozygosity, it should be the first peak that's well to the right of zero. Here, we'll pick 100.

Then run

./estimate-total-genome-size.py reads.fa reads-cov.dist 100

and you should see:

Estimated total genome size is: 7666 bp
Estimated size > 5x repeats is: 587 bp
Estimated single-genome coverage is: 109.8

The true values (for this simulated genome) are 7500bp genome size, 500 bp > 5x repeat, and 150x coverage, so this isn't a terribly bad set of estimates. The single-genome coverage estimate is probably low because we're estimating read coverage with a k-mer based approach, which is sensitive to highly erroneous reads.

A few notes:

  • Yes, you could automate the selection of the first peak, but I think it's always a good idea to look at your data!
  • In the case of highly polymorphic diploid genomes, you would have to make sure to include the haploid reads but only count them 50% towards the final genome size.

by C. Titus Brown at September 06, 2014 10:00 PM

September 03, 2014

Titus Brown

The NIH #ADDSup meeting on the next five years of data science at the NIH

Here is a links roundup and some scattered thoughts on the recent meeting on "the next five years of data science at the NIH"; this meeting was hosted by Phil Bourne, the new Associate Director for Data Science at the NIH.

Before I go any further, let me make it clear that I do not speak in any way, shape or form for the NIH, or for Phil Bourne, or anyone else other than myself!

Introduction and commentary:

Phil Bourne took on the role of Associate Director for Data Science at the NIH in March 2014, with the mission of "changing the culture of the NIH" with respect to data science. The #ADDSup meeting was convened on September 3rd with about 50 people attending, after the previous night's dinner and sauna party at Phil's house. (One highlight of dinner was the NIH director, Francis Collins, leading a data science singalong/kumbaya session. I kid you not.) The meeting was incredibly diverse, with a range of academic faculty attending along with representatives from funders, tech companies, biotech, and publishers/data infrastructure folk.

It's very hard to summarize the information-dense discussion in any meaningful way, but notes were taken throughout the meeting if you're interested. I should also note that (like a previous meeting on Software Discovery, #bd2kSDW) tweeting was encouraged with the hashtag #ADDSup -- here's the storify.

I ended up co-leading the training breakout session with Michelle Dunn (NIH), and I am writing a blog post on that separately.

Background links:

  1. Data and Informatics Working Group report
  2. Phil Bourne's job application statement.
  3. Phil Bourne's "universities as Big Data".
  4. Phil Bourne's 10 week report.
  5. The final report from the May Software Discovery Workshop (storify here, with video links.)
  6. Uduak Thomas' excellent BioInform article (open PDF here) summarizing Phil Bourne's keynote on the NIH Commons, from the 2014 Bioinformatics Open Source Conference. Also see video and slides;

Meeting links:

  1. Cover letter for meeting
  2. Agenda
  3. Vision statement
  4. ADDS guiding principles
  5. Draft training guidelines
  6. NIH Commons
  7. Vision statement

During-meeting coverage:

  1. Running notes
  2. A storify of the Twitter conversation for #ADDSup

Some fragmented thoughts.

Again, all opinions my own :).

It's on us. The NIH can fund things, and mandate things, but cannot _do_ all that much. If you want biomedical data science to advance, figure out what needs to be done and talk to Phil to propose it.

Two overwhelming impressions: the NIH moves very slowly. And the NIH has an awful lot of money.

Nobody in the academic community is interested in computational infrastructure building, unless there's a lot of money involved, in which case we will do it badly (closed source, monolithic architecture, closed data, etc.) Contractors would love to do it for us, but the odds are poor. There may be exceptions but it was hard to think of any extramural infrastructure project that had, long term, met community expectations and been sustainable (counter-examples in comments, please!)

Very few people in the biomedical community are particularly interested in training, either, although they will feign interest if it supports their graduate students in doing research (see: T-32s).

Because of this, the NIH (and the ADDS more specifically) is left carrying water with a sieve. Data science depends critically on software, data sharing, computational infrastructure, and training.

Open was missing. (Geoffrey Bilder pointed this out to me midway through the morning.) That having been said, most of the meeting attendees clearly "got it", but oops?

Use cases! The NIH ADDS is looking for use cases! What do you want to enable, and what would it look like?

The point was made that the commercial data science sector is way more active and advanced than the academic data science sector. There are lots of links, of course, but are we taking advantage of them? I would also counter that this is IMO not true in the case of biomedical data science, where I am unimpressed with what I have seen commercially so far. But maybe I'm just picky.


by C. Titus Brown at September 03, 2014 10:00 PM

Continuum Analytics

Introducing Blaze - Expressions

tl;dr Blaze abstracts tabular computation, providing uniform access to a variety of database technologies


NumPy and Pandas couple a high level interface with fast low-level computation. They allow us to manipulate data intuitively and efficiently.

Occasionally we run across a dataset that is too big to fit in our computer’s memory. In this case NumPy and Pandas don’t fit our needs and we look to other tools to manage and analyze our data. Popular choices include databases like Postgres and MongoDB, out-of-disk storage systems like PyTables and BColz and the menagerie of tools on top of the Hadoop File System (Hadoop, Spark, Impala and derivatives.)

by Matt Rocklin at September 03, 2014 10:00 AM

September 02, 2014

Jake Vanderplas

On Frequentism and Fried Chicken

My recent series of posts on Frequentism and Bayesianism have drawn a lot of comments, but recently Frederick J. Ross, a UW colleague whom I have not yet had the pleasure of meeting, penned a particularly strong-worded critique: Bayesian vs frequentist: squabbling among the ignorant. Here I want to briefly explore and respond to the points he makes in the post.

Mr. Ross doesn't mince words. He starts as follows:

Every so often some comparison of Bayesian and frequentist statistics comes to my attention. Today it was on a blog called Pythonic Perambulations. It's the work of amateurs.

He goes on to lodge specific complaints about subtleties I glossed-over in the four posts, all of which seem to miss one salient detail: the posts were an explicit response to my observation that "many scientific researchers never have opportunity to learn the distinctions between Frequentist and Bayesian methods and the different practical approaches that result..." That is, I aimed the discussion not toward someone with a deep background in statistics, but at someone who can't even name the fundamental differences between frequentism and Bayesianism.

Did I gloss over advanced subtleties in this introductory primer? Certainly. As interesting as it may have been for Mr. Ross and other well-read folks had I delved into, say, the deeper questions of assumptions implicit in frequentist constraints vs. Bayesian priors, it would have distracted from the purpose of the posts, and would have lost the very readers for whom the posts were written.

Rethinking the Debate

Thus, we see that his first set of complaints can be chalked-up to a simple misunderstanding of the intended audience: that's an honest mistake, and I won't make more of it. But he goes beyond this, and proposes his own final answer to the centuries-old debate between frequentists and Bayesians. As he writes: "Which one is right? The answer, as usual when faced with a dichotomy, is neither."

This should pique your interest: he's claiming that not only am I, a humble blogger, an ignorant amateur (which may be true), but that luminaries of the science and statistics world — people like Neyman, Pearson, Fisher, Jaynes, Jeffries, Savage, and many others who sometimes ardently addressed this question — are simply ignorant squabblers within the field which they all but created. I doubt I'm alone in finding this sweeping indictment a bit suspect.

But let's skip these charges and dig further: what third route does Mr. Ross propose to trample all this precedent? The answer is decision theory:

Probability, as a mathematical theory, has no need of an interpretation... the real battleground is statistics, and the real purpose is to choose an action based on data. The formulation that everyone uses for this, from machine learning to the foundations of Bayesian statistics, is decision theory.

His argument is that frequentist and Bayesian methods, in a reductionist sense, are both simply means of reaching a decision based on data, and can therefore be viewed as related branches of decision theory. He goes on to define some notation which explains how any statistical procedure can be formulated as a question of progressing from data, via some loss function, to a particular decision. Frequentist and Bayesian approaches are simply manifestations of this unified theory which use particular loss functions, and thus squabbling about them is the pastime of the ignorant.

I'd like to offer an analogy in response to this idea.

Baked or Fried?

One day in the kitchen, two chefs begin arguing about who makes the best chicken. Chef Hugh prefers his chicken fried: the quick action of the hot oil results light, crispy spiced outer breading complementing the tender meat it encloses. Chef Wolfgang, on the other hand, swears by baked chicken, asserting that its gentler process leaves more moisture, and allows more time for complex flavors to seep into the meat. They decide to have a cook-off: Fried vs. Baked, to decide once and for all which method is the best.

They're just beginning their preparations in the test kitchen when Rick, the local Food Theorist, storms through the door. He follows these chefs on Twitter, and has heard about this great Fried vs. Baked debate. Given his clear expertise on the matter, he wants to offer his final say on the question. As Food Theorists are wont to do, he starts lecturing them:

"Truly, I'm not really sure what this whole contest is about. Don't you know that baking and frying are both doing essentially the same thing? Proteins denature as they heat. Water evaporates, sugar caramelizes, and the Maillard Reaction turns carbohydrates and amino acids into a crispy crust. If you could just get into your ignorant heads that any cooking method is simply a manifestation of these simple principles, you'd realize that neither method is better, and we wouldn't need to waste our time on this silly competition."

At this point, Chef Hugh and Chef Wolfgang pause, catch each other's gaze for a moment, and burst into a hearty laughter. They turn around continue the task of actually turning the raw chicken meat into an edible meal, enjoying the craft and camaraderie of cooking even in the midst of their friendly disagreement. Rick slips out the door and heads home to gaze at his navel while eating a dinner of microwaved pizza bagels.

Baked or Fried? Bayesian or Frequentist?

So what's my point here? The fact is that everything our Food Theorist has said is technically correct: from a completely reductionist perspective, cooking meat is nothing more than controlled denaturing of proteins, evaporation of water, and other well-understood chemical processes. But to actually prepare a meal, you can't stop with the theory. You have to figure out how to apply that knowledge in practice, and that requires decisions about whether to use an oven, a deep fryer, or a charcoal grill.

Similarly, everything Mr. Ross said in his blog post is more or less true, but you can't stop there. Applying his decision theory in practice requires making some choices: despite his protests, you actually do have to decide how to map your theory of probability onto reality reflected in data, and that requires some actual philosophical choices about how you treat probability, which lead to fundamentally different questions being answered.

Frequentism vs. Bayesianism, Again

This brings us back to the original question Mr. Ross (not I) posed: Frequentism vs. Bayesianism: which is correct? As I've maintained throughout my posts (and as Mr. Ross seems to have overlooked when reading them): neither is correct. Or both. It really depends on the situation. As I have attempted to make clear, if you're asking questions about long-term limiting frequencies of repeated processes, classical frequentist approaches are probably your best bet. If you're hoping to update your knowledge about the world based on a finite set of data, Bayesian approaches are more appropriate.

While I have argued that Frequentist approaches answer the wrong question in most scientific settings, I have never claimed that frequentism is fundamentally flawed, or that it is "wrong": on the contrary, in that particular post I went to great length to use Monte Carlo simulations to show that the frequentist approach does in fact give the correct answer to the question it asks. Frequentist and Bayesian approaches answer different statistical questions, and that is a fact you must realize in order to use them.

So where does this leave us? Well, Mr. Ross seems to have based his critique largely on misunderstandings: my intended audience was novices rather than experts, and despite his claims otherwise I have never held either the frequentist or Bayesian approach as globally correct at the expense of the other. His protests notwithstanding, I maintain that in practice, frequentism and Bayesianism remain as different as fried and baked chicken: you can huff and puff about unified theoretical frameworks until your face is blue, but at the end of the day you need to choose between the oven and the fryer.

by Jake Vanderplas at September 02, 2014 11:00 PM

Matthieu Brucher

Book review: A Brief Introduction to Continuous Evolutionary Optimization

When I developed my first tool based on genetic algorithms, it was to replace local optimization algorithm (“Simulated Annealing”, as advertised by Numerical Recipes) as a global optimization algorithm. Now a couple of years later, I found a small book on GE that seemed on topic with what I had to implement (I relied at the time on another book that I never reviewed ; perhaps another time). Is it a good brief introduction as the book says?

Content and opinions

First of all, the book is really small, only 100 pages. At more than 50$, that’s quite a high page rate, especially when there are only 70 pages of actual content…

There are three parts in the book, the foundation, advanced optimization and learning. Let’s start with the first one.

The first part also contains the introduction. It mainly describes what evolutionary optimization means, and how to compare it to the other techniques, and a description of the other chapters. So the first interesting/technical chapter is the second one. It explains what a genetic algorithm actually is, and the way the algorithm is described is better that what I found in bigger books. There are also additional variants of algorithms used by GE (different types of recombinations for instance), other usual may be missing (like selection by tournament). As the book is about evolutionary optimization, and not GE, there is also the introduction of other algorithms, namely Particle Swarm Optimization, and Covariance Matrix Adaptation Evolution Strategies. Those were properly defined, better than most online resources as well. The third chapter is about defining the parameters of the algorithms, and this is an interesting chapter as the main topic is to allow this evolutionary algorithm to evolve the parameters during the optimization. This was really interesting to see, and I’m looking forward to implementing this in my tool.

More evolution in part 2, with using evolution to handle constraints (really funny, yet something else to try on my case), and then a variant of the Powell method mixed with evolutionary optimization. It is less useful perhaps, as it seems to be a local optimization algorithm, related to simulated annealing in some way, but with the population approach. Still interesting, but you won’t apply this as often as what you learned in the first part or the chapter before. The last chapter in this part tackles multiple objective optimization. This is really a good approach (a colleague of mine made his PhD on this kind of problems, now I understand his thesis better!) to multiple objective optimization. Usually, you have to combine them in some way, but as we use populations, we get actual borders, and possibly an infinite number of possible solutions, without having to choose which function to favor.

The last chapter is called “Learning”, but I would have called it “Applications”. The applications are model learning, but that’s about it. It is nice though that the book tackles more or less textbook functions in the first parts, and then tackles a problem that makes some actual sense (no, I don’t think the Rosenbrock cost function optimization can compare to solving an actual problem). Someone thought about what was missing in the other books!


In the end, the main issue is the cost of the book, although compared to other scientific book, it is not that expensive. Otherwise, I found it really clear, efficient, covering several subjects
properly. I think I found my new reference for people wanting to start evolutionary computation!

by Matthieu Brucher at September 02, 2014 07:30 AM

September 01, 2014

Titus Brown

Running Software Carpentry instructor training at UC Davis in January, 2015

I am pleased to announce that Dr. Greg Wilson will be giving a two-day Software Carpentry Instructor Training workshop at UC Davis, January 6-7, 2015. This will be an in-person version of the instructor training that Greg runs every quarter; see my blog post about the first such instructor training, here to get some of idea of what it's like. Note that the training is necessary for those who want to become accredited Software Carpentry or Data Carpentry instructors.

If you're interested in further information, go subscribe to this github issue; we should have room for a dozen extra people to attend. Anyone is welcome, including people unaffiliated with UC Davis.

NOTE: This workshop is hosted by my new lab, but I am seeking additional support; drop me a line if you're interested in co-sponsoring the workshop.

On January 8th, I plan to run a third day of activities. The three things I have tentatively planned are,

  1. run people through the GitHub Flow process of contributing to a github repository, as in our hackathon (also see our writeup here);
  2. walk people through our process for generating maintainable command-line tutorials, recipes, and protocols (more on that at ABIC);
  3. have an open discussion about what and how to do training in data intensive biology at Davis.

This third day of activities is going to be more restricted in attendance than the first two days, but if you're into teaching computation and biology and want to see how we're doing things, drop me a note (t@idyll.org) and I'll see if we can fit you in.


by C. Titus Brown at September 01, 2014 10:00 PM

Matthew Rocklin

Introducing Blaze - Migrations

tl;dr Blaze migrates data efficiently between a variety of data stores.

In our last post on Blaze expressions we showed how Blaze can execute the same tabular query on a variety of computational backends. However, this ability is only useful if you can migrate your data to the new computational system in the first place. To help with this, Blaze provides the into function which moves data from one container type to another:

>>> into(list, (1, 2, 3))
[1, 2, 3]

>>> into(set, (1, 2, 3))
{1, 2, 3}

>>> into(np.ndarray, [1, 2, 3])
array([1, 2, 3])

The into function takes two arguments, a and b, and it puts the data in b into a container like a. For example, if we have the class iris dataset in a CSV file (iris.csv includes measurements and species of various flowers)

$ head iris.csv

We can load this csv file into a Python list, a numpy array, and a Pandas DataFrame, all using the into function.

List $\leftarrow$ CSV

csv = CSV('iris.csv')
>>> L = into(list, csv)
>>> L[:4]
[(5.1, 3.5, 1.4, 0.2, u'Iris-setosa'),
 (4.9, 3.0, 1.4, 0.2, u'Iris-setosa'),
 (4.7, 3.2, 1.3, 0.2, u'Iris-setosa'),
 (4.6, 3.1, 1.5, 0.2, u'Iris-setosa')]

NumPy $\leftarrow$ CSV

>>> x = into(np.ndarray, csv)
>>> x[:4]
rec.array([(5.1, 3.5, 1.4, 0.2, 'Iris-setosa'),
           (4.9, 3.0, 1.4, 0.2, 'Iris-setosa'),
           (4.7, 3.2, 1.3, 0.2, 'Iris-setosa'),
           (4.6, 3.1, 1.5, 0.2, 'Iris-setosa')],
           dtype=[('SepalLength', '<f8'), ('SepalWidth', '<f8'),
           ('PetalLength', '<f8'), ('PetalWidth', '<f8'), ('Species', 'O')])

Pandas $\leftarrow$ CSV

>>> df = into(DataFrame, csv)
>>> df[:4]
     SepalLength  SepalWidth  PetalLength  PetalWidth         Species
     0            5.1         3.5          1.4         0.2     Iris-setosa
     1            4.9         3.0          1.4         0.2     Iris-setosa
     2            4.7         3.2          1.3         0.2     Iris-setosa
     3            4.6         3.1          1.5         0.2     Iris-setosa

Again, Blaze isn’t doing any of the work, it just calls out to the read_csv function of the appropriate library with the right inputs.

Demonstrating Breadth

We demonstrate breadth by moving data between more exotic backends

SQL $\leftarrow$ CSV

>>> sql = SQL('sqlite:///iris.db', 'iris', schema=csv.schema)  # SQLite database

>>> into(sql, csv)                  # CSV to SQL migration
<blaze.data.sql.SQL at 0x7fb4305423d0>

MongoDB $\leftarrow$ Pandas

into doesn’t work just with csv files. We can use it to convert between any pair of data types.

>>> import pymongo
>>> db = pymongo.MongoClient().db
>>> into(db.iris, df)               # Migrate from Pandas DataFrame to MongoDB
Collection(Database(MongoClient('localhost', 27017), u'db'), u'iris')

And to demonstrate that it’s there

>>> next(db.iris.find())  # First entry from database
{u'PetalLength': 1.4,
 u'PetalWidth': 0.2,
 u'SepalLength': 5.1,
 u'SepalWidth': 3.5,
 u'Species': u'Iris-setosa',
 u'_id': ObjectId('53f6913ff0470512a4e782e6')}

BColz $\leftarrow$ MongoDB

Finally we migrate from a Mongo database to a BColz out-of-core array.

>>> into(bcolz.ctable, db.iris) # Migrate between MongoDB and BColz
ctable((150,), [('PetalLength', '<f8'), ('PetalWidth', '<f8'), ('SepalLength',
'<f8'), ('SepalWidth', '<f8'), ('Species', '<U15')])
  nbytes: 13.48 KB; cbytes: 319.98 KB; ratio: 0.04
    cparams := cparams(clevel=5, shuffle=True, cname='blosclz')
    [(1.4, 0.2, 5.1, 3.5, u'Iris-setosa') (1.4, 0.2, 4.9, 3.0, u'Iris-setosa')
     (1.3, 0.2, 4.7, 3.2, u'Iris-setosa') (1.5, 0.2, 4.6, 3.1, u'Iris-setosa')
     (1.4, 0.2, 5.0, 3.6, u'Iris-setosa') (1.7, 0.4, 5.4, 3.9, u'Iris-setosa')
     (1.4, 0.3, 4.6, 3.4, u'Iris-setosa') (1.5, 0.2, 5.0, 3.4, u'Iris-setosa')

Robustness and Performance

Blaze leverages known solutions where they exist, for example migrating from CSV files to SQL databases we use fast the built-in loaders for that particular database.

Blaze manages solutions where they don’t exist, for example when migrating from a MongoDB to a BColz out-of-core array we stream the database through Python, translating types as necessary.

More Information

September 01, 2014 12:00 AM

Introducing Blaze - Practice

We look at data from the Home Mortgage Disclosure Act, a collection of actions taken on housing loans by various governmental agencies (gzip-ed csv file here) (thanks to Aron Ahmadia for the pointer). Uncompressed this dataset is around 10GB on disk and so we don’t want to load it up into memory with a modern commercial notebook.

Instead, we use Blaze to investigate the data, select down to the data we care about, and then migrate that data into a suitable computational backend.

In this post we’re going to use the interactive Table object, which wraps up a dataset and calls compute whenever we ask for something to be printed to the screen. It retains the abstract delayed-evaluation nature of Blaze with the interactive feel of NumPy and Pandas.

from blaze import CSV, Table
csv = CSV('hmda_lar-2012.csv')  # Open the CSV file
t = Table(csv)                  # Interact with CSV file using interactive Table object
action_taken action_taken_name agency_code agency_abbr agency_name applicant_ethnicity applicant_ethnicity_name applicant_income_000s applicant_race_1 applicant_race_2 applicant_race_3 applicant_race_4 applicant_race_5 applicant_race_name_1 applicant_race_name_2 applicant_race_name_3 applicant_race_name_4 applicant_race_name_5 applicant_sex applicant_sex_name application_date_indicator as_of_year census_tract_number co_applicant_ethnicity co_applicant_ethnicity_name co_applicant_race_1 co_applicant_race_2 co_applicant_race_3 co_applicant_race_4 co_applicant_race_5 co_applicant_race_name_1 co_applicant_race_name_2 co_applicant_race_name_3 co_applicant_race_name_4 co_applicant_race_name_5 co_applicant_sex co_applicant_sex_name county_code county_name denial_reason_1 denial_reason_2 denial_reason_3 denial_reason_name_1 denial_reason_name_2 denial_reason_name_3 edit_status edit_status_name hoepa_status hoepa_status_name lien_status lien_status_name loan_purpose loan_purpose_name loan_type loan_type_name msamd msamd_name owner_occupancy owner_occupancy_name preapproval preapproval_name property_type property_type_name purchaser_type purchaser_type_name respondent_id sequence_number state_code state_abbr state_name hud_median_family_income loan_amount_000s number_of_1_to_4_family_units number_of_owner_occupied_units minority_population population rate_spread tract_to_msamd_income
0 1 Loan originated 7 HUD Department of Housing and Urban Development 2 Not Hispanic or Latino 173 5 White 1 Male 0 2012 8803.06 2 Not Hispanic or Latino 5 White 2 Female 197 Will County None NaN 2 Not a HOEPA loan 1 Secured by a first lien 3 Refinancing 1 Conventional 16974 Chicago, Joliet, Naperville - IL 1 Owner-occupied as a principal dwelling 3 Not applicable 1 One-to-four family dwelling (other than manufa... 6 Commercial bank, savings bank or savings assoc... 36-4176531 2712 17 IL Illinois 77300 264 2153 1971 45.820000 7894 NaN 170.679993
1 1 Loan originated 5 NCUA National Credit Union Administration 2 Not Hispanic or Latino 83 5 White 1 Male 0 2012 2915.00 5 No co-applicant 8 No co-applicant 5 No co-applicant 111 Midland County None NaN 2 Not a HOEPA loan 1 Secured by a first lien 3 Refinancing 1 Conventional NaN 1 Owner-occupied as a principal dwelling 3 Not applicable 1 One-to-four family dwelling (other than manufa... 0 Loan was not originated or was not sold in cal... 0000060137 328 26 MI Michigan 52100 116 1662 1271 3.340000 4315 NaN 102.760002
2 6 Loan purchased by the institution 9 CFPB Consumer Financial Protection Bureau 4 Not applicable 70 7 Not applicable 4 Not applicable 2 2012 212.01 4 Not applicable 7 Not applicable 4 Not applicable 7 Benton County None 6 Quality edit failure only 2 Not a HOEPA loan 4 Not applicable 3 Refinancing 1 Conventional 22220 Fayetteville, Springdale, Rogers - AR, MO 1 Owner-occupied as a principal dwelling 3 Not applicable 1 One-to-four family dwelling (other than manufa... 8 Affiliate institution 0000476810 43575 5 AR Arkansas 58200 159 1194 708 21.870001 4239 NaN 127.639999
3 6 Loan purchased by the institution 9 CFPB Consumer Financial Protection Bureau 2 Not Hispanic or Latino 108 5 White 2 Female 2 2012 407.06 5 No co-applicant 8 No co-applicant 5 No co-applicant 123 Ramsey County None NaN 2 Not a HOEPA loan 4 Not applicable 3 Refinancing 1 Conventional 33460 Minneapolis, St. Paul, Bloomington - MN, WI 1 Owner-occupied as a principal dwelling 3 Not applicable 1 One-to-four family dwelling (other than manufa... 1 Fannie Mae (FNMA) 0000451965 2374657 27 MN Minnesota 83900 100 1927 1871 13.680000 4832 NaN 137.669998
4 1 Loan originated 3 FDIC Federal Deposit Insurance Corporation 2 Not Hispanic or Latino NaN 5 White 1 Male 0 2012 104.00 2 Not Hispanic or Latino 5 White 2 Female 3 Allen County None 6 Quality edit failure only 2 Not a HOEPA loan 1 Secured by a first lien 2 Home improvement 1 Conventional 23060 Fort Wayne - IN 1 Owner-occupied as a principal dwelling 3 Not applicable 1 One-to-four family dwelling (other than manufa... 0 Loan was not originated or was not sold in cal... 0000013801 11 18 IN Indiana 63800 267 1309 1160 4.680000 3612 NaN 139.100006
5 1 Loan originated 7 HUD Department of Housing and Urban Development 2 Not Hispanic or Latino 144 5 White 1 Male 0 2012 8057.01 2 Not Hispanic or Latino 5 White 1 Male 31 Cook County None NaN 2 Not a HOEPA loan 1 Secured by a first lien 3 Refinancing 1 Conventional 16974 Chicago, Joliet, Naperville - IL 1 Owner-occupied as a principal dwelling 3 Not applicable 1 One-to-four family dwelling (other than manufa... 3 Freddie Mac (FHLMC) 36-4327855 22594 17 IL Illinois 77300 260 1390 1700 6.440000 5074 NaN 140.550003
6 1 Loan originated 7 HUD Department of Housing and Urban Development 2 Not Hispanic or Latino 51 3 Black or African American 1 Male 0 2012 17.00 5 No co-applicant 8 No co-applicant 5 No co-applicant 19 Calcasieu Parish None NaN 2 Not a HOEPA loan 1 Secured by a first lien 1 Home purchase 1 Conventional 29340 Lake Charles - LA 1 Owner-occupied as a principal dwelling 3 Not applicable 1 One-to-four family dwelling (other than manufa... 1 Fannie Mae (FNMA) 7056000000 34177 22 LA Louisiana 62400 115 3500 2797 29.260000 8745 NaN 86.739998
7 1 Loan originated 7 HUD Department of Housing and Urban Development 2 Not Hispanic or Latino 162 5 White 1 Male 0 2012 2.01 2 Not Hispanic or Latino 5 White 2 Female 49 Grand County None NaN 2 Not a HOEPA loan 1 Secured by a first lien 3 Refinancing 2 FHA-insured NaN 1 Owner-occupied as a principal dwelling 3 Not applicable 1 One-to-four family dwelling (other than manufa... 6 Commercial bank, savings bank or savings assoc... 87-0623581 2141 8 CO Colorado 61000 283 5706 1724 9.650000 4817 NaN 128.559998
8 1 Loan originated 3 FDIC Federal Deposit Insurance Corporation 2 Not Hispanic or Latino 32 5 White 2 Female 0 2012 103.04 5 No co-applicant 8 No co-applicant 5 No co-applicant 3 Allen County None NaN 2 Not a HOEPA loan 1 Secured by a first lien 3 Refinancing 1 Conventional 23060 Fort Wayne - IN 1 Owner-occupied as a principal dwelling 3 Not applicable 1 One-to-four family dwelling (other than manufa... 0 Loan was not originated or was not sold in cal... 0000013801 10 18 IN Indiana 63800 40 2384 2210 6.640000 6601 3.33 149.690002
9 1 Loan originated 9 CFPB Consumer Financial Protection Bureau 2 Not Hispanic or Latino 38 5 White 1 Male 0 2012 9608.00 2 Not Hispanic or Latino 5 White 2 Female 41 Talbot County None NaN 2 Not a HOEPA loan 1 Secured by a first lien 3 Refinancing 1 Conventional NaN 1 Owner-occupied as a principal dwelling 3 Not applicable 1 One-to-four family dwelling (other than manufa... 3 Freddie Mac (FHLMC) 0000504713 18459 24 MD Maryland 72600 108 1311 785 6.040000 1920 NaN 77.279999
10 1 Loan originated 7 HUD Department of Housing and Urban Development 2 Not Hispanic or Latino 47 5 White 1 Male 0 2012 22.04 1 Hispanic or Latino 5 White 2 Female 19 Calcasieu Parish None NaN 2 Not a HOEPA loan 1 Secured by a first lien 1 Home purchase 3 VA-guaranteed 29340 Lake Charles - LA 1 Owner-occupied as a principal dwelling 3 Not applicable 1 One-to-four family dwelling (other than manufa... 6 Commercial bank, savings bank or savings assoc... 7056000000 33791 22 LA Louisiana 62400 158 1854 1463 12.410000 4955 NaN 112.010002

Reducing the Dataset

That’s a lot of columns, most of which are redundant or don’t carry much information. Let’s clean up our dataset a bit by selecting a smaller subset of columns. Already this quick investigation improves our comprehension and reduces the size of the dataset.

columns = ['action_taken_name', 'agency_abbr', 'applicant_ethnicity_name',
           'applicant_race_name_1', 'applicant_sex_name', 'county_name',
           'loan_purpose_name', 'state_abbr']

t = t[columns]
action_taken_name agency_abbr applicant_ethnicity_name applicant_race_name_1 applicant_sex_name county_name loan_purpose_name state_abbr
0 Loan originated HUD Not Hispanic or Latino White Male Will County Refinancing IL
1 Loan originated NCUA Not Hispanic or Latino White Male Midland County Refinancing MI
2 Loan purchased by the institution CFPB Not applicable Not applicable Not applicable Benton County Refinancing AR
3 Loan purchased by the institution CFPB Not Hispanic or Latino White Female Ramsey County Refinancing MN
4 Loan originated FDIC Not Hispanic or Latino White Male Allen County Home improvement IN
5 Loan originated HUD Not Hispanic or Latino White Male Cook County Refinancing IL
6 Loan originated HUD Not Hispanic or Latino Black or African American Male Calcasieu Parish Home purchase LA
7 Loan originated HUD Not Hispanic or Latino White Male Grand County Refinancing CO
8 Loan originated FDIC Not Hispanic or Latino White Female Allen County Refinancing IN
9 Loan originated CFPB Not Hispanic or Latino White Male Talbot County Refinancing MD
10 Loan originated HUD Not Hispanic or Latino White Male Calcasieu Parish Home purchase LA

More Complex Computation

Now that we can more clearly see what’s going on let’s ask a simple question:

How many times does each action occur in the state of New York?

t2 = t[t.state_abbr == 'NY']
from blaze import into, by
from pandas import DataFrame
# Group on action_taken_name, count each group
into(DataFrame, by(t2.action_taken_name,
CPU times: user 13min 50s, sys: 5.23 s, total: 13min 55s
Wall time: 13min 55s
action_taken_name action_taken_name_count
0 Loan originated 285106
1 Application denied by financial institution 109423
2 Loan purchased by the institution 75241
3 Application withdrawn by applicant 50563
4 Application approved but not accepted 25632
5 File closed for incompleteness 20585
6 Preapproval request approved but not accepted 259
7 Preapproval request denied by financial instit... 171

Great! Sadly, because it was reading through the CSV file and because it was using a Pure Python backend, that computation took fourteen minutes.

Moving to a Faster Backend

By default computations on CSV files use the streaming Python backend. While robust for large files and decently fast, this backend parses the CSV file each time we do a full-data operation, and this parsing is very slow. Let’s move our reduced dataset to a more efficient and widely accessible backend, sqlite.

from blaze import SQL
sql = SQL('sqlite:///hmda.db', 'data', schema=t.schema) # A SQLite database
into(sql, t)  # Migrate data

Yup, a little sqlite database just arrived

$ ls -lh hmda*

-rw-r--r-- 1 mrocklin mrocklin 2.7G Aug 25 13:38 hmda.db
-rw-r--r-- 1 mrocklin mrocklin  12G Jul 10 12:15 hmda_lar-2012.csv

Working with SQL

Now that we’ve migrated our csv file into a sqlite database let’s redefine t to use the SQL backend and repeat our computation.

# t = Table(csv)
t = Table(sql)
t2 = t[t.state_abbr == 'NY']
into(DataFrame, by(t2.action_taken_name,
CPU times: user 5.55 s, sys: 1.64 s, total: 7.19 s
Wall time: 7.46 s
action_taken_name action_taken_name_count
0 Loan originated 285106
1 Application denied by financial institution 109423
2 Loan purchased by the institution 75241
3 Application withdrawn by applicant 50563
4 Application approved but not accepted 25632
5 File closed for incompleteness 20585
6 Preapproval request approved but not accepted 259
7 Preapproval request denied by financial instit... 171

We’re about to repeat this same computation many times. We’ll omit the table result from here on out. It will always be the same.

Create an index on state name

This was much faster, largely because the data was stored in a binary format. We can improve the query speed significantly by placing an index on the state_abbr field. This will cause the selection t[t.state_abbr == 'NY'] to return more quickly, eliminating the need for an expensive full table scan.

from blaze import create_index
create_index(sql, 'state_abbr', name='state_abbr_index')

Now we can ask this same query for many states at interactive timescales.

t2 = t[t.state_abbr == 'NY']
into(DataFrame, by(t2.action_taken_name,
CPU times: user 1.74 s, sys: 430 ms, total: 2.17 s
Wall time: 2.17 s

Comparing against MongoDB

Because moving between computational backends is now easy, we can quickly compare performance between backends. SQLite and MongoDB are similarly available technologies, each being trivial to set up on a personal computer. However they’re also fairly different technologies with varying communities.

Which performs faster for our sample computation?

import pymongo
db = pymongo.MongoClient().db

into(db.hmda, sql)  # Migrate to Mongo DB from SQLite database
# t = Table(csv)
# t = Table(sql)
t = Table(db.hmda)
t2 = t[t.state_abbr == 'NY']
into(DataFrame, by(t2.action_taken_name,
CPU times: user 4.05 ms, sys: 701 µs, total: 4.76 ms
Wall time: 7.61 s

Almost exactly the same time as for SQLite.

We just did a complex thing easily. If we weren’t familiar with MongoDB we would need to learn how to set up a database, how to migrate data from SQL to MongoDB, and finally how to perform queries. Blaze eased that process considerably.

Create an index on state name

Again we create an index on the state name and observe the performance difference.

create_index(db.hmda, 'state_abbr', name='state_abbr_index')
t2 = t[t.state_abbr == 'NY']
into(DataFrame, by(t2.action_taken_name,
CPU times: user 4.13 ms, sys: 844 µs, total: 4.97 ms
Wall time: 954 ms

Here the indexed MongoDB system seems about twice as fast as the comparably indexed SQLite system.


Disclaimer: These results come from a single run. No attempt was made to optimize the backend configuration, nor was any consideration taken into account about databases being warmed up. These numbers are far from conclusive, and are merely here to present the ease with which intuitive-building experiments are easy with Blaze and the value of choosing the right backend.

Backend Duration
NumPy/Pandas need bigger machine
Python/CSV 14 minutes
SQLite 7 seconds
MongoDB 7 seconds
SQLite (indexed) 2 seconds
MongoDB (indexed) 1 second


Blaze enables you to investigate, transform, and migrate large data intuitively. You can choose the right technology for your application without having to worry about learning a new syntax.

We hope that by lowering this barrier more users will use the right tool for the job.

More Information

September 01, 2014 12:00 AM

Introducing Blaze - Expressions

tl;dr Blaze abstracts tabular computation, providing uniform access to a variety of database technologies


NumPy and Pandas couple a high level interface with fast low-level computation. They allow us to manipulate data intuitively and efficiently.

Occasionally we run across a dataset that is too big to fit in our computer’s memory. In this case NumPy and Pandas don’t fit our needs and we look to other tools to manage and analyze our data. Popular choices include databases like Postgres and MongoDB, out-of-disk storage systems like PyTables and BColz and the menagerie of tools on top of the Hadoop File System (Hadoop, Spark, Impala and derivatives.)

Each of these systems has their own strengths and weaknesses and an experienced data analyst will choose the right tool for the problem at hand. Unfortunately learning how each system works and pushing data into the proper form often takes most of the data scientist’s time.

The startup costs of learning to munge and migrate data between new technologies often dominate biggish-data analytics.

Blaze strives to reduce this friction. Blaze provides a uniform interface to a variety of database technologies and abstractions for migrating data.


At its core, Blaze is a way to express data and computations.

In the following example we build an abstract table for accounts in a simple bank. We then describe a query, deadbeats, to find the names of the account holders with a negative balance.

from blaze import TableSymbol, compute
accounts = TableSymbol('accounts', '{id: int, name: string, amount: int}')

# The names of account holders with negative balance
deadbeats = accounts[accounts['amount'] < 0]['name']

Programmers familiar with Pandas should find the syntax to create deadbeats familiar. Note that we haven’t actually done any work yet. The table accounts is purely imaginary and so the deadbeats expression is just an expression of intent. The Pandas-like syntax builds up a graph of operations to perform later.

However, if we happen to have some similarly shaped data lying around

L = [[1, 'Alice',   100],
     [2, 'Bob',    -200],
     [3, 'Charlie', 300],
     [4, 'Dennis',  400],
     [5, 'Edith',  -500]]

We can combine our expression, deadbeats with our data L to compute an actual result

>>> compute(deadbeats, L) # an iterator as a result
<itertools.imap at 0x7fdd7fcde450>

>>> list(compute(deadbeats, L))
['Bob', 'Edith']

So in its simplest incarnation, Blaze is a way to write down computations abstractly which can later be applied to real data.

Multiple Backends - Pandas

Fortunately the deadbeats expression can run against many different kinds of data. We just computed deadbeats against Python lists, here we compute it against a Pandas DataFrame

from pandas import DataFrame
df = DataFrame([[1, 'Alice',   100],
                [2, 'Bob',    -200],
                [3, 'Charlie', 300],
                [4, 'Dennis',  400],
                [5, 'Edith',  -500]],
               columns=['id', 'name', 'amount'])

>>> compute(deadbeats, df)
1      Bob
4    Edith
Name: name, dtype: object

Note that Blaze didn’t perform the computation here, Pandas did (it’s good at that), Blaze just told Pandas what to do. Blaze doesn’t compute results; Blaze drives other systems to compute results.

Multiple Backends - MongoDB

To demonstrate some breadth, let’s show Blaze driving a Mongo Database.

$ # We install and run MongoDB locally
$ sudo apt-get install mongodb-server
$ mongod &
$ pip install pymongo
import pymongo
db = pymongo.MongoClient().db
db.mycollection.insert([{'id': 1, 'name': 'Alice',   'amount':  100},
                        {'id': 2, 'name': 'Bob',     'amount': -200},
                        {'id': 3, 'name': 'Charlie', 'amount':  300},
                        {'id': 4, 'name': 'Dennis',  'amount':  400},
                        {'id': 5, 'name': 'Edith',   'amount': -500}])

>>> compute(deadbeats, db.mycollection)
[u'Bob', u'Edith']


To remind you we created a single Blaze query

accounts = TableSymbol('accounts', '{id: int, name: string, amount: int}')
deadbeats = accounts[accounts['amount'] < 0]['name']

And then executed that same query against multiple backends

>>> list(compute(deadbeats, L))          # Python
['Bob', 'Edith']

>>> compute(deadbeats, df)               # Pandas
1      Bob
4    Edith
Name: name, dtype: object

>>> compute(deadbeats, db.mycollection)  # MongoDB
[u'Bob', u'Edith']

At the time of this writing Blaze supports the following backends

  • Pure Python
  • Pandas
  • MongoDB
  • SQL
  • PySpark
  • PyTables
  • BColz


The separation of expressions and computation is core to Blaze. It’s also confusing for new Blaze users. NumPy and Pandas demonstrated the value of immediate data interaction and having to explicitly call compute is a step backward from that goal.

To this end we create the Table abstraction, a TableSymbol that knows about a particular data resource. Operations on this Table object produce abstract expressions just like normal, but statements that would normally print results to the screen initiate calls to compute and then print those results, giving an interactive feel in a console or notebook

>>> from blaze import Table
>>> t = Table(db.mycollection)  # give MongoDB resource to Table
>>> t
   amount  id     name
0     100   1    Alice
1    -200   2      Bob
2     300   3  Charlie
3     400   4   Dennis
4    -500   5    Edith

>>> t[t.amount < 0]
   amount  id   name
0    -200   2    Bob
1    -500   5  Edith

>>> type(t[t.amount < 0])        # Still just an expression

>>> print repr(t[t.amount < 0])  # repr triggers call to compute
   amount  id   name
0    -200   2    Bob
1    -500   5  Edith

These expressions generate the appropriate MongoDB queries, call compute only when we print a result to the screen, and then push the result into a DataFrame to use Pandas’ excellent tabular printing. For large datasets we always append a .head(10) call to the expression to only retrieve the sample of the output necessary to print to the screen; this avoids large data transfers when not necessary.

Using the interactive Table object we can interact with a variety of computational backends with the familiarity of a local DataFrame.

More Information

September 01, 2014 12:00 AM

August 30, 2014

Titus Brown

Some naive ideas about training efforts at UC Davis

As I mentioned, I am hoping to significantly scale up my training efforts at UC Davis; it's one of the reasons they hired me, it's a big need in biology, and I'm enthusiastic about the whole thing! A key point is that, at least at the beginning, it may replace some or all of my for-credit teaching. (Note that the first four years of Analyzing Next-Generation Sequencing Data counted as outreach, not teaching, at MSU.)

I don't expect to fully spool up before fall 2015, but I wanted to start outlining my thoughts.

The ideas below came in large part from conversations with Tracy Teal, a Software Carpentry instructor who is one of the people driving Data Carpentry, and who also was one of the EDAMAME course instructors.

How much training, how often, and to whom?

I think my initial training efforts will center on Software Carpentry-style workshops, on a variety of (largely bio-specific) topics. These would be two-day in-person workshops, 9-5am, each focused on a specific topic.

I think I can sustainably lead one a month, with perhaps a few months where I organize two in the same week (M/Tu and Th/Fri, perhaps).

These would be on top of at least one NGS course a year, too. I also expect I will participate in various Genome Center training workshops.

The classes would be targeted at grad students, postdocs, and faculty -- same as the current NGS course. I would give attendees from VetMed some priority, followed by attendees with UC Davis affiliations, and then open to anyone. I imagine doing this in a tiered way, so that some outsiders could always come; variety and a mixed audience are good things!

On what topics?

I have a laundry list of ideas, but I'm not sure what to start with or how to make decisions about what to teach when. ...suggestions welcome. (I also can't teach all of these myself, but I want to get the list of ideas down!)

I'd like to preface this list with a few comments: I've been teaching and training in these topics for five years (at least) now, so I'm not naive about how hard (or easy) it is to teach this to computationally inexperienced biologists. It's clear that there's a progression of skills that need to be taught for most of these, as well as a need for careful lesson planning, tutorial design, and pre/post assessment. These workshops would also be but one arrow in the quiver -- I have many other efforts that contribute to my lab's teaching and training.

With that having been said, here's a list of general things I'd like to teach:

  • Shell and UNIX (long running commands, remote commands, file and path management)
  • Scripting and automation (writing scripts, make, etc.)
  • Bioinformatics and algorithms
  • "Big data" statistics
  • Data integration for sequencing data
  • Software engineering (testing, version control, code review, etc.) on the open source model
  • Practical bioinformatics (See topics below)
  • Modeling and simulations
  • Workflows and replication tracking
  • Software Carpentry
  • Data Carpentry

I have many specific topics that I think people know they want to learn:

  • Mapping and variant calling
  • Genome assembly and evaluation (microbial & large genomes both)
  • Transcriptome assembly and evaluation (reference free & reference based)
  • Genome annotation
  • Differential expression analysis
  • ChIP-seq
  • Metagenome assembly
  • Microbial ecology and 16s approaches
  • Functional inference (pathway annotations)
  • Phylogenomics
  • Marker development
  • Genotyping by sequencing
  • Population genomics

And finally, here are two shorter workshop ideas that I find particularly neat: experimental design (from sample prep through validation), and sequencing case studies (success and failure stories). In the former, I would get together a panel of two or three people to talk through the issues involved in doing a particular experiment, with the goal of helping them write a convincing grant For the latter, I would find both success and failure stories and then talk about what other approaches could have rescued the failures, as well as what made the successful stories successful.

To what end? Community building and collaborations.

Once I started focusing in on NGS data at MSU as an assistant professor, I quickly realized that I could spend all my time in collaborations. I learned to say "no" fairly fast :). But all those people still need to do data analysis. What to do? I had no clear answer at MSU, but this was one reason I focused on training.

At Davis, I hope to limit my formal collaborations to research topics, and concentrate on training everybody to deal with their own data; in addition to being the only scalable approach, this is career-building for them. This means not only investing in training, but trying to build a community around the training topics. So I'd like to do regular (weekly? fortnightly?) "help desk" afternoons for the campus, where people can come talk about their issue du jour. Crucially, I would limit this to people that have gone through some amount of training - hopefully both incentivizing people to do the training, and making sure that some minimal level of effort has been applied. The goal would be to move towards a self-sustaining community of people working in bioinformatic data analysis across multiple levels.

Cost and materials.

Since UCD VetMed is generously supporting my salary, I am naively expecting to charge nothing more than a nominal fee -- something that would discourage people from frivolously signing up or canceling. Perhaps lunch money? (This might have to be modified for people from outside of VetMed, or off-campus attendees.)

All materials would continue to be CC0 and openly available, of course. 'cause life's too short to limit the utility of materials.

Other thoughts

I'd love to put together a slush fund so that I can invite out speakers to run workshops on topics that I don't know that well (most of 'em).

How about a workshop focused on teaching people how to teach with the materials we put together? (I would expect most of these workshops to be cloud-based.)


p.s. In addition to Tracy, thanks to Keith Bradnam, Aaron Darling, Matt MacManes and Ethan White, for their comments and critiques on a draft.

by C. Titus Brown at August 30, 2014 10:00 PM

The Critical Assessment of Metagenome Interpretation and why I'm not a fan

If you're into metagenomics, you may have heard of CAMI, the Critical Assessment of Metagenome Interpretation. I've spoken to several people about it in varying amounts of detail, and it seems like the CAMI group is working to generate some new shotgun metagenome data sets and will then encourage tool developers to bang on them. (You can also read a short Methagora blog on CAMI.)

I've been asked by about a dozen people now what I think of it, so I'm blogging about it now rather than answering people individually :).

tl; dr? I'm not a fan.

First, what's this subscription nonsense? (We'll keep you in the loop if you register here.) There are a plethora of ways to keep people in the loop without asking them to subscribe to anything (blogs, Twitter, Web sites, e-mail lists...). Is there a reason to keep the details of the contest secret or hidden behind a subscriber wall? (Answer: no.)

Second, I am told that the contest will be run once, with software that is blind to the content of the metagenome. I understand the idea of blind software evaluation, but this is not a long-term sustainable approach; we need to generate a nice set of orthogonal data sets and then LART people who fine-tune their software against one or another.

Third, it looks like the CAMI folk will make the same mistake as the Assemblathon 2 folk, and not require that the analyses be completely replicable. So in the end there will be a massive expenditure of effort that results in a paper, which will then be a nice static record of how things were back in 2014. Given the pace of tool and technology change, this will have a very short shelf-life (although no doubt tool developers will cite it for years to come, to prove that IDBA was once worse than their own assembler is now). Why not re-run it every 6 months with the latest versions of softwares X, Y, and Z? We have plenty of ways to automate analyses, and there is simply no excuse for not doing so at this point. (ht to Aaron Darling for alerting me to nucleotid.es.)

Fourth, there are several mock metagenome data sets that are already underanalyzed. For example, we're currently working with the Shakya et al. (2013) data set, but I don't think anyone else is (and it's pretty clear most people don't realize what a stinging indictment of 16s this paper is - or, at least, don't care. Discussion for another time.).

So why are we not poking at existing data first? I don't know, but I suspect it speaks to an underlying bias against re-analyzing published data sets, which we really need to get over in this field.

As I said during the Q&A for my BOSC 2014 keynote, I'm not a big fan of how we do benchmarks in bioinformatics; I think this is a fine exemplar.

I probably won't participate in the CAMI benchmark, not just for the above reasons but because I simply don't have the people to do so at the moment. ...plus, as with Assemblathon 2, they're going to need reviewers, right? ;)

Note that Aaron Darling raised many similar points at ISME, and they were apparently very well received. Maybe CAMI will adapt their approach in response, which would be great!


p.s. Thanks to Alice McHardy and Aaron Darling for their detailed discussions of this with me!

by C. Titus Brown at August 30, 2014 10:00 PM

August 29, 2014

Titus Brown

Some software development plans for Davis

I've been thinking a lot about what I want to do at UC Davis, and I would like to announce the following plans:

  1. I hope to write new and better transcriptome and metagenome assemblers. The current assemblers are hopelessly bad and I think we can probably eke out a 1% improvement, at least, by starting from scratch.
  2. I've been very disappointed with the workflow systems that are out there, so for people that want to run our new assembler, we'll be packaging that with a new workflow management system.
  3. We'll also be developing a new Web site for running analyses of biological data. Galaxy may have market penetration, but I really dislike their default CSS, and I think my lab can probably do a better job if we start from scratch.
  4. Needless to say, I'll need my own physical cloud hardware to run it all. So I'm planning a considerable expansion of our lab's cluster. I think we can probably make a big impact by writing our own virtualization management software, too; the existing systems are written by amateurs like Amazon and OpenStack, after all!

The bad news is that after some serious consideration, I have decided not to invest in a new mapping program; on top of the above it's a bit too much. I think I trust Heng Li's bwa, for now; we may revisit in a few months.

I will be seeking large scale NSF, NIH, USDA, and DOE funding for the above. Let me know if you want to sign on!


by C. Titus Brown at August 29, 2014 10:00 PM

Stefan Pfenninger

IPython notebook extensions to ease day-to-day work

Since I use the IPython notebook for a lot of explorative data analysis, I wanted some extra functionality to make day-to-day notebook work just a little bit more pleasant. I use several of Min RK’s excellent notebook extensions, particularly the floating table of contents and the Gist button.

This post describes some additional extensions I created to do things I need in day-to-day work, as well as how to leverage the customization possibilities of the CodeMirror editor to add vertical rulers to code cells. The notebook extensions described here are available on GitHub.


Sometimes larger analysis tasks take more than a couple of seconds to compute, and I want to alt-tab to do something else while I wait. The IPython-notify extension ensures that you don’t forget what you were originally doing, by displaying a notification once the kernel becomes idle again after a set minimum busy time:

Notification screenshot

When the plugin is activated, a button to request notification permissions is shown in the toolbar. After notification permissions have been granted, this button is replaced by a dropdown menu with five choices: Disabled, 0, 5, 10 and 30.

To activate notifications, select a minimum kernel busy time required to trigger a notification (e.g. if selecting 5, a notification will only be shown if the kernel was busy for more than 5 seconds). The selection is saved in the notebook’s metadata and restored when the notebook is re-opened.

Theme switching

As described in a previous post, depending on ambient lighting and time of day I prefer seeing my code in light text on a dark background, so I created a simple extension to switch between two themes.

The default CSS included only changes code cells (this has the added effect that it’s easier to tell code cells apart from markdown or raw cells):

Dark IPython Notebook

The CSS files can easily be exchanged for a different ones, for example a Base16 theme.

Customizing the CodeMirror editor

IPython notebooks use the CodeMirror editor, which includes a wide range of addons. IPython also allows the user to add additional javascript and CSS via the custom.js and custom.css files in the static/custom directory inside the IPython profile (likely ~/.ipython/profile_default/static/custom on Linux or OS X).

I want to display some rulers to indicate line lengths I shouldn’t exceed to conform with PEP8. Thanks to CodeMirror’s flexible configuration system, this can be achieved by adding the following code to custom.js, making use of the rulers addon:


var clsname = "ipynb_ruler";
var rulers = [{column: 79, className: clsname},
              {column: 99, className: clsname}];

IPython.Cell.options_default.cm_config.rulers = rulers;

By adding the class ipynb_ruler to custom.css, the look of the rulers can be tweaked, for example to use a lighter color:

.ipynb_ruler {
    border-color: #DFDFDF;

Unfortunately, there is a bug in older CodeMirror versions that in some browsers causes extra scrollbars when enabling rulers. For the time being, this can be fixed by placing a fixed rulers.js into ~/.ipython/profile_default/static/components/codemirror/addon/display (adjust as needed if your IPython profile directory is different). I backported a fix into the rulers.js shipped with IPython 2.2, available here.

by Stefan Pfenninger at August 29, 2014 02:31 PM

August 27, 2014

Titus Brown

Estimating (meta)genome size from shotgun data

This is a recipe that provides a time- and memory- efficient way to loosely estimate the likely size of your assembled genome or metagenome from the raw reads alone. It does so by using digital normalization to assess the size of the coverage-saturated de Bruijn assembly graph given the reads provided by you. It does take into account coverage, so you need to specify a desired assembly coverage - we recommend starting with a coverage of 20.

Uses for this recipe include estimating the amount of memory required to achieve an assembly and providing a lower bound for metagenome assembly size and single-copy genome diversity.

This recipe will provide inaccurate estimates on transcriptomes (where splice variants will end up confusing the issue - this looks at single-copy sequence only) or for metagenomes with high levels of strain variation (where the assembler may collapse strain variants that this estimate will split).

Note: at the moment, the khmer script normalize-by-median.py needs to be updated from the master branch of khmer to run this code properly. Once we've cut a new release, we'll remove this note and simply specify the khmer release required.

Let's assume you have a simple genome with some 5x repeats, and you've done some shotgun sequencing to a coverage of 150 or higher. If your reads are in reads.fa, you can get an estimate of the single-copy genome size (here known to be 5500 bp) by running normalize-by-median.py

~/dev/khmer/scripts/normalize-by-median.py -x 1e8 -k 20 -C 20 -R report.txt reads.fa
./estimate-genome-size.py -C 20 -k 20 reads.fa.keep report.txt

This yields the output:

Estimated (meta)genome size is: 8727 bp

This is off by about 50% for reasons that we don't completely understand. Note that you can get more accurate estimates for this data set by increasing C and decreasing k, but 20/20 should work about this well for most data sets. (For an E. coli data set, it returns 6.5 Mbp, which is only about 25% off.)

by C. Titus Brown at August 27, 2014 10:00 PM

Estimate whether your sequencing has saturated your sample to a given coverage

This recipe provides a time-efficient way to determine whether you've saturated your sequencing depth, i.e. how much new information is likely to arrive with your next set of sequencing reads. It does so by using digital normalization to generate a "collector's curve" of information collection.

Uses for this recipe include evaluating whether or not you should do more sequencing.

This approach is more accurate for low coverage than normalize-by-median's reporting, because it collects the redundant reads rather than discarding them.

Note: at the moment, the khmer script normalize-by-median.py needs to be updated from the master branch of khmer to run this code properly. Once we've cut a new release, we'll remove this note and simply specify the khmer release required.

Let's assume you have a simple genome with some 5x repeats, and you've done some shotgun sequencing, and you want to know whether or not you've saturated to a coverage of 5 with your sequencing. You can use a variant of digital normalization, saturate-by-median, to run a collector's curve:

~/dev/khmer/sandbox/saturate-by-median.py -x 1e8 -k 20 -C 5 -R report.txt --report-frequency 10 reads.fa

Then, plot the resulting saturation curve:

./plot-saturation-curve.py report.txt saturation.png --xmin 0 --ymin 0 --xmax 1500

The x axis here is the number of reads examined (column 1 in report.txt), while the y axis (column 2) is the number of reads that are below a coverage of 5 in the data set at that point. You can see here that by the time you had sampled 1000 reads, you'd stopped seeing new coverage=5 reads, which suggests that further sequencing is unnecessary.

If you zoom out on the graph, you'll see that the curve keeps on climbing, albeit much more slowly. This is due to the influence of error rate on prediction of "novel" reads, and is something we have to fix.

by C. Titus Brown at August 27, 2014 10:00 PM

William Stein

What is SageMathCloud: let's clear some things up

[PDF version of this blog post]
"You will have to close source and commercialize Sage. It's inevitable." -- Michael Monagan, cofounder of Maple, told me this in 2006.
SageMathCloud (SMC) is a website that I first launched in April 2013, through which you can use Sage and all other open source math software online, edit Latex documents, IPython notebooks, Sage worksheets, track your todo items, and many other types of documents. You can write, compile, and run code in most programming languages, and use a color command line terminal. There is realtime collaboration on everything through shared projects, terminals, etc. Each project comes with a default quota of 5GB disk space and 8GB of RAM.

SMC is fun to use, pretty to look at, frequently backs up your work in many ways, is fault tolerant, encourages collaboration, and provides a web-based way to use standard command-line tools.

The Relationship with the SageMath Software

The goal of the SageMath software project, which I founded in 2005, is to create a viable free open source alternative to Magma, Mathematica, Maple, and Matlab. SMC is not mathematics software -- instead, SMC is best viewed by analogy as a browser-based version of a Linux desktop environment like KDE or Gnome. The vast majority of the code we write for SMC involves text editor issues (problems similar to those confronted by Emacs or Vim), personal information management, support for editing LaTeX documents, terminals, file management, etc. There is almost no mathematics involved at all.

That said, the main software I use is Sage, so of course support for Sage is a primary focus. SMC is a software environment that is being optimized for its users, who are mostly college students and teachers who use Sage (or Python) in conjunction with their courses. A big motivation for the existence of SMC is to make Sage much more accessible, since growth of Sage has stagnated since 2011, with the number one show-stopper obstruction being the difficulty of students installing Sage.

Sage is Failing

Measured by the mission statement, Sage has overall failed. The core goal is to provide similar functionality to Magma (and the other Ma's) across the board, and the Sage development model and community has failed to do this across the board, since after 9 years, based on our current progress, we will never get there. There are numerous core areas of research mathematics that I'm personally familiar with (in arithmetic geometry), where Sage has barely moved in years and Sage does only a few percent of what Magma does. Unless there is a viable plan for the areas to all be systematically addressed in a reasonable timeframe, not just with arithmetic geometry in Magma, but with everything in Mathematica, Maple., etc, we are definitely failing at the main goal I have for the Sage math software project.

I have absolutely no doubt that money combined with good planning and management would make it possible to achieve our mission statement. I've seen this hundreds of times over at a small scale at Sage Days workshops during the last decade. And let's not forget that with very substantial funding, Linux now provides a viable free open source alternative to Microsoft Windows. Just providing Sage developers with travel expenses (and 0 salary) is enough to get a huge amount done, when possible. But all my attempts with foundations and other clients to get any significant funding, at even the level of 1% of the funding that Mathematica gets each year, has failed. For the life of the Sage project, we've never got more than maybe 0.1% of what Mathematica gets in revenue. It's just a fact that the mathematics community provides Mathematica $50+ million a year, enough to fund over 600 fulltime positions, and they won't provide enough to fund one single Sage developer fulltime.

But the Sage mission statement remains, and even if everybody else in the world gives up on it, I HAVE NOT. SMC is my last ditch strategy to provide resources and visibility so we can succeed at this goal and give the world a viable free open source alternative to the Ma's. I wish I were writing interesting mathematical software, but I'm not, because I'm sucking it up and playing the long game.

The Users of SMC

During the last academic year (e.g., April 2014) there were about 20K "monthly active users" (as defined by Google Analytics), 6K weekly active users, and usually around 300 simultaneous connected users. The summer months have been slower, due to less teaching.

Numerically most users are undergraduate students in courses, who are asked to use SMC in conjunction with a course. There's also quite a bit of usage of SMC by people doing research in mathematics, statistics, economics, etc. -- pretty much all computational sciences. Very roughly, people create Sage worksheets, IPython notebooks, and Latex documents in somewhat equal proportions.

What SMC runs on

Technically, SMC is a multi-datacenter web application without specific dependencies on particular cloud provider functionality. In particular, we use the Cassandra database, and custom backend services written in Node.js (about 15,000 lines of backend code). We also use Amazon's Route 53 service for geographically aware DNS. There are two racks containing dedicated computers on opposites sides of campus at University of Washington with 19 total machines, each with about 1TB SSD, 4TB+ HDD, and 96GB RAM. We also have dozens of VM's running at 2 Google data centers to the east.

A substantial fraction of the work in implementing SMC has been in designing and implementing (and reimplementing many times, in response to real usage) a robust replicated backend infrastructure for projects, with regular snapshots and automatic failover across data centers. As I write this, users have created 66677 projects; each project is a self-contained Linux account whose files are replicated across several data centers.

The Source Code of SMC

The underlying source of SMC, both the backend server and frontend client, is mostly written in CoffeeScript. The frontend (which is nearly 20,000 lines of code) is implemented using the "progressive refinement" approach to HTML5/CSS/Javascript web development. We do not use any Javascript single page app frameworks, though we make heavy use of Bootstrap3 and jQuery. All of the library dependencies of SMC, e.g., CodeMirror, Bootstrap, jQuery, etc. for SMC are licensed under very permissive BSD/MIT, etc. libraries. In particular, absolutely nothing in the Javascript software stack is GPL or AGPL licensed. The plan is that any SMC source code that will be open sourced will be released under the BSD license. Some of the SMC source code is not publicly available, and is owned by University of Washington. But other code, e.g., the realtime sync code, is already available.
Some of the functionality of SMC, for example Sage worksheets, communicate with a separate process via a TCP connection. That separate process is in some cases a GPL'd program such as Sage, R, or Octave, so the viral nature of the GPL does not apply to SMC. Also, of course the virtual machines are running the Linux operating system, which is mostly GPL licensed. (There is absolutely no AGPL-licensed code anywhere in the picture.)

Note that since none of the SMC server and client code links (even at an interpreter level) with any GPL'd software, that code can be legally distributed under any license (e.g., from BSD to commercial).
Also we plan to create a fully open source version of the Sage worksheet server part of SMC for inclusion with Sage. This is not our top priority, since there are several absolutely critical tasks that still must be finished first on SMC, e.g., basic course management.

The SMC Business Model

The University of Washington Center for Commercialization (C4C) has been very involved and supportive since the start of the projects. There are no financial investors or separate company; instead, funding comes from UW, some unspent grant funds that were about to expire, and a substantial Google "Academic Education Grant" ($60K). Our first customer is the "US Army Engineer Research and Development Center", which just started a support/license agreement to run their own SMC internally. We don't currently offer a SaaS product for sale yet -- the options for what can be sold by UW are constrained, since UW is a not-for-profit state university. Currently users receive enhancements to their projects (e.g., increased RAM or disk space) in exchange for explaining to me the interesting research or teaching they are doing with SMC.

The longterm plan is to start a separate for-profit company if we build a sufficient customer base. If this company is successful, it would also support fulltime development of Sage (e.g., via teaching buyouts for faculty, support of students, etc.), similar to how Magma (and Mathematica, etc.) development is funded.

In conclusion, in response to Michael Monagan, you are wrong. And you are right.

by William Stein (noreply@blogger.com) at August 27, 2014 07:55 AM

You don't really think that Sage has failed, do you?

I just received an email from a postdoc in Europe, and very longtime contributor to the Sage project.  He's asking for a letter of recommendation, since he has to leave the world of mathematical software development (after a decade of training and experience), so that he can take a job at hedge fund.  He ends his request with the question:

> P.S. You don't _really_ think that Sage has failed, do you?

After almost exactly 10 years of working on the Sage project, I absolutely do think it has failed to accomplish the stated goal of the mission statement: "Create a viable free open source alternative to Magma, Maple, Mathematica and Matlab.".     When it was only a few years into the project, it was really hard to evaluate progress against such a lofty mission statement.  However, after 10 years, it's clear to me that not only have we not got there, we are not going to ever get there before I retire.   And that's definitely a failure.   

Here's a very rough quote I overheard at lunch today at Sage Days 61, from John Voight, who wrote much quaternion algebra code in Magma: "I'm making a list of what is missing from Sage that Magma has for working with quaternion algebras.  However, it's so incredibly daunting, that I don't want to put in everything.  I've been working on Magma's quaternion algebras for over 10 years, as have several other people.  It's truly daunting how much functionality Magma has compared to Sage."

The only possible way Sage will not fail at the stated mission is if I can get several million dollars a year in money to support developers to work fulltime on implementing interesting core mathematical algorithms.  This is something that Magma has had for over 20 years, and that Maple, Matlab, and Mathematica also have.   That I don't have such funding is probably why you are about to take a job at a hedge fund.    If I had the money, I would try to hire a few of the absolute best people (rather than a bunch of amateurs), people like you, Robert Bradshaw, etc. -- we know who is good. (And clearly I mean serious salaries, not grad student wages!)

So yes, I think the current approach to Sage has failed.    I am going to try another approach, namely SageMathCloud.  If it works, maybe the world will get a free open source alternative to Magma, Mathematica, etc.  Otherwise, maybe the world never ever will.      If you care like I do about having such a thing, and you're teaching course, or whatever, maybe try using SageMathCloud.   If enough people use SageMathCloud for college teaching, then maybe a business model will emerge, and Sage will get proper funding.   

by William Stein (noreply@blogger.com) at August 27, 2014 07:52 AM

August 26, 2014

Matthieu Brucher

Announcement: ATKUniversalDelay 1.1.0

I’m happy to announce the release of a mono fixed delay line on the Audio Toolkit. It is available on Windows and OS X (min. 10.8) in different formats. The three knobs adjust the direct signal (blend), the delayed signal (feedforward) as well as the feedback signal from the delay line injected in the input. The delay can be set from 0 ms to 1 s by steps of 0.1 ms.



The supported formats are:

  • VST2 (32bits/64bits on Windows, 64bits on OS X)
  • VST3 (32bits/64bits on Windows, 64bits on OS X)
  • Audio Unit (64bits, OS X)

Direct link for ATKUniversalDelay

The files as well as the previous plugins can be downloaded on SourceForge, as well as the source code.

Buy Me a Coffee!

Other Amount:

Your Email Address :

by Matthieu Brucher at August 26, 2014 07:31 AM

August 25, 2014

Paul Ivanov

pedestrian musings

I walk in monologue 
    through Berkeley's Hills
Feet pressing into sidewalk firmly
I eat the pensive mood 
    solitude brings
And bite into the juiciness of
I write, first time in years,
    free verse impromptu
Taking few dozen steps
    between each pair of lines
I yearn, on tip-toes
    stretching high, to be expressive
A mode of being longtime
I'm walking home - from job
    I'll soon be leaving
To find myself believing once 
That which I do defines 
    me not and feeling
That which I am is
    good. enough. a lot.

by Paul Ivanov at August 25, 2014 07:00 AM

August 24, 2014

Titus Brown

The first six years of faculty life

Inspired by Sarah Bisbing's excellent post on her first year as a faculty member, here are the questions I remember asking myself during my first six years:

Year 0: What science do I want to do?

Year 1: What the hell am I doing all day and why am I always so exhausted?

Year 2: Why don't my grad students know how to do anything yet?

Year 3: What does this data mean?

Year 4: Why are grant and paper reviewers so dumb?

Year 5: Why are grant and paper authors so dumb?

Year 6: Will I get tenure, and if so, do I really want to be doing this for the rest of my life?


by C. Titus Brown at August 24, 2014 10:00 PM

August 23, 2014

Titus Brown

Downsampling shotgun reads to a given average coverage (assembly-free)

The below is a recipe for subsetting a high-coverage data set to a given average coverage. This differs from digital normalization because the relative abundances of reads should be maintained -- what changes is the average coverage across all the reads.

Uses for this recipe include subsampling reads from a super-high coverage data set for the purpose of assembly, as well as more esoteric reasons (see the bottom of the post). This approach won't work on digitally normalized reads, and is primarily intended for genomes and low-complexity metagenomes. For high-complexity metagenomes we recommend partitioning.

Note: at the moment, the khmer scripts collect-reads.py and slice-reads-by-coverage.py are in the khmer repository under branch feature/collect_reads. Once we've merged it into the master branch and cut a release, we'll remove this note and simply specify the khmer release required.

This recipe uses code from khmer-recipes and dbg-graph-null.

Let's assume you have a simple genome with some 5x repeats, and you've done some shotgun sequencing to a coverage of 150 or higher. If your reads are in reads.fa

load-into-counting.py -x 1e8 -k 20 reads.kh reads.fa
~/dev/khmer/sandbox/calc-median-distribution.py reads.kh reads.fa reads-cov.dist
./plot-coverage-dist.py reads-cov.dist reads-cov.png --xmax=600 --ymax=500

and looks like this:


You see the same peaks at roughly the same places.

Now use collect-reads.py to subset the data to a lower average coverage of 50

~/dev/khmer/sandbox/collect-reads.py -x 1e8 -C 50 -k 20 reads.subset.kh reads.fa -o reads.subset.fa

Here, collect-reads.py is walking through the data set and computing a running average of the coverage of the last 1000 reads. Once it hits the specified average coverage of 50 (-C 50) it stops collecting the reads. Take a look at the read coverage spectrum for the subsetted data:

~/dev/khmer/sandbox/calc-median-distribution.py reads.subset.kh reads.subset.fa reads-subset.dist
./plot-coverage-dist.py reads-subset.dist reads-subset.png --xmax=600 --ymax=500

and compare the resulting plot with the one above --


Here you can see that the coverage spectrum has been shifted left and down by the subsampling (which is what you'd expect).

Note that picking the coverage that you want is a bit tricky, because it will be the average across the reads. If you have a highly repetitive genome you may need to go for something higher than your desired single-copy genome coverage, because the repeats will skew your average to the right.


If the peaks look good, you can use the output counting table reads.subset.kh as an argument to slice-reads-by-coverage (see this post). If you use the original reads, this will then give you _all_ the reads that cluster by coverage with that peak. For example,

~/dev/khmer/sandbox/slice-reads-by-coverage.py reads.subset.kh reads.fa reads-repeats.fa -m 100 -M 200

will give you all the reads from the repetitive component, which will be much higher coverage in the combined data set; take a look:

load-into-counting.py -x 1e8 -k 20 reads-repeats.kh reads-repeats.fa
~/dev/khmer/sandbox/calc-median-distribution.py reads-repeats.kh reads-repeats.fa reads-repeats.dist
./plot-coverage-dist.py reads-repeats.dist reads-repeats.png --xmax=600 --ymax=500

Here the slice specified (-m and -M) is with respect to the read abundances in reads.subset.kh). This allows you to more explore and subset large data sets than you would otherwise be able to, and also avoids some khmer-specific issues with counting k-mers that are higher abundance than 255.

by C. Titus Brown at August 23, 2014 10:00 PM

Extracting shotgun reads based on coverage in the data set (assembly-free)

In recent days, we've gotten several requests, including two or three on the khmer mailing list, for ways to extract shotgun reads based on their coverage with respect to the reference. This is fairly easy if you have an assembled genome, but what if you want to avoid doing an assembly? khmer can do this fairly easily using techniques taken from the digital normalization work, which allows you to estimate read coverage directly from the data.

The below is a recipe for computing coverage spectra and slicing reads out of a data set based on their coverage, with no assembly required. It's the first of what I hope to be many practical and useful recipes for working with shotgun data. Let me know what you think!

Uses for extracting reads by coverage include isolating repeats or pulling out mitochondrial DNA. This approach won't work on digitally normalized reads, and is primarily intended for genomes and low-complexity metagenomes. For high-complexity metagenomes we recommend partitioning.

Note: at the moment, the khmer script slice-reads-by-coverage is in the khmer repository under branch feature/collect_reads. Once we've merged it into the master branch and cut a release, we'll remove this note and simply specify the khmer release required.

This recipe uses code from khmer-recipes and dbg-graph-null.

Let's assume you have a simple genome with some 5x repeats, and you've done some shotgun sequencing to a coverage of 150. If your reads are in reads.fa, you can generate a k-mer spectrum from your genome with k=20

load-into-counting.py -x 1e8 -k 20 reads.kh reads.fa
abundance-dist.py -s reads.kh reads.fa reads.dist
./plot-abundance-dist.py reads.dist reads-dist.png --ymax=300

and it would look something like this:


For this (simulated) data set, you can see three peaks: one on the far right, which contains the high-abundance k-mers from your repeats; one in the middle, which contains the k-mers from the single-copy genome; and one all the way on the left at ~1, which contains all of the erroneous k-mers.

This is a useful diagnostic tool, but if you wanted to extract one peak or another, you'd have to compute a summary statistic of some sort on the reads. The khmer package includes just such a 'read coverage' estimator. On this data set, the read coverage spectrum can be generated like so::

~/dev/khmer/sandbox/calc-median-distribution.py reads.kh reads.fa reads-cov.dist
./plot-coverage-dist.py reads-cov.dist reads-cov.png --xmax=600 --ymax=500

and looks like this:


You see the same peaks at roughly the same places. While superficially similar to the k-mer spectrum, this is actually more useful in its own right -- because now you can grab the reads and do things with them.

We provide a script in khmer to extract the reads; slice-reads-by-coverage will take either a min coverage, or a max coverage, or both, and extract the reads that fall in the given interval.

First, let's grab the reads between 50 and 200 coverage -- these are the single-copy genome components. We'll put them in reads-genome.fa.

~/dev/khmer/sandbox/slice-reads-by-coverage.py reads.kh reads.fa reads-genome.fa -m 50 -M 200

Next, grab the reads greater in abundance than 200; these are the repeats. We'll put them in reads-repeats.fa.

~/dev/khmer/sandbox/slice-reads-by-coverage.py reads.kh reads.fa reads-repeats.fa -m 200

Now let's replot the read coverage spectra, first for the genome:

load-into-counting.py -x 1e8 -k 20 reads-genome.kh reads-genome.fa
~/dev/khmer/sandbox/calc-median-distribution.py reads-genome.kh reads-genome.fa reads-genome.dist
./plot-coverage-dist.py reads-genome.dist reads-genome.png --xmax=600 --ymax=500

and then for the repeats:

load-into-counting.py -x 1e8 -k 20 reads-repeats.kh reads-repeats.fa
~/dev/khmer/sandbox/calc-median-distribution.py reads-repeats.kh reads-repeats.fa reads-repeats.dist
./plot-coverage-dist.py reads-repeats.dist reads-repeats.png --xmax=600 --ymax=500
images/slice/reads-genome.png images/slice/reads-repeats.png

and voila! As you can see we have the reads of high coverage in reads-repeats.fa, and the reads of intermediate coverage in reads-genome.fa.

If you look closely, you might note that some reads seem to fall outside the specified slice categories above -- that's presumably because their coverage was predicated on the coverage of other reads in the whole data set, and now that we've sliced out various reads their coverage has dropped.

by C. Titus Brown at August 23, 2014 10:00 PM

August 22, 2014

Titus Brown

A Review: Neanderthal Man: In Search of Lost Genomes

I just finished reading Svante Paabo's autobiography, Neanderthal Man: In Search of Lost Genomes. The book is perfect -- if you're a biologist of any kind, you'll understand most of it without any trouble, and even physicists can probably get a lot out of the story (heh).

The book describes Svante Paabo's journey towards sequencing the Neanderthal and Denisovan genomes, and the attendant scientific and popular science implications. It's a fantastic portrayal of how science really works, from the perspective of a driven, charismatic, and successful scientist.

Beyond the scientific story, which I had not known much about at all, there were two particularly interesting parts of the book.

First, I was surprised at the candor and simultaneous shallowness with which Dr. Paabo discussed his various relationships and bisexuality. Throughout the book there is occasional mention of men, women, marriage, and relationships, and while I don't get much of a sense of how impactful all of this was on him personally, it is striking how little it seems to have impacted his scientific life. There was one particular bit in the middle that I found very understated in reporting, where he and a couple all move into the same apartment building, and then depart with different spouses. I guess I was surprised at the choice to report the personal relationships while avoiding any depth whatsoever. (Craig Venter still takes the cake with a single paragraph in A Life Decoded where he starts the paragraph with a divorce and ends with an engagement. It's a long paragraph, but still.)

Second, it was somewhat dispiriting to watch Dr. Paabo's transition from an enthusiastic young scientist concerned primarily with getting accurate scientific knowledge out there to one who was very focused not only on scientific correctness but on publicity. There's a great section at the beginning where he talks about publishing the first mtDNA sequence from Neanderthals, and he chooses Cell because it allowed longer articles and a better representation of the science; he also spends a lot of time making sure his mtDNA sequence can be reproducibly generated by another lab. By the end, however, he's publishing in Science and Nature, and agonizing over whether or not he'll be first with the publication. (The latter approach seem much more harmful to good scientific practice to me; both the secrecy & competitiveness and the desire to push out papers in Science and Nature are problematic.)

I don't know how much self-reflection went into the narrative, but it would be interesting to know why his approach towards publication changed so much during his career. Is it because science changed as a whole towards requiring the splashy high impact papers? Or is it because he grew in stature and funding requirements to the point where he needed the splashy high impact publications to justify the funding? Or (as he says towards the end) is it because of the perceived career needs of his junior colleagues? Or all three? Or more?

Anyway, a great read. Highly recommended for the sci-curious.


by C. Titus Brown at August 22, 2014 10:00 PM

Jake Vanderplas

Hacking Academia: Data Science and the University

A reflection on our SciFoo breakout session, where we discussed issues of data science within academia.

Almost a year ago, I wrote a post I called the Big Data Brain Drain, lamenting the ways that academia is neglecting the skills of modern data-intensive research, and in doing so is driving away many of the men and women who are perhaps best equipped to enable progress in these fields. This seemed to strike a chord with a wide range of people, and has led me to some incredible opportunities for conversation and collaboration on the subject. One of those conversations took place at the recent SciFoo conference, and this article is my way of recording some reflections on that conversation.

SciFoo is an annual gathering of several hundred scientists, writers, and thinkers sponsored by Digital Science, Nature, O'Reilly Media & Google. SciFoo brings together an incredibly eclectic group of people: I met philosophers, futurists, alien hunters, quantum phys