## November 21, 2014

### Filipe Saraiva

#### Presenting a Season of KDE 2014 student – Minh Ngo

Season of KDE is an outreach program hosted by the KDE community. This year I am working as a mentor to a long time requested project related with Cantor – the development of Python 3 backend. You can read more about Cantor in my blog (texts in English and Portuguese). So, let’s say welcome and good luck to Minh Ngo, the student behind this project!

Hi,

My name is Minh,

Minh Ngo

I’m BSc graduated student. I’m Vietnamese, but unlike other Vietnamese students spent most of my life in Ukraine. Currently, I’m preparing myself to the Master degree that will start in the next semester.

Open source is my free time hobby, so I would like to make something that is useful for the community. Previously, I was participated in the GSoC 2013 program and in several open source projects. Some of my personal projects is available on my github page https://github.com/Ignotus, not so popular like other cool projects, but several are used by other people and this fact makes me very happy .

Cantor is one of opportunities to spend time to create an useful thing and win an exclusive KDE T-shirt :). I decided to start my contribution with the Python3 backend, because few months ago I studied several courses that are related with Machine Learning, so I was looking for a stable desktop backend for IPython. A notebook version IPython I do not entirely like and its qtconsole version doesn’t satisfy me in terms of functionality, therefore I decided to find some existent frontend for IPython that I can tune for myself. And the story with Cantor began after than

Happy hacking!

## November 19, 2014

### Matthew Rocklin

#### Blaze Datasets

tl;dr Blaze aids exploration by supporting full databases and collections of datasets.

This post was composed using Blaze version 0.6.6. You can follow along with the following conda command.

conda install -c blaze blaze=0.6.6


When we encounter new data we need to explore broadly to find what exists before we can meaningfully perform analyses. The tools for this task are often overlooked.

This post outlines how Blaze explores collections and hierarchies of datasets, cases where one entity like a database or file or directory might hold many tables or arrays. We use examples from HDF5 files and SQL databases. Blaze understands how the underlying libraries work so that you don’t have to.

## Motivating problem - Intuitive HDF5 File Navigation

For example, if we want to understand the contents of this set of HDF5 files encoding meteorological data then we need to navigate a hierarchy of arrays. This is common among HDF5 files.

Typically we navigate these files in Python with h5py or pytables.

>>> import h5py
>>> f = h5py.File('OMI-Aura_L2-OMAERO_2014m1105t2304-o54838_v003-2014m1106t215558.he5')

HDF5 files organize datasets with an internal file system. The h5py library accesses this internal file system through successive calls to .keys() and item access.

>>> f.keys()
['HDFEOS', 'HDFEOS INFORMATION']

>>> f['/HDFEOS'].keys()

>>> f['/HDFEOS/SWATHS'].keys()
['ColumnAmountAerosol']

>>> f['/HDFEOS/SWATHS/ColumnAmountAerosol'].keys()
['Data Fields', 'Geolocation Fields']

>>> f['/HDFEOS/SWATHS/ColumnAmountAerosol/Data Fields'].keys()
['AerosolIndexUV',
'AerosolIndexVIS',
'AerosolModelMW',
'AerosolModelsPassedThreshold',
'AerosolOpticalThicknessMW',
...

>>> f['/HDFEOS/SWATHS/ColumnAmountAerosol/Data Fields/TerrainPressure']
<HDF5 dataset "TerrainPressure": shape (1643, 60), type "<i2">

>>> f['/HDFEOS/SWATHS/ColumnAmountAerosol/Data Fields/TerrainPressure'][:]
array([[1013, 1013, 1013, ..., 1013, 1013, 1013],
[1013, 1013, 1013, ..., 1013, 1013, 1013],
[1013, 1013, 1013, ..., 1013, 1013, 1013],
...,
[1010,  992,  986, ..., 1013, 1013, 1013],
[ 999,  983,  988, ..., 1013, 1013, 1013],
[1006,  983,  993, ..., 1013, 1013, 1013]], dtype=int16)

This interaction between programmer and interpreter feels like a long and awkward conversation.

Blaze improves the exploration user experience by treating the entire HDF5 file as a single Blaze variable. This allows users to both explore and compute on larger collections of data in the same workflow. This isn’t specific to HDF5 but works anywhere many datasets are bundled together.

## Exploring a SQL Database

For example, a SQL database can be viewed as a collection of tables. Blaze makes it easy to to access a single table in a database using a string URI specifying both the database and the table name.

>>> iris = Data('sqlite:////my.db::iris')  # protocol://database::table-name
>>> iris
sepal_length  sepal_width  petal_length  petal_width      species
0            5.1          3.5           1.4          0.2  Iris-setosa
1            4.9          3.0           1.4          0.2  Iris-setosa
2            4.7          3.2           1.3          0.2  Iris-setosa
3            4.6          3.1           1.5          0.2  Iris-setosa
4            5.0          3.6           1.4          0.2  Iris-setosa
...

This only works if we know what table we want ahead of time. The approach above assumes that the user is already familiar with their data. To resolve this problem we omit the table name and access the database as a variable instead. We use the same interface to access the entire database as we would a specific table.

>>> db = Data('sqlite:////my.db')  # protocol://database
>>> db
Data:       Engine(sqlite:////home/mrocklin/workspace/blaze/blaze/examples/data/iris.db)
DataShape:  {
iris: var * {
sepal_length: ?float64,
sepal_width: ?float64,
petal_length: ?float64,
petal_width: ?float64,
species: ?string
...

The db expression points to a SQLAlchemy engine. We print the engine alongside a truncated datashape showing the metadata of the tables in that database. Note that the datashape maps table names to datashapes of those tables, in this case a variable length collection of records with fields for measurements of flowers.

Blaze isn’t doing any work of the grunt work here, SQLAlchemy is. SQLAlchemy is a mature Python library that interacts with a wide variety of SQL databases. It provides both database reflection (as we see above) along with general querying (as we see below). Blaze provides a convenient front-end.

We seamlessly transition from exploration to computation. We query for the shortest and longest sepal per species.

>>> by(db.iris.species, shortest=db.iris.sepal_length.min(),
...                      longest=db.iris.sepal_length.max())
species  longest  shortest
0      Iris-setosa      5.8       4.3
1  Iris-versicolor      7.0       4.9
2   Iris-virginica      7.9       4.9

Blaze doesn’t pull data into local memory, instead it generates SQLAlchemy which generates SQL which executes on the foreign database. The (much smaller) result is pulled back into memory and rendered nicely using Pandas.

## A Larger Database

Improved metadata discovery on SQL databases overlaps with the excellent work done by yhat on db.py. We steal their example, the Lahman Baseball statistics database, as an example of a richer database with a greater variety of tables.

>>> lahman = Data('sqlite:///baseball-archive-2012.sqlite')
>>> lahman.dshape  # ask for the entire datashape
dshape("""{
battingpost: var * {
yearID: ?int32,
round: ?string,
playerID: ?string,
teamID: ?string,
lgID: ?string,
G: ?int32,
AB: ?int32,
R: ?int32,
H: ?int32,
2B: ?int32,
3B: ?int32,
HR: ?int32,
RBI: ?int32,
SB: ?int32,
CS: ?int32,
BB: ?int32,
SO: ?int32,
IBB: ?int32,
HBP: ?int32,
SH: ?int32,
SF: ?int32,
GIDP: ?int32
},
awardsmanagers: var * {
managerID: ?string,
awardID: ?string,
yearID: ?int32,
lgID: ?string,
tie: ?string,
notes: ?string
},
hofold: var * {
hofID: ?string,
yearid: ?int32,
votedBy: ?string,
ballots: ?int32,
inducted: ?string,
category: ?string
},
salaries: var * {
yearID: ?int32,
teamID: ?string,
lgID: ?string,
playerID: ?string,
salary: ?float64
},
pitchingpost: var * {
playerID: ?string,
yearID: ?int32,
round: ?string,
teamID: ?string,
lgID: ?string,
W: ?int32,
L: ?int32,
G: ?int32,
GS: ?int32,
CG: ?int32,
SHO: ?int32,
SV: ?int32,
IPouts: ?int32,
H: ?int32,
ER: ?int32,
HR: ?int32,
BB: ?int32,
SO: ?int32,
BAOpp: ?float64,
ERA: ?float64,
IBB: ?int32,
WP: ?int32,
HBP: ?int32,
BK: ?int32,
BFP: ?int32,
GF: ?int32,
R: ?int32,
SH: ?int32,
SF: ?int32,
GIDP: ?int32
},
managers: var * {
managerID: ?string,
yearID: ?int32,
teamID: ?string,
lgID: ?string,
inseason: ?int32,
G: ?int32,
W: ?int32,
L: ?int32,
rank: ?int32,
plyrMgr: ?string
},
teams: var * {
yearID: ?int32,
lgID: ?string,
teamID: ?string,
franchID: ?string,
divID: ?string,
Rank: ?int32,
G: ?int32,
Ghome: ?int32,
W: ?int32,
L: ?int32,
DivWin: ?string,
WCWin: ?string,
LgWin: ?string,
WSWin: ?string,
R: ?int32,
AB: ?int32,
H: ?int32,
2B: ?int32,
3B: ?int32,
HR: ?int32,
BB: ?int32,
SO: ?int32,
SB: ?int32,
CS: ?int32,
HBP: ?int32,
SF: ?int32,
RA: ?int32,
ER: ?int32,
ERA: ?float64,
CG: ?int32,
SHO: ?int32,
SV: ?int32,
IPouts: ?int32,
HA: ?int32,
HRA: ?int32,
BBA: ?int32,
SOA: ?int32,
E: ?int32,
DP: ?int32,
FP: ?float64,
name: ?string,
park: ?string,
attendance: ?int32,
BPF: ?int32,
PPF: ?int32,
teamIDBR: ?string,
teamIDlahman45: ?string,
teamIDretro: ?string
},
teamshalf: var * {
yearID: ?int32,
lgID: ?string,
teamID: ?string,
Half: ?string,
divID: ?string,
DivWin: ?string,
Rank: ?int32,
G: ?int32,
W: ?int32,
L: ?int32
},
fieldingpost: var * {
playerID: ?string,
yearID: ?int32,
teamID: ?string,
lgID: ?string,
round: ?string,
POS: ?string,
G: ?int32,
GS: ?int32,
InnOuts: ?int32,
PO: ?int32,
A: ?int32,
E: ?int32,
DP: ?int32,
TP: ?int32,
PB: ?int32,
SB: ?int32,
CS: ?int32
},
master: var * {
lahmanID: ?int32,
playerID: ?string,
managerID: ?string,
hofID: ?string,
birthYear: ?int32,
birthMonth: ?int32,
birthDay: ?int32,
birthCountry: ?string,
birthState: ?string,
birthCity: ?string,
deathYear: ?int32,
deathMonth: ?int32,
deathDay: ?int32,
deathCountry: ?string,
deathState: ?string,
deathCity: ?string,
nameFirst: ?string,
nameLast: ?string,
nameNote: ?string,
nameGiven: ?string,
nameNick: ?string,
weight: ?int32,
height: ?float64,
bats: ?string,
throws: ?string,
debut: ?string,
finalGame: ?string,
college: ?string,
lahman40ID: ?string,
lahman45ID: ?string,
retroID: ?string,
holtzID: ?string,
bbrefID: ?string
},
fieldingof: var * {
playerID: ?string,
yearID: ?int32,
stint: ?int32,
Glf: ?int32,
Gcf: ?int32,
Grf: ?int32
},
pitching: var * {
playerID: ?string,
yearID: ?int32,
stint: ?int32,
teamID: ?string,
lgID: ?string,
W: ?int32,
L: ?int32,
G: ?int32,
GS: ?int32,
CG: ?int32,
SHO: ?int32,
SV: ?int32,
IPouts: ?int32,
H: ?int32,
ER: ?int32,
HR: ?int32,
BB: ?int32,
SO: ?int32,
BAOpp: ?float64,
ERA: ?float64,
IBB: ?int32,
WP: ?int32,
HBP: ?int32,
BK: ?int32,
BFP: ?int32,
GF: ?int32,
R: ?int32,
SH: ?int32,
SF: ?int32,
GIDP: ?int32
},
managershalf: var * {
managerID: ?string,
yearID: ?int32,
teamID: ?string,
lgID: ?string,
inseason: ?int32,
half: ?int32,
G: ?int32,
W: ?int32,
L: ?int32,
rank: ?int32
},
appearances: var * {
yearID: ?int32,
teamID: ?string,
lgID: ?string,
playerID: ?string,
G_all: ?int32,
G_batting: ?int32,
G_defense: ?int32,
G_p: ?int32,
G_c: ?int32,
G_1b: ?int32,
G_2b: ?int32,
G_3b: ?int32,
G_ss: ?int32,
G_lf: ?int32,
G_cf: ?int32,
G_rf: ?int32,
G_of: ?int32,
G_dh: ?int32,
G_ph: ?int32,
G_pr: ?int32
},
halloffame: var * {
hofID: ?string,
yearid: ?int32,
votedBy: ?string,
ballots: ?int32,
needed: ?int32,
inducted: ?string,
category: ?string
},
awardsplayers: var * {
playerID: ?string,
awardID: ?string,
yearID: ?int32,
lgID: ?string,
tie: ?string,
notes: ?string
},
schoolsplayers: var * {
playerID: ?string,
schoolID: ?string,
yearMin: ?int32,
yearMax: ?int32
},
schools: var * {
schoolID: ?string,
schoolName: ?string,
schoolCity: ?string,
schoolState: ?string,
schoolNick: ?string
},
teamsfranchises: var * {
franchID: ?string,
franchName: ?string,
active: ?string,
NAassoc: ?string
},
allstarfull: var * {
playerID: ?string,
yearID: ?int32,
gameID: ?string,
teamID: ?string,
lgID: ?string,
GP: ?int32,
startingPos: ?int32
},
tmp_batting: var * {
playerID: ?string,
yearID: ?int32,
stint: ?int32,
teamID: ?string,
lgID: ?string,
G: ?int32,
G_batting: ?int32,
AB: ?int32,
R: ?int32,
H: ?int32,
2B: ?int32,
3B: ?int32,
HR: ?int32,
RBI: ?int32,
SB: ?int32,
CS: ?int32,
BB: ?int32,
SO: ?int32,
IBB: ?int32,
HBP: ?int32,
SH: ?int32,
SF: ?int32,
GIDP: ?int32,
G_old: ?int32
},
seriespost: var * {
yearID: ?int32,
round: ?string,
teamIDwinner: ?string,
lgIDwinner: ?string,
teamIDloser: ?string,
lgIDloser: ?string,
wins: ?int32,
losses: ?int32,
ties: ?int32
},
awardssharemanagers: var * {
awardID: ?string,
yearID: ?int32,
lgID: ?string,
managerID: ?string,
pointsWon: ?int32,
pointsMax: ?int32,
},
awardsshareplayers: var * {
awardID: ?string,
yearID: ?int32,
lgID: ?string,
playerID: ?string,
pointsWon: ?float64,
pointsMax: ?int32,
},
fielding: var * {
playerID: ?string,
yearID: ?int32,
stint: ?int32,
teamID: ?string,
lgID: ?string,
POS: ?string,
G: ?int32,
GS: ?int32,
InnOuts: ?int32,
PO: ?int32,
A: ?int32,
E: ?int32,
DP: ?int32,
PB: ?int32,
WP: ?int32,
SB: ?int32,
CS: ?int32,
ZR: ?float64
}
}""")

Seeing at once all the tables in the database, all the columns in those tables, and all the types of those columns provides a clear and comprehensive overview of our data. We represent this information as a datashape, a type system covers everything from numpy arrays to SQL databases and Spark collections.

We use standard Blaze expressions to navigate more deeply. Things like auto-complete work fine.

>>> lahman.<tab>
lahman.allstarfull          lahman.fieldingpost         lahman.pitchingpost
lahman.appearances          lahman.fields               lahman.relabel
lahman.awardsmanagers       lahman.halloffame           lahman.salaries
lahman.awardsplayers        lahman.hofold               lahman.schema
lahman.awardssharemanagers  lahman.isidentical          lahman.schools
lahman.awardsshareplayers   lahman.like                 lahman.schoolsplayers
lahman.battingpost          lahman.managers             lahman.seriespost
lahman.data                 lahman.managershalf         lahman.teams
lahman.dshape               lahman.map                  lahman.teamsfranchises
lahman.fielding             lahman.master               lahman.teamshalf
lahman.fieldingof           lahman.pitching             lahman.tmp_batting

And we finish by a fun multi-table computation, joining two tables on year, team, and league and then computing the average salary by team name and year

>>> joined = join(lahman.teams, lahman.salaries, ['yearID', 'teamID', 'lgID'])

>>> by(joined[['name', 'yearID']], average_salary=joined.salary.mean())
name  yearID  average_salary
0         Anaheim Angels    1997  1004370.064516
1         Anaheim Angels    1998  1214147.058824
2         Anaheim Angels    1999  1384704.150000
3         Anaheim Angels    2000  1715472.233333
4         Anaheim Angels    2001  1584505.566667
5         Anaheim Angels    2002  2204345.250000
6         Anaheim Angels    2003  2927098.777778
7         Anaheim Angels    2004  3723506.185185
8   Arizona Diamondbacks    1998   898527.777778
9   Arizona Diamondbacks    1999  2020705.852941
...

Looks good, we compute and store to CSV file with into

>>> into('salaries.csv',
...      by(j[['name', 'yearID']], total_salary=j.salary.mean()))

(Final result here: salaries.csv)

## Beyond SQL

SQL isn’t unique, many systems hold collections or hierarchies of datasets. Blaze supports navigation over Mongo databases, Blaze servers, HDF5 files, or even just dictionaries of pandas DataFrames or CSV files.

>>> d = {'accounts 1': CSV('accounts_1.csv'),
...      'accounts 2': CSV('accounts_2.csv')}

>>> accounts = Data(d)

>>> accounts.dshape
dshape("""{
accounts 1: var * {id: ?int64, name: string, amount: ?int64},
accounts 2: var * {id: ?int64, name: string, amount: ?int64}
}""")

None of this behavior was special-cased in Blaze. The same mechanisms that select a table from a SQL database select a column from a Pandas DataFrame.

## Finish with HDF5 example

### Raw H5Py

Recall that we previously had a long back-and-forth conversation with the Python interpreter.

>>> import h5py
>>> f = h5py.File('OMI-Aura_L2-OMAERO_2014m1105t2304-o54838_v003-2014m1106t215558.he5')

>>> f.keys()
['HDFEOS', 'HDFEOS INFORMATION']

>>> f['/HDFEOS'].keys()
...

### H5Py with Blaze expressions

With Blaze we get a quick high-level overview and an API that is shared with countless other systems.

>>> from blaze import Data
>>> d = Data(f)  # Wrap h5py file with Blaze interactive expression
>>> d
Data:       <HDF5 file "OMI-Aura_L2-OMAERO_2014m1105t2304-o54838_v003-2014m1106t215558.he5" (mode r+)>
DataShape:  {
HDFEOS: {
SWATHS: {
ColumnAmountAerosol: {
Data Fields: {
AerosolIndexUV: 1643 * 60 * int16,
...

By default we see the data and a truncated datashape.

We ask for the datashape explicitly to see the full picture.

>>> d.dshape
dshape("""{
HDFEOS: {
SWATHS: {
ColumnAmountAerosol: {
Data Fields: {
AerosolIndexUV: 1643 * 60 * int16,
AerosolIndexVIS: 1643 * 60 * int16,
AerosolModelMW: 1643 * 60 * uint16,
AerosolModelsPassedThreshold: 1643 * 60 * 10 * uint16,
AerosolOpticalThicknessMW: 1643 * 60 * 14 * int16,
AerosolOpticalThicknessMWPrecision: 1643 * 60 * int16,
AerosolOpticalThicknessNUV: 1643 * 60 * 2 * int16,
AerosolOpticalThicknessPassedThreshold: 1643 * 60 * 10 * 9 * int16,
AerosolOpticalThicknessPassedThresholdMean: 1643 * 60 * 9 * int16,
AerosolOpticalThicknessPassedThresholdStd: 1643 * 60 * 9 * int16,
CloudFlags: 1643 * 60 * uint8,
CloudPressure: 1643 * 60 * int16,
EffectiveCloudFraction: 1643 * 60 * int8,
InstrumentConfigurationId: 1643 * uint8,
MeasurementQualityFlags: 1643 * uint8,
NumberOfModelsPassedThreshold: 1643 * 60 * uint8,
ProcessingQualityFlagsMW: 1643 * 60 * uint16,
ProcessingQualityFlagsNUV: 1643 * 60 * uint16,
RootMeanSquareErrorOfFitPassedThreshold: 1643 * 60 * 10 * int16,
SingleScatteringAlbedoMW: 1643 * 60 * 14 * int16,
SingleScatteringAlbedoMWPrecision: 1643 * 60 * int16,
SingleScatteringAlbedoNUV: 1643 * 60 * 2 * int16,
SingleScatteringAlbedoPassedThreshold: 1643 * 60 * 10 * 9 * int16,
SingleScatteringAlbedoPassedThresholdMean: 1643 * 60 * 9 * int16,
SingleScatteringAlbedoPassedThresholdStd: 1643 * 60 * 9 * int16,
SmallPixelRadiancePointerUV: 1643 * 2 * int16,
SmallPixelRadiancePointerVIS: 1643 * 2 * int16,
SmallPixelRadianceUV: 6783 * 60 * float32,
SmallPixelRadianceVIS: 6786 * 60 * float32,
SmallPixelWavelengthUV: 6783 * 60 * uint16,
SmallPixelWavelengthVIS: 6786 * 60 * uint16,
TerrainPressure: 1643 * 60 * int16,
TerrainReflectivity: 1643 * 60 * 9 * int16,
XTrackQualityFlags: 1643 * 60 * uint8
},
Geolocation Fields: {
GroundPixelQualityFlags: 1643 * 60 * uint16,
Latitude: 1643 * 60 * float32,
Longitude: 1643 * 60 * float32,
OrbitPhase: 1643 * float32,
SolarAzimuthAngle: 1643 * 60 * float32,
SolarZenithAngle: 1643 * 60 * float32,
SpacecraftAltitude: 1643 * float32,
SpacecraftLatitude: 1643 * float32,
SpacecraftLongitude: 1643 * float32,
TerrainHeight: 1643 * 60 * int16,
Time: 1643 * float64,
ViewingAzimuthAngle: 1643 * 60 * float32,
ViewingZenithAngle: 1643 * 60 * float32
}
}
}
},
HDFEOS INFORMATION: {
}
}""")

When we see metadata for everything in this dataset all at once the full structure becomes apparent. We have two collections of arrays, all shaped (1643, 60); the collection in Data Fields holds what appear to be weather measurements while the collection in Geolocation Fields holds coordinate information.

Moreover we can quickly navigate this structure to compute relevant information.

>>> d.HDFEOS.SWATHS.ColumnAmountAerosol.Geolocation_Fields.Latitude
array([[-67., -67., -67., ..., -61., -60., -60.],
[-67., -67., -67., ..., -61., -60., -60.],
[-67., -67., -68., ..., -61., -61., -60.],
...,
[ 69.,  70.,  71., ...,  85.,  84.,  84.],
[ 69.,  70.,  71., ...,  85.,  85.,  84.],
[ 69.,  70.,  71., ...,  85.,  85.,  84.]], dtype=float32)

>>> d.HDFEOS.SWATHS.ColumnAmountAerosol.Geolocation_Fields.Longitude
array([[  46.,   43.,   40., ...,   -3.,   -5.,   -7.],
[  46.,   43.,   40., ...,   -3.,   -5.,   -7.],
[  46.,   43.,   40., ...,   -4.,   -5.,   -7.],
...,
[ 123.,  124.,  124., ..., -141., -131., -121.],
[ 123.,  124.,  124., ..., -141., -130., -120.],
[ 123.,  123.,  124., ..., -140., -130., -119.]], dtype=float32)

>>> d.HDFEOS.SWATHS.ColumnAmountAerosol.Data_Fields.TerrainPressure
array([[1013, 1013, 1013, ..., 1013, 1013, 1013],
[1013, 1013, 1013, ..., 1013, 1013, 1013],
[1013, 1013, 1013, ..., 1013, 1013, 1013],
...,
[1010,  992,  986, ..., 1013, 1013, 1013],
[ 999,  983,  988, ..., 1013, 1013, 1013],
[1006,  983,  993, ..., 1013, 1013, 1013]], dtype=int16)

>>> d.HDFEOS.SWATHS.ColumnAmountAerosol.Data_Fields.TerrainPressure.min()
620

## Final Thoughts

Often the greatest challenge is finding what you already have.

Discovery and exploration are just as important as computation. By extending the Blaze’s expression system to hierarchies of datasets we create a smooth user experience from first introductions to data all the way to analytic queries and saving results.

This work was easy. The pluggable architecture of Blaze made it surprisingly simple to extend the Blaze model from tables and arrays to collections of tables and arrays. We wrote about 40 significant lines of code for each supported backend.

## November 18, 2014

### Titus Brown

#### RFC: The khmer project: what are we, and what are our goals?

As we think about the next few years of khmer development, it is helpful to explore what khmer is, and what our goals for khmer development are. This can provide guiding principles for development, refactoring, extension, funding requests, and collaborations.

## Definition

khmer is an open source project that serves as:

• a stable research platform for novel CS/bio research on data structures and algorithms, mostly k-mer based;
• a set of of command line tools for various kinds of data transformations;
• a test bed for software engineering practice in science;
• a Python library for working with k-mers and graph structures;
• an exercise in community building in scientific software engineering;
• an effort to participate in and sustainably grow the bioinformatics ecosystem.

## Goals

Our long term goals for khmer, in some rough order of priority, are:

• Keep khmer versatile and agile enough to easily enable the CS and bio we want to do.

Practical implications: limit complexity of internals as much as possible.

• Continue community building.

Practical implications: run khmer as a real open source project, with everything done in the open; work nicely with other projects.

• Build, sustain, and maintain a set of protocols and recipes around khmer.

• Improve the efficiency (time/memory) of khmer implementations.

Practical implications: optimize, but not at expense of clean code.

Some specifics: streaming; variable sized counters.

• Lower barriers to an increasing user base.

Practical implications: find actual pain points, address if it’s easy or makes good sense.

Some specifics: hash function k > 32, stranded hash function, integrate efficient k-mer cardinality counting, implement dynamically sized data structures.

• Keep khmer technologically up to date.

Practical implications: transition to Python 3.

--titus

p.s. Thanks to Qingpeng Zhang, Travis Collier, and Matt MacManes for comments in the draft stage!

### Thomas Wiecki

#### A modern guide to getting started with Data Science and Python

Python has an extremely rich and healthy ecosystem of data science tools. Unfortunately, to outsiders this ecosystem can look like a jungle (cue snake joke). In this blog post I will provide a step-by-step guide to venturing into this PyData jungle.

What's wrong with the many lists of PyData packages out there already you might ask? I think that providing too many options can easily overwhelm someone who is just getting started. So instead, I will keep a very narrow scope and focus on the 10% of tools that allow you to do 90% of the work. After you mastered these essentials you can browse the long lists of PyData packages to decide which to try next.

The upside is that the few tools I will introduce already allow you to do most things a data scientist does in his day-to-day (i.e. data i/o, data munging, and data analysis).

## Installation

It has happened quite a few times that people came up to me and said "I heard Python is amazing for data science so I wanted to start learning it but spent two days installing Python and all the other modules!". It's quite reasonable to think that you have to install Python if you want to use it but indeed, installing the full PyData stack manually when you don't know which tools you actually need is quite an undertaking. So I strongly recommend against doing that.

Fortunately for you, the fine folks at Continuum have created the Anaconda Python distribution that installs most of the PyData stack for you, and the modules it doesn't provide out of the box can easily be installed via a GUI. The distribution is also available for all major platforms so save yourself the two days and just use that!

## IPython Notebook

After Python is installed, most people start by launching it. Again, very reasonable but unfortunately dead wrong. I don't know a single SciPythonista that uses the Python command shell directly (YMMV). Instead, IPython, and specifically the IPython Notebook are incredibly powerful Python shells that are used ubiquitously in PyData. I strongly recommend you directly start using the IPython Notebook (IPyNB) and don't bother with anything else, you won't regret it. In brief, the IPyNB is a Python shell that you access via your web browser. It allows you to mix code, text, and graphics (even interactive ones). This blog post was written in an IPyNB and it's rare to go find a talk at a Python conference that does not use the IPython Notebook. It comes preinstalled by Anaconda so you can just start using it. Here's an example of what it looks like:

In [1]:
print('Hello World')

Hello World



This thing is a rocket -- every time I hear one of the core devs talk at a conference I am flabbergasted by all the new things they cooked up. To get an idea for some of the advanced capabilities, check out this short tutorial on IPython widgets. These allow you to attach sliders to control a plot interactively:

In [1]:
from IPython.display import YouTubeVideo

Out[1]:

## Pandas

Normally, people recommend you start by learning NumPy (pronounced num-pie, not num-pee!) which is the library that provides multi-dimensional arrays. Certainly this was the way to go a few years ago but I hardly use NumPy at all today. The reason is that NumPy became more of a core library that's used by other libraries which provide a much nicer interface. Thus, the main library to use for working with data is Pandas. It can input and output data from all kinds of formats (including databases), do joins and other SQL-like functions for shaping the data, handle missing values with ease, support time series, has basic plotting capabilities and basic statistical functionality and much more. There is certainly a learning curve to all its features but I strongly suggest you go through most of the documentation as a first step. Trust me, the time you invest will be set off a thousand fold by being more efficient in your data munging. Here are a few quick tricks to whet your appetite:

In [18]:
import pandas as pd

df = pd.DataFrame({ 'A' : 1.,
'B' : pd.Timestamp('20130102'),
'C' : pd.Series(1, index=list(range(4)), dtype='float32'),
'D' : pd.Series([1, 2, 1, 2], dtype='int32'),
'E' : pd.Categorical(["test", "train", "test", "train"]),
'F' : 'foo' })

In [19]:
df

Out[19]:
A B C D E F
0 1 2013-01-02 1 1 test foo
1 1 2013-01-02 1 2 train foo
2 1 2013-01-02 1 1 test foo
3 1 2013-01-02 1 2 train foo

Columns can be accessed by name:

In [17]:
df.B

Out[17]:
0   2013-01-02
1   2013-01-02
2   2013-01-02
3   2013-01-02
Name: B, dtype: datetime64[ns]


Compute the sum of D for each category in E:

In [21]:
df.groupby('E').sum().D

Out[21]:
E
test     2
train    4
Name: D, dtype: int32


Doing this is in NumPy (or *gasp* Matlab!) would be much more clunky.

There's a ton more. If you're not convinced, check out 10 minutes to pandas where I borrowed this from.

## Seaborn

The main plotting library of Python is Matplotlib. However, I don't recommend using it directly for the same reason I don't recommend spending time learning NumPy initially. While Matplotlib is very powerful, it is its own jungle and requires lots of tweaking to make your plots look shiny. So instead, I recommend to start using Seaborn. Seaborn essentially treats Matplotlib as a core library (just like Pandas does with NumPy). I will briefly illustrate the main benefits of seaborn. Specifically, it:

1. creates aesthetically pleasing plots by default (for one thing, it does not default to the jet colormap),
2. creates statistically meaningful plots, and
3. understands the pandas DataFrame so the two work well together.

While pandas comes prepackaged with anaconda, seaborn is not directly included but can easily be installed with conda install seaborn.

### Statistically meaningful plots

In [5]:
%matplotlib inline # IPython magic to create plots within cells

In [7]:
import seaborn as sns

# Load one of the data sets that come with seaborn

sns.jointplot("total_bill", "tip", tips, kind='reg');


As you can see, with just one line we create a pretty complex statistical plot including the best fitting linear regression line along with confidence intervals, marginals and the correlation coefficients. Recreating this plot in matplotlib would take quite a bit of (ugly) code, including calls to scipy to run the linear regression and manually applying the linear regression formula to draw the line (and I don't even know how to do the marginal plots and confidence intervals from the top of my head). This and the next example are taken from the tutorial on quantitative linear models.

### Works well with Pandas DataFrame

Data has structure. Often, there are different groups or categories we are interested in (pandas' groupby functionality is amazing in this case). For example, the tips data set looks like this:

In [9]:
tips.head()

Out[9]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4

We might ask if smokers tip differently than non-smokers. Without seaborn, this would require a pandas groupby together with the complex code for plotting a linear regression. With seaborn, we can provide the column name we wish to split by as a keyword argument to col:

In [11]:
sns.lmplot("total_bill", "tip", tips, col="smoker");


Pretty neat, eh?

As you dive deeper you might want to control certain details of these plots at a more fine grained level. Because seaborn is just calling into matplotlib you probably will want to start learning this library at that point. For most things I'm pretty happy with what seaborn provides, however.

# Conclusions

The idea of this blog post was to provide a very select number of packages which maximize your efficiency when starting with data science in Python.

## Acknowledgements

Thanks to Katie Green and Andreas Dietzel for helpful feedback on an earlier draft.

### Continuum Analytics

#### Blaze Datasets

tl;dr Blaze aids exploration by supporting full databases and collections of datasets.

## November 14, 2014

### Titus Brown

#### A Saturday morning conversation about publishing inconclusions

Here's an excerpt from an e-mail to a student whose committee I'm on; they were asking me about a comment their advisor had made that they shouldn't put a result in a paper because "It'll confuse the reviewer."

One thing to keep in mind is that communicating the results _is_ important. "It'll confuse the reviewer" is not always shorthand for "well, it's true but we can't say it because it's too confusing", which is how I think you're interpreting your advisor's comment; it is also shorthand for "well, we don't really have the depth of data/information to understand why the result is this way, so there's something more complicated going on, but we can't talk about it because we don't understand it ourselves." These can lead to the most productive bits of a project, if followed up (but it takes a lot of time to do so...)

Their response:

I'm curious, however, if we don't understand what's going on ourselves, shouldn't that be all the more reason to publish something? Because then other people with more knowledge can read it and they may know what's going on? Or at least help us?

And my response:

Well, that's definitely not how it works. Let me see if I can think of some reasons why...

First, it's much easier to get confused than it is to fight your way out of confusion - and you usually learn something in the process, obviously. So you have a lot more people who are confused than have learned something, and the latter is considered more interesting and worthy of communication than the former.

Second, writing papers is a lot of work and takes a lot of time. If you're confused (and hence probably wrong about any conclusions) it's probably not worth the effort...

Third, reading papers is also a lot of work! Most papers are never read seriously by anyone other than the reviewers. So doubling or tripling the number of papers to include confusing or inconclusive data would probably not be helpful.

Fourth, most conclusions require multiple lines of evidence, most which are often not developed until you have at least some solid hypotheses about why you're seeing what you're seeing from one line of evidence. So a lot of those pubs would be wrong.

Fifth, failure is frequently underdetermined -- that is, we may have wrong or incorrect results and not know or know why. It's rarely worth chasing down exactly why something didn't work unless it doesn't work reproducibly and it's critical and important to your results.

That having been said, there's a movement to make publishing data easier and bring data citations into standard practice, which I think is partly targeted at the same problem you're asking about.

What am I missing?

thanks, --titus

### William Stein

#### SageMathCloud Notifications are Now Better

I just made live a new notifications systems for  SageMathCloud, which I spent all week writing.

These notifications are what you see when you click the bell in the upper right.   This new system replaces the one I made live two weeks ago.     Whenever somebody actively *edits* (using the web interface) any file in any project you collaborate on, a notification will get created or updated.    If a person *comments* on any file in any project you collaborate on (using the chat interface to the right), then not only does the notification get updated, there is also a little red counter on top of the bell and also in the title of the  SageMathCloud tab.   In particular, people will now be much more likely to see the chats you make on files.

NOTE: I have not yet enabled any sort of daily email notification summary, but that is planned.

Some technical details:  Why did this take all week?  It's because the technology that makes it work behind the scenes is something that was fairly difficult for me to figure out how to implement.  I implemented a way to create an object that can be used simultaneously by many clients and supports realtime synchronization.... but is stored by the distributed Cassandra database instead of a file in a project.   Any changes to that object get synchronized around very quickly.   It's similar to how synchronized text editing (with several people at once) works, but I rethought differential synchronization carefully, and also figured out how to synchronize using an eventually consistent database.    This will be useful for implementing a lot other things in SageMathCloud that operate at a different level than "one single project".  For example, I plan to add functions so you can access these same "synchronized databases" from Python processes -- then you'll be able to have sage worksheets (say) running on several different projects, but all saving their data to some common synchronized place (backed by the database).   Another application will be a listing of the last 100 (say) files you've opened, with easy ways to store extra info about them.    It will also be easy to make account and project settings more realtime, so when you change something, it automatically takes effect and is also synchronized across other browser tabs you may have open.   If you're into modern Single Page App web development, this might remind you of Angular or React or Hoodie or Firebase -- what I did this week is probably kind of like some of the sync functionality of those frameworks, but I use Cassandra (instead of MongoDB, say) and differential synchronization.

I BSD-licensed the differential synchronization code  that I wrote as part of the above.

## November 12, 2014

### Jake Vanderplas

#### The Hipster Effect: An IPython Interactive Exploration

This week I started seeing references all over the internet to this paper: The Hipster Effect: When Anticonformists All Look The Same. It essentially describes a simple mathematical model which models conformity and non-conformity among a mutually interacting population, and finds some interesting results: namely, conformity among a population of self-conscious non-conformists is similar to a phase transition in a time-delayed thermodynamic system. In other words, with enough hipsters around responding to delayed fashion trends, a plethora of facial hair and fixed gear bikes is a natural result.

Also naturally, upon reading the paper I wanted to try to reproduce the work. The paper solves the problem analytically for a continuous system and shows the precise values of certain phase transitions within the long-term limit of the postulated system. Though such theoretical derivations are useful, I often find it more intuitive to simulate systems like this in a more approximate manner to gain hands-on understanding. By the end of this notebook, we'll be using IPython's incredible interactive widgets to explore how the inputs to this model affect the results.

## Mathematically Modeling Hipsters

We'll start by defining the problem, and going through the notation suggested in the paper. We'll consider a group of $$N$$ people, and define the following quantities:

• $$\epsilon_i$$ : this value is either $$+1$$ or $$-1$$. $$+1$$ means person $$i$$ is a hipster, while $$-1$$ means they're a conformist.
• $$s_i(t)$$ : this is also either $$+1$$ or $$-1$$. This indicates person $$i$$'s choice of style at time $$t$$. For example, $$+1$$ might indicated a bushy beard, while $$-1$$ indicates clean-shaven.
• $$J_{ij}$$ : The influence matrix. This is a value greater than zero which indicates how much person $$j$$ influences person $$i$$.
• $$\tau_{ij}$$ : The delay matrix. This is an integer telling us the length of delay for the style of person $$j$$ to affect the style of person $$i$$.

The idea of the model is this: on any given day, person $$i$$ looks at the world around him or her, and sees some previous day's version of everyone else. This information is $$s_j(t - \tau_{ij})$$.

The amount that person $$j$$ influences person $$i$$ is given by the influence matrix, $$J_{ij}$$, and after putting all the information together, we see that person $$i$$'s mean impression of the world's style is

$m_i(t) = \frac{1}{N} \sum_j J_{ij} \cdot s_j(t - \tau_{ij})$

Given the problem setup, we can quickly check whether this impression matches their own current style:

• if $$m_i(t) \cdot s_i(t) > 0$$, then person $$i$$ matches those around them
• if $$m_i(t) \cdot s_i(t) < 0$$, then person $$i$$ looks different than those around them

A hipster who notices that their style matches that of the world around them will risk giving up all their hipster cred if they don't change quickly; a conformist will have the opposite reaction. Because $$\epsilon_i$$ = $$+1$$ for a hipster and $$-1$$ for a conformist, we can encode this observation in a single value which tells us what which way the person will lean that day:

$x_i(t) = -\epsilon_i m_i(t) s_i(t)$

Simple! If $$x_i(t) > 0$$, then person $$i$$ will more likely switch their style that day, and if $$x_i(t) < 0$$, person $$i$$ will more likely maintain the same style as the previous day. So we have a formula for how to update each person's style based on their preferences, their influences, and the world around them.

But the world is a noisy place. Each person might have other things going on that day, so instead of using this value directly, we can turn it in to a probabilistic statement. Consider the function

$\phi(x;\beta) = \frac{1 + \tanh(\beta \cdot x)}{2}$

We can plot this function quickly:

In [1]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

# Use seaborn styles for nice-looking plots
import seaborn; seaborn.set()

In [2]:
x = np.linspace(-1, 1, 1000)
for beta in [0.5, 1, 5]:
plt.plot(x, 0.5 * (1 + np.tanh(beta * x)),
label=r'$\beta = {0:.1f}$'.format(beta))
plt.legend(loc='upper left', fontsize=18)
plt.xlabel('$x$', size=18); plt.ylabel(r'$\phi(x;\beta)$', size=18);


This gives us a nice way to move from our preference $$x_i$$ to a probability of switching styles. Here $$\beta$$ is inversely related to noise. For large $$\beta$$, the noise is small and we basically map $$x > 0$$ to a 100% probability of switching, and $$x<0$$ to a 0% probability of switching. As $$\beta$$ gets smaller, the probabilities get less and less distinct.

## The Code

Let's see this model in action. We'll start by defining a class which implements everything we've gone through above:

In [3]:
class HipsterStep(object):
"""Class to implement hipster evolution

Parameters
----------
initial_style : length-N array
values > 0 indicate one style, while values <= 0 indicate the other.
is_hipster : length-N array
True or False, indicating whether each person is a hipster
influence_matrix : N x N array
Array of non-negative values. influence_matrix[i, j] indicates
how much influence person j has on person i
delay_matrix : N x N array
Array of positive integers. delay_matrix[i, j] indicates the
number of days delay between person j's influence on person i.
"""
def __init__(self, initial_style, is_hipster,
influence_matrix, delay_matrix,
beta=1, rseed=None):
self.initial_style = initial_style
self.is_hipster = is_hipster
self.influence_matrix = influence_matrix
self.delay_matrix = delay_matrix

self.rng = np.random.RandomState(rseed)
self.beta = beta

# make s array consisting of -1 and 1
self.s = -1 + 2 * (np.atleast_2d(initial_style) > 0)
N = self.s.shape[1]

# make eps array consisting of -1 and 1
self.eps = -1 + 2 * (np.asarray(is_hipster) > 0)

# create influence_matrix and delay_matrix
self.J = np.asarray(influence_matrix, dtype=float)
self.tau = np.asarray(delay_matrix, dtype=int)

# validate all the inputs
assert self.s.ndim == 2
assert self.s.shape[1] == N
assert self.eps.shape == (N,)
assert self.J.shape == (N, N)
assert np.all(self.J >= 0)
assert np.all(self.tau > 0)

@staticmethod
def phi(x, beta):
return 0.5 * (1 + np.tanh(beta * x))

def step_once(self):
N = self.s.shape[1]

# iref[i, j] gives the index for the j^th individual's
# time-delayed influence on the i^th individual
iref = np.maximum(0, self.s.shape[0] - self.tau)

# sref[i, j] gives the previous state of the j^th individual
# which affects the current state of the i^th individual
sref = self.s[iref, np.arange(N)]

# m[i] is the mean of weighted influences of other individuals
m = (self.J * sref).sum(1) / self.J.sum(1)

# From m, we use the sigmoid function to compute a transition probability
transition_prob = self.phi(-self.eps * m * self.s[-1], beta=self.beta)

# Now choose steps stochastically based on this probability
new_s = np.where(transition_prob > self.rng.rand(N), -1, 1) * self.s[-1]

# Add this to the results, and return
self.s = np.vstack([self.s, new_s])
return self.s

def step(self, N):
for i in range(N):
self.step_once()
return self.s

def trend(self, hipster=True, return_std=True):
if hipster:
subgroup = self.s[:, self.eps > 0]
else:
subgroup = self.s[:, self.eps < 0]

return subgroup.mean(1), subgroup.std(1)


Now we'll create a function which plots the trend for a certain number of time steps:

In [4]:
def plot_results(Npeople=500, Nsteps=200,
hipster_frac=0.8, initial_state_frac=0.5,
delay=20, log10_beta=0.5, rseed=42):
rng = np.random.RandomState(rseed)

initial_state = (rng.rand(1, Npeople) > initial_state_frac)
is_hipster = (rng.rand(Npeople) > hipster_frac)

influence_matrix = abs(rng.randn(Npeople, Npeople))
influence_matrix.flat[::Npeople + 1] = 0

delay_matrix = 1 + rng.poisson(delay, size=(Npeople, Npeople))

h = HipsterStep(initial_state, is_hipster,
influence_matrix, delay_matrix=delay_matrix,
beta=10 ** log10_beta, rseed=rseed)
h.step(Nsteps)

def beard_formatter(y, loc):
if y == 1:
return 'bushy-\nbeard'
elif y == -1:
return 'clean-\nshaven'
else:
return ''

t = np.arange(Nsteps + 1)

fig, ax = plt.subplots(2, sharex=True, figsize=(8, 6))
ax[0].imshow(h.s.T, cmap='binary', interpolation='nearest')
ax[0].set_ylabel('individual')
ax[0].axis('tight')
ax[0].grid(False)

mu, std = h.trend(True)
ax[1].plot(t, mu, c='red', label='hipster')
ax[1].fill_between(t, mu - std, mu + std, color='red', alpha=0.2)

mu, std = h.trend(False)
ax[1].plot(t, mu, c='blue', label='conformist')
ax[1].fill_between(t, mu - std, mu + std, color='blue', alpha=0.2)

ax[1].set_xlabel('time')
ax[1].set_ylabel('Trend')
ax[1].legend(loc='best')
ax[1].set_ylim(-1.1, 1.1);
ax[1].set_xlim(0, Nsteps)
ax[1].yaxis.set_major_formatter(plt.FuncFormatter(beard_formatter))


## Exploring the Result

With this code in place, we can now explore the result. We'll start by seeing what happens when just 10% of the population is made up of non-conformist hipsters:

In [5]:
plot_results(hipster_frac=0.1)


Let's describe this plot briefly: the top panel has 500 rows and 200 columns: each row represents an individual person, and the color (white or black) along the row represents the style of that person at that time.

In the bottom panel, we see the mean and standard deviation of the styles of all hipsters (red) and all conformists (blue).

This plot shows something relatively unsurprising: when there are only a few hipsters in the population, we quickly reach an equilibrium where hipsters all have one style (a bushy beard) while the norm-core conformists have the opposite (clean shaven faces).

Let's see what happens if there are more hipsters in the population:

In [6]:
plot_results(hipster_frac=0.5)


With half the population made up of hipsters, the trend washes out. There is so much noise and confusion about styles, that both the hipsters and the conformists have a wide range of styles at any given time.

Now let's see what happens when we have even more hipsters:

In [7]:
plot_results(hipster_frac=0.8)


Now this is getting interesting! With a population dominated by hipsters, we end up approaching steady cycles in trends. The hipsters start trying to distance themselves from the "normal" style, and then the normal style moves to catch up with them. The hipsters then swing the other way, and the process repeats. This is an example of the "phase transition" that the author of the original paper talked about in analogy to thermodynamic systems: above certain critical values of the model parameters, a qualitatively new type of behavior appears out of the noise. This oscillation can be thought of as a rough and simplistic mathematical model for recurrent cycles of cultural and fashion trends that anyone over a couple decades old has probably noticed over time.

But let's explore this even more.

## Fully Interactive

One of the nice pieces of the IPython notebook is the ability to quickly create interactive visualizations. Unfortunately this only works when you're viewing the notebook live (i.e. a static HTML view on a blog post won't give you any interaction). If you're reading this on my blog or on nbviewer, then you can download the notebook and run it with IPython to see these features.

What we'll do is to call IPython's interactive tools on our Python function, which will create javascript sliders allowing us to explore the parameter space of this hipster conformity model. I'd encourage you to download the notebook and try it out!

In [8]:
from IPython.html.widgets import interact, fixed

interact(plot_results, hipster_frac=(0.0, 1.0), delay=(1, 50),
initial_state_frac=(0.0, 1.0), log10_beta=(-2.0, 2.0),
Npeople=fixed(500), Nsteps=fixed(200), rseed=fixed(42));


Again, unless you download the notebook and run it on a local Python kernel, all you'll see is a static graphic above. But with the interactive version, you can really start to see how these various parameters affect the system.

## Conclusion

This has been a lot of fun, and if you've read this far I hope this helped you understand the mathematics of Hipster-dom! For more information and analysis, go read the paper. It goes much deeper than the rough, discrete approximation I've used here.

For further ideas, I'd love to see a simulation of how this looks if we add-in spatial information, and create a delay related to that information. Would you start to see pockets of people adapting similar styles? My guess is yes, but I'm not entirely sure... there's only one way to find out.

Happy coding!

This post was written entirely in the IPython notebook. You can download this notebook, or see a static view here.

## November 10, 2014

### Titus Brown

#### Some thoughts on Journal Impact Factor

A colleague just e-mailed me to ask me how I felt about journal impact factor being such a big part of the Academic Ranking of World Universities - they say that 20% of the ranking weight comes from # of papers published in Nature and Science. So what do I think?

## On evaluations

I'm really not a big fan of rankings and evaluations in the first place. This is largely because I feel that evaluations are rarely objective. For one very specific example, last year at MSU I got formal evaluations from both of my departments. Starting with the same underlying data (papers published, students graduated, grants submitted/awarded, money spent, classes taught, body mass/height ratio, gender weighting, eye color, miles flown, talks given, liquid volume of students' tears generated, committees served on, etc.) one department gave me a "satisfactory/satisfactory/satisfactory" while the other department gave me an "excellent/excellent/excellent." What fun! (I don't think the difference was the caliber in departments.)

Did I mention that these rankings helped determine my raise for the year?

Anyhoo, I find the rating and ranking scheme within departments at MSU to be largely silly. It's done in an ad hoc manner by untrained people, and as far as I can tell, is biased against people who are not willing to sing their own praises. (I brought up the Dunning-Kruger effect in my last evaluation meeting. Heh.) That's not to say there's not serious intent -- they are factored into raises, and at least one other purpose of evaluating assistant professors is so that once you fire their ass (aka "don't give them tenure") there's a paper trail of critical evaluations where you explained that they were in trouble.

Metrics are part of the job, though; departments evaluate their faculty so they can see who, if anyone, needs help or support or mentoring, and to do this, they rely at least in part on metrics. Basically, if someone has lots of funding and lots of papers, they're probably not failing miserably at being a research professor; if they're grant-poor and paper-poor, they're targets for further investigation. There are lots of ways to evaluate, but metrics seem like an inextricable part of it.

## Back to the impact factor

Like faculty evaluations, ranking by the impact factor of the journals that university faculty publish in is an attempt to predict future performance using current data.

But impact factor is extremely problematic for many reasons. It's based on citations, which (over the long run) may be an OK measure of impact, but are subject to many confounding factors, including field-specific citation patterns. It's an attempt to predict the success of individual papers on a whole-journal basis, which falls apart in the face of variable editorial decisions. High-impact journals are also often read more widely by people than low-impact journals, which yields a troubling circularity in terms of citation numbers (you're more likely to cite a paper you've read!) Worse, the whole system is prone to being gamed in various ways, which is leading to high rates of retractions for high-impact journals, as well as outright fraud.

Impact factor is probably a piss-poor proxy for paper impact, in other words.

If impact factor was just a thing that didn't matter, I wouldn't worry. The real trouble is that impact factors have real-world effect - many countries use impact factor of publications as a very strong weight in funding and promotion decisions. Interestingly, the US is not terribly heavy handed here - most universities seem pretty enlightened about considering the whole portfolio of a scientist, at least anecdotally. But I can name a dozen countries that care deeply about impact factor for promotions and raises.

And apparently impact factor affects university rankings, too!

Taking a step back, it's not clear that any good ranking scheme can exist, and if it does, we're certainly not using it. All of this is a big problem if you care about fostering good science.

The conundrum is that many people like rankings, and it seems futile to argue against measuring and ranking people and institutions. However, any formalized ranking system can be gamed and perverted, which ends up sometimes rewarding the wrong kind of people, and shutting out some of the right kind of people. (The Reed College position on the US News & World Report ranking system is worth reading here.) More generally, in any ecosystem, the competitive landscape is evolving, and a sensible measure today may become a lousy measure tomorrow as the players evolve their strategies; the stricter the rules of evaluation, and the more entrenched the evaluation system, the less likely it is to adapt, and the more misranking will result. So ranking systems need to evolve continuously.

At its heart, this is a scientific management challenge. Rankings and metrics do pretty explicitly set the landscape of incentives and competition. If our goal in science is to increase knowledge for the betterment of mankind, then the challenge for scientific management is to figure out how to incentive behaviors that trend in that direction in the long term. If you use bad or outdated metrics, then you incentivize the wrong kind of behavior, and you waste precious time, energy, and resources. Complicating this is the management structure of academic science, which is driven by many things that include rankings and reputation - concepts that range from precise to fuzzy.

My position on all of this is always changing, but it's pretty clear that the journal system is kinda dumb and rewards the wrong behavior. (For the record, I'm actually a big fan of publications, and I think citations are probably not a terribly bad measure of impact when measured on papers and individuals, although I'm always happy to engage in discussions on why I'm wrong.) But the impact factor is especially horrible. The disproportionate effect that high-IF glamour mags like Cell, Nature and Science have on our scientific culture is clearly a bad thing - for example, I'm hearing more and more stories about editors at these journals warping scientific stories directly or indirectly to be more press-worthy - and when combined with the reproducibility crisis I'm really worried about the short-term future of science. Journal Impact Factor and other simple metrics are fundamentally problematic and are contributing to the problem, along with the current peer review culture and a whole host of other things. (Mike Eisen has written about this a lot; see e.g. this post.)

In the long term I think a much more experimental culture of peer review and alternative metrics will emerge. But what do we do for now?

More importantly:

## How can we change?

I think my main advice to faculty is "lead, follow, or get out of the way."

Unless you're a recognized big shot, or willing to take somewhat insane gambles with your career, "leading" may not be productive - following or getting out of the way might be best. But there are a lot of things you can do here that don't put you at much risk, including:

• be open to a broader career picture when hiring and evaluating junior faculty;
• argue on behalf of alternative metrics in meetings on promotion and tenure;
• use sites like Google Scholar to pick out some recent papers to read in depth when hiring faculty and evaluating grants;
• avoid making (or push back at) cheap shots at people who don't have a lot of high-impact-factor papers;
• invest in career mentoring that is more nuanced than "try for lots of C-N-S papers or else" - you'd be surprised how often this is the main advice assistant professors take away...
• believe in and help junior faculty that seem to have a plan, even if you don't agree with the plan (or at least leave them alone ;)

What if you are a recognized big shot? Well, there are lots of things you can do. You are the people who set the tone in the community and in your department, and it behooves you to think scientifically about the culture and reward system of science. The most important thing you can do is think and investigate. What evidence is there behind the value of peer review? Are you happy with C-N-S editorial policies, and have you talked to colleagues who get rejected at the editorial review stage more than you do? Have you thought about per-article metrics? Do you have any better thoughts on how to improve the system than 'fund more people', and how would you effect changes in this direction by recognizing alternate metrics during tenure and grant review?

The bottom line is that the current evaluation systems are the creation of scientists, for scientists. It's our responsibility to critically evaluate them, and perhaps evolve them when they're inadequate; we shouldn't just complain about how the current system is broken and wait for someone else to fix it.

## Addendum: what would I like to see?

Precisely predicting the future importance of papers is obviously kind of silly - see this great 1994 paper by Gans and Shepherd on rejected classics papers, for example -- and is subject to all sorts of confounding effects. But this is nonetheless what journals are accustomed to doing: editors at most journals, especially the high impact factor ones, select papers based on projected impact before sending them out for review, and/or ask the reviewers to review impact as well.

So I think we should do away with impact review and review for correctness instead. This is why I'm such a big fan of PLOS One and PeerJ, who purport to do exactly that.

But then, I get asked, what do we do about selecting out papers to read? Some (many?) scientists claim that they need the filtering effect of these selective journals to figure out what they should be reading.

There are a few responses to this.

First, it's fundamentally problematic to outsource your attention to editors at journals, for reasons mentioned above. There's some evidence that you're being drawn into a manipulated and high-retraction environment by doing that, and that should worry you.

But let's say you feel you need something to tell you what to read.

Well, second, this is technologically solvable - that's what search engines already do. There's a whole industry of search engines that give great results based on integrating free text search, automatic content classification, and citation patterns. Google Scholar does a great job here, for example.

Third, social media (aka "people you know") provides some great recommendation systems! People who haven't paid much attention to Twitter or blogging may not have noticed, but in addition to person-to-person recommendations, there are increasingly good recommendation systems coming on line. I personally get most of my paper recs from online outlets (mostly people I follow, but I've found some really smart people to follow on Twitter!). It's a great solution!

Fourth, if one of the problems is that many journals review for correctness AND impact together, why not separate them? For example, couldn't journals like Science or Nature evolve into literature overlays that highlight papers published in impact-blind journals like PLOS One or PeerJ? I can imagine a number of ways that this could work, but if we're so invested in having editors pick papers for us, why not have them pick papers that have been reviewed for scientific correctness first, and then elevate them to our attention with their magic editorial pen?

I don't see too many drawbacks to this vs the current approach, and many improvements. (Frankly this is where I see most of scientific literature going, once preprint archives become omnipresent.)

So that's where I want and expect to see things going. I don't see ranking based on predicted impact going away, but I'd like to see it more reflective of actual impact (and be measured in more diverse ways).

--titus

p.s. People looking for citations of high retraction rate, problematic peer review, and the rest could look at one of my earlier blog posts on problems with peer review. I'd be interested in more citations, though!

## November 06, 2014

### Fabian Pedregosa

#### Plot memory usage as a function of time

One of the lesser known features of the memory_profiler package is its ability to plot memory consumption as a function of time. This was implemented by my friend Philippe Gervais, previously a colleague at INRIA and now working for Google.

With this feature it is possible to generate very easily a plot of the memory consumption as a function of time. The result will be something like this:

where you can see the memory used (in the y-axis) as a function of time (x-axis). Furthermore, we have used two functions, test1 and test2, and it is possible to see with the colored brackets at what time do these functions start and finish.

This plot was generated with the following simple script:

import time

@profile
def test1():
n = 10000
a = [1] * n
time.sleep(1)
return a

@profile
def test2():
n = 100000
b = [1] * n
time.sleep(1)
return b

if __name__ == "__main__":
test1()
test2()


what happens here is that we have two functions, test1() and test2() in which we create two lists of different sizes (the one in test2 is bigger). We call time.sleep() for one second so that the function does not return too soon and so we have time to get reliable memory measurements.

The decorator @profile is optional and is useful so that memory_profiler knows when the function has been called so he can plot the brackets indicating that. If you don't put the decorator, the example will work just fine except that the brackets will not appear in your plot.

Suppose we have saved the script as test1.py. We run the script as

$mprof run test1.py  where mprof is an executable provided by memory_profiler. If the above command was successful it will print something like this $ mprof run test1.py
mprof: Sampling memory every 0.1s
running as a Python program...


The above command will create a .dat file on your current working directory, something like mprofile_20141108113511.dat. This file (you can inspect it, it's a text file) contains the memory measurements for your program.

You can now plot the memory measurements with the command

mprof plot  This will open a matplotlib window and show you the plot: As you see, attention has been paid to the default values so that the plot it generates already looks decent without much effort. The not-so-nice-part is that, at least as of November 2014, if you want to customize the plot, well, you'll have to look and modify the mprof script. Some refactoring is still needed in order to make it easier to customize the plots (work in progress). ## November 02, 2014 ### Titus Brown #### Doing biology, however it needs to be done Sean Eddy wrote an interesting blog post on how scripting is something every biologist should learn to do. This spurred a few discussions on Twitter and elsewhere, most of which devolved into the usual arguments about what, precisely, biologists should be taught. I always find these discussions not merely predictable but rather besides the point. Statisticians will always complain that people need a better appreciation of stats; bioinformatics will point to alignment or sequence comparison or whatnot; evolutionary biologists will talk about how important evolution is; physicists will point to more math; etc. But the truth is there's very, very little that is necessary to be a biologist. My perspective on this is informed by my own background, which is a tad idiosyncratic. Despite being an assistant professor in a Microbiology department and (soon) a professor in a VetMed program, I have taken more physics courses than biology courses; heck, I've taught more biology courses than I've taken. I've never taken a chemistry course and know virtually nothing about biochemistry, in particular. Most of my evolutionary knowledge derives from my reading for my research on Avida; I've never been taught anything about ecology and still know very little. I failed my first qual exam at Caltech because I didn't actually know anything about cells (which event, ahem, spurred me to learn). Despite this utter lack of formal background in biology, I spent 5-10 years of my life doing experimental molecular biology for developmental biology research (after 3-5 years learning basic molecular biology), and I've learned what I think this a reasonable amount about cell biology into the bargain. I know a fair bit about developmental biology, evo-devo, molecular biology, and genomics. My PhD is actually in Biology, with a Developmental Biology focus. But I learned it all as I needed it. (My undergrad degree is in Unapplied Mathematics.) My weakest formal training is probably in stats, where I know enough to point out where whatever system I'm studying is violating standard statistical requirements, but not enough to point how to rescue our approach. Despite having "run" a bioinformatics lab for the last few years, my formal bioinformatics background is basically nil - I took a strong programming background, learned a bunch of biology and genomics, and then realized that much of bioinformatics is fairly obvious at that point. I don't really understand Hidden Markov Models or sequence alignment (but shh, don't tell anyone!) With all of this, what do I call myself? Well, I definitely consider myself a biologist, as do at least a few different hiring panels, including one at UC Davis :). And when I talk to other biologists, I think that at least some of them agree - I'm focused on answering biological questions. I do so primarily in collaboration at this point, and primarily from the standpoint of data, but: biology. So here's my conclusion: to be a biologist, one must be seriously trying to study biology. Period. Clearly you must know something about biology in order to be effective here, and critical thinking is presumably pretty important there; I think "basic competency in scientific practice" is probably the minimum bar, but even there you can imagine lab techs or undergraduates putting in useful work at a pretty introductory level here. I think there are many useful skills to have, but I have a hard time concluding that any of them are strictly necessary. The more interesting question, to my mind, is what should we be teaching undergraduates and graduate students in biology? And there I unequivocally agree with the people who prioritize some reasonable background in stats, and some reasonable background in data analysis (with R or Python - something more than Excel). What's more important than teaching any one thing in specific, though, is that the whole concept that biologists can avoid math or computing in their training (be it stats, modeling, simulation, programming, data science/data analysis, or whatever) needs to die. That is over. Dead, done, over. One particular challenge we are facing now is that we don't have many people capable of teaching these younger biologists the appropriate data analysis skills, because most biologists (including the non-research-active faculty that do most teaching) don't know anything about them, and data analysis in biology is about data analysis in biology -- you can't just drop in a physicists or an engineer to teach this stuff. At the end of the day, though, a scientist either learns what they need to know in order to do their research, or they collaborate with others to do it. As data becomes ever more important in biology, I expect more and more biologists will learn how to do their own analysis. One of my interests is in figuring out how to help biologists to make this transition if they want to. So perhaps we can shift from talking about what you must know in order to practice biology, and talk about what we're going to teach, to whom, and when, to people who are the biologists of the future? --titus ## October 27, 2014 ### Juan Nunez-Iglesias #### GitHub's hidden PR comments It’s time to draw my “continuous integration in Python” series to a close. This final post ties all six previous posts together and is the preferred write-up to share more widely and on which to provide feedback. Almost everything I know about good Python development I’ve learned from Stéfan van der Walt, Tony Yu, and the rest of the scikit-image team. But a few weeks ago, I was trying to emulate the scikit-image CI process for my own project: cellom2tif, a tool to liberate images from a rather useless proprietary format. (I consider this parenthetical comment sufficient fanfare to announce the 0.2 release!) As I started copying and editing config files, I found that even from a complete template, getting started was not very straightforward. First, scikit-image has much more complicated requirements, so that a lot of the .travis.yml file was just noise for my purposes. And second, as detailed in the previous posts, a lot of the steps are not found or recorded anywhere in the repository, but rather must be navigated to on the webpages of GitHub, Travis, and Coveralls. I therefore decided to write this series as both a notetaking exercise and a guide for future CI novices. (Such as future me.) To recap, here are my six steps to doing continuous integration in Python with pytest, Travis, and Coveralls: If you do all of the above at the beginning of your projects, you’ll be in a really good place one, two, five years down the line, when many academic projects grow far beyond their original scope in unpredictable ways and end up with much broken code. (See this wonderful editorial by Zeeya Merali for much more on this topic.) ## Reducing the boilerplate with PyScaffold But it’s a lot of stuff to do for every little project. I was about to make myself some minimal setup.cfg and .travis.yml template files so that I could have these ready for all new projects, when I remembered PyScaffold, which sets up a Python project’s basic structure automatically (setup.py, package_name/__init__.py, etc.). Sure enough, PyScaffold has a --with-travis option that implements all my recommendations, including pytest, Travis, and Coveralls. If you set up your projects with PyScaffold, you’ll just have to turn on Travis-CI on your GitHub repo admin and Coveralls on coveralls.io, and you’ll be good to go. ## When Travises attack I’ve made a fuss about how wonderful Travis-CI is, but it breaks more often than I’d like. You’ll make some changes locally, and ensure that the tests pass, but when you push them to GitHub, Travis fails. This can happen for various reasons: • your environment is different (e.g. NumPy versions differ between your local build and Travis’s VMs). • you’re testing a function that depends on random number generation and have failed to set the seed. • you depend on some web resource that was temporarily unavailable when you pushed. • Travis has updated its VMs in some incompatible way. • you have more memory/CPUs locally than Travis allows. • some other, not-yet-understood-by-me reason. Of these, the first three are acceptable. You can use conda to match your environments both locally and on Travis, and you should always set the seed for randomised tests. For network errors, Travis provides a special function, travis_retry, that you can prefix your commands with. Travis VM updates should theoretically be benign and not cause any problems, but, in recent months, they have been a significant source of pain for the scikit-image team: every monthly update by Travis broke our builds. That’s disappointing, to say the least. For simple builds, you really shouldn’t run into this. But for major projects, this is an unnecessary source of instability. Further, Travis VMs don’t have unlimited memory and disk space for your builds (naturally), but the limits are not strictly defined (unnaturally). This means that builds requiring “some” memory or disk space randomly fail. Again, disappointing. Travis could, for example, guarantee some minimal specs that everyone could program against — and request additional space either as special exemptions or at a cost. Finally, there’s the weird failures. I don’t have any examples on hand but I’ll just note that sometimes Travis builds fail, where your local copy works fine every single time. Sometimes rebuilding fixes things, and other times you have to change some subtle but apparently inconsequential thing before the build is fixed. These would be mitigated if Travis allowed you to clone their VM images so you could run them on a local VM or on your own EC2 allocation. A too-common Travis occurrence: randomly failing tests In all though, Travis is a fantastic resource, and you shouldn’t let my caveats stop you from using it. They are just something to keep in mind before you pull all your hair out. ## The missing test: performance benchmarks Testing helps you maintain the correctness of your code. However, as Michael Droettboom eloquently argued at SciPy 2014, all projects are prone to feature creep, which can progressively slow code down. Airspeed Velocity is to benchmarks what pytest is to unit tests, and allows you to monitor your project’s speed over time. Unfortunately, benchmarks are a different beast to tests, because you need to keep the testing computer’s specs and load constant for each benchmark run. Therefore, a VM-based CI service such as Travis is out of the question. If your project has any performance component, it may well be worth investing in a dedicated machine only to run benchmarks. The machine could monitor your GitHub repo for changes and PRs, check them out when they come in, run the benchmarks, and report back. I have yet to do this for any of my projects, but will certainly consider this strongly in the future. ## Some reservations about GitHub The above tools all work great as part of GitHub’s pull request (PR) development model. It’s a model that is easy to grok, works well with new programmers, and has driven massive growth in the open-source community. Lately, I recommend it with a bit more trepidation than I used to, because it does have a few high-profile detractors, notably Linux and git creator Linus Torvalds, and OpenStack developer Julien Danjou. To paraphrase Julien, there are two core problems with GitHub’s chosen workflow, both of which are longstanding and neither of which shows any sign of improving. First, comments on code diffs are buried by subsequent changes, whether the changes are a rebase or they simply change the diff. This makes it very difficult for an outside reviewer to assess what discussion, if any, resulted in the final/latest design of a PR. This could be a fairly trivial fix (colour-code outdated diffs, rather than hiding them), so I would love to see some comments from GitHub as to what is taking so long. Expect to see a lot of these when using pull requests. Second, bisectability is broken by fixup commits. The GitHub development model is not only geared towards small, incremental commits being piled on to a history, but it actively encourages these with their per-commit badging of a user’s contribution calendar. Fixup commits make bug hunting with git bisect more difficult, because some commits will not be able to run a test suite at all. This could be alleviated by considering only commits merging GitHub PRs, whose commit message start with Merge pull request #, but I don’t know how to get git to do this automatically (ideas welcome in the comments). I disagree with Julien that there is “no value in the social hype [GitHub] brings.” In fact, GitHub has dramatically improved my coding skills, and no doubt countless others’. For many, it is their first experience with code review. Give credit where it is due: GitHub is driving the current, enormous wave of open-source development. But there is no doubt it needs improvement, and it’s sad to see GitHub’s developers apparently ignoring their critics. I hope the latter will be loud enough soon that GitHub will have no choice but to take notice. ## Final comments This series, including this post, sums up my current thinking on CI in Python. It’s surely incomplete: I recently came across a curious “Health: 88%” badge on Mitchell Stanton-Cook’s BanzaiDB README. Clicking it took me to the project’s landscape.io page, which appears to do for coding style what Travis does for builds/tests and Coveralls does for coverage. How it measures “style” is not yet clear to me, but it might be another good CI tool to keep track of. Nevertheless, since it’s taken me a few years to get to this stage in my software development practice, I hope this series will help other scientists get there faster. If any more experienced readers think any of my advice is rubbish, please speak up in the comments! I’ll update the post(s) accordingly. CI is a big rabbit hole and I’m still finding my way around. ## October 22, 2014 ### Titus Brown #### Infrastructure for Data Intensive Biology - a statement of work Since being chosen as a Moore Foundation Data Driven Discovery Investigator, I've been putting together the paperwork at UC Davis to actually receive the money. Part of that is putting together a budget and a Statement of Work to help guide the conversation between me, Davis, and the Moore Foundation. Here's what I sent Chris Mentzel at Moore: Title: Infrastructure for Data Intensive Biology OUTCOME: In support of demonstrating the high level of scientific impact that data scientists deliver through their focus on interdisciplinary data-driven research, funds from this award will be used to better understand gene function in non-model organisms through the development of new workflows and better data sharing technology for large-scale data analysis. Research direction 1: Develop and extend protocols for non-model genomic and transcriptomic analysis. Research direction 2: Integrate and extend existing workflow and data analysis software into a cloud-enabled deployment system with a Web interface for executing protocols. Research direction 3: Investigate and develop a distributed graph database system and distributed query functionality to support distributed data-driven discovery. (DDDD :) For more of the background, see my full award submission, my presentation, and a science fiction story that I would like to enable. Comments and pointers welcome! --titus ## October 17, 2014 ### Juan Nunez-Iglesias #### Readme badges We’re finally ready to wrap up this topic. By now you can: But, much as exercise is wasted if your bathroom scale doesn’t automatically tweet about it, all this effort is for naught if visitors to your GitHub page can’t see it! Most high-profile open-source projects these days advertise their CI efforts. Above, I cheekily called this showing off, but it’s truly important: anyone who lands on your GitHub page is a potential user or contributor, and if they see evidence that your codebase is stable and well-tested, they are more likely to stick around. Badging your README is easy. (You do have a README, don’t you?) In Travis, go to your latest build. Near the top right, click on the “build passing” badge: You’ll get an overlay, with a pull-down menu for all the different options for getting the badge. You can grab the image URL, or you can directly grab the Markdown to put into your markdown-formatted README, or a bunch of other options, including RST: Just copy and paste the appropriate code and add it to your README file wherever you please. Meanwhile, on your repository’s Coveralls page, on the right-hand side, you will find another pull-down menu with the appropriate URLs: Again, just grab whichever URL is appropriate for your needs (I prefer Markdown-formatted READMEs), and add it to your README, usually next to the Travis badge. And you’re done! You’ve got automated tests, tons of test coverage, you’re running everything correctly thanks to configuration files, and all this is getting run on demand thanks to Travis and Coveralls. And thanks to badging, the whole world knows about it: ### William Stein #### A Non-technical Overview of the SageMathCloud Project What problems is the SageMathCloud project trying to solve? What pain points does it address? Who are the competitors and what is the state of the technology right now? ## What problems you’re trying to solve and why are these a problem? • Computational Education: How can I teach a course that involves mathematical computation and programming? • Computational Research: How can I carry out collaborative computational research projects? • Cloud computing: How can I get easy user-friendly collaborative access to a remote Linux server? ## What are the pain points of the status quo and who feels the pain? • Student/Teacher pain: • Getting students to install software needed for a course on their computers is a major pain; sometimes it is just impossible, due to no major math software (not even Sage) supporting all recent versions of Windows/Linux/OS X/iOS/Android. • Getting university computer labs to install the software you need for a course is frustrating and expensive (time and money). • Even if computer labs worked, they are often being used by another course, stuffy, and students can't possibly do all their homework there, so computation gets short shrift. Lab keyboards, hardware, etc., all hard to get used to. Crappy monitors. • Painful confusing problems copying files around between teachers and students. • Helping a student or collaborator with their specific problem is very hard without physical access to their computer. • Researcher pain: • Making backups every few minutes of the complete state of everything when doing research often hard and distracting, but important for reproducibility. • Copying around documents, emailing or pushing/pulling them to revision control is frustrating and confusing. • Installing obscuring software is frustarting and distracting from the research they really want to do. • Everybody: • It is frustrating not having LIVE working access to your files wherever you are. (Dropbox/Github doesn't solve this, since files are static.) • It is difficult to leave computations running remotely. ## Why is your technology poised to succeed? • When it works, SageMathCloud solves every pain point listed above. • The timing is right, due to massive improvements in web browsers during the last 3 years. • I am on full sabbatical this year, so at least success isn't totally impossible due to not working on the project. • I have been solving the above problems in less scalable ways for myself, colleagues and students since the 1990s. • SageMathCloud has many users that provide valuable feedback. • We have already solved difficult problems since I started this project in Summer 2012 (and launched first version in April 2013). ## Who are your competitors? There are no competitors with a similar range of functionality. However, there are many webapps that have some overlap in capabilities: • Mathematical overlap: Online Mathematica: "Bring Mathematica to life in the cloud" • Python overlap: Wakari: "Web-based Python Data Analysis"; also PythonAnywhere • Latex overlap: ShareLaTeX, WriteLaTeX • Web-based IDE's/terminals: target writing webapps (not research or math education): c9.io, nitrous.io, codio.com, terminal.com • Homework: WebAssign and WebWork Right now, SageMathCloud gives away for free far more than any other similar site, due to very substantial temporary financial support from Google, the NSF and others. ## What’s the total addressable market? Though our primary focus is the college mathematics classroom, there is a larger market: • Students: all undergrad/high school students in the world taking a course involving programming or mathematics • Teachers: all teachers of such courses • Researchers: anybody working in areas that involve programming or data analysis Moreover, the web-based platform for computing that we're building lends itself to many other collaborative applications. ## What stage is your technology at? • The site is up and running and has 28,413 monthly active users • There are still many bugs • I have a precise todo list that would take me at least 2 months fulltime to finish. ## Is your solution technically feasible at this point? • Yes. It is only a matter of time until the software is very polished. • Morever, we have compute resources to support significantly more users. • But without money (from paying customers or investment), if growth continues at the current rate then we will have to clamp down on free quotas for users. ## What technical milestones remain? • Infrastructure for creating automatically-graded homework problems. • Fill in lots of details and polish. ## Do you have external credibility with technical/business experts and customers? • Business experts: I don't even know any business experts. • Technical experts: I founded the Sage math software, which is 10 years old and relatively well known by mathematicians. • Customers: We have no customers; we haven't offered anything for sale. ## Market research? • I know about math software and its users as a result of founding the Sage open source math software project, NSF-funded projects I've been involved in, etc. ## Is the intellectual property around your technology protected? • The IP is software. • The website software is mostly new Javascript code we wrote that is copyright Univ. of Washington and mostly not open source; it depends on various open source libraries and components. • The Sage math software is entirely open source. ## Who are the team members to move this technology forward? I am the only person working on this project fulltime right now. • Everything: William Stein -- UW professor • Browser client code: Jon Lee, Andy Huchala, Nicholas Ruhland -- UW undergrads • Web design, analytics: Harald Schilly -- Austrian grad student • Hardware: Keith Clawson ## Why are you the ideal team? • We are not the ideal team. • If I had money maybe I could build the ideal team, leveraging my experience and connections from the Sage project... ## October 16, 2014 ### Jake Vanderplas #### How Bad Is Your Colormap? (Or, Why People Hate Jet – and You Should Too) I made a little code snippet that I find helpful, and you might too: In [1]: def grayify_cmap(cmap): """Return a grayscale version of the colormap""" cmap = plt.cm.get_cmap(cmap) colors = cmap(np.arange(cmap.N)) # convert RGBA to perceived greyscale luminance # cf. http://alienryderflex.com/hsp.html RGB_weight = [0.299, 0.587, 0.114] luminance = np.sqrt(np.dot(colors[:, :3] ** 2, RGB_weight)) colors[:, :3] = luminance[:, np.newaxis] return cmap.from_list(cmap.name + "_grayscale", colors, cmap.N)  What this function does is to give you a lumninance-correct grayscale version of any matplotlib colormap. I've found this useful for quickly checking how my plots might appear if printed in black and white, but I think it's probably even more useful for stoking the flame of the internet's general rant against jet. If you want to take a step toward joining the in-crowd of chromatically-sensitive data viz geeks, your best bet is to start by bashing jet. Even if you don't know it by name, I can guarantee that if you've read many scientific papers, you've seen jet before. For example, here's a snapshot of a plot from neuroscience journal which is skewered by an appropriately ranty blogpost on the subject: Jet is the default colorbar originally used by matlab, and this default was inherited in the early days of Python's matplotlib package. The reasons not to use jet are numerous, and you can find good arguments against it across the web. For some more subdued and nuanced arguments, I'd start with the paper Rainbow Color Map (Still) Considered Harmful and, for more general visualization tips, Ten Simple Rules for Better Figures. So what do I have to add to this discussion that hasn't been already said? Well, nothing really, except the code snippet I shared above. Let me show you what it does. ## Taking the Color Out of Jet Let's start by defining some data and seeing how it looks with the default jet colormap: In [2]: %matplotlib inline import numpy as np import matplotlib.pyplot as plt x = np.linspace(0, 6) y = np.linspace(0, 3)[:, np.newaxis] z = 10 * np.cos(x ** 2) * np.exp(-y)  We'll use matplotlib's imshow command to visualize this. By default, it will use the "jet" colormap: In [4]: plt.imshow(z) plt.colorbar();  At first glance this might look OK. But upon closer examination, you might notice that jet's Luminance profile is incredibly complicated. Because your eye has different levels of sensitivity to light of different color, the luminance is not simply the sum of the RGB values as you might naively expect, but some weighted Euclidean sum of the individual values. You can find more information than you'd ever need to know on imagemagick's website. When you take the jet colormap used above and convert it to luminance using the code snippet above, you get this: In [5]: plt.imshow(z, cmap=grayify_cmap('jet')) plt.colorbar();  It's a mess! The greyscale-only version of this colormap has strange luminance spikes in the middle, and makes it incredibly difficult to figure out what's going on in a plot with a modicum of complexity. Much better is to use a colormap with a uniform luminance gradient, such as the built-in grayscale colormap. Let's plot this beside the previous two: In [6]: cmaps = [plt.cm.jet, grayify_cmap('jet'), plt.cm.gray] fig, axes = plt.subplots(1, 3, figsize=(12, 3)) fig.subplots_adjust(wspace=0) for cmap, ax in zip(cmaps, axes): im = ax.imshow(z, cmap=cmap) ax.set_title(cmap.name) fig.colorbar(im, ax=ax)  In particular, notice that in the left panel, your eye is drawn to the yellow and cyan regions, because the luminance is higher. This can have the unfortunate side-effect of highlighting "features" in your data which may not actually exist! We can see this Luminance spike more clearly if we look at the color profile of jet up close: In [7]: def show_colormap(cmap): im = np.outer(np.ones(10), np.arange(100)) fig, ax = plt.subplots(2, figsize=(6, 1.5), subplot_kw=dict(xticks=[], yticks=[])) fig.subplots_adjust(hspace=0.1) ax[0].imshow(im, cmap=cmap) ax[1].imshow(im, cmap=grayify_cmap(cmap)) show_colormap('jet')  Once you have the grayscale lined-up with the color version, it's easy to point out these luminance spikes in the jet spectrum. By comparison, take a look at the Cube Helix colormap: In [8]: show_colormap('cubehelix')  This is a rainbow-like colormap which – by design – has a uniform luminance gradient across its progression of colors. It's certainly not the best choice in all situations, but you could easily argue that it's always a better choice than jet. ## All the Colormaps! It's useful to see this sort of visualization for all the available colormaps in matplotlib. Here's a quick script that does this: In [18]: fig, axes = plt.subplots(36, 6, figsize=(10, 7)) fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.1, wspace=0.1) im = np.outer(np.ones(10), np.arange(100)) cmaps = [m for m in plt.cm.datad if not m.endswith("_r")] cmaps.sort() axes = axes.T.ravel() for ax in axes: ax.axis('off') for cmap, color_ax, gray_ax, null_ax in zip(cmaps, axes[1::3], axes[2::3], axes[::3]): del null_ax color_ax.set_title(cmap, fontsize=10) color_ax.imshow(im, cmap=cmap) gray_ax.imshow(im, cmap=grayify_cmap(cmap))  There are some colormaps in here that have very nice, linear luminance gradients, and this is something you should keep in mind when choosing your color map. Much more could be written about choosing an appropriate color map for any given data; for a more in-depth discussion of matplotlib's maps (and some interesting luminance illustrations), you can refer to matplotlib's choosing a colormap documentation. If you're interested in streamlined statistical plotting in Python with well thought-out default color choices, I'd suggest taking a look at Michael Waskom's seaborn project, and especially the associated Color Palette Tutorial. I hope you find this grayify_cmap snippet helpful, and thanks for reading! This post was written entirely in the IPython notebook. You can download this notebook, or see a static view here. ### William Stein #### Public Sharing in SageMathCloud, Finally SageMathCloud (SMC) is a free (NSF, Google and UW supported) website that lets you collaboratively work with Sage worksheets, IPython notebooks, LaTeX documents and much, much more. All work is snapshotted every few minutes, and copied out to several data centers, so if something goes wrong with a project running on one machine (right before your lecture begins or homework assignment is due), it will pop up on another machine. We designed the backend architecture from the ground up to be very horizontally scalable and have no single points of failure. This post is about an important new feature: You can now mark a folder or file so that all other users can view it, and very easily copy it to their own project. This solves problems: • Problem: You create a "template" project, e.g., with pre-installed software, worksheets, IPython notebooks, etc., and want other users to easily be able to clone it as a new project. Solution: Mark the home directory of the project public, and share the link widely. • Problem: You create a syllabus for a course, an assignment, a worksheet full of 3d images, etc., that you want to share with a group of students. Solution: Make the syllabus or worksheet public, and share the link with your students via an email and on the course website. (Note: You can also use a course document to share files with all students privately.) For example... • Problem: You run into a problem using SMC and want help. Solution: Make the worksheet or code that isn't working public, and post a link in a forum asking for help. • Problem: You write a blog post explaining how to solve a problem and write related code in an SMC worksheet, which you want your readers to see. Solution: Make that code public and post a link in your blog post. Here's a screencast. Each SMC project has its own local "project server", which takes some time to start up, and serves files, coordinates Sage, terminal, and IPython sessions, etc. Public sharing completely avoids having anything to do with the project server -- it works fine even if the project server is not running -- it's always fast and there is no startup time if the project server isn't running. Moreover, public sharing reads the live files from your project, so you can update the files in a public shared directory, add new files, etc., and users will see these changes (when they refresh, since it's not automatic). As an example, here is the cloud-examples github repo as a share. If you click on it (and have a SageMathCloud account), you'll see this: ## What Next? There is an enormous amount of natural additional functionality to build on top of public sharing. For example, not all document types can be previewed in read-only mode right now; in particular, IPython notebooks, task lists, LaTeX documents, images, and PDF files must be copied from the public share to another project before people can view them. It is better to release a first usable version of public sharing before systematically going through and implementing the additional features needed to support all of the above. You can make complicated Sage worksheets with embedded images and 3d graphics, and those can be previewed before copying them to a project. Right now, the only way to visit a public share is to paste the URL into a browser tab and load it. Soon the projects page will be re-organized so you can search for publicly shared paths, see all public shares that you have previously visited, who shared them, how many +1's they've received, comments, etc. Also, I plan to eventually make it so public shares will be visible to people who have not logged in, and when viewing a publicly shared file or directory, there will be an option to start it running in a limited project, which will vanish from existence after a period of inactivity (say). There are also dozens of details that are not yet implemented. For example, it would be nice to be able to directly download files (and directories!) to your computer from a public share. And it's also natural to share a folder or file with a specific list of people, rather than sharing it publicly. If somebody is viewing a public file and you change it, they should likely see the update automatically. Right now when viewing a share, you don't even know who shared it, and if you open a worksheet it can automatically execute Javascript, which is potentially unsafe. Once public content is easily found, if somebody posts "evil" content publicly, there needs to be an easy way for users to report it. ## Sharing will permeate everything Sharing has been thought about a great deal during the last few years in the context of sites such as Github, Facebook, Google+ and Twitter. With SMC, we've developed a foundation for interactive collaborative computing in a browser, and will introduce sharing on top of that in a way that is motivated by your problems. For example, as with Github or Google+, when somebody makes a copy of your publicly shared folder, this copy should be listed (under "copies") and it could start out public by default. There is much to do. One reason it took so long to release the first version of public sharing is that I kept imagining that sharing would happen at the level of complete projects, just like sharing in Github. However, when thinking through your problems, it makes way more sense in SMC to share individual directories and files. Technically, sharing at this level works works well for read-only access, not for read-write access, since projects are mapped to Linux accounts. Another reason I have been very hesitant to support sharing is that I've had enormous trouble over the years with spammers posting content that gets me in trouble (with my University -- it is illegal for UW to host advertisements). However, by not letting search engines index content, the motivation for spammers to post nasty content is greatly reduced. Imagine publicly sharing recipes for automatically gradable homework problems, which use the full power of everything installed in SMC, get forked, improved, used, etc. ### Juan Nunez-Iglesias #### Coveralls dashboard In this series of posts, we’ve covered: Today I will show you how to continuously check your test coverage using Coveralls. Travis runs whatever commands you tell it to run in your .travis.yml file. Normally, that’s just installing your program and its requirements, and running your tests. If you wanted instead to launch some nuclear missiles, you could do that. (Assuming you were happy to put the launch keys in a public git repository… =P) The Coveralls service, once again free for open-source repositories, takes advantage of this: you just need to install an extra piece of software from PyPI, and run it after your tests have passed. Do so by adding the line pip install coveralls to your before_install section, and just the coveralls command to a new after_success section: language: python python: - "2.7" - "3.4" before_install: - pip install pytest pytest-cov - pip install coveralls script: - py.test after_success: - coveralls  This will push your coverage report to Coveralls every time Travis is run, i.e., every time you push to your GitHub repository. You will need to turn on the service by: • going to coveralls.io and signing in with your GitHub credentials • clicking on “Repositories”, and “Add Repo” (and perhaps on “Sync GitHub Repos”, top right) • switching your repository’s switch to “On” That’s it! With just two small changes to your .travis.yml and the flick of a switch, you will get continuous monitoring of your test coverage with each push, and also with each pull request. By default, Coveralls will even comment on your pull requests, so that you instantly know whether someone is not testing their submitted code! Plus, you get a swanky web dashboard to monitor your coverage! Tune in tomorrow for the final chapter in continuous integration in Python: showing off! No, it’s not a blog post about blogging about it. =P (Addendum: the above .travis.yml will run Coveralls twice, once per Python version, so it will also comment twice for your PRs, notify you twice, etc. If you want to test more Python versions, the problem multiplies. Therefore, you can choose to call Coveralls only for a specific Python version, like so: after_success: - if [[ENV == python=3.4* ]]; then
coveralls;
fi


This is the approach taken by the scikit-image project.)

## October 15, 2014

### Continuum Analytics news

#### The Leading Enterprise Python Distribution for Data Analytics Is Moving to the Hadoop and Spark Infrastructures

Continuum Analytics, the premier provider of Python-based data analytics solutions and services, today announced the release of Anaconda Cluster. Continuum’s enterprise Python distribution for data analytics, Anaconda Server, is moving to the Hadoop and Spark infrastructures. Anaconda Cluster is a highly-scalable cluster resource management tool that lets data scientists streamline deployment without rework. “We would love to pass off clustering to the engineering and IT team, but we are the only ones that know how to manage it right now.” - A Hadoop/Python user in response to the pain points that Anaconda Cluster addresses

### Juan Nunez-Iglesias

#### Travis build page

This is the fourth post in my Continuous integration (CI) in Python series, and the one that puts the “continuous” in CI! For the introductory three posts, see:

## Introduction to Travis-CI

Once you’ve set up your tests locally, it does you no good if you don’t remember to run them! Travis-CI makes this seamless, because it will check out your code and run you tests for each and every commit you push to GitHub! (This is even more important when you are receiving pull requests on GitHub: the tests will be run online, without you having to individually check out each PR and run the tests on your machine!)

This is what continuous integration is all about. Once upon a time, the common practice was to pile on new features on a codebase. Then, come release time, there would be a feature freeze, and some time would be spent cleaning up code and removing bugs. In continuous integration, instead, no new feature is allowed into the codebase until it is bug free, as demonstrated by the test suite.

## What to do

You need to first add a .travis.yml file to the root of your project. This tells Travis how to install your program’s dependencies, install your program, and run the tests. Here’s an example file to run tests and coverage on our maths.py sample project:

language: python
python:
- "2.7"
- "3.4"
before_install:
- pip install pytest pytest-cov
script:
- py.test


Pretty simple: tell Travis the language of your project, the Python version (you can specify multiple versions, one per line, to test on multiple Python versions!), how to install test dependencies. Finally, the command to run your tests. (This can be anything, not just pytest or another testing framework, as long as a shell exit status of 0 means “success” and anything else means “failure”.)

You can read more about the syntax and options for your .travis.yml in the Travis documentation. There are other sections you can add, such as “virtualenv” to set up Python virtual environments, “install” to add compilation and other installation steps for your library, before testing, and “after_success” to enable e.g. custom notifications. (We will use this last one in the next post.)

Once you have created .travis.yml for your project, you need to turn it on for your repo. This is, currently, a fairly painful process, and I can’t wait for the day that it’s enabled by default on GitHub. [Update: see below] In the meantime though:

[Update 2014-10-28: Thanks to @hugovk for pointing out that the first four points above can be skipped. It turns out that when you first log in to Travis-CI using your GitHub account, you give them write access to your webhooks. So, when you add a repo from their end, they go ahead and add themselves on the GitHub end! Boom. Way easier.]

Voilà! Every push and pull-request to your repository will trigger a job on Travis’s servers, building your dependencies and your software, running your tests, and emailing you if anything went wrong! Amazingly, this is completely free for open source projects, so you really have no excuse for not using it!

Follow this blog to learn how to continuously check test coverage using Coveralls, coming in the next post!

## October 14, 2014

### Matthieu Brucher

#### Book review: The Prize – The Epic Quest for Oil, Money & Power

Yet another book that my colleague suggested me to read. I also discussed it with another colleague who told me the same: this is a book that anyone in the oil and gas field should read. And what about people not in this industry?

#### Content and opinions

The book is primarily focused on the big scale, from a geopolitical and strategic point of view, with anecdotes on what happen behind the scenes. But even for the global scale, there are many things in the book that I didn’t know about: how Rockfeller make so rich in the first place, the role of gasoline in World War 1, what drove the 2 oil shocks… The book is quite big, and split in five parts, mainly driven by fundamental changes in the approaches to oil place. Sometimes it is difficult to follow the main thread of the story as the author has to go back in time to follow a different side of History, and also sometimes the name of the biggest companies is confusing, albeit precise.

The first part was like a Far West story. It starts in the middle of the 19th century and finishes before World War 1. Basically this is the story of Rockfeller fortune, the fall of his company, as well as a nice description of what happens when everyone looks for oil without regulation. It is “funny” to compare this to the current rush for gas shale in the US and how it is exploited (i.e. without regard for the global economy).

The second chapter describes World War 1 as well as the Depression that follows. Here, the politicians start noticing the importance of oil, and how it plays a crucial role in World War 1, as well as how it was a great helper in the recession. But it also indicated that oil needed a minimum price, which is what I think is currently forgotten in the gas shale exploitation: the gas companies are sacrified on the altar of growth. At leat Roosevelt understood that all the economy has to grow, and the oil and gas industry couldn’t be sacrified…

The third chapter tackles World War 2, and the different theatres of war. Clearly the different opponents knew about the importance of oil, and also the winners were those who could manage to keep a steady supply of oil. Also here is mentioned the issue we still face: what energy to use when you don’t have oil anymore?

Fourth chapter is about growth years before the first oil shock. Clearly the balance shifted then between companies and countries, and even though this part is less interesting in terms of events (there is still the birth of OPEC and the Suez crisis), it still brings some background for the current power balance that is not even hinted in the medias. Also the corruption of the countries by money is also introduced.

The fifth chapter describes the current balance, with the oil sword that was so important during the second oil shock and the war between Israel and the Arab nations. The way it was used is not something that I linked with the second oil shock at first, although it seems so blatant. Here is also where the Middle-East story ends with the first Iraq War. Of course, the Middle East is actually the central piece of the whole book, and there are so many pieces that gravitated around it int he book that it was sometimes incredible.

There is a small epilogue to sum up what could be after the first Iraq war and then the advent of the war on terrorism. This epilogue is lacking some depth, but it is really difficult to get perspective of events that are happening when you are writing the book!

#### Conclusion

I thought I understood the story of the 20th century quite well. I was very mistaken. Even though it is a long book, there is so much to discuss that it is sometimes not enough. It is simple enough so that anyone can read it. And everyone should read definitely read it.

### Juan Nunez-Iglesias

#### jnuneziglesias

This is the third post in my series on continuous integration (CI) in Python. For the first two posts, see 1: automated tests with pytest, and 2: measuring test coverage.

By now, you’ve got a bunch of tests and doctests set up in your project, which you run with the command:

$py.test --doctest-modules --cov . --cov-report term-missing  If you happen to have a few script and setup files lying around that you don’t want to test, this might expand to $ py.test --doctest-modules --cov . --cov-report term-missing --ignore script.py


As you can see, this is getting cumbersome. You can fix this by adding a setup.cfg file to the root of your project, containing configuration options for your testing. This follows standard INI file syntax.

[pytest]
addopts = –ignore script.py –doctest-modules –cov-report term-missing –cov .

Note that the “ignore” flag only tells pytest not to run that file during testing. You also have to tell the coverage module that you don’t want those lines counted for coverage. Do this in a .coveragerc file, with similar syntax:

[run]
omit = *script.py

Once you’ve done this, you can invoke your tests with a simple, undecorated call to just py.test.

To find out more about your configuration options, see the pytest basic test configuration page, and Ned Batchelder’s excellent .coveragerc syntax guide.

That’s it for this entry of my CI series. Follow this blog for the next two entries, setting up Travis CI and Coveralls.

## October 13, 2014

### Titus Brown

#### The emerging discipline of Data Intensive Biology

Yesterday I gave my third keynote address ever, at the Australasian Genomics Technology Association's annual meeting in Melbourne (talk slides here). On my personal scale of talks, it was a 7 or 8 out of 10: I gave it a lot of energy, and I think the main messages got across, but I ended up conveying a few things that I regretted - in particular, the Twitter feed pointed that I'd slagged on biologists a few times (slag, v: British informal. Criticize (someone) in an abusive and insulting manner). Absolutely not my intent!

I consider myself a biologist, albeit one who works primarily on biological data analysis. So I'm starting to call myself a data-intensive biologist. And I thought now would be a good time to talk about the emerging discipline of Data Intensive Biology.

Before I go on, this post is dedicated to my Australian hosts. I've had a wonderful time so far, and I've been tremendously impressed with the Australasian genomics and bioinformatics research communities. They've got something great going on here and I look forward to further interactions!

## Who are Data Intensive Biologists?

In my talk, I included the de rigeur picture of a tidal wave, reproduced below.

This tidal wave is intended to represent the data deluge in biology, the combination of lots of -omics and sensor data that is starting to hit the field of biology. If you look closely at the picture, you'll see three groups of researchers.

• the inland researchers, who are up on the shore well away from the boardwalk. They are looking at the tidal wave, wondering if the water's going to reach them; they're far away from it, though, so they're not really too worried about it.
• the boardwalk researchers, who are on the little walkway at the base of the crashing wave. Some of them are looking at the wave in horror, aware that this is going to be painful; others are busy putting on little lifevests, in the vain hope that this will save them; and the third group are looking the other way, wondering what everyone is talking about.
• the surfer dudes and dudettes, who are having the time of their lives, surfing down the face of the wave. Every now and then they fall off, but the water's deep and they can get right back on the board. (OK, it's an imperfect analogy.)

The surfers are the data intensive biologists: they love the data, they love the thrill of data analysis, and they're going somewhere fast, although maybe not in a forward direction.

## The anatomy of a Data Intensive Biologist

In my ABiC 2014 keynote (keynote #2), I listed out five character types that participate in bioinformatics. These were:

1. Computer scientists
2. Software engineers
3. Data scientists
4. Statisticians
5. Biologists

(I missed a 6th, database maintainers, and possibly a 7th, data curators.)

In another miscommunication, I meant to say (but did not say during my talk) that almost every effective bioinformatics researcher is some linear combination of these seven characters. I think that data intensive biologists can be defined on this set of axes, too.

So: data intensive biologists are biology researchers who are focused on biological questions, but who have substantial footings in many or all of the other fields above. That is, their focus is on making biological progress, but they are using tools from computer science, software engineering, data science, statistics, and databases to study biology.

Data Intensive Biologists:

• are usually well grounded in at least one field of biology;
• understand that data is not information, but that it's a darn good start;
• know that most of our current biological knowledge is limited or wrong, but that we've got to rely on it anyway;
• are aware that investing in automation is sometimes repaid 10x in efficiency, except when it's not;
• realize that reproducible computational analyses are a great idea;
• write software when they have to, but only because they have to;
• think data science training is neat, because it gives an enduring competitive edge;
• constantly rebalance themselves between the CS, software engineering, data science, and stats perspectives -- almost as if they were surfing across those fields;
• get that open science is obvious if not always practical;
• are confused as to why biology graduate programs aren't teaching more data analysis;

and, finally, data intensive biologists

• shouldn't worry about their career options.

## Concluding thoughts

If you had to nail down a definition - people like to do that, for some reason :) - I would go with:

Data intensive biology: a researcher focused on addressing biological research questions primarily through large scale data analysis or integration.

I don't have any interest in being exclusionary with this definition. If you're tackling biological questions in any way, shape, or form, and you're using lots of data to do it, you fit my definition!

Oh, and by the way? There are already workshops and faculty positions in this area. Although they're all at Hopkins, so maybe the best you can say is that James Taylor and I agree on terminology :).

--titus

## October 10, 2014

### Titus Brown

#### How good is MEGAHIT?

A few weeks back, Nick Loman (via Manoj Samanta) brought MEGAHIT to our attention on Twitter. MEGAHIT promised "an ultra-fast single-node solution for large and complex metagenome assembly" and they provided a preprint and some open source software. This is a topic near and dear to my heart (see Pell et al., 2012 and Howe et al., 2014), so I was immediately interested - especially since the paper used our Howe et al. data set to prove out their results. (The twitterati also pointed out that the preprint engaged in some bashing of this previous work, presumably to egg me on. ;)

So I thought, heck! Let's take MEGAHIT out for a spin! So my postdoc Sherine Awad and I tried it out.

tl; dr? MEGAHIT seems pretty awesome to me, although IDBA and SPAdes still seem to beat it by a bit on the actual assembly results.

## Installing MEGAHIT

We ran into some small compilation problems but got it working on an Ubuntu 12.04 system easily enough.

Running it was also a snap. It took a few minutes to work through the required options, and voila, we got it running and producing results. (You can see some example command lines here.)

First question --

## How does it do on E. coli?

One of the claims made in the paper is that this approach performs well on low-coverage data. To evaluate this, I took a 5m read subset from the E. coli MG1655 dataset (see Chitsaz et al., 2011) and further subsetted it to 1m reads and 500k reads, to get (roughly) 100x, 20x, and 10x data sets. I then ran MEGAHIT with default parameters, specifying 1 GB of memory, and limiting only the upper k size used (because otherwise it crashed) -- again, see the Makefile.

For comparison, I also ran SPAdes on the lowest-coverage data, looking only at the contigs (not the scaffolds).

After it was all done assembling, I ran QUAST on the results.

Measure 100x 20x 10x 10x (SPAdes)
N50 73736 52352 9067 18124
Largest alignment 221kb 177kb 31kb 62kb
bp in contigs > 1kb 4.5mb 4.5mb 4.5mb 4.5mb
Genome fraction 98.0% 98.0% 97.4% 97.9%
Misassembled length 2kb 40.8kb 81.3kb 63.6kb

(Data: MEGAHIT 100x, 20x, and 10x; and SPAdes 10x.)

In summary, it does pretty well - with even pretty low coverage, you're getting 97.4% of the genome in contigs > 500bp (QUAST's default cutoff). Misassemblies grow significantly at low coverage, but you're still only at 2% in misassembled contigs.

In comparison to SPAdes at low coverage, the results are ok also. SPAdes performs better in every category, which I would expect -- it's a great assembler! - but MEGAHIT performs well enough to be usable. MEGAHIT is also much, much faster - seconds vs minutes.

Next question -

## How well does it do on on a metagenomic data set?

Sherine has been running benchmarks for Velvet, SPAdes, and IDBA on the data set from Shakya et al, 2013, a mock community data set. So I asked her to add some MEGAHIT results. She did quality trimming as specified in Kalamazoo, and ran MEGAHIT with 10 GB of RAM. She then used QUAST to evaluate the results against the known good genomes.

# contigs > 1kb 19,982 16,387 16,191
length in contigs >1kb 190.7mb 192.5mb 194.8
# misassemblies 698 894 1223
Bp in misassemblies 12.7mb 28.2mb 21.8mb
Metagenome fraction 89.96% 90.42% 90.97%

Again, the answer is "MEGAHIT works pretty well." Fewer misassemblies, but also more contigs and a bit less coverage of the known genome.

Third question --

## How fast and memory efficient was MEGAHIT?

Very. We didn't actually measure it, but, like, really fast. And low memory, also. We're doing systematic benchmarking on this front for our own paper, and we'll provide details as we get them.

(We didn't measure MEGAHIT's performance because we don't have numbers for SPAdes and IDBA yet. We didn't measure SPAdes and IDBA yet because actually doing the benchmarking well is really painful - they take a long time to run. 'nuff said :)

## So, what are your conclusions?

So far, +1. Highly recommended to people who are good at command line stuff and general all-around UNIX-y folk. I'd want to play around with it a bit more before strongly recommending it to anyone who wasn't a seasoned bioinformatician. It's rough around the edges, and I haven't looked at the code much yet. It also breaks in various edge cases, but at least it's pretty robust when you just hand it a straight up FASTQ file!

That having been said, it works shockingly well and is quite fast and memory efficient. If you're having trouble achieving an assembly any other way I would definitely recommend investing the time to try out MEGAHIT.

--titus

p.p.s. Thanks to Rayan Chikhi and Lex Nederbragt for reading and commenting on
a draft version of this post!

## Appendix: MEGAHIT and digital normalization

In the MEGAHIT paper, they commented that they believed that digital normalization could lead to loss of information. So I thought I'd compare MEGAHIT on 100x against MEGAHIT and SPAdes running on digitally normalized 100x:

Measure 100x DN (w/MEGAHIT) DN (w/SPAdes)
N50 73736 82753 132872
Largest alignment 221kb 222kb 224kb
bp in contigs > 1kb 4.5mb 4.5mb 4.6mb
Genome fraction 98.0% 98.1% 98.2%
Misassembled length 2kb 120kb 48kb

(Data: MEGAHIT 100x, MEGAHIT DN, and SPAdes DN.)

The short version is, I don't see any evidence that diginorm leads to incompleteness, but clearly diginorm leads to lots of misassemblies when used in conjunction with MEGAHIT or SPAdes on high-coverage genomes. (We have some (ok, lots) of evidence that this doesn't happen with lower coverage genomes, or metagenomes.) That having been said, it's clearly rather assembler-specific, since SPAdes does a much better job than MEGAHIT on dn data.

The shorter version? You probably won't need to use diginorm with MEGAHIT, and you shouldn't. That's OK. (There are lots of reasons why you shouldn't use diginorm.)

I still don't have any evidence that diginorm drops information in non-polyploid situations. Let me know if you've seen this happen!

## Appendix II: Running your own evaluation

All of the E. coli numbers above are available in the 2014-megahit-evaluation github repo. See README.md in that repo for basic install instructions, and Makefile for what I ran and how to run it. Feel free to reproduce, extend, and update!

## October 09, 2014

### Continuum Analytics

#### Tips for More Effective Numba Usage

Last week, I wrote about the features Numba has, and what our long term development goals are. Today, I want to dig into the details of how to more effectively use Numba on your programs.

## October 07, 2014

### Matthieu Brucher

#### Book review: Fundamentals of Reservoir Engineering

I worked for a long time for the seismic department of my company, and switched to the reservoir department only last year. The problems that are tackled are quite different, and the way they are solved as well. So nothing to do with the book I reviewed a long time ago. So after 2 trainings in reservoir simulation, I also read this book that a colleague of mine labeled as the reference book.

#### Content and opinions

The book has a nice progression in complexity. Although it won’t tackle anything like model generation (it is someone else’s job after all), it tackles the basic questions that a reservoir engineer has to ask himself. The progression is also close to what I had during my reservoir engineering training.

The first four chapters are an introduction to the basic physics, namely how to compute the volume of oil in the reservoir, how does it work, how much can you get out of it, basic thermodynamics of oil (black oil model), or the Darcy law.

Then, the four next chapters deal with wells, pressure, which are the most important things in the job. The main values you can get to describe your reservoir come from well tests and the analysis of the results. They may seem basic, with a lot of approximations, but they are still the first maths you do when appreciating a reservoir! 40 years later…

The last two chapters are a little bit different, but equally important. I didn’t think much of aquifers before I actually realized that they are the main way of recovering oil… Without them, you have to use quite rapidly enhanced techniques. And they behave quite differently depending on their size, their location. The last chapter deals with relative permeability. It’s something that I found “missing” int he previous chapters, as it was always mentioned during my trainings, and then I noticed that the last chapter tries to cover everything they knew about this topic in the last chapter.

#### Conclusion

Although the book is quite old and although reservoir simulation has made tremendous progress, the basic approaches are still used today, and this is what the book is about. The location of the exercises and the fact that they are solved just after is also well appreciated.

## October 04, 2014

### Titus Brown

#### Putting together an online presence for a diffuse academic community - how?

I would like to build a community site. Or, more precisely, I would like to recognize, collect, and collate information from an already existing but rather diffuse community.

The focus of the community will be academic data science, or "data driven discovery". This is spurred largely by the recent selection of the Moore Data Driven Discovery Investigators, as well as the earlier Moore and Sloan Data Science Environments, and more broadly by the recognition that academia is broken when it comes to data science.

So, where to start?

For a variety of reasons -- including the main practical one, that most academics are not terribly social media integrated and we don't want to try to force them to learn new habits -- I am focusing on aggregating blog posts and Twitter.

So, the main question is... how can we most easily collect and broadcast blog posts and articles via a Web site? And how well can we integrate with Twitter?

## First steps and initial thoughts

Following Noam Ross's suggestions in the above storify, I put together a WordPress blog that uses the RSS Multi Importer to aggregate RSS feeds as blog posts (hosted on NFSN). I'd like to set this up for the DDD Investigators who have blogs; those who don't can be given accounts if they want to post something. This site also uses a Twitter feed plugin to pull in tweets from the list of DDD Investigators.

The resulting RSS feed from the DDDI can be pulled into a River of News site that aggregates a much larger group of feeds.

The WordPress setup was fairly easy and I'm going to see how stable it is (I assume it will be very stable, but shrug time will tell :). I'm upgrading my own hosting setup and once that's done, I'll try out River4.

## Next steps and longer-term thoughts

Ultimately a data-driven-discovery site that has a bit more information would be nice; I could set up a mostly static site, post it on github, authorize a few people to merge, and then solicit pull requests when people want to add their info or feeds.

One thing to make sure we do is track only a portion of feeds for prolific bloggers, so that I, for example, have to tag a post specifically with 'ddd' to make it show up on the group site. This will avoid post overload.

I'd particularly like to get a posting set up that integrates well with how I consume content. In particular, I read a lot of things via my phone and tablet, and the ability to post directly from there -- probably via e-mail? -- would be really handy. Right now I mainly post to Twitter (and largely by RTing) which is too ephemeral, or I post to Facebook, which is a different audience. (Is there a good e-mail-to-RSS feed? Or should I just do this as a WordPress blog with the postie plug-in?)

The same overall setup could potentially work for a Software Carpentry Instructor community site, a Data Carpentry Instructor community site, trainee info sites for SWC/DC trainees, and maybe also a bioinformatics trainee info site. But I'd like to avoid anything that involves a lot of administration.

## Things I want to avoid

Public forums.

Private forums that I have to administer or that aren't integrated with my e-mail (which is where I get most notifications, in the end).

Overly centralized solutions; I'm much more comfortable with light moderation ("what feeds do I track?") than anything else.

Thoughts?

--titus

## October 02, 2014

### Continuum Analytics

#### Advanced Features of Conda Part 2

Conda is the package manager that comes with Continuum’s Anaconda distribution. Conda makes it easy to install and manage all your packages and environments on Windows, Mac OS X, and Linux.

If you are new to conda, I recommend starting at the documentation, and watching my presentation from SciPy 2014.

In this post, I will assume that you are already familiar with conda and its basic usage, both for installing and building packages. I will look at a few advanced features or tips that may not be well known, even among advanced users of conda. These features will help you with feature discovery, customization, and allow you to manage your packages and environments in more advanced ways.

In part 1, we looked at --help, configuration, conda update --all, conda list --export and conda create --file, conda clean, and pinning packages. In this post, we will look at some tools that make building packages easier and some more features that help manage your environments and use conda more effectively from the command line.

### Juan Nunez-Iglesias

#### jnuneziglesias

(Edit: I initially thought it would be cute to number from 0. But it turns out it becomes rather obnoxious to relate English (first, second, …) to 0-indexing. So this was formerly volume 1. But everything else remains the same.)

This is the second post in a series about setting up continuous integration for a Python project from scratch. For the first post, see Automated tests with pytest.

After you’ve written some test cases for a tiny project, it’s easy to check what code you have automatically tested. For even moderately big projects, you will need tools that automatically check what parts of your code are actually tested. The proportion of lines of code that are run at least once during your tests is called your test coverage.

For the same reasons that testing is important, measuring coverage is important. Pytest can measure coverage for you with the coverage plugin, found in the pytest-cov package. Once you’ve installed the extension, a test coverage measurement is just a command-line option away:

 ~/projects/maths $py.test --doctest-modules --cov . ============================= test session starts ============================== platform darwin -- Python 2.7.8 -- py-1.4.25 -- pytest-2.6.3 plugins: cov collected 2 items maths.py . test_maths.py . --------------- coverage: platform darwin, python 2.7.8-final-0 ---------------- Name Stmts Miss Cover -------------------------------- maths 2 0 100% test_maths 4 0 100% -------------------------------- TOTAL 6 0 100% =========================== 2 passed in 0.07 seconds ===========================  (The --cov takes a directory as input, which I find obnoxious, given that py.test so naturally defaults to the current directory. But it is what it is.) Now, if I add a function without a test, I’ll see my coverage drop: def sqrt(x): """Return the square root of x.""" return x * 0.5  (The typo is intentional.) --------------- coverage: platform darwin, python 2.7.8-final-0 ---------------- Name Stmts Miss Cover -------------------------------- maths 4 1 75% test_maths 4 0 100% -------------------------------- TOTAL 8 1 88%  With one more option, --cov-report term-missing, I can see which lines I haven’t covered, so I can try to design tests specifically for that code: --------------- coverage: platform darwin, python 2.7.8-final-0 ---------------- Name Stmts Miss Cover Missing ------------------------------------------ maths 4 1 75% 24 test_maths 4 0 100% ------------------------------------------ TOTAL 8 1 88%  Do note that 100% coverage does not ensure correctness. For example, suppose I test my sqrt function like so: def sqrt(x): """Return the square root of x. Examples -------- >>> sqrt(4.0) 2.0 """ return x * 0.5  Even though my test is correct, and I now have 100% test coverage, I haven’t detected my mistake. Oops! But, keeping that caveat in mind, full test coverage is a wonderful thing, and if you don’t test something, you’re guaranteed not to catch errors. Further, my example above is quite contrived, and in most situations full test coverage will spot most errors. That’s it for part 1. Tune in next time to learn how to turn on Travis continuous integration for your GitHub projects! ## October 01, 2014 ### Titus Brown #### Announcement: Moore Data Driven Discovery Investigator Award I am very, very happy to announce that I have been selected to be one of the fourteen Moore Data Driven Discovery Investigators. This is a signal investment by the Moore Foundation into the burgeoning area of data-intensive science, and it is quite a career booster. It will provide my lab with$1.5m over a period of 5 years, enabling us to dive into certain aspects of infrastructure building that we would otherwise be unable to support. I'm also particularly enthusiastic about the implicit recognition of our efforts in open science, open source, and replicability!

My talk and proposal for the competition are linked here:

http://ivory.idyll.org/blog/2014-moore-ddd-talk.html

You can see the list of all fourteen DDD Investigators, along with various and sundry information that includes talk titles, proposals, and Twitter handles, here.

--titus

Dr. C. Titus Brown

Visiting Associate Professor, Population Health and Reproduction, UC Davis School of Veterinary Medicine.

Member, UC Davis Genome Center

External faculty, BEACON Center for the Study of Evolution in Action.

Slightly outdated Lab Web site: http://ged.msu.edu/

#### Introducing the Moore Foundation's Data Driven Discovery (DDD) Investigators

Note: the source data for this is available on github at https://github.com/ctb/dddi

Today, the Moore Foundation announced that they have selected fourteen Moore Data Driven Discovery Investigators.

In reverse alphabetical order, they are:

Dr. Ethan White, University of Florida

Source code repository: https://github.com/weecology

ImpactStory: https://impactstory.org/EthanWhite

Dr. Laura Waller, UC Berkeley

Source code repository: https://github.com/Waller-Lab/CalOptrics

Dr. Matt Turk, University of Illinois at Urbana-Champaign

Proposal: Changing How We Conduct Inquiry

Source code repository: http://bitbucket.org/MatthewTurk/

Dr. Blair D. Sullivan, North Carolina State University

Proposal title: Enabling Science via Structural Graph Algorithms

Source code repository: https://github.com/bdsullivan/

Dr. Matthew Stephens, University of Chicago

Proposal title: Gene Regulation and Dynamic Statistical Comparisons

Source code repository: https://github.com/stephens999

Dr. Amit Singer, Princeton University

Proposal title: Maximum Likelihood Estimation by Semidefinite Programming: Application to Cryo-Electron Microscopy

Dr. Kimberly Reynolds, UT Southwestern

Proposal title: Decoding the Genome: Finding Effective Variables from Evolutionary Ensembles

Dr. Chris Re, Stanford

Dr. Laurel Larsen, UC Berkeley

Proposal title: Developing the informationscape approach to environmental change detection.

Dr. Carl Kingsford, Carnegie Mellon University

Proposal title: Lightweight Algorithms for Petabase-scale Genomics

Dr. Jeffrey Heer, U. Washington Seattle

Proposal: Interactive Data Analysis

Dr. Casey Greene, Dartmouth

Proposal title: Learning the context of publicly available genome-wide data

Dr. C. Titus Brown, UC Davis

Source code repository: http://github.com/ged-lab/

ImpactStory: https://impactstory.org/TitusBrown

Dr. Joshua Bloom, UC Berkeley

Proposal title: Efficient Data-Driven Astrophysical Inquiry with Machine Learning

--titus

### Filipe Saraiva

#### Cantor: new features in KDE 4.14

KDE 4.14 was released in August 2014 but I did not have time to write about new features in Cantor for that release.

So, let’s fix it now!

## New backend: Lua

Cantor family of backends have a new member: Lua, using luajit implementation.

This backend have a lot of features: syntax highlighting, tab complete, figures in worksheet, script editor, and more.

Cantor + Lua in action

Lua backend was developed by Lucas Negri, a Brazilian guy, and this is a reason for me to be very happy. Welcome aboard Lucas!

## Use utf8 on LaTeX entries

When you to export the worksheet to LaTeX, utf8 will be used as default. This improvement was developed by Lucas.

## Support to packaging extension in Sage and Octave backends

Now these backends have an assistant to import packages/modules/libraries.

## Support to auto run scripts

Auto run scripts/commands in Python 2 backend

Now Python 2, Scilab, Octave, Sage, Maxima, Qalculate, and KAlgebra backends have support to auto run scripts. You can configure a set of scripts or commands and they will run automatically after the worksheet launch!

## Add CTRL+Space as alternative default code completion to worksheet

Default code completion command in worksheet is TAB key, but now we have an alternative command too: CTRL + Space. It will maintain consistence between script editor (where the default code completion is CTRL + Space) and worksheet.

## Initial support to linear algebra and plot assistants in Python 2

I developed the initial support to 2 amazing plugins in Python 2 backend: the linear algebra plugin and the plot plugin.

First, let’s see the linear algebra plugin. In menu bar go to Linear Algebra > Create Matrix. A window to matrix creation will be open, as below. You must to put the values in the cells.

Matrix creation assistant

After push ‘Ok’ button, the matrix command from numpy  module will be loaded in the worksheet, automatically.

New matrix created

For now this plugin have implemented just the matrix creation.

Let’s see the plot plugin now. You can use it to create 2D and 3D plot. Let’s to do x = numpy.arange(0.0, 2.0, 0.01) and, in menu bar, go to Graphics > Graphics 2D. The window below will be open.

Ploting 2D assistant

You can set some expression to be the Y axis (in this case I am using numpy.sin) and a variable name to X axis (this case, 2 * x * numpy.pi). You could to put just x in variable name to do a plot with the values of x.

After push ‘Ok’ button, the command using pylab will be load in worksheet to make the graphic.

3D plotting assistant have a similar way to create the pictures.

How you can see, to use this assistants we need to have some python modules in the workspace, and they must to have the same name used in the plugins. There are a lot of ways to import modules in python environment (import foo; import foo as [anyname]; from foo import *; etc), so to do a generic way to use it is impossible (well, if you have some idea I would like to hear it).

My choice was to import numpy, scipy, matplotlib and pylab when Python 2 backend is loaded by Cantor. Well, I intent to change it because that modules will be mandatory to use Python 2 backend correctly, and pylab is not longer recommended in recent matplotlib version. So, wait for some changes in this plugin soon.

In any case, I would like to hear the opinions of scientific python community about this features.

## Future

For now we are working in Cantor port to Qt5/KF5. You can follow the work in ‘frameworks‘ branch on Cantor repository.

## Donations

If you use or appreciate my work in Cantor or another free software project, please consider to make a donation for me, then I can to continue my contributions to improve Cantor.

You can consider make a donation to KDE too, and help with the maintenance of this great free software community and their products.

### Juan Nunez-Iglesias

#### jnuneziglesias

(Edit: I initially thought it would be cute to number from 0. But it turns out it becomes rather obnoxious to relate English (first, second, …) to 0-indexing. So this was formerly volume 0. But everything else remains the same.)

I just finished the process of setting up continuous integration from scratch for one of my projects, cellom2tif, a simple image file converter/liberator. I thought I would write a blog post about that process, but it has slowly mutated into a hefty document that I thought would work better as a series. I’ll cover automated testing, test coverage, and how to get these to run automatically for your project with Travis-CI and Coveralls.

Without further ado, here goes the first post: how to set up automated testing for your Python project using pytest.

## Automated tests, and why you need them

Software engineering is hard, and it’s incredibly easy to mess things up, so you should write tests for all your functions, which ensure that nothing obviously stupid is going wrong. Tests can take a lot of different forms, but here’s a really basic example. Suppose this is a function in your file, maths.py:

def square(x):
return x ** 2


Then, elsewhere in your package, you should have a file test_maths.py with the following function definition:

from maths import square
def test_square():
x = 4
assert square(x) == 16


This way, if someone (such as your future self) comes along and messes with the code in square(), test_square will tell you whether they broke it.

## Testing in Python with pytest

A whole slew of testing frameworks, such as nose, will then traverse your project, look for files and functions whose names begin with test_, run them, and report any errors or assertion failures.

I’ve chosen to use pytest as my framework because:

• it is a strict superset of both nose and Python’s built-in unittest, so that if you run it on projects set up with those, it’ll work out of the box;
• its output is more readable than nose’s; and
• its fixtures provide a very nice syntax for test setup and cleanup.

For a quick intro to pytest, you can watch Andreas Pelme’s talk at this year’s EuroPython conference:

But the basics are very simple: sprinkle files named test_something.py throughout your project, each containing one or more test_function() definition; then type py.test on the command line (at your project root directory), and voila! Pytest will traverse all your subdirectories, gather up all the test files and functions, and run your tests.

Here’s the output for the minimal maths project described above:

~/projects/maths $py.test ============================= test session starts ============================== platform darwin -- Python 2.7.8 -- py-1.4.20 -- pytest-2.5.2 collected 1 items test_maths.py . =========================== 1 passed in 0.03 seconds ===========================  In addition to the test functions described above, there is a Python standard called doctest in which properly formatted usage examples in your documentation are automatically run as tests. Here's an example: def square(x): """Return the square of a number x. [...] Examples -------- >>> square(5) 25 """ return x ** 2  (See my post on NumPy docstring conventions for what should go in the ellipsis above.) Depending the complexity of your code, doctests, test functions, or both will be the appropriate route. Pytest supports doctests with the --doctest-modules flag. (This runs both your standard tests and doctests.) ~/projects/maths$ py.test --doctest-modules
============================= test session starts ==============================
platform darwin -- Python 2.7.8 -- py-1.4.20 -- pytest-2.5.2
collected 2 items

maths.py .
test_maths.py .

=========================== 2 passed in 0.06 seconds ===========================


## Test-driven development

That was easy! And yet most people, my past self included, neglect tests, thinking they’ll do them eventually, when the software is ready. This is backwards. You’ve probably heard the phrase “Test-driven development (TDD)”; this is what they’re talking about: writing your tests before you’ve written the functionality to pass them. It might initially seem like wasted effort, like you’re not making progress in what you actually want to do, which is write your software. But it’s not:

By spending a bit of extra effort to prevent bugs down the road, you will get to where you want to go faster.

That’s it for volume 0! Watch out for the next post: ensuring your tests are thorough by measuring test coverage.

## September 30, 2014

### Matthieu Brucher

#### Book review: Intelligent Systems in Oil Field Development under Uncertainty

How to know whether or not to produce an oil field, and how to know how? The oil industry is used to make a lot of simulations to satisfy the SEC rules, and to know how much oil they can drill today and in the future. This book goes further than just the usual how much from a field under specific condition, by adding the development plan as well as the oil price in the equation.

#### Content and opinions

The book seems to be based on things Petrobras, the Brazilian national oil company, does, so it’s fair to assume they actually do far more than what is presented in the book.

There are four chapters that make the core of the book, exposing the different algorithms, explaining bits of them (you will still need to dig further to understand what they actually are, and obviously even more to duplicate the results), but also giving lots of examples.

Let’s start with the first two chapters, they are not in the “core” of the book. The first one explains why you need to do uncertainties in the oil world. Main reason: why would you spend billions of dollars when you are not sure you will get them back? The second chapters explains some financial theory. Usual uncertainty analysis (the one for the SEC for instance) don’t involve financial studies. They just require you to give your P1/proven oil and gas reserves (10% chance that you get this oil out of the field). Then for your management, you want to get the P2 (50%). Here, you go one step further, because these 2 values don’t show how much money you will get. So the financial framework is important to understand how you can get to the final step of the simulation.

The third chapter deals with the Machine Learning algorithms that will be used through the book. Several evolutionary algorithms (genetic algorithms and genetic programming) are now mainstream, but some of them are not (cultural algorithms or quantum-inspired GA). It’s always interesting to see new algorithms, especially as they are applied to a concrete problem in one of the biggest industries. Then the authors tackle neural networks, with a simple approach (compared to my previous review), fuzzy logic (I hadn’t heard about it for quite some time) and the mix between the two of them, neuro-fuzzy models.

The fourth chapter is about the exploration process and mainly uses evolutionary algorithms to optimize the development plan, that is the number of wells and their type to optimize the number of oil barrels at the end of the life of the field. It is interesting to see that they don’t simulate an actual complete development plan (when are the wells drilled and then started). Injection of prior knowledge is addressed, and it isn’t always making the result better, interesting to know!

Fifth chapter starts introducing the market behavior by adding the option to activate new parts of the development plan, depending on several parameters. Here, the authors use genetic algorithms and fuzzy models, adding also the simulation of the oil price during the life of the field, ending up with Monte Carlo methods to define the “option” price. I have to say that I didn’t understand everything in their examples, but I understood the global ideas, without knowing much of the financial side. The explanation of how the fuzzy numbers and statistics differ in the end is lacking, although they are quite close. It’s a different approach, with a different theory anyway.

The sixth chapter and the last one in the “core” takes everything together and tries to make it even further. It introduced additional mathematical tools but wasn’t as thorough as the last three. It was still interesting, but obviously taking advantage of this one requires mastering the other ones before.

The last chapter present the tools used in production and the architecture decisions behind. It has less value as you can’t get the tool…

#### Conclusion

I picked up the book because I’m currently in the oil industry, and I have interests in intelligent systems. Even though I can’t use the tools now, it gives some good leads to build such a system, and hopefully I may work on such systems some day (OK, I’ve already designed the core framework around the reservoir simulator, before I even read the book).

### Continuum Analytics news

#### Continuum Analytics Releases Anaconda 2.1

Continuum Analytics, the premier provider of Python-based data analytics solutions and services, announced today the release of the latest version of Anaconda, its free, enterprise-ready collection of libraries for Python.

Anaconda enables big data management, analysis, and cross-platform visualization for business intelligence, scientific analysis, engineering, machine learning, and more. The latest release, version 2.1, adds a new version of the Anaconda Launcher and PyOpenSSL, as well as updates NumPy, Blaze, Bokeh, Numba, and 50 other packages.

## September 28, 2014

### Gaël Varoquaux

#### Improving your programming style in Python

Here are some references on software development techniques and patterns to help write better code. They are intended for the casual programmer, and certainly not an advanced developer.

They are listed in order of difficulty.

## Software carpentry

These are the original notes from Greg Wilson’s course on software engineering at the university of Toronto. This course is specifically intended for scientists, but not computer science students. It is very basic and does not cover design issues.

## A tutorial introduction to Python

This tutorial is easier to follow than Guido’s tutorial, thought it does not go as much in depth.

## Python Essential Reference

http://www.informit.com/articles/article.asp?p=453682&rl=1

http://www.informit.com/articles/article.asp?p=459269&rl=1

These are two chapters out of David Beazley’s excellent book Python Essential Reference. They allow to understand more deeply how python works. I strongly recommend this book to anybody serious about python.

## An Introduction to Regular Expressions

http://www.informit.com/articles/article.asp?p=20454&rl=1

If you are going to do any sort of text manipulation, you absolutely need to know how to use regular expressions: powerful search and replace patterns.

## Software design for maintainability

My own post

A case of shameless plug: this is a post that I wrote a few years ago. I think that it is still relevant.

## Writing a graphical application for scientific programming using TraitsUI

http://gael-varoquaux.info/computers/traits_tutorial/index.html

Building interactive graphical application is a difficult problem. I have found that the traitsUI module provides a great answer to this problem. This is a tutorial intended for the non programmer.

## An introduction to Python iterators

http://www.informit.com/articles/article.asp?p=26895&rl=1

This article may not be terribly easy to follow, but iterator are a great feature of Python, so this is definitely worth reading.

## Functional programming

http://www.ibm.com/developerworks/linux/library/l-prog.html?open&l=766,t=gr,p=PrmgPyth

Functional programming is a programming style where mathematical functions are successively applied to immutable objects to go from the inputs of the program to its outputs in a succession of transformation. It is appreciated by some because it is easy to analyze and prove. In certain cases it can be very readable.

## Patterns in Python

This document exposes a few design patterns in Python. Design patterns are solutions to recurring development problems using object oriented programming. I suggest this reading only if you are familiar with OOP.

## Idiomatic Python

Designing Object-oriented code actually requires some care: when you are building your set of abstractions, you are designing the world in which you are going to be condemned to living (or actually coding). I would advice people to keep things as simple as possible, and follow the SOLID principles:

http://mmiika.wordpress.com/oo-design-principles/

## Using decorators to do meta-programming in Python

A very beautiful article for the advanced python user. Meta-programming is a programming technique that involves changing the program at the run-time. This allows to add new abstractions to the code the programmer writes, thus creating a “meta-language”. This article shows this very well.

## A Primer on Python Metaclass Programming

http://www.onlamp.com/lpt/a/3388

Metaclasses allow to define new style of objects, that can have different calling, creation or inheritance rules. This is way over my head, but I am referencing it here for the record.

## Iterators in Python

https://docs.python.org/2/library/itertools.html#recipes

Learn to use the itertools (but don’t abuse them)!

Related to the producer/consumer problem with iterators, see:

http://www.oluyede.org/blog/2007/04/09/producerconsumer-in-python/

## September 23, 2014

### Continuum Analytics

#### Introducing Blaze - HMDA Practice

We look at data from the Home Mortgage Disclosure Act, a collection of actions taken on housing loans by various governmental agencies (gzip-ed csv file here) (thanks to Aron Ahmadia for the pointer). Uncompressed this dataset is around 10GB on disk, so we don’t want to load it up into memory with a modern commercial notebook.

Instead, we use Blaze to investigate the data, select down to the data we care about, and then migrate that data into a suitable computational backend.

## September 19, 2014

### Gaël Varoquaux

#### Hiring an engineer to mine large functional-connectivity databases

Work with us to leverage leading-edge machine learning for neuroimaging

At Parietal, my research team, we work on improving the way brain images are analyzed, for medical diagnostic purposes, or to understand the brain better. We develop new machine-learning tools and investigate new methodologies for for quantifying brain function from MRI scans.

One of our important alley of contributions is in deciphering “functional connectivity”: analysis the correlation of brain activity to measure interactions across the brain. This direction of research is exciting because it can be used to probe the neural-support of functional deficits in incapacitated patients, and thus lead to new biomarkers on functional pathologies, such as autism. Indeed, functional connectivity can be computed without resorting to complicated cognitive tasks, unlike most functional imaging approaches. The flip side is that exploiting such “resting-state” signal requires advanced multivariate statistics tools, something at which the Parietal team excels.

For such multivariate processing of brain imaging data, Parietal has an ecosystem of leading-edge high-quality tools. In particular we have built the foundations of the most successful Python machine learning library, scikit-learn, and we are growing a dedicate software, nilearn, that leverages machine-learning for neuroimaging. To support this ecosystem, we have dedicated top-notch programmers, lead by the well-known Olivier Grisel.

We are looking for a data-processing engineer to join our team and work on applying our tools on very large neuroimaging databases to learn specific biomarkers of pathologies. For this, the work will be shared with the CATI, the Fench platform for multicentric neuroimaging studies, located in the same building as us. The general context of the job is the NiConnect project, a multi-organisational research project that I lead and that focuses on improving diagnostic tools on resting-state functional connectivity. We have access to unique algorithms and datasets, before they are published. What we are now missing between those two, and that link could be you.

If you want more details, they can be found on the job offer. This post is to motivate the job in a personal way, that I cannot give in an official posting.

## Why take this job?

I don’t expect some to take this job only because it pays the bill. To be clear, the kind of person I am looking for has no difficulties finding a job elsewhere. So, if you are that person, why would you take the job?

• To join a great team with many experts, focused on finding elegant solutions to hard problems at the intersection of machine learning, cognitive science, and software. Choose to work with great people, knowledgeable, passionate, and fun.
• To work on interesting problems, that matter. They are interesting because they are challenging but we have the skills to solve them. They matter because they can make brain research better.
• To learn. NeuroImaging + Machine learning is a quickly growing topic. If you come from a NeuroImaging background and want to add to your CV an actual expertise in machine learning for NeuroImaging. This is the place to be.

## What would make me excited in a resume?

• A genuine experience in neuroimaging data processing, especially large databases.
• Talent with computers and ideally some Python experience.
• The unlikely combination of research training (graduate or undergraduate) and experience in a non academic setting.
• A problem-solving mindset.
• A good ability to write about neuroimaging and data processing in English: who knows, if everything goes to plan, you could very well be publishing about new biomarkers.

Now if you are interested and feel up for the challenge, read the real job offer, and send me your resume.

## September 18, 2014

### Continuum Analytics

#### Numba 0.14 Release and Project Status

Continuing our goal to have a new release of Numba every month, we are celebrating the month of September with Numba version 0.14! Along with a number of bug fixes, it brings support for many new features, including:

• Support for nearly all of the NumPy math functions (including comparison, logical, bitwise and some previously missing float functions) in nopython mode.
• The NumPy datetime64 and timedelta64 dtypes are supported in nopython mode with NumPy 1.7 and later.
• Support for NumPy math functions on complex numbers in nopython mode.
• ndarray.sum() is supported in nopython mode.
• Improved warnings and error messages.
• Support for NumPy record arrays on the GPU.

## September 16, 2014

### Continuum Analytics

#### Introducing Blaze - Migrations

tl;dr Blaze migrates data efficiently between a variety of data stores.

In our last post on Blaze expressions we showed how Blaze can execute the same tabular query on a variety of computational backends. However, this ability is only useful if you can migrate your data to the new computational system in the first place.

To help with this, Blaze provides the into function which moves data from one container type to another.

### Matthieu Brucher

#### Book review: Machine Learning for Adaptive Many-Core Machines – A Practical Approach

Nice title, surfing on the many core hype, and with a practical approach! What more could one expect from a book on such an interesting subject?

#### Content and opinions

Let’s start first with the table of contents. What the authors call many-core machines is GPUs. You won’t see anything else, like MPI or OpenMP. One of the central topics is also big data, but expect for the part that GPU can process data for some algorithms faster than a CPU, it’s not really big data. I would call it large datasets, but not big data, as in terabytes of data!

The book starts with an introduction on Machine Learning, what will be addressed, and the authors’ GPU library for Machine Learning. If you don’t know anything about GPUs, you will be given a short introduction, but it may not be enough for you to understand the difference between CPUs and GPUs, and why we don’t run everything on GPUs.

The next part tackles supervised learning. Each time, you get a general description of algorithms, then the GPU implementation, and then results and discussions. The authors’ favorite algorithms seem to be neural networks, as everything is compared to these. It’s the purpose of the first chapter in this part, with the classic Back-Propagation algorithm, or the Multiple BP algorithm. They also cover their own tool to automate training, which is nice, as it is usually the issue with neural networks. The next chapter handles missing data and the different kind of missing data. The way it is handled by the library is through an activation mechanism that seems to work, but I don’t know how reliable the training of such NN is, although the results are quite good, so the training system must work. Then we have Support Vector Machines, and curiously, it’s not the “usual” kernel that is mainly used in the book. Also, for once, you don’t the full definition of a kernel, but in a way, you don’t need it for the purpose of this book. The last algorithm of this part is the Incremental Hypersphere Classifier. After a shorter presentation, the authors deal with the experimental setups, and they also chain IHC with SVM. All things considered, this is a small set of supervised learning algorithms. Compared to what you can find in scikit-learn, it is really a small subset.

The third part is about unsupervised and semi-supervised learning. The first algorithm is the Non-Negative Matrix Factorization, presented as a non linear algorithm (when it obviously is), so quite strange. The semi-supervised approach is to divide the set of original features in subsets that have meaning together (eyes, mouth, nose…). Two different implementation are provided in the library, and presented with results in the book. The second chapter is about the current hype in neural networks: deep learning. After the presentation of the basic tool of Deep Belief Networks, the Restricted Boltzmann Machines, you directly jump to implementation and results. I have to say that the fact that the book didn’t describe the architecture of DBNs properly, and it quite difficult to know what DBNs actually are and how they can give such nice results.

The last part consists of the final chapter, a conclusion and a sum-up of the algorithms described in the book.

#### Conclusion

The book doesn’t display all the code that was developed. I certainly didn’t want it to do that, because the code is available on SourceForge. It was a good description of several machine learning algorithms ported on the GPU, with results and comparisons, but it didn’t live up to the expectation of the book title and the introduction. A GPU is kind of many-cores, but we know that the future many-cores won’t be that kind of cores only. It also falls short for the big data approach.

## September 14, 2014

### Jan Hendrik Metzen

#### Unsupervised feature learning with Sparse Filtering

An illustration of unsupervised learning of features for images from the Olivetti faces dataset using the sparse filtering algorithm. This work is based on the paper "Sparse Filtering" by the authors Jiquan Ngiam, Pang Wei Koh, Zhenghao Chen, Sonia Bhaskar, and Andrew Y. Ng published in NIPS 2011.

The sparse filtering algorithm does not try to model the data's distribution but rather to learn features which are sparsely activated, in the sense that

• for each example, only a small subset of features is activated ("Population Sparsity")
• each feature is only activated on a small subset of he examples ("Lifetime Sparsity")
• features are roughly activated equally often ("High Dispersal")

This sparsity is encoded as an objective function and L-BFGS is used to minimize this function.

This notebook illustrates the algorithm on the Olivetti faces dataset.

The source code of the sparse-filtering implementation is avalaible at https://github.com/jmetzen/sparse-filtering

In [2]:
import numpy as np
import pylab as plt

np.random.seed(0)
%matplotlib inline

In [3]:
# Install with "pip install sparse_filtering"
from sparse_filtering import SparseFiltering


### Prepare dataset¶

This will load the Olivetti faces dataset, normalize the examples (global and local centering), and convert each example into a 2D structure (64*64 pixel image). Thereupon, 25 patches of size 16x16pixels are extracted randomly from each image.

In [4]:
from sklearn.datasets import fetch_olivetti_faces
dataset = fetch_olivetti_faces(shuffle=True)
faces = dataset.data

n_samples, _ = faces.shape

faces_centered = faces - faces.mean(axis=0)  # global centering

faces_centered -= \
faces_centered.mean(axis=1).reshape(n_samples, -1)  # local centering

faces_centered = \
faces_centered.reshape(n_samples, 64, 64)  # Reshaping to 64*64 pixel images

plt.imshow(faces_centered[0], cmap=plt.get_cmap('gray'))
_ = plt.title("One example from dataset with n=%s example" % n_samples)