July 06, 2015

Jake Vanderplas

The Model Complexity Myth

(or, Yes You Can Fit Models With More Parameters Than Data Points)

An oft-repeated rule of thumb in any sort of statistical model fitting is "you can't fit a model with more parameters than data points". This idea appears to be as wide-spread as it is incorrect. On the contrary, if you construct your models carefully, you can fit models with more parameters than datapoints, and this is much more than mere trivia with which you can impress the nerdiest of your friends: as I will show here, this fact can prove to be very useful in real-world scientific applications.

A model with more parameters than datapoints is known as an under-determined system, and it's a common misperception that such a model cannot be solved in any circumstance. In this post I will consider this misconception, which I call the model complexity myth. I'll start by showing where this model complexity myth holds true, first from from an intuitive point of view, and then from a more mathematically-heavy point of view. I'll build from this mathematical treatment and discuss how underdetermined models may be addressed from a frequentist standpoint, and then from a Bayesian standpoint. (If you're unclear about the general differences between frequentist and Bayesian approaches, I might suggest reading my posts on the subject). Finally, I'll discuss some practical examples of where such an underdetermined model can be useful, and demonstrate one of these examples: quantitatively accounting for measurement biases in scientific data.

The Root of the Model Complexity Myth

While the model complexity myth is not true in general, it is true in the specific case of simple linear models, which perhaps explains why the myth is so pervasive. In this section I first want to motivate the reason for the underdetermination issue in simple linear models, first from an intuitive view, and then from a more mathematical view.

I'll start by defining some functions to create plots for the examples below; you can skip reading this code block for now:

In [1]:
# Code to create figures
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

def plot_simple_line():
    rng = np.random.RandomState(42)
    x = 10 * rng.rand(20)
    y = 2 * x + 5 + rng.randn(20)
    p = np.polyfit(x, y, 1)
    xfit = np.linspace(0, 10)
    yfit = np.polyval(p, xfit)
    plt.plot(x, y, 'ok')
    plt.plot(xfit, yfit, color='gray')
    plt.text(9.8, 1,
             "y = {0:.2f}x + {1:.2f}".format(*p),
             ha='right', size=14);
def plot_underdetermined_fits(p, brange=(-0.5, 1.5), xlim=(-3, 3),
    rng = np.random.RandomState(42)
    x, y = rng.rand(2, p).round(2)
    xfit = np.linspace(xlim[0], xlim[1])
    for r in rng.rand(20):
        # add a datapoint to make model specified
        b = brange[0] + r * (brange[1] - brange[0])
        xx = np.concatenate([x, [0]])
        yy = np.concatenate([y, [b]])
        theta = np.polyfit(xx, yy, p)
        yfit = np.polyval(theta, xfit)
        plt.plot(xfit, yfit, color='#BBBBBB')
    plt.plot(x, y, 'ok')
    if plot_conditioned:
        X = x[:, None] ** np.arange(p + 1)
        theta = np.linalg.solve(np.dot(X.T, X)
                                + 1E-3 * np.eye(X.shape[1]),
                                np.dot(X.T, y))
        Xfit = xfit[:, None] ** np.arange(p + 1)
        yfit = np.dot(Xfit, theta)
        plt.plot(xfit, yfit, color='black', lw=2)

def plot_underdetermined_line():
def plot_underdetermined_cubic():
    plot_underdetermined_fits(3, brange=(-1, 2),
                            xlim=(0, 1.2))
def plot_conditioned_line():
    plot_underdetermined_fits(1, plot_conditioned=True)

Fitting a Line to Data

The archetypical model-fitting problem is that of fitting a line to data: A straight-line fit is one of the simplest of linear models, and is usually specified by two parameters: the slope m and intercept b. For any observed value \(x\), the model prediction for \(y\) under the model \(M\) is given by

\[ y_M = m x + b \]

for some particular choice of \(m\) and \(b\). Given \(N\) obverved data points \(\{x_i, y_i\}_{y=1}^N\), it is straightforward (see below) to compute optimal values for \(m\) and \(b\) which fit this data:

In [2]:

The simple line-plus-intercept is a two-parameter model, and it becomes underdetermined when fitting it to fewer than two datapoints. This is easy to understand intuitively: after all, you can draw any number of perfectly-fit lines through a single data point:

In [3]:

The single point simply isn't enough to pin-down both a slope and an intercept, and the model has no unique solution.

Fitting a More General Linear Model

While it's harder to see intuitively, this same notion extends to models with more terms. For example, let's think consider fitting a general cubic curve to data. In this case our model is

\[ y_M = \theta_0 + \theta_1 x + \theta_2 x^2 + \theta_3 x^3 \]

Note that this is still a linear model: the "linear" refers to the linearity of model parameters \(\theta\) rather than linearity of the dependence on the data \(x\). Our cubic model is a four-parameter linear model, and just as a two-parameter model is underdetermined for fewer than two points, a four-parameter model is underdetermined for fewer than four points. For example, here are some of the possible solutions of the cubic model fit to three randomly-chosen points:

In [4]:

For any such simple linear model, an underdetermined system will lead to a similar result: an infinite set of best-fit solutions.

The Mathematics of Underdetermined Models

To make more progress here, let's quickly dive into the mathematics behind these linear models. Going back to the simple straight-line fit, we have our model

\[ y_M(x~|~\theta) = \theta_0 + \theta_1 x \]

where we've replaced our slope \(m\) and intercept \(b\) by a more generalizable parameter vector \(\theta = [\theta_0, \theta_1]\). Given some set of data \(\{x_n, y_n\}_{n=1}^N\) we'd like to find \(\theta\) which gives the best fit. For reasons I'll not discuss here, this is usually done by minimizing the sum of squared residuals from each data point, often called the \(\chi^2\) of the model in reference to its expected theoretical distribution:

\[ \chi^2 = \sum_{n=1}^N [y_n - y_M(x_n~|~\theta)]^2 \]

We can make some progress by re-expressing this model in terms of matrices and vectors; we'll define the vector of \(y\) values:

\[ y = [y_1, y_2, y_3, \cdots y_N] \]

We'll also define the design matrix which we'll call \(X\); this contains all the information about the form of the model:

\[ X = \left[ \begin{array}{ll} 1 & x_1 \\ 1 & x_2 \\ \vdots &\vdots \\ 1 & x_N \\ \end{array} \right] \]

With this formalism, the vector of model values can be expressed as a matrix-vector product:

\[ y_M = X\theta \]

and the \(\chi^2\) can be expressed as a simple linear product as well:

\[ \chi^2 = (y - X\theta)^T(y - X\theta) \]

We'd like to minimize the \(\chi^2\) with respect to the parameter vector \(\theta\), which we can do by the normal means of differentiating with respect to the vector \(\theta\) and setting the result to zero (yes, you can take the derivative with respect to a vector!):

\[ \frac{d\chi^2}{d\theta} = -2X^T(y - X\theta) = 0 \]

Solving this for \(\theta\) gives the Maximum Likelihood Estimate (MLE) for the parameters,

\[ \hat{\theta}_{MLE} = [X^T X]^{-1} X^T y \]

Though this matrix formalism may seem a bit over-complicated, the nice part is that it straightforwardly generalizes to a host of more sophisticated linear models. For example, the cubic model considered above requires only a larger design matrix \(X\):

\[ X = \left[ \begin{array}{llll} 1 & x_1 & x_1^2 & x_1^3\\ 1 & x_2 & x_2^2 & x_2^3\\ \vdots & \vdots & \vdots & \vdots\\ 1 & x_N & x_N^2 & x_N^3\\ \end{array} \right] \]

The added model complexity is completely encapsulated in the design matrix, and the expression to compute \(\hat{\theta}_{MLE}\) from \(X\) is unchanged!

Why Underdetermined Models Break

Taking a look at this Maximum Likelihood solution for \(\theta\), we see that there is only one place that it might go wrong: the inversion of the matrix \(X^T X\). If this matrix is not invertible (i.e. if it is a singular matrix) then the maximum likelihood solution will not be well-defined.

The number of rows in \(X\) corresponds to the number of data points, and the number of columns in \(X\) corresponds to the number of parameters in the model. It turns out that a matrix \(C = X^TX\) will always be singular if \(X\) has fewer rows than columns, and this is the source of the problem. For underdetermined models, \(X^TX\) is a singular matrix, and so the maximum likelihood fit is not well-defined.

Let's take a look at this in the case of fitting a line to the single point shown above, \((x=0.37, y=0.95)\). For this value of \(x\), here is the design matrix:

In [5]:
X = np.array([[1, 0.37]])

We can use this to compute the normal matrix, which is the standard name for \(X^TX\):

In [6]:
C = np.dot(X.T, X)

If we try to invert this, we will get an error telling us the matrix is singular:

In [7]:
LinAlgError                               Traceback (most recent call last)
<ipython-input-7-a6d66d9f99af> in <module>()
----> 1 np.linalg.inv(C)

/Users/jakevdp/anaconda/envs/py34/lib/python3.4/site-packages/numpy/linalg/linalg.py in inv(a)
    518     signature = &aposD->D&apos if isComplexType(t) else &aposd->d&apos
    519     extobj = get_linalg_error_extobj(_raise_linalgerror_singular)
--> 520     ainv = _umath_linalg.inv(a, signature=signature, extobj=extobj)
    521     return wrap(ainv.astype(result_t))

/Users/jakevdp/anaconda/envs/py34/lib/python3.4/site-packages/numpy/linalg/linalg.py in _raise_linalgerror_singular(err, flag)
     89 def _raise_linalgerror_singular(err, flag):
---> 90     raise LinAlgError("Singular matrix")
     92 def _raise_linalgerror_nonposdef(err, flag):

LinAlgError: Singular matrix

Evidently, if we want to fix the underdetermined model, we'll need to figure out how to modify the normal matrix so it is no longer singular.

Fixing an Underdetermined Model: Conditioning

One easy way to make a singular matrix invertible is to condition it: that is, you add to it some multiple of the identity matrix before performing the inversion (in many ways this is equivalent to "fixing" a divide-by-zero error by adding a small value to the denominator). Mathematically, that looks like this:

\[ C = X^TX + \sigma I \]

For example, by adding \(\sigma = 10^{-3}\) to the diagonal of the normal matrix, we condition the matrix so that it can be inverted:

In [8]:
cond = 1E-3 * np.eye(2)
np.linalg.inv(C + cond)
array([[ 121.18815362, -325.16038316],
       [-325.16038316,  879.69065823]])

Carrying this conditioned inverse through the computation, we get the following intercept and slope for our underdetermined problem:

In [9]:
b, m = np.linalg.solve(C + cond,
                       np.dot(X.T, [0.95]))
print("Conditioned best-fit model:")
print("y = {0:.3f} x + {1:.3f}".format(m, b))
Conditioned best-fit model:
y = 0.309 x + 0.835

In [10]:

This conditioning caused the model to settle on a particular one of the infinite possibilities for a perfect fit to the data. Numerically we have fixed our issue, but this arbitrary conditioning is more than a bit suspect: why is this particular result chosen, and what does it actually mean in terms of our model fit? In the next two sections, we will briefly discuss the meaning of this conditioning term from both a frequentist and Bayesian perspective.

Frequentist Conditioning: Regularization

In a frequentist approach, this type of conditioning is known as regularization. Regularization is motivated by a desire to penalize large values of model parameters. For example, in the underdetermined fit above (with \((x, y) = (0.37, 0.95)\)), you could fit the data perfectly with a slope of one billion and an intercept near negative 370 million, but in most real-world applications this would be a silly fit. To prevent this sort of canceling parameter divergence, in a frequentist setting you can "regularize" the model by adding a penalty term to the \(\chi^2\):

\[ \chi^2_{reg} = \chi^2 + \lambda~\theta^T\theta \]

Here \(\lambda\) is the degree of regularization, which must be chosen by the person implementing the model.

Using the expression for the regularized \(\chi^2\), we can minimize with respect to \(\theta\) by again taking the derivative and setting it equal to zero:

\[ \frac{d\chi^2}{d\theta} = -2[X^T(y - X\theta) - \lambda\theta] = 0 \]

This leads to the following regularized maximum likelihood estimate for \(\theta\):

\[ \hat{\theta}_{MLE} = [X^TX + \lambda I]^{-1} X^T y \]

Comparing this to our conditioning above, we see that the regularization degree \(\lambda\) is identical to the conditioning term \(\sigma\) that we considered above. That is, regulariation of this form is nothing more than a simple conditioning of \(X^T X\), with \(\lambda = \sigma\). The result of this conditioning is to push the absolute values of the parameters toward zero, and in the process make an ill-defined problem solvable.

I'll add that the above form of regularization is known variably as L2-regularization or Ridge Regularization, and is only one of the possible regularization approaches. Another useful form of regularization is L1-regularization, also known as Lasso Regularization, which has the interesting property that it favors sparsity in the model.

Bayesian Conditioning: Priors

Regularization illuminates the meaning of matrix conditioning, but it still sometimes seems like a bit of black magic. What does this penalization of model parameters within the \(\chi^2\) actually mean? Here, we can make progress in understanding the problem by examining regularization from a Bayesian perspective.

As I pointed out in my series of posts on Frequentism and Bayesianism, for many simple problems, the frequentist likelihood (proportional to the negative exponent of the \(\chi^2\)) is equivalent to the Bayesian posterior – albeit with a subtlely but fundamentally different interpretation.

The Bayesian posterior probability on the model parameters \(\theta\) is given by

\[ P(\theta~|~D, M) = \frac{P(D~|~\theta, M) P(\theta~|~M)}{P(D~|~M)} \]

where the most important terms are the likelihood \(P(D~|~\theta, M)\) and the prior \(P(\theta~|~M)\). From the expected correspondence with the frequentist result, we can write:

\[ P(D~|~\theta, M) P(\theta~|~M) \propto \exp[- \chi^2] \]

Because the term on the right-hand-side has no \(\theta\) dependence, we can immediately see that

\[ P(D~|~\theta, M) \propto \exp[-\chi^2]\\ P(\theta~|~M) \propto 1 \]

That is, the simple frequentist likelihood is equivalent to the Bayesian posterior for the model with an implicit flat prior on \(\theta\).

To understand the meaning of regularization, let's repeat this exercise with the regularized \(\chi^2\):

\[ P(D~|~\theta, M) P(\theta~|~M) \propto \exp[- \chi^2 - \lambda~|\theta|^2] \]

The regularization term in the \(\chi^2\) becomes a second term in the product which depends only on \(\theta\), thus we can immediately write

\[ P(D~|~\theta, M) \propto \exp[-\chi^2]\\ P(\theta~|~M) \propto \exp[- \lambda~|\theta|^2] \]

So we see that ridge regularization is equivalent to applying a Gaussian prior to your model parameters, centered at \(\theta=0\) and with a width \(\sigma_P = (2\lambda)^{-2}\). This insight lifts the cover off the black box of regularization, and reveals that it is simply a roundabout way of adding a Bayesian prior within a frequentist paradigm. The stronger the regularization, the narrower the implicit Gaussian prior is.

Returning to our single-point example above, we can quickly see how this intuition explains the particular model chosen by the regularized fit; it is equivalent to fitting the line while taking into account prior knowledge that both the intercept and slope should be near zero:

In [11]:

The benefit of the Bayesian view is that it helps us understand exactly what this conditioning means for our model, and given this understanding we can easily extend use more general priors. For example, what if you have reason to believe your slope is near 1, but have no prior information on your intercept? In the Bayesian approach, it is easy to add such information to your model in a rigorous way.

But regardless of which approach you use, this central fact remains clear: you can fit models with more parameters than data points, if you restrict your parameter space through the use of frequentist regularization or Bayesian priors.

Underdetermined Models in Action

There are a few places where these ideas about underdetermined models come up in real life. I'll briefly discuss a couple of them here, and then walk through the solution of a simple (but rather interesting) problem that demonstrates these ideas.

Compressed Sensing: Image Reconstruction

One area where underdetermined models are often used is in the field of Compressed Sensing. Compressed sensing comprises a set of models in which underdetermined linear systems are solved using a sparsity prior, the classic example of which is the reconstruction of a full image from just a handful of its pixels. As a simple linear model this would fail, because there are far more unknown pixels than known pixels. But by carefully training a model on the structure of typical images and applying priors based on sparsity, this seemingly impossible problem becomes tractable. This 2010 Wired article has a good popular-level discussion of the technique and its applications, and includes this image showing how a partially-hidden input image can be iteratively reconstructed from a handful of pixels using a sparsity prior:

In [12]:
from IPython.display import Image

Kernel-based Methods: Gaussian Processes

Another area where a classically underdetermined model is solved is in the case of Gaussian Process Regression. Gaussian Processes are an interesting beast, and one way to view them is that rather than fitting, say, a two-parameter line or four-parameter cubic curve, they actually fit an infinite-dimensional model to data! They accomplish this by judicious use of certain priors on the model, along with a so-called "kernel trick" which solves the infinite dimensional regression implicitly using a finite-dimensional representation constructed based on these priors.

In my opinion, the best resource to learn more about Gaussian Process methods is the Gaussian Processes for Machine Learning book, which is available for free online (though it is a bit on the technical side). You might also take a look at the scikit-learn Gaussian Process documentation. If you'd like to experiment with a rather fast Gaussian Process implementation in Python, check out the george library.

Imperfect Detectors: Extrasolar Planets

Another place I've seen effective use of the above ideas is in situations where the data collection process has unavoidable imperfections. There are two basic ways forward when working with such noisy and biased data:

  • you can pre-process the data to try to remove and/or correct data imperfections, and then fit a conventional model to the corrected data.
  • you can account for the data imperfections along with interesting model parameters as part of a unified model: this type of unified approach is often preferable in scientific settings, where even the most careful pre-processing can lead to biased results.

If you'd like to see a great example of this style of forward-modeling analysis, check out the efforts of David Hogg's group in finding extrasolar planets in the Kepler survey's K2 data; there's a nice astrobytes post which summarizes some of these results. While the group hasn't yet published any results based on truly underdetermined models, it is easy to imagine how this style of comprehensive forward-modeling analysis could be pushed to such an extreme.

Example: Fitting a Model to Biased Data

While it might be fun to dive into the details of models for noisy exoplanet searches, I'll defer that to the experts. Instead, as a more approachable example of an underdetermined model, I'll revisit a toy example in which a classically underdetermined model is used to account for imperfections in the input data.

Imagine you have some observed data you would like to model, but you know that your detector is flawed such that some observations (you don't know which) will have a bias that is not reflected in the estimated error: in short, there are outliers in your data. How can you fit a model to this data while accounting for the possibility of such outliers?

To make this more concrete, consider the following data, which is drawn from a line with noise, and includes several outliers:

In [13]:
rng = np.random.RandomState(42)
theta = [-10, 2]
x = 10 * rng.rand(10)
dy = 2 + 2 * rng.rand(10)
y = theta[0] + theta[1] * x + dy * rng.randn(10)
y[4] += 15
y[7] -= 10
plt.errorbar(x, y, dy, fmt='ok', ecolor='gray');

If we try to fit a line using the standard \(\chi^2\) minimization approach, we will find an obviously biased result:

In [14]:
from scipy import optimize

def chi2(theta, x=x, y=y, dy=dy):
    y_model = theta[0] + theta[1] * x
    return np.sum(0.5 * ((y - y_model) / dy) ** 2)

theta1 = optimize.fmin(chi2, [0, 0], disp=False)

xfit = np.linspace(0, 10)
plt.errorbar(x, y, dy, fmt='ok', ecolor='gray')
plt.plot(xfit, theta1[0] + theta1[1] * xfit, '-k')
plt.title('Standard $\chi^2$ Minimization');

This reflects a well-known deficiency of \(\chi^2\) minimization: it is not robust to the presence of outliers.

What we would like to do is propose a model which somehow accounts for the possibility that each of these points may be the result of a biased measurement. One possible route is to add \(N\) new model parameters: one associated with each point which indicates whether it is an outlier or not. If it is an outlier, we use the standard model likelihood; if not, we use a likelihood with a much larger error. The result for our straight-line fit will be a model with \(N + 2\) parameters, where \(N\) is the number of data points. An overzealous application of lessons from simple linear models might lead you to believe this model can't be solved. But, if carefully constructed, it can!

Let's show how it can be done.

Our linear model is:

\[ y_M(x~|~\theta) = \theta_0 + \theta_1 x \]

For a non-outlier (let's call it an "inlier") point at \(x\), \(y\), with error on \(y\) given by \(dy\), the likelihood is

\[ L_{in, i}(D~|~\theta) = \frac{1}{\sqrt{2\pi dy_i^2}} \exp\frac{-[y_i - y_M(x_i~|~\theta)]^2}{2 dy_i^2} \]

For an "outlier" point, the likelihood is

\[ L_{out, i}(D~|~\theta) = \frac{1}{\sqrt{2\pi \sigma_y^2}} \exp\frac{-[y_i - y_M(x_i~|~\theta)]^2}{2 \sigma_y^2} \]

where \(\sigma_y\) is the standard deviation of the \(y\) data: note that the only difference between the "inlier" and "outlier" likelihood is the width of the Gaussian distribution.

Now we'll specify \(N\) additional binary model parameters \(\{g_i\}_{i=1}^N\) which indicate whether point \(i\) is an outlier \((g_i = 1)\) or an inlier \((g_i = 0)\). With this, the overall Likelihood becomes:

\[ L(D~|~\theta, g) = \prod_i \left[(1 - g_i)~L_{in, i} + g_i~L_{out, i}\right] \]

We will put a prior on these indicator variables \(g\) which encourages sparsity of outliers; this can be accomplished with a simple L1 prior, which penalizes the sum of the \(g\) terms:

\[ P(g) = \exp\left[-\sum_i g_i\right] \]

where, recall, \(g_i \in \{0, 1\}\).

Though you could likely solve for a point estimate of this model, I find the Bayesian approach to be much more straightforward and interpretable for a model this complex. To fit this, I'll make use of the excellent emcee package. Because emcee doesn't have categorical variables, we'll instead allow \(g_i\) to range continuously between 0 and 1, so that any single point will be some mixture of "outlier" and "inlier".

We start by defining a function which computes the log-posterior given the data and model parameters, using some computational tricks for the sake of floating-point accuracy:

In [15]:
# theta will be an array of length 2 + N, where N is the number of points
# theta[0] is the intercept, theta[1] is the slope,
# and theta[2 + i] is the weight g_i

def log_prior(theta):
    g = theta[2:]
    #g_i needs to be between 0 and 1
    if (np.any(g < 0) or np.any(g > 1)):
        return -np.inf # recall log(0) = -inf
        return -g.sum()

def log_likelihood(theta, x, y, dy):
    sigma_y = np.std(y)
    y_model = theta[0] + theta[1] * x
    g = np.clip(theta[2:], 0, 1)  # g<0 or g>1 leads to NaNs in logarithm
    # log-likelihood for in-lier
    logL_in = -0.5 * (np.log(2 * np.pi * dy ** 2) + ((y - y_model) / dy)** 2)
    # log-likelihood for outlier
    logL_out = -0.5 * (np.log(2 * np.pi * sigma_y ** 2) + ((y - y_model) / sigma_y) ** 2)
    return np.sum(np.logaddexp(np.log(1 - g) + logL_in,
                               np.log(g) + logL_out))

def log_posterior(theta, x, y, dy):
    return log_prior(theta) + log_likelihood(theta, x, y, dy)

Now we use the emcee package to run this model. Note that because of the high dimensionality of the model, the run_mcmc command below will take a couple minutes to complete:

In [16]:
import emcee

ndim = 2 + len(x)  # number of parameters in the model
nwalkers = 50  # number of MCMC walkers
nburn = 10000  # "burn-in" period to let chains stabilize
nsteps = 15000  # number of MCMC steps to take

# set walkers near the maximum likelihood
# adding some random scatter
rng = np.random.RandomState(0)
starting_guesses = np.zeros((nwalkers, ndim))
starting_guesses[:, :2] = rng.normal(theta1, 1, (nwalkers, 2))
starting_guesses[:, 2:] = rng.normal(0.5, 0.1, (nwalkers, ndim - 2))

sampler = emcee.EnsembleSampler(nwalkers, ndim, log_posterior, args=[x, y, dy])
sampler.run_mcmc(starting_guesses, nsteps)

sample = sampler.chain  # shape = (nwalkers, nsteps, ndim)
sample = sampler.chain[:, nburn:, :].reshape(-1, ndim)
-c:21: RuntimeWarning: divide by zero encountered in log
-c:22: RuntimeWarning: divide by zero encountered in log

The Runtime warnings here are normal – they just indicate that we've hit log(0) = -inf for some pieces of the calculation.

With the sample chain determined, we can plot the marginalized distribution of samples to get an idea of the value and uncertainty of slope and intercept with this model:

In [17]:
plt.plot(sample[:, 0], sample[:, 1], ',k', alpha=0.1)

Finally, we can make use of the marginalized values of all \(N + 2\) parameters and plot both the best-fit model, along with a model-derived indication of whether each point is an outlier:

In [18]:
theta2 = np.mean(sample[:, :2], 0)
g = np.mean(sample[:, 2:], 0)
outliers = (g > 0.5)

plt.errorbar(x, y, dy, fmt='ok', ecolor='gray')
plt.plot(xfit, theta1[0] + theta1[1] * xfit, color='gray')
plt.plot(xfit, theta2[0] + theta2[1] * xfit, color='black')
plt.plot(x[outliers], y[outliers], 'ro', ms=20, mfc='none', mec='red')
plt.title('Bayesian Fit');

The red circles mark the points that were determined to be outliers by our model, and the black line shows the marginalized best-fit slope and intercept. For comparison, the grey line is the standard maximum likelihood fit.

Notice that we have successfully fit an \((N + 2)\)-parameter model to \(N\) data points, and the best-fit parameters are actually meaningful in a deep way – the \(N\) extra parameters give us individual estimates of whether each of the \(N\) data points has misreported errors. I think this is a striking example of how a model that would be considered impossible under the model complexity myth can be solved in practice to produce very useful results!


I hope you will see after reading this post that the model complexity myth, while rooted in a solid understanding of simple linear models, should not be assumed to apply to all possible models. In fact, it is possible to fit models with more parameters than datapoints: and for the types of noisy, biased, and heterogeneous data we often encounter in scientific research, you can make a lot of progress by taking advantage of such models. Thanks for reading!

This post was written entirely in the IPython notebook. You can download this notebook, or see a static view here.

by Jake Vanderplas at July 06, 2015 03:00 PM

July 03, 2015


Sumatra 0.7 released

We would like to announce the release of version 0.7.0 of Sumatra, a tool for automated tracking of simulations and computational analyses so as to be able to easily replicate them at a later date.

This version of Sumatra brings some major improvements for users, including an improved web browser interface, improved support for the R language, Python 3 compatibility, a plug-in interface making Sumatra easier to extend and customize, and support for storing data using WebDAV.
In addition, there have been many changes under the hood, including a move to Github and improvements to the test framework, largely supported by the use of Docker.
Last but not least, we have changed licence from the CeCILL licence (GPL-compatible) to a BSD 2-Clause Licence, which should make it easier for other developers to use Sumatra in their own projects.

Updated and extended web interface

Thanks to Felix Hoffman’s Google Summer of Code project, the web browser interface now provides the option of viewing the history of your project either in a “process-centric” view, as in previous versions, where each row in the table represents a computation, or in a “data-centric” view, where each row is a data file. Where the output from one computation is the input to another, additional links make it possible to follow these connections.
The web interface has also had a cosmetic update and several other improvements, including a more powerful comparison view (see screenshot). Importantly, the interface layout no longer breaks in narrower browser windows.

BSD licence

The Sumatra project aims to provide not only tools for scientists as end users (such as smt and smtweb), but also library components for developers to add Sumatra’s functionality to their own tools. To support this second use, we have switched licence from CeCILL (GPL-compatible) to the BSD 2-Clause Licence.

Python 3 support

In version 0.6.0, Sumatra already supported provenance capture for projects using Python 3, but required Python 2.6 or 2.7 to run. Thanks to Tim Tröndle, Sumatra now also runs in Python 3.4.

Plug-in interface

To support the wide diversity of workflows in scientific computing, Sumatra has always had an extensible architecture. It is intended to be easy to add support for new database formats, new programming languages, new version control systems, or new ways of launching computations.
Until now, adding such extensions has required that the code be included in Sumatra’s code base. Version 0.7.0 adds a plug-in interface, so you can define your own local extensions, or use other people’s.
For more information, see Extending Sumatra with plug-ins.

WebDAV support

The option to archive output data files has been extended to allow archiving to a remote server using the WebDAV protocol.

Support for the R language

Sumatra will now attempt to determine the versions of all external packages loaded by an R script.

Other changes

For developers, there has been a significant change - the project has moved from Mercurial to Git, and is now hosted on Github. Testing has also been significantly improved, with more system/integration testing, and the use of Docker for testing PostgreSQL and WebDAV support.
Parsing of command-line parameters has been improved. The ParameterSet classes now have a diff() method, making it easier to see the difference between two parameter sets, especially for large and hierarchical sets.
Following the recommendation of the Mercurial developers, and to enable the change of licence to BSD, we no longer use the Mercurial internal API. Instead we use the Mercurial command line interface via the hgapi package.

Bug fixes

fair number of bugs have been fixed.

Download, support and documentation

The easiest way to get the latest version of Sumatra is

  $ pip install sumatra

Alternatively, Sumatra 0.7.0 may be downloaded from PyPI or from the INCF Software Center. Support is available from the sumatra-users Google Group. Full documentation is available on Read the Docs.

by Andrew Davison (noreply@blogger.com) at July 03, 2015 03:55 PM

Abraham Escalante

Mid-term summary

Hello all,

We're reaching the halfway mark for the GSoC and it's been a great journey so far.

I have had some off court issues. I was hesitant to write about them because I don't want my blog to turn into me ranting and complaining but I have decided to briefly mention them in this occasion because they are relevant and at this point they are all but overcome.

Long story short, I was denied the scholarship that I needed to be able to go to Sheffield so I had to start looking for financing options from scratch. Almost at the same time I was offered a place at the University of Toronto (which was originally my first choice). The reason why this is relevant to the GSoC is because it coincided with the beginning of the program so I was forced to cope with not just the summer of code but also with searching/applying for funding and paperwork for the U of T which combined to make for a lot of work and a tough first month.

I will be honest and say that I got a little worried at around week 3 and week 4 because things didn't seem to be going the way I had foreseen in my proposal to the GSoC. In my previous post I wrote about how I had to make a change to my approach and I knew I had to commit to it so it would eventually pay off.

At this point I am feeling pretty good with the way the project is shaping up. As I mentioned, I had to make some changes, but out of about 40 open issues, now only 23 remain, I have lined up PRs for another 8 and I have started discussion (either with the community or with my mentor) on almost all that remain, including some of the longer ones like NaN handling which will span over the entire scipy.stats module and is likely to become a long term community effort depending on what road Numpy and Pandas take on this matter in the future.

I am happy to look at the things that are still left and find that I at least have a decent idea of what I must do. This was definitely not the case three or four weeks ago and I'm glad with the decision that I made when choosing a community and a project. My mentor is always willing to help me understand unknown concepts and point me in the right direction so that I can learn for myself and the community is engaging and active which helps me keep things going.

My girlfriend, Hélène has also played a major role in helping me keep my motivation when it seems like things amount to more than I can handle.

I realise that this blog (since the first post) has been a lot more about my personal journey than technical details about the project. I do apologise if this is not what you expect but I reckon that this makes it easier to appreciate for a reader who is not familiarised with 'scipy.stats', and if you are familiarised you probably follow the issues or the developer's mailing list (where I post a weekly update) so technical details would be redundant to you.  I also think that the setup of the project, which revolves around solving many issues makes it too difficult to write about specific details without branching into too many tangents for a reader to enjoy.

If you would like to know more about the technical aspect of the project you can look at the PRs, contact me directly (via a comment here or the SciPy community) or even better, download SciPy and play around with it. If you find something wrong with the statistics module, chances are it's my fault, feel free to let me know. If you like it, you can thank guys like Ralf Gommers (my mentor), Evgeni Burovski and Josef Perktold (to name just a few of the most active members in 'scipy.stats') for their hard work and support to the community.

I encourage anyone who is interested enough to go here to see my proposal or go here to see currently open tasks to find out more about the project. I will be happy to fill you in on the details if you reach me personally.


by noreply@blogger.com (Abraham Escalante) at July 03, 2015 01:07 AM

July 02, 2015

Continuum Analytics

Find Continuum at SciPy Austin 2015

Continuum is a sponsor of this year’s SciPy Conference in Austin, TX. We invite you to join us at the fourteenth annual Scientific Computing with Python conference, and to check out some really fantastic talks and tutorials from our talented engineers and developers.

by Continuum at July 02, 2015 12:00 AM

Continuum Analytics - July Tech Events

The Continuum Team is hitting some big conferences this month, and we hope you’ll join us for exciting talks on Bokeh, Numba, Blaze, Anaconda, and more. From our hometown of Austin, Texas all the way to Spain, our team is ready to talk Python.

by Continuum at July 02, 2015 12:00 AM

July 01, 2015


Python(x, y) Released!

Hi All,

I'm happy to announce that Python(x, y) is available for immediate download.from any of the mirrors. The full change log can be viewed here. Please post your comments and suggestions to the mailing list.

Work on the Python 3.x 64 bit version will resume once Visual Studio 2015 RTM is released.

What's new in general:

  • All packages have been updated to their latest versions. Numpy, Scipy etc.
  • ipython has been held back at 2.4.1 to avoid potential compatibility issues.
  • OpenCV has been help back at 2.4.11 to avoid potential compatibility issues.

New noteworthy packages:

  • yappi - Yet Another Python Profiler.
  • libnacl - A ctypes wrapper around libsodium.
Have fun!

-Gabi Davar

by Gabi Davar (noreply@blogger.com) at July 01, 2015 11:16 PM

June 30, 2015

Matthieu Brucher

Audio Toolkit: Anatomy of a transient shaper

When I first about transient shaper, I was like “what’s the difference with a compressor? Is there one?”. And I tried to see how to get these transient without relying on the transient energy, with a relative power (ratio between the instant power and the mean power) filter, or its derivative, but nothing worked. Until someone explained that the gain was driven by the difference between a fast attack filtered power and a slower one. So here it goes.

First, let’s have a look on the filter graph:
Transient shaper pipelineTransient shaper pipeline

I’ve surrounded the specific transient shaper part in with a dotted line. This is the difference with a compressor/limiter/expander: the way the signal steered the gain computation is generated.

Let’s start from a kick. The fast envelope follower can then be generated (red curve) as well as the slow envelope follower (green curve). The difference is always positive (if the two follower have the same release time value), so we can use it to compute a gain through our usual GainCompressorFilter.
Depending on whether you want to increase the transient or attenuate it, the ratio will be below or higher than 1 (respectively), which is what is shown in the last two plots here:
Transient shaper plotTransient shaper plot

In the end, it’s all about the right algorithms. If you have a proper framework, you may already have everything you need for some filters. In the case of a transient shaper, I already had all the filters in my toolbox. Now I just need to make a plugin out of this simple pipeline!

The code for the last plots can be found on Github here: https://github.com/mbrucher/AudioTK/blob/master/Examples/dynamic/display_transient_shaper.py

Buy Me a Coffee!

Other Amount:

Your Email Address :

by Matt at June 30, 2015 07:52 AM

June 29, 2015

Wei Xue

GSoC Week 5

The week 5 began with a discussion with whether we should deprecate params. I fixed some bugs in checking functions, random number generator and one of covariance updating methods. In the following days, I completed the main functions of GaussianMixutre and all test cases, except AIC, BIC and sampling functions. The tests are some kind of challenging, sine the current implementation in the master branch contains very old test cases imported from Weiss's implementation which is never got improved. I simplified the test cases, and wrote more tests that are not covered by the current implementation, such as covariance estimation, ground truth parameter prediction, and other user-friendly warnings and errors.

Next week, I will begin to code BayesianGaussianMixture.

June 29, 2015 04:03 PM

June 26, 2015

Matthew Rocklin

Write Complex Parallel Algorithms

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

tl;dr: We discuss the use of complex dask graphs for non-trivial algorithms. We show off an on-disk parallel SVD.

Most Parallel Computation is Simple

Most parallel workloads today are fairly trivial:

>>> import dask.bag as db
>>> b = db.from_s3('githubarchive-data', '2015-01-01-*.json.gz')
          .map(lambda d: d['type'] == 'PushEvent')

Graphs for these computations look like the following:

Embarrassingly parallel dask graph

This is great; these are simple problems to solve efficiently in parallel. Generally these simple computations occur at the beginning of our analyses.

Sophisticated Algorithms can be Complex

Later in our analyses we want more complex algorithms for statistics , machine learning, etc.. Often this stage fits comfortably in memory, so we don’t worry about parallelism and can use statsmodels or scikit-learn on the gigabyte result we’ve gleaned from terabytes of data.

However, if our reduced result is still large then we need to think about sophisticated parallel algorithms. This is fresh space with lots of exciting academic and software work.

Example: Parallel, Stable, Out-of-Core SVD

I’d like to show off work by Mariano Tepper, who is responsible for dask.array.linalg. In particular he has a couple of wonderful algorithms for the Singular Value Decomposition (SVD) (also strongly related to Principal Components Analysis (PCA).) Really I just want to show off this pretty graph.

>>> import dask.array as da
>>> x = da.ones((5000, 1000), chunks=(1000, 1000))
>>> u, s, v = da.linalg.svd(x)

Parallel SVD dask graph

This algorithm computes the exact SVD (up to numerical precision) of a large tall-and-skinny matrix in parallel in many small chunks. This allows it to operate out-of-core (from disk) and use multiple cores in parallel. At the bottom we see the construction of our trivial array of ones, followed by many calls to np.linalg.qr on each of the blocks. Then there is a lot of rearranging of various pieces as they are stacked, multiplied, and undergo more rounds of np.linalg.qr and np.linalg.svd. The resulting arrays are available in many chunks at the top and second-from-top rows.

The dask dict for one of these arrays, s, looks like the following:

>>> s.dask
{('x', 0, 0): (np.ones, (1000, 1000)),
 ('x', 1, 0): (np.ones, (1000, 1000)),
 ('x', 2, 0): (np.ones, (1000, 1000)),
 ('x', 3, 0): (np.ones, (1000, 1000)),
 ('x', 4, 0): (np.ones, (1000, 1000)),
 ('tsqr_2_QR_st1', 0, 0): (np.linalg.qr, ('x', 0, 0)),
 ('tsqr_2_QR_st1', 1, 0): (np.linalg.qr, ('x', 1, 0)),
 ('tsqr_2_QR_st1', 2, 0): (np.linalg.qr, ('x', 2, 0)),
 ('tsqr_2_QR_st1', 3, 0): (np.linalg.qr, ('x', 3, 0)),
 ('tsqr_2_QR_st1', 4, 0): (np.linalg.qr, ('x', 4, 0)),
 ('tsqr_2_R', 0, 0): (operator.getitem, ('tsqr_2_QR_st2', 0, 0), 1),
 ('tsqr_2_R_st1', 0, 0): (operator.getitem,('tsqr_2_QR_st1', 0, 0), 1),
 ('tsqr_2_R_st1', 1, 0): (operator.getitem, ('tsqr_2_QR_st1', 1, 0), 1),
 ('tsqr_2_R_st1', 2, 0): (operator.getitem, ('tsqr_2_QR_st1', 2, 0), 1),
 ('tsqr_2_R_st1', 3, 0): (operator.getitem, ('tsqr_2_QR_st1', 3, 0), 1),
 ('tsqr_2_R_st1', 4, 0): (operator.getitem, ('tsqr_2_QR_st1', 4, 0), 1),
 ('tsqr_2_R_st1_stacked', 0, 0): (np.vstack,
                                   [('tsqr_2_R_st1', 0, 0),
                                    ('tsqr_2_R_st1', 1, 0),
                                    ('tsqr_2_R_st1', 2, 0),
                                    ('tsqr_2_R_st1', 3, 0),
                                    ('tsqr_2_R_st1', 4, 0)])),
 ('tsqr_2_QR_st2', 0, 0): (np.linalg.qr, ('tsqr_2_R_st1_stacked', 0, 0)),
 ('tsqr_2_SVD_st2', 0, 0): (np.linalg.svd, ('tsqr_2_R', 0, 0)),
 ('tsqr_2_S', 0): (operator.getitem, ('tsqr_2_SVD_st2', 0, 0), 1)}

So to write complex parallel algorithms we write down dictionaries of tuples of functions.

The dask schedulers take care of executing this graph in parallel using multiple threads. Here is a profile result of a larger computation on a 30000x1000 array:

Low Barrier to Entry

Looking at this graph you may think “Wow, Mariano is awesome” and indeed he is. However, he is more an expert at linear algebra than at Python programming. Dask graphs (just dictionaries) are simple enough that a domain expert was able to look at them say “Yeah, I can do that” and write down the very complex algorithms associated to his domain, leaving the execution of those algorithms up to the dask schedulers.

You can see the source code that generates the above graphs on GitHub.

Approximate SVD dask graph

Randomized Parallel Out-of-Core SVD

A few weeks ago a genomics researcher asked for an approximate/randomized variant to SVD. Mariano had a solution up in a few days.

>>> import dask.array as da
>>> x = da.ones((5000, 1000), chunks=(1000, 1000))
>>> u, s, v = da.linalg.svd_compressed(x, k=100, n_power_iter=2)

I’ll omit the full dict for obvious space reasons.

Final Thoughts

Dask graphs let us express parallel algorithms with very little extra complexity. There are no special objects or frameworks to learn, just dictionaries of tuples of functions. This allows domain experts to write sophisticated algorithms without fancy code getting in their way.

June 26, 2015 12:00 AM

June 24, 2015

Titus Brown

A review of &quot;Large-Scale Search of Transcriptomic Read Sets with Sequence Bloom Trees&quot;

(This is a review of Large-Scale Search of Transcriptomic Read Sets with Sequence Bloom Trees, Solomon and Kingsford, 2015.)

In this paper, Solomon and Kingsford present Sequence Bloom Trees (SBTs). SBT provides an efficient method for indexing multiple sequencing datasets and finding in which datasets a query sequence is present.

The new method is based on using multiple Bloom filters and organizing them in a binary tree, where leaves represent specific datasets and internal nodes contain all the k-mers present in their subtrees. A query starts by breaking the sequence into a set of k-mers and checking if they are present in the node Bloom filter at a specific threshold. If yes then the query is repeated for children nodes, but if it isn't the subtree is pruned and search proceeds on other nodes. If all searches are pruned before reaching a leaf then the sequence is not present in any dataset. They prove the false positive rate for a k-mer can be quite higher than traditional applications of Bloom filters, since they are interested in finding if the whole set of k-mers is over a threshold. This leads to very small data structures that remain capable of approximating the correct answer.

Compared to alternative software (like SRA-BLAST or STAR) it has both decreased runtime and memory consumption, and it also can be used as a filter to make these tools faster.

Overall review

The paper is well written, clear, mostly expert in the area (but see below), and lays out the approach and tool well.

The approach is novel within bioinformatics, as far as we know. More, we think it's a tremendously important approach; it's by far the most succinct representation of large data sets we've seen (and Bloom filters are notoriously efficient), and it permits efficient indexing, storage of indices, and queries of indices.

A strange omission is the work that has been done by our group and others with Bloom filters. Pell et al., 2012 (pmid 22847406), showed that implicit De Bruijn graphs could be stored in Bloom filters in exactly the way the authors are doing here; work by Chikhi and Rizk, 2013 (pmid 24040893) implemented exact De Bruijn graphs efficiently using Bloom filters; and Salikhov et al, 2014 (pmid 24565280) further used Cascading Bloom filters. Our group has also used the median k-mer abundance (which, in a Bloom filter, equals median k-mer presence) to estimate read presence and coverage in a very similar way to Solomon and Kingsford (Brown et al., 2012, "digital normalization"). We also showed experimentally that this is very robust to high false positive rates (Zhang et al., 2014, pmid 25062443, buried in the back).

There are three points to make here --

  1. Previous work has been done connecting Bloom filters and k-mer storage, in ways that seem to be ignored by this paper; the authors should cite some of this literature. Given citation space limitations, this doesn't need to be exhaustive, but either Salikhov or Pell seems particularly relevant.
  2. The connection between Bloom filters and implicit De Bruijn graphs should be explicitly made in the paper, as it's a powerful theoretical connection.
  3. All of our previous result support the conclusions reached in this paper, and this paper makes the false-positive robustness argument much more strongly, which is a nice conclusion!


We have found that users are often very confused about how to pick the size of Bloom filters. My sense here is that the RRR compression means that very large Bloom filters will be stored efficiently, so you might as well start big, because there's no way to do progressive size increases on the Bloom filter; do the authors agree with that conclusion, or am I missing something?

One possible writing improvement is to add another level under the leaves in Supp Fig 1 to make it clear that traditional alignment or other alternatives are required, since SBT only finds if the query is present in the dataset (but not where). The speed comparisons in the paper could be qualified a bit more to make it clear that this is only for basic search, although some of us think it's already clear enough so it's advice, not a requested or required change.

However, there is a solid point to be made that (in our opinion) the true value of the SBT approach is not necessarily in speeding up the overall process (3.5x speedup) but in doing the search in very low memory across an index that can be distributed independently of the data.

page 16, Theorem 2 says the probability that ... is nearly 0 when FPR is << theta, fraction threshold. But next in the example, theta is 0.5 and FPR is also 0.5, so here the FPR is NOT << theta, as in Theorem 2. How to conclude that "by Theorem 2, we will be unlikely to observe > theta fraction of false positive kmers in the filter."?

Software and tool publication

Bioinformatics paper checklist (http://ivory.idyll.org/blog/blog-review-criteria-for-bioinfo.html):

The software is directly available for download: Yes, https://github.com/Kingsford-Group/bloomtree

The software license lets readers download and run it: The license is not specified; this needs to be fixed. But 'bloomtree' makes use of several GPL toolkits.

The software source code is available to readers: Yes, https://github.com/Kingsford-Group/bloomtree

We successfully downloaded and ran the software.

The data for replication is available for download: Yes, public data from SRA; it's listed on supp materials, but could be added to the tool site too.

The data format is either standard, or straightforward, or documented. Yes

Other comments:


  • we strongly recommend that a lab-independent URL be used as the official URL for the software (e.g. the github page, instead of the CMU page). Lab Web sites tend to fall out of date or otherwise decay.
  • One of the big drawbacks to Bloom filters is that they are fixed in size. Guidance on choosing Bloom filter size would be welcome. One way to do this is to use an efficient method to calculate cardinality, and khmer has a BSD- licensed implementation of the HyperLogLog cardinality counter that they'd be welcome to copy wholesale.


C. Titus Brown

Luiz Irber

Qingpeng Zhang

by C. Titus Brown, Luiz Irber, Qingpeng Zhang at June 24, 2015 10:00 PM

Thoughts on Sequence Bloom Trees

We just submitted our review of the paper Large-Scale Search of Transcriptomic Read Sets with Sequence Bloom Trees., by Brad Solomon and Carl Kingsford.

The paper outlines a fairly simple and straightforward way to query massive amounts of sequence data (5 TB of mRNAseq!) in very small disk (~70 GB) and memory (~under a GB), fairly quickly (~2.5 days).

The short version is that I think this is an incredibly powerful approach that I am now planning to build on for our own Moore DDD project.

The review was a lot of fun; it's right up our alley, and we've thought a lot about related (but somewhat distinct) issues. Here are some extended comments that I thought were probably not appropriate for the official review because they're somewhat forward looking.

First, we did the review in approved Open Science style. Since Brad and Carl did us the favor of using GitHub (source code) and bioRxiv (preprint), and I sign my reviews, there need be no mystery in this process. Therefore I am posting our review publicly with only minor edits. Moreover, I filed three issues on GitHub (#1, #2, and #4) and submitted two pull requests (#3 and #5), both of which were merged.

Second, because we work in this area and are very interested in it, I put together both a demo of their software (see 2015-sbt-demo) and also did a simple reimplementation in khmer (see 2015-khmer-sequence-bloom-trees) to make sure that their software worked, and that I thoroughly understood the basic issues involved.

Note that we are unable to use their implementation as licensed, as it is under the GPL and would contaminate our source tree, which is under BSD :(.

Third, I have a lot of suggestions and thoughts! For example,

  • The use of RRR compression is awesome and is something we should look into for khmer (dib-lab/khmer#1074).
  • We get a 20% performance increase from a simple optimization applied to our k-mer lookups that could apply to SBTs -- see just-merged dib-lab/khmer#862, by Camille Scott.
  • The authors might also be interested in making use of our HyperLogLog implementation for k-mer cardinality counting, which could help users choose the right size for their Bloom filters.
  • Streaming diginorm/semi-streaming in general (see Crossing the streams) could be a very useful pre-filter for building SBTs. My guess is that with k-mer prefiltering a la digital normalization, there would be no loss to sensitivity but a substantial decrease in memory requirements.
  • It would be really interesting to brainstorm about how far this can be taken. We have reasonably strong evidence & intuition that you can do straight abundance estimation directly off of counting Bloom filters, and it doesn't seem like a stretch to say, "hey, let's store EVERYTHING in an sequence Bloom tree analog and do comparative expression analysis that way!" We don't have an immediate use case for this ourselves, but I'm sure one will present itself...
  • Qingpeng Zhang and I immediately started talking about how to apply this to metagenomics, and the main worry is that the method seems to depend on low diversity of true k-mers. This can partly be mitigated by diginorm and/or semi-streaming k-mer abundance trimming, but ultimately things are going to scale at best with the true size of the De Bruijn graph. It will be interesting to see how this plays out.


by C. Titus Brown at June 24, 2015 10:00 PM

June 23, 2015

Matthieu Brucher

Announcement: Audio TK 0.6.0

The main changes for this release are first trials at modulated filters, C++11 usage (nullptr, override and final), and some API changes (the main process_impl function is now const).

Download link: ATK 0.6.0


* Added override and final keywords in virtual calls
* Changed the API so that process_impl is now const
* Exposed full_setup to the user (direct reset of the internal state, already called when changing sample rate)
* Added LinkWitz-Riley second order low and high path filters
* Fix resetting the internal state of all delays by using full_setup
* Added a CustomFIRFilter with Python wrapper

* Added time-varying IIR filters (variable frequency, coded as transposed direct form II)
* Added second order time varying filter implementations
* Added a RelativePowerFilter with Python wrappers
* Added a DerivativeFilter with Python wrappers
* Added Python wrappers for the InWavFilter
* Fixed some warnings during compilation

Buy Me a Coffee!

Other Amount:

Your Email Address :

by Matt at June 23, 2015 07:01 AM

Matthew Rocklin

Distributed Scheduling

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

tl;dr: We evaluate dask graphs with a variety of schedulers and introduce a new distributed memory scheduler.

Dask.distributed is new and is not battle-tested. Use at your own risk and adjust expectations accordingly.

Evaluate dask graphs

Most dask users use the dask collections, Array, Bag, and DataFrame. These collections are convenient ways to produce dask graphs. A dask graph is a dictionary of tasks. A task is a tuple with a function and arguments.

The graph comprising a dask collection (like a dask.array) is available through its .dask attribute.

>>> import dask.array as da
>>> x = da.arange(15, chunks=(5,))  # 0..14 in three chunks of size five

>>> x.dask  # dask array holds the graph to create the full array
{('x', 0): (np.arange, 0, 5),
 ('x', 1): (np.arange, 5, 10),
 ('x', 2): (np.arange, 10, 15)}

Further operations on x create more complex graphs

>>> z = (x + 100).sum()
>>> z.dask
{('x', 0): (np.arange, 0, 5),
 ('x', 1): (np.arange, 5, 10),
 ('x', 2): (np.arange, 10, 15),
 ('y', 0): (add, ('x', 0), 100),
 ('y', 1): (add, ('x', 1), 100),
 ('y', 2): (add, ('x', 2), 100),
 ('z', 0): (np.sum, ('y', 0)),
 ('z', 1): (np.sum, ('y', 1)),
 ('z', 2): (np.sum, ('y', 2)),
 ('z',): (sum, [('z', 0), ('z', 1), ('z', 2)])}

Hand-made dask graphs

We can make dask graphs by hand without dask collections. This involves creating a dictionary of tuples of functions.

>>> def add(a, b):
...     return a + b

>>> # x = 1
>>> # y = 2
>>> # z = add(x, y)

>>> dsk = {'x': 1,
...        'y': 2,
...        'z': (add, 'x', 'y')}

We evaluate these graphs with one of the dask schedulers

>>> from dask.threaded import get
>>> get(dsk, 'z')   # Evaluate graph with multiple threads

>>> from dask.multiprocessing import get
>>> get(dsk, 'z')   # Evaluate graph with multiple processes

We separate the evaluation of the graphs from their construction.

Distributed Scheduling

The separation of graphs from evaluation allows us to create new schedulers. In particular there exists a scheduler that operates on multiple machines in parallel, communicating over ZeroMQ.

This system has a single centralized scheduler, several workers, and potentially several clients.

Clients send graphs to the central scheduler which farms out those tasks to workers and coordinates the execution of the graph. While the scheduler centralizes metadata, the workers themselves handle transfer of intermediate data in a peer-to-peer fashion. Once the graph completes the workers send data to the scheduler which passes it through to the appropriate user/client.


And so now we can execute our dask graphs in parallel across multiple machines.

$ ipython  # On your laptop                 $ ipython  # Remote Process #1:  Scheduler
>>> def add(a, b):                          >>> from dask.distributed import Scheduler
...     return a + b                        >>> s = Scheduler(port_to_workers=4444,
                                            ...               port_to_clients=5555,
>>> dsk = {'x': 1,                          ...               hostname='notebook')
...        'y': 2,
...        'z': (add, 'x', 'y')}            $ ipython  # Remote Process #2:  Worker
                                            >>> from dask.distributed import Worker
>>> from dask.threaded import get           >>> w = Worker('tcp://notebook:4444')
>>> get(dsk, 'z')  # use threads
3                                           $ ipython  # Remote Process #3:  Worker
                                            >>> from dask.distributed import Worker
                                            >>> w = Worker('tcp://notebook:4444')

>>> from dask.distributed import Client
>>> c = Client('tcp://notebook:5555')

>>> c.get(dsk, 'z') # use distributed network

Choose Your Scheduler

This graph is small. We didn’t need a distributed network of machines to compute it (a single thread would have been much faster) but this simple example can be easily extended to more important cases, including generic use with the dask collections (Array, Bag, DataFrame). You can control the scheduler with a keyword argument to any compute call.

>>> import dask.array as da
>>> x = da.random.normal(10, 0.1, size=(1000000000,), chunks=(1000000,))

>>> x.mean().compute(get=get)    # use threads
>>> x.mean().compute(get=c.get)  # use distributed network

Alternatively you can set the default scheduler in dask with dask.set_options

>>> import dask
>>> dask.set_options(get=c.get)  # use distributed scheduler by default

Known Limitations

We intentionally made the simplest and dumbest distributed scheduler we could think of. Because dask separates graphs from schedulers we can iterate on this problem many times; building better schedulers after learning what is important. This current scheduler learns from our single-memory system but is the first dask scheduler that has to think about distributed memory. As a result it has the following known limitations:

  1. It does not consider data locality. While linear chains of tasks will execute on the same machine we don’t think much about executing multi-input tasks on nodes where only some of the data is local.
  2. In particular, this scheduler isn’t optimized for data-local file-systems like HDFS. It’s still happy to read data from HDFS, but this results in unnecessary network communication. We’ve found that it’s great when paired with S3.
  3. This scheduler is new and hasn’t yet had its tires kicked. Vocal beta users are most welcome.
  4. We haven’t thought much about deployment. E.g. somehow you need to ssh into a bunch of machines and start up workers, then tear them down when you’re done. Dask.distributed can bootstrap off of an IPython Parallel cluster, and we’ve integrated it into anaconda-cluster but deployment remains a tough problem.

The dask.distributed module is available in the last release but I suggest using the development master branch. There will be another release in early July.

Further Information

Blake Griffith has been playing with dask.distributed and dask.bag together on data from http://githubarchive.org. He plans to write a blogpost to give a better demonstration of the use of dask.distributed on real world problems. Look for that post in the next week or two.

You can read more about the internal design of dask.distributed at the dask docs.


Special thanks to Min Regan-Kelley, John Jacobsen, Ben Zaitlen, and Hugo Shi for their advice on building distributed systems.

Also thanks to Blake Griffith for serving as original user/developer and for smoothing over the user experience.

June 23, 2015 12:00 AM

June 22, 2015

Titus Brown

We're throwing a Software Carpentry! And it's already full...

Yesterday morning, we announced a Software Carpentry workshop here at UC Davis, running July 6-7 -- see the Web site for more information. I'm organizing, and Easton White and Noam Ross are co-lead instructors. (This is the first workshop I'm running since we became an affiliate!)

I'd love it if you could all come!

...but it's already full-ish.

I announced the workshop on Monday morning via the dib-training mailing list. We opened up 30 seats to all comers, and reserved 30 seats for people from the UC Davis School of Veterinary Medicine, which is hosting the workshop (by employing me).

Within 12 hours, the 30 open seats were filled. Wow.

The VetMed seats will be opened to the waiting list next Monday, and (if no one else signs up ;) there is still plenty of room. So if you're in the area, take a look and sign yourself up!


by C. Titus Brown at June 22, 2015 10:00 PM

June 19, 2015

Titus Brown

Some personal perspectives on academic freedom and free speech

Some background: I'm a white, male, tenured faculty member at UC Davis, and a 3rd generation academic. I work in relatively uncontroversial areas of science (primarily bioinformatics & genomics) at a university that is about as protective of academic freedom as you can get these days. I also live in a country that is (at least formally and legally if not always in practice) more protective of free speech than most countries in the world. For these reasons, I have less to fear from expressing my opinions -- on my blog, on Twitter, or in person -- than virtually anyone, anywhere, ever has.

A week ago, I tweeted Dr. Alice Dreger's article, Wondering if I'm the Next Tim Hunt. I was surprised, frustrated, and upset by the response of a number of colleagues on Twitter, who essentially said "speech that I disagree with is not protected by academic freedom." (This is a bit of a paraphrase, but I think an accurate one; judge for yourselves.)

I must say that I don't really care about Dr. Tim Hunt per se, and that whole issue has been covered very well elsewhere. He made clearly sexist and harmful remarks and is not a credible role model. To the extent that I have anything useful to say, I am worried about the reported actions of UCL in this interview with Mary Collins.

What I am much more worried about is the degree to which academics seem oblivious to the issue of academic freedom. It takes a special kind of obliviousness and subjectivity to look at the history of science and argue that academic researchers should be restricted in the scope of their opinions by their employer. For more on the sordid history of academic freedom, see the wikipedia page.

But, you know what? I don't work on academic freedom, and I'm not a lawyer, so I can't comment on nuances of employment contracts vs teaching vs publication, speech vs action, etc. All I can do is tell you why I care about free speech and academic freedom, and what my personal experiences are in this area, and try to explain my perspective.

Communism and my family

In a very real sense, my brothers and sisters and I only exist because of practical limits on free speech.

My father left the United States after receiving his PhD because he'd briefly been a member of the Communist Party. From personal conversations with him, I believe he was keenly aware of the danger of staying in the US, and knew that his academic career would have been in danger had he remained here.

In England, he met his first wife with whom he had my three oldest siblings, all of whom were born in Europe. (He met my mother when he returned to Princeton.)

Ironically, he managed to run afoul of both the US Government (by being a former member of the Communist Party and leaving their reach) and the Communist Party (from which he was evicted for asking too many questions).

I grew up on stories of him being called into British government buildings to meet with US officials who wanted him to surrender his passport (which would have prevented him from traveling); when he wouldn't surrender his passport, the US officials tried to get the Brits to take it from him. That never happened; the Brit response was to ask for a letter from the US, which of course wasn't forthcoming because all of this was unofficial persecution.

I also grew up on stories of him being wined and dined by the Stasi when he was visiting his first wife's family in Eastern Europe, in attempts to recruit him. This, together with his habit of sending theoretical nuclear physics journals to colleagues in Russia, led to a frightening-in-hindsight visit by several serious men in dark sunglasses from the FBI in the early 80s (I was about 10).

Interestingly, despite having been very politically active early on, my father was completely apolitical during my life. In the last years of his life, I got some minimal insight into this from his recounting of the Peekskill riots, but I never got the whole story.

Academic freedom doesn't protect you from being sued personally

Some 20-odd years ago (I'm feeling old today) I mirrored a spam blacklist site on my employee account at Caltech. (This was back when the Internet was new enough that such things could be maintained somewhat manually. ;) One of the people on the spam blacklist got very upset and sent some very nasty e-mails threatening everyone and everything at Caltech with lawsuits unless we removed it.

The resulting conversation escalated all the way to the provost (who I was actually doing research for at the time - see below), and I had the awkward conversation where I was told:

  • we have no problems hosting this, if you make the following modifications to make it clear this isn't Caltech's official opinion;
  • we're not going to fire you or anything like that;
  • but we won't protect you yourself from libel litigation, so good luck!

Nothing ever happened, but it made the indelible point on me that academic freedom is a thin reed to clutch - people can still bring legal action against you for what you said, even if you face no blowback from your employer.

Climate studies and modeling

I worked for a few years on climate studies with Dr. Steven Koonin, back when he was provost at Caltech (and before he took up his position at BP). My scientific career exists in part because of the work I did with him. I regard him as a mentor, colleague, and friend.

In 2014, Dr. Koonin wrote an op-ed on climate change (here) that I think makes many good points. Knowing him personally, I trust his judgement; having worked (a small bit) in this area, and having a reasonable amount of experience in modeling, I am in agreement with many of his central points.

This measured response is a good example of true scientific debate in action. We need more of this, in general. (I'm a particular fan of Dr. Koonin's suggestion on model evaluation; he's a smart scientist.)

Several people privately told me that they thought Dr. Koonin was an idiot for writing this, and others told me it was our responsibility as scientists to toe the climate change line for fear of doing further damage to the environment. I disagree with both of these groups of people, even though I believe that climate change is anthropogenic and we need to do something about it. I think Dr. Koonin made some good points that needed to be made.

Blogging and intellectual community

About 2-3 times a year, I get a request to change something in a blog post. Very rarely is it because what I've said is wrong; it's usually because it makes someone uncomfortable or unhappy. As a matter of policy, I refuse to do so (plus my blog is under version control, and I'm certainly not going to rewrite my git history :).

(I don't have any problem with posting explicit corrections when I'm wrong, obviously.)

A key point is that I don't expect to be fired for anything I say in my blog posts. Completely apart from having an awful lot of privilege (white, male, tenure, supportive family, no health problems), there's an expectation that what I say on my blog is subject to academic freedom. I've never gotten any pushback from my employer and I don't expect to, no matter how critical I am of them.

Joe Pickrell makes a very good point that intellectual community is key to academia. How can we have robust discussion and without academic freedom? (Rebecca Schuman makes an excellent related point about adjuncts, job security and academic freedom, here, with which I greatly sympathize.)

Privilege, and free speech, and academic freedom

(I'm not a lawyer, so please correct me. This is my understanding.)

Free speech is a constitutional right in the US; as such it only applies to government action. If my employer is upset with my speech, they are free to fire me; Twitter is under no obligation to allow me to tweet whatever I want; etc.

Academic freedom is, essentially, free speech commuted to academic employees: basically, universities should not fire people for something they said. While I am still individually liable for what I say under the law of the country I'm in, my employer cannot fire me without some substantial process (if at all) for what I say.

There are a lot of tricky bits in there, though.

For example, when I wrote on Twitter, "academic ideal: I should be able to hold & defend ideas w/o fear of losing my job", I got a very important response from a colleague -- White men exercising their entitlement to this ideal seems to be at odds with marginalized people gaining the same privileges.

(Please read the rest of that Twitter commentary if you're at all interested in this!)

I don't have a sophisticated response to offer; as a tenured white guy whose research isn't in this area, I am only slowly learning about this area, and a large part of that learning is being open to colleagues who tell me about their experiences (latest horrific example, of many: Julie Libarkin, with whom I work on learning evaluation). For this reason I tend to simply stay quiet and do what I can to foster a welcoming environment. I certainly don't feel qualified to say anything intelligent on the specific question of marginalization.

I do have two tentative thoughts that I keep on coming back to, though, and I'd welcome feedback.

One thought is this: we can only have conversations about sexism and privilege and systemic oppression because of free speech, and, in the university, because discussions of these controversial topics are protected by academic freedom. I have colleagues and mentees who come from "free speech challenged" countries (I'm not being more specific in order to protect them), and the stories they tell me of government and institutional oppression are horrifying. With respect to one actual real-life example that happened to the family of a colleague, I can confirm that I would say virtually anything you want me to if you took my children, put them in a jail cell, and threatened them until I acquiesced. We are fairly far from that in the US (with national security and terrorism being one horrible counterexample), and I value that tremendously. I would hate to see that weakened even in the service of efforts that I believe in passionately.

My other thought is this: limits to academic freedom and free speech are and always have been a double edged sword. This is almost the definition of a "slippery slope" situation - it's very hard to enact precise limitations on free speech that don't have seriously unintended consequences. It's pretty easy to find pairs of examples to juxtapose -- consider gun rights vs animal rights. I bet relatively few people are sympathetic to both lawsuits on any grounds other than academic freedom! But most people will be sympathetic to at least one. How else to square this but academic freedom??

So inasmuch as I have anything to say, it's this: we should be careful what we wish for, because your well-intentioned limits on free speech and academic freedom today will be used used against you tomorrow. And if you don't agree that happens, you are taking an ahistorical position.

Concluding thoughts

There's a long and righteous history of defending the most disgusting and horrifying actions based on due process. For one example, Miranda rights rest on a despicable character, Ernesto Miranda, who was later convicted of some horrible crimes. Presumably most of my readers would agree that Miranda rights are a net win for the rights of the accused, but note that it was controversial -- for example, the Supreme Court decision was 5-4. (The wikipedia page is a very good read.)

So, ultimately, I don't think there's any conflict in arguing for due process or legal protections of free speech, academic freedom, or anything else, no matter how heinous the speech being protected is. And if you disagree, then I think you're not only wrong but dangerously so.

That having been said, I'm unsympathetic to people who want me to host their obnoxious speech. I can't see any reason why I, personally, am required to pay attention to what anyone else is saying. I don't have any reason to put up with (say) sexist speech within my lab, or on my blog. Nor do I have to engage with, pay attention to, or promote, those who have opinions I find to be silly or nonsensical. (One exception here - academic norms require me to engage with those opinions that bear on my own academic research.)


p.s. Respectful comments only, abiding by the Principle of Charity; others may be deleted without notice, and commenters may be banned. My blog, my rules. Read the above if you're confused :).

by C. Titus Brown at June 19, 2015 10:00 PM

Abraham Escalante

Scipy and the first few GSoC weeks

Hi all,

We're about three (and a half) weeks into the GSoC and it's been one crazy ride so far. Being my first experience working in OpenSource projects and not being much of an expert in statistics was challenging at first, but I think I might be getting the hang of it now.

First off, for those of you still wondering what I'm actually doing, here is an abridged version of the abstract from my proposal to the GSoC (or you can click here for the full proposal):

"scipy.stats is one of the largest and most heavily used modules in Scipy. [...] it must be ensured that the quality of this module is up to par and [..] there are still some milestones to be reached. [...] Milestones include a number of enhancements and [...] maintenance issues; most of the scope is already outlined and described by the community in the form of open issues or proposed enhancements."

So basically, the bulk of my project consists on getting to work on open issues for the StatisticsCleanup milestone within the statistics module of SciPy (a Python-based OpenSource library for scientific computing). I suppose this is an unusual approach for a GSoC project since it focuses on maintaining and streamlining an already stable module (in preparation for the release of SciPy 1.0), rather than adding a new module or a specific function within.

The unusual approach allows me to make several small contributions and it gives me a wide (although not as deep) scope, rather than a narrow one. This is precisely the reason why I chose it. I feel like I can benefit (and contribute) a lot more this way, while I get acquainted with the OpenSource way and it also helps me to find new personal interests (win-win).

However, there are also some nuances that may be uncommon. During the first few weeks I have discovered that my proposal did not account for the normal life-cycle of issues and PRs in scipy; my estimations we're too hopeful.

One of OpenSource's greatest strengths is the community getting involved in peer reviews; this allows a developer to "in the face of ambiguity, refuse the temptation to guess". If you didn't get that [spoiler alert] it was a reference to the zen of python (and if you're still reading this and your name is Hélène, I love you).

The problem with this is that even the smooth PRs can take much longer than one week to be merged because of the back and forth with feedback from the community and code update (if it's a controversial topic, discussions can take months). Originally, I had planned to work on four or five open issues a week, have the PRs merged and then continue with the next four or five issues for the next week but this was too naive so I have had to make some changes.

I spent the last week compiling a list of next steps for pretty much all of the open issues and I am now trying to work on as many as I can at a time, thus minimising the impact of waiting periods between feedback cycles for each PR. I can already feel the snowball effect it is having on the project and on my motivation. I am learning a lot more (and in less time) than before which was the whole idea behind doing the Summer of Code.

I will get back in touch soon. I feel like I have rambled on for too long, so I will stop and let you continue to be awesome and get on with your day.


by noreply@blogger.com (Abraham Escalante) at June 19, 2015 12:19 AM

June 18, 2015

Matthew Rocklin

Pandas Categoricals

tl;dr: Pandas Categoricals efficiently encode and dramatically improve performance on data with text categories

Disclaimer: Categoricals were created by the Pandas development team and not by me.

There is More to Speed Than Parallelism

I usually write about parallelism. As a result people ask me how to parallelize their slow computations. The answer is usually just use pandas in a better way

  • Q: How do I make my pandas code faster with parallelism?
  • A: You don’t need parallelism, you can use Pandas better

This is almost always simpler and more effective than using multiple cores or multiple machines. You should look towards parallelism only after you’ve made sane choices about storage format, compression, data representation, etc..

Today we’ll talk about how Pandas can represent categorical text data numerically. This is a cheap and underused trick to get an order of magnitude speedup on common queries.


Often our data includes text columns with many repeated elements. Examples:

  • Stock symbols – GOOG, APPL, MSFT, ...
  • Gender – Female, Male, ...
  • Experiment outcomes – Healthy, Sick, No Change, ...
  • States – California, Texas, New York, ...

We usually represent these as text. Pandas represents text with the object dtype which holds a normal Python string. This is a common culprit for slow code because object dtypes run at Python speeds, not at Pandas’ normal C speeds.

Pandas categoricals are a new and powerful feature that encodes categorical data numerically so that we can leverage Pandas’ fast C code on this kind of text data.

>>> # Example dataframe with names, balances, and genders as object dtypes
>>> df = pd.DataFrame({'name': ['Alice', 'Bob', 'Charlie', 'Danielle'],
...                    'balance': [100.0, 200.0, 300.0, 400.0],
...                    'gender': ['Female', 'Male', 'Male', 'Female']},
...                    columns=['name', 'balance', 'gender'])

>>> df.dtypes                           # Oh no!  Slow object dtypes!
name        object
balance    float64
gender      object
dtype: object

We can represent columns with many repeats, like gender, more efficiently by using categoricals. This stores our original data in two pieces

  • Original data

     Female, Male, Male, Female
  1. Index mapping each category to an integer

    Female: 0
    Male: 1
  2. Normal array of integers

    0, 1, 1, 0

This integer array is more compact and is now a normal C array. This allows for normal C-speeds on previously slow object dtype columns. Categorizing a column is easy:

In [5]: df['gender'] = df['gender'].astype('category')  # Categorize!

Lets look at the result

In [6]: df                          # DataFrame looks the same
       name  balance  gender
0     Alice      100  Female
1       Bob      200    Male
2   Charlie      300    Male
3  Danielle      400  Female

In [7]: df.dtypes                   # But dtypes have changed
name         object
balance     float64
gender     category
dtype: object

In [8]: df.gender                   # Note Categories at the bottom
0    Female
1      Male
2      Male
3    Female
Name: gender, dtype: category
Categories (2, object): [Female, Male]

In [9]: df.gender.cat.categories    # Category index
Out[9]: Index([u'Female', u'Male'], dtype='object')

In [10]: df.gender.cat.codes        # Numerical values
0    0
1    1
2    1
3    0
dtype: int8                         # Stored in single bytes!

Notice that we can store our genders much more compactly as single bytes. We can continue to add genders (there are more than just two) and Pandas will use new values (2, 3, …) as necessary.

Our dataframe looks and feels just like it did before. Pandas internals will smooth out the user experience so that you don’t notice that you’re actually using a compact array of integers.


Lets look at a slightly larger example to see the performance difference.

We take a small subset of the NYC Taxi dataset and group by medallion ID to find the taxi drivers who drove the longest distance during a certain period.

In [1]: import pandas as pd
In [2]: df = pd.read_csv('trip_data_1_00.csv')

In [3]: %time df.groupby(df.medallion).trip_distance.sum().sort(ascending=False,
CPU times: user 161 ms, sys: 0 ns, total: 161 ms
Wall time: 175 ms

1E76B5DCA3A19D03B0FB39BCF2A2F534    870.83
6945300E90C69061B463CCDA370DE5D6    832.91
4F4BEA1914E323156BE0B24EF8205B73    811.99
191115180C29B1E2AF8BE0FD0ABD138F    787.33
B83044D63E9421B76011917CE280C137    782.78
Name: trip_distance, dtype: float64

That took around 170ms. We categorize in about the same time.

In [4]: %time df['medallion'] = df['medallion'].astype('category')
CPU times: user 168 ms, sys: 12.1 ms, total: 180 ms
Wall time: 197 ms

Now that we have numerical categories our computaion runs 20ms, improving by about an order of magnitude.

In [5]: %time df.groupby(df.medallion).trip_distance.sum().sort(ascending=False,
CPU times: user 16.4 ms, sys: 3.89 ms, total: 20.3 ms
Wall time: 20.3 ms

1E76B5DCA3A19D03B0FB39BCF2A2F534    870.83
6945300E90C69061B463CCDA370DE5D6    832.91
4F4BEA1914E323156BE0B24EF8205B73    811.99
191115180C29B1E2AF8BE0FD0ABD138F    787.33
B83044D63E9421B76011917CE280C137    782.78
Name: trip_distance, dtype: float64

We see almost an order of magnitude speedup after we do the one-time-operation of replacing object dtypes with categories. Most other computations on this column will be similarly fast. Our memory use drops dramatically as well.


Pandas Categoricals efficiently encode repetitive text data. Categoricals are useful for data like stock symbols, gender, experiment outcomes, cities, states, etc.. Categoricals are easy to use and greatly improve performance on this data.

We have several options to increase performance when dealing with inconveniently large or slow data. Good choices in storage format, compression, column layout, and data representation can dramatically improve query times and memory use. Each of these choices is as important as parallelism but isn’t overly hyped and so is often overlooked.

Jeff Reback gave a nice talk on categoricals (and other featuress in Pandas) at PyData NYC 2014 and is giving another this weekend at PyData London.

June 18, 2015 12:00 AM

June 17, 2015

Wei Xue

GSoC Week 4: Progress Report

Updated in Jun 24.

Here is the task check-list.

  1. [x] Completes derivation report.
  2. [x] Adds new classes. One abstract class _BasesMixture. Three derived classes GaussianMixture, BayesianGaussianMixture, DirichletProcessGaussianMixture
  3. [ ] Decouples large functions, especially in DirichletProcessGaussianMixture and BayesianGaussianMixture
  4. [x] Removes numerical stability fixes for HMM. It seems that whenever there is a numerical issue, the code always adds 10*EPS in the computation. I think in some cases there is a better way to address the problem, such as normalization the extremely small variables earlier, or we just simply remove 10*EPS which is only for HMM.
  5. [ ] Writes updating functions for BayesianGaussianMixtureModel and DirichletProcessGaussianMixtureModel according to the report.
  6. [x] Provides methods that allow users to initialize the model with user-provided data
  7. [x] Corrects kmeans initialization. It is weird when using kmeans initialization, only means is initialized. The weights and covariances are initialized by averaging.
  8. [x] Writes several checking functions for the initialization data
  9. [x] Adjusts the verbose messages. When verbose>1, it display log-likelihood and time used in the same line of the message Iteration x
  10. [ ] Adjusts the time to compute log-likelihood. The code in the current master branch compute the log-likelihood of the model after E-step which is actually the score of the last iteration, and misses the score immediately after the initialization.
  11. [x] Simplify fit_predict
  12. [x] Adds warning for params!='wmc'
  13. [ ] Studies and contrasts the convergence of classical MLE / EM GMM with Bayesian GMM against the number of samples and the number of components
  14. [ ] Friendly warning and error messages, or automatically addressing if possible (e.g. random re-init of singular components)
  15. [ ] Examples that shows how models can over-fit by comparing likelihood on training and validation sets (normalized by the number of samples). For instance extend the BIC score example with a cross-validated likelihood plot
  16. [ ] Testing on 1-D dimensions
  17. [ ] Testing on Degenerating cases
  18. [ ] AIC, BIC for VBGMM DPGMM
  19. [ ] Old faithful geyser data set
  20. [optional] add a partial_fit function for incremental / out-of-core fitting of (classical) GMM, for instance http://arxiv.org/abs/0712.4273
  21. [optional] ledoit_wolf covariance estimation

The most important progress I have done is the derivation report which include the updating functions, log-probability, and predictive distribution for all three models, and the implementation of the base class. Compared with the current scikit-learn math derivation documents, my report is consistent to PRML. It clearly depicts the updating functions of three models share a lot of patterns. We could abstract common functions into the abstract base class _MixtureBase. The three models could inherit it and override the updating methods.

Next week I will finish the GaussianMixture model with necessary testing functions.

June 17, 2015 08:20 PM

June 16, 2015

Matthieu Brucher

Audio Toolkit: Anatomy of a middle-side compressor

Sometimes images are worth a thousand words, so let’s look at some pictures of a middle-side compressor behavior.

A middle-side compressor like ATKStereoCompressor can work in a middle-side workflow. This means that the stereo signal is split in a center/middle channel and a side (L-R) channel. Then each channel is processed through the compressor independently and then recreated after:

Stereo compressor pipelineStereo compressor pipeline

So let’s take a stereo signal (I won’t extract the middle/side channels, it is easy to get them from AudioTK):
Stereo signal inputStereo signal input

From this, it is easy to generate the RMS power for each channel:
Middle-side power signals (RMS)Middle-side power signals (RMS)

Now, after an attack-release filter, we can apply the gain stage with the desired ratio, knee…
Middle-side gainMiddle-side gain
The gain processed for the two channels are similar, following the same trend, but with vastly different features.

Finally, we can apply the gain on the middle and side channels before regenerating the stereo channels and get the following result:
Stereo output signalStereo output signal

Of course, changing attack/release values will also change the shape of the signals, as well as the ratio: with a higher ratio on the side, the signal will sound more like a mono signal, whereas a higher ratio on the middle, the stereo image will be broadened.

The code to generate these images for any stereo signal is available on github.

Buy Me a Coffee!

Other Amount:

Your Email Address :

by Matt at June 16, 2015 07:01 AM

June 15, 2015

Wei Xue

GSoC Week 3

The week 3 has a very exciting start. I finished the derivation of DPGMM, as well as the lower bound and the predictive probability for each model.

The difference between my derivation and the current document is that the current models assume a simpler approximation. The model defined in PRML is more accurate and provides more knobs. The two approximations both appear in the literature. Maybe we should do some experiments to decide which one is better.

With regards to the new names of DPGMM and VBGMM, I think these two names are not suitable, just like someone calls SVM as SMO. Actually, the models are Bayesian GMM, Dirichlet Process Bayesian GMM (DPGMM is often used) respectively. Both of them are solved by variational inference. In other words, VBGMM is not a good name. The new names, I think, should have the meaning of 'Bayesian GMM solved by VB', 'DP(B)GMM solved by VB'.

I also took a close look at the code base. The code is not maintained well. The problem I am going to address is as follows.

  • decouple some large functions, such as _fit
  • use abstract class and inheritance to reduce code redundancy
  • numerical stability. It seems that whenever there is a numerical issue. The code always like to add EPS. I think in some place there is a better way to address the problem, such as normalization the extremely small variables earlier.
  • write updating functions for BayesianGaussianMixtureModel and DirichletProcessGaussianMixtureModel
  • provide methods that allow users to initialize the model before fit
  • correct kmeans initialization. It is weird when using kmean initialization, only means is initialized. The weights and covariances are initialized by averaging.
  • write several checking functions for the initialization data
  • [optional] add a partial_fit function for incremental / out-of-core fitting of (classical) GMM, for instance http://arxiv.org/abs/0712.4273
  • [optional] ledoit_wolf covariance estimation

The last days of this week I implemented the structure of new classes. _MixtureModelBase, GaussianMixtureModel, BayesianMixtureModel, DirichletProcessMixtureModel. It provides us a big picture of the classes I am going to implement. I am looking forward the feedback.

June 15, 2015 01:00 AM

June 13, 2015

Jake Vanderplas

Fast Lomb-Scargle Periodograms in Python

Image source: astroML. Source code here

The Lomb-Scargle periodogram (named for Lomb (1976) and Scargle (1982)) is a classic method for finding periodicity in irregularly-sampled data. It is in many ways analogous to the more familiar Fourier Power Spectral Density (PSD) often used for detecting periodicity in regularly-sampled data.

Despite the importance of this method, until recently there have not been any (in my opinion) solid implementations of the algorithm available for easy use in Python. That has changed with the introduction of the gatspy package, which I recently released. In this post, I will compare several available Python implementations of the Lomb-Scargle periodogram, and discuss some of the considerations required when using it to analyze data.

To cut to the chase, I'd recommend using the gatspy package for Lomb-Scargle periodograms in Python, and particularly its gatspy.periodic.LombScargleFast algorithm which implements an efficient pure-Python version of Press & Rybicki's \(O[N\log N]\) periodogram. Below, I'll dive into the reasons for this recommendation.

Example: Lomb-Scargle on Variable Stars

As an motivation, let's briefly consider some data from my own field: observations of an RR Lyrae-type variable star. RR Lyrae are small stars – about 50% the mass of our sun – which pulsate with a regular period on order half a day. Their relatively consistent peak intrinsic brightness allows for an accurate estimation of their distance from the sun, and thus they are important for studies such as understanding the substructure of the Milky Way galaxy. Because of this and other similar applications, detecting the telltale periodic variation of RR Lyrae stars within noisy data is an important statistical task for astronomers.

Here we will quickly demonstrate what this looks like in practice, using tools from the astroML package to download some data, and tools from the gatspy package to detect the periodicity.

We'll start with some typical Python import statements:

In [1]:
# Do preliminary imports and notebook setup
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

# use seaborn for plot styles
import seaborn; seaborn.set()

Now we'll download some data from the LINEAR dataset, using tools in astroML. We'll plot the data to see what we're working with:

In [2]:
from astroML.datasets import fetch_LINEAR_sample
LINEAR_data = fetch_LINEAR_sample()
star_id = 10040133
t, mag, dmag = LINEAR_data.get_light_curve(star_id).T

fig, ax = plt.subplots()
ax.errorbar(t, mag, dmag, fmt='.k', ecolor='gray')
ax.set(xlabel='Time (days)', ylabel='magitude',
       title='LINEAR object {0}'.format(star_id))

This data has around 250 observations spread across about 2000 days, and we're hoping to detect a period of order 0.5 days. If the series were regularly-sampled, we'd be far above the Nyquist limit and all hope would be lost. Fortunately for astronomers, the assumptions behind the Nyquist sampling limit do not hold for irregular sampling rates, and we can proceed with no problem.

Let's start by computing and plotting the Lomb-Scargle Periodogram for this data, using tools from gatspy:

In [3]:
from gatspy.periodic import LombScargleFast
model = LombScargleFast().fit(t, mag, dmag)
periods, power = model.periodogram_auto(nyquist_factor=100)

fig, ax = plt.subplots()
ax.plot(periods, power)
ax.set(xlim=(0.2, 1.4), ylim=(0, 0.8),
       xlabel='period (days)',
       ylabel='Lomb-Scargle Power');

The periodogram gives a measure of periodic content as a function of period; we see here a strong peak at around 0.61 days. Other lower peaks are due to some combination of higher-order harmonics in the data and effects of the irregular survey window. While we could find this maximum manually from the above grid, gatspy provides a better way: a built-in two-stage grid-search that accurately determines the best period in a specified range:

In [4]:
# set range and find period
model.optimizer.period_range=(0.2, 1.4)
period = model.best_period
print("period = {0}".format(period))
Finding optimal frequency:
 - Estimated peak width = 0.0032
 - Using 5 steps per peak; omega_step = 0.00064
 - User-specified period range:  0.2 to 1.4
 - Computing periods at 42104 steps
Zooming-in on 5 candidate peaks:
 - Computing periods at 1000 steps
period = 0.6105387801103276

We see that the optimizer determined that it needed a grid of over 40,000 points to adequately cover the frequency grid (more on this below), and in the end arrived at a best period of 0.6105 days. Given this detected period, we can fold the input data and over-plot a best-fit empirical RR Lyrae template to see the fit:

In [5]:
# Compute phases of the obsevations
phase = (t / period) % 1

# Compute best-fit RR Lyrae template
from gatspy.periodic import RRLyraeTemplateModeler
model = RRLyraeTemplateModeler('r').fit(t, mag, dmag)
phase_fit = np.linspace(0, 1, 1000)
mag_fit = model.predict(period * phase_fit, period=period)

# Plot the phased data & model
fig, ax = plt.subplots()
ax.errorbar(phase, mag, dmag, fmt='.k', ecolor='gray', alpha=0.5)
ax.plot(phase_fit, mag_fit, '-k')
ax.set(xlabel='Phase', ylabel='magitude')

This very close template fit gives a strong indication that the star in question is an RR Lyrae.

Computational Considerations for Lomb-Scargle

The Lomb-Scargle periodogram involves the computation of a power \(P(\omega)\) at a set of frequencies \(\omega_i\). For data \(\{y_k\}\) pre-centered such that \(\sum_k y_k = 0\), the expression for the power is:

\[ P(\omega) \propto \frac{\left[\sum_k y_k \cos\omega(t_k - \tau)\right]^2} {\sum_k \cos^2\omega(t_k - \tau)} + \frac{\left[\sum_k y_k \sin\omega(t_k - \tau)\right]^2} {\sum_k \sin^2\omega(t_k - \tau)} \]

where \(\tau\) is an easily computed time-offset which orthogonalizes the model and makes \(P(\omega)\) independent of a translation in \(t\).

Rather than get lost in the math, I want to emphasize the key feature of this expression: for any frequency \(\omega\), the power is an \(O[N]\) computation involving simple trigonometric sums over the data, where \(N\) is the number of observed data points. The main computational question then becomes: how many frequencies must you compute? In my experience, the most common mistake people make when doing this sort of periodic analysis is not thinking hard enough about the frequency grid. It turns out that the grid-spacing question is very important. If you choose too fine a grid, you do much more computation than is required. Worse, if you choose too coarse a grid, the periodogram peak may fall between grid points and you'll miss it entirely!

Let's think about the required frequency range and frequency spacing for Lomb-Scargle.

Frequency spacing

First we'll choose the spacing of the frequency grid. If you're asking about a candidate frequency \(f\), then data with range \(T = t_{max} - t_{min}\) contains \(T \cdot f\) complete cycles. If our error in frequency is \(\delta f\), then \(T\cdot\delta f\) is the error in number of cycles between the endpoints of the data. It's clear that this error must not be a significant fraction of a cycle, or the fit could be drastically affected. This leads to an approximate grid-spacing criterion:

\[ T\cdot\delta f \ll 1 \]

Commonly, we'll choose some oversampling factor (say, 5) and use \(\delta f = (5T)^{-1}\) as our frequency grid spacing.

Frequency limits

Next, we need to choose the upper and lower limits of the frequency grid. On the low end, \(f=0\) is suitable, but causes some numerical problems – we'll go one step away and use \(\delta f\) as our minimum frequency. But on the high end, we need to make a choice: what's the highest frequency we'd trust our data to be sensitive to? At this point, many people are tempted to mis-apply the Nyquist-Shannon sampling theorem, and choose some version of the Nyquist limit for the data (based on, say, the minimum or mean spacing between observations). But this is entirely wrong! The Nyquist frequency is derived from special properties of regularly-sampled data, and does not apply – even approximately – to irregularly-sampled time-series. In fact, as we saw above, irregularly-sampled data can be sensitive to much, much higher frequencies than even the minimum spacing between observations. With this in mind, the upper limit for frequencies should be determined based on what kind of signal you are looking for.

Still, a common (if dubious) rule-of-thumb is that the high frequency is some multiple of what Press & Rybicki call the "average" Nyquist frequency,

\[ \hat{f}_{Ny} = \frac{N}{2T} \]

This means that the "typical" number of frequencies you'll need is

\[ N_{freq} \sim O\left[\frac{\hat{f}_{Ny}}{\delta f}\right] \sim O\left[\frac{N/(2T)}{1/T}\right] \sim O[N] \]

That is, the number of frequencies to search will scale with the number of data points!

Computational Complexity

From the above considerations, we see that the determination of the optimal Lomb-Scargle period within \(N\) points requires computing an \(O[N]\) expression for power across \(O[N]\) grid points; that is, Lomb-Scargle is naively an \(O[N^2]\) algorithm.

This computational complexity can be improved in one of several ways. Most notably, in a 1989 paper, Press and Rybicki proposed a clever method whereby a Fast Fourier Transform is used on a grid extirpolated from the original data, such that this naively \(O[N^2]\) problem can be solved in \(O[N\log N]\) time. The broad idea is that when you compute sums of sines and cosines for one frequency, this gives you some amount of information about those sums computed at another frequency, and by carefully using all information across a frequency grid, you can significantly reduce the number of required operations.

Thus the fundamental divide between Lomb-Scargle implementations is whether they use the naive \(O[N^2]\) algorithm or the \(O[N\log N]\) algorithm of Press & Rybicki and other similar approaches.

Lomb-Scargle Algorithms in Python

Now we get to the meat of this post: Lomb-Scargle implementations written in Python. If you search this on Google, you'll currently find links to several available implementations. Here I'm going to delve into and compare the following four implementations:

  • scipy.signal.lombscargle, an \(O[N^2]\) implementation from SciPy.
  • astroML.time_series.lomb_scargle, an \(O[N^2]\) implementation from astroML.
  • gatspy.periodic.LombScargle, an \(O[N^2]\) implementation from gatspy.
  • gatspy.periodic.LombScargleFast, an \(O[N\log N]\) implementation, also from gatspy.

Let's see some examples of the above tools:


The SciPy Lomb-Scargle periodogram is a C implementation of the naive \(O[N^2]\) algorithm. The algorithm cannot account for noise in the data, and has some other quirks as well:

  • it requires you to center your data (by subtracting the mean) before computing the periodogram. If you do not, the results will be garbage.
  • it computes the unnormalized periodogram, which can be normalized manually as we'll see below.
  • it takes angular frequencies as the argument.

Let's use scipy's algorithm to plot the periodogram of the data shown above. Note that the results will not be identical, because this algorithm ignores the noise in the data and doesn't fit for the data mean.

Against the above recommendations, we'll choose a simple regular grid in period for the plot:

In [6]:
from scipy.signal import lombscargle

# Choose a period grid
periods = np.linspace(0.2, 1.4, 4000)
ang_freqs = 2 * np.pi / periods

# compute the (unnormalized) periodogram
# note pre-centering of y values!
power = lombscargle(t, mag - mag.mean(), ang_freqs)

# normalize the power
N = len(t)
power *= 2 / (N * mag.std() ** 2)

# plot the results
fig, ax = plt.subplots()
ax.plot(periods, power)
ax.set(ylim=(0, 0.8), xlabel='period (days)',
       ylabel='Lomb-Scargle Power');

Comparing to the first periodogram plot, we see that becuase our period grid here is too coarse at low frequencies, some of the peak structure is missed by this visualization. Consider this a warning against arbitrarily choosing a period gridding!


AstroML has two \(O[N^2]\) implementations of Lomb-Scargle: one in astroML and one in astroML_addons, which is a collection of C extensions which replace slower functionality in the pure-python astroML package. In order to use the faster version, make sure you install both packages; e.g.

$pip install astroML$ pip install astroML_addons

Some important features of astroML's Lomb Scargle periodogram:

  • unlike scipy, it uses an extended periodogram model which can correctly account for uncorrelated Gaussian measurement error.
  • like scipy, it takes angular frequencies as its argument.
  • unlike scipy, it implements a floating mean periodogram, meaning that the data centering required for scipy is not required here, but it goes beyond simple centering: the mean of the data is fit as part of the model, which has advantages in many real-world scenarios. To directly compare to scipy's standard Lomb Scargle pass generalized=False.

Let's repeat the above plot with this periodogram:

In [7]:
from astroML.time_series import lomb_scargle
power = lomb_scargle(t, mag, dmag, ang_freqs)

# plot the results
fig, ax = plt.subplots()
ax.plot(periods, power)
ax.set(ylim=(0, 0.8), xlabel='period (days)',
       ylabel='Lomb-Scargle Power');


Gatspy's basic Lomb-Scargle algorithm is an \(O[N^2]\) implementation, but is implemented differently than either of the above versions. It uses a direct linear algebra approach which carries some additional computational and memory overhead. The reason for this approach is that it naturally accommodates several extensions to the periodogram, including floating mean, multiple terms, regularization, and multi-band models (more details in VanderPlas & Ivezic (2015), the paper that inspired gatspy).

Gatspy is a pure python package, and thus installation is easy and requires no compilation of C or Fortran code:

$ pip install gatspy

Some important features of this implementation:

  • like astroML, it uses an extended periodogram model which correctly accounts for uncorrelated Gaussian measurement error.
  • unlike astroML, it takes periods as its argument.
  • like astroML, it uses a floating mean model by default. To compare directly to scipy's non-floating-mean model, set fit_offset=False.
  • it has an API inspired by scikit-learn, where the model itself is a class instance, the model is applied to data with a fit() method, and the periodogram is computed via a score() method.

Let's repeat the above periodogram using this tool:

In [8]:
from gatspy.periodic import LombScargle

model = LombScargle(fit_offset=True).fit(t, mag, dmag)
power = model.score(periods)

# plot the results
fig, ax = plt.subplots()
ax.plot(periods, power)
ax.set(ylim=(0, 0.8), xlabel='period (days)',
       ylabel='Lomb-Scargle Power');


Gatspy's fast Lomb-Scargle is an \(O[N\log N]\) algorithm built on a pure Python/numpy implementation of the Press & Rybicki FFT/extirpolation method. Note that a requirement of this fast algorithm is that it be computed on a regular grid of frequencies (not periods), and so to attain this performance it provides the score_frequency_grid() method which takes 3 arguments: the minimum frequency f0, the frequency spacing df, and the number of grid points N.

Some features of the model

  • like astroML, it uses an extended periodogram model which correctly accounts for uncorrelated Gaussian measurement error.
  • it takes a regular frequency grid as its argument for the fast computation; note that the score() function itself falls back on the slower LombScargle approach above.
  • like astroML, it uses a floating mean model by default. To compare directly to scipy, set fit_offset=False.
  • it has an identical API to the LombScargle object above.

Let's take a look at computing the periodogram:

In [9]:
from gatspy.periodic import LombScargleFast

fmin = 1. / periods.max()
fmax = 1. / periods.min()
N = 10000
df = (fmax - fmin) / N

model = LombScargleFast().fit(t, mag, dmag)
power = model.score_frequency_grid(fmin, df, N)
freqs = fmin + df * np.arange(N)

# plot the results
fig, ax = plt.subplots()
ax.plot(1. / freqs, power)
ax.set(ylim=(0, 0.8), xlabel='period (days)',
       ylabel='Lomb-Scargle Power');

You'll notice here that this approach shows a lot more high-frequency peaks than any of the above versions. This is not because it is computing a different model; it is because we are using a finer frequency grid which does not miss these peaks. The above versions, with a regular grid of 4000 periods miss these important features, and give the user absolutely no warning that these features are missed! Keep this in mind as you choose grid parameters while following the above discussion.

If you want to make sure you're using a sufficient grid, you can use the periodogram_auto() method of LombScargleFast, which computes a sufficient frequency grid for you using the rules-of-thumb discussed in the previous section:

In [10]:
model = LombScargleFast().fit(t, mag, dmag)

period, power = model.periodogram_auto(nyquist_factor=200)

print("period range: ({0}, {1})".format(period.min(), period.max()))
print("number of periods: {0}".format(len(period)))
period range: (0.0764511670428014, 9823.97496499998)
number of periods: 128500

The model decided that we needed over 100,000 periods, between about 0.1 days (which was tuned by the nyquist_factor argument) and about 10,000 days (which is derived from the time-span of the data). Plotting the results as above, we see a similar periodogram:

In [11]:
# plot the results
fig, ax = plt.subplots()
ax.plot(period, power)
ax.set(xlim=(0.2, 1.4), ylim=(0, 1.0),
       xlabel='period (days)',
       ylabel='Lomb-Scargle Power');

The LombScargleFast algorithm computes these \(10^5\) periodogram steps very quickly; I wouldn't suggest any of the other methods with a grid of this size!

Benchmarking Lomb-Scargle Implementations

As a final piece of the picture, let's compare the execution speed of the four approaches. We can do this with IPython's %timeit magic function using the following script. Note that this script will take several minutes to run, as it automatically does multiple passes of each benchmark to minimize system timing variation. For efficiency, we cut-off the slower algorithms at high \(N\):

In [12]:
from scipy.signal import lombscargle as ls_scipy
from astroML.time_series import lomb_scargle as ls_astroML

def create_data(N, rseed=0, period=0.61):
    """Create noisy data"""
    rng = np.random.RandomState(rseed)
    t = 52000 + 2000 * rng.rand(N)
    dmag = 0.1 * (1 + rng.rand(N))
    mag = 15 + 0.6 * np.sin(2 * np.pi * t / period) + dmag * rng.randn(N)
    return t, mag, dmag

def compute_frequency_grid(t, oversampling=2):
    """Compute the optimal frequency grid (**not** angular frequencies)"""
    T = t.max() - t.min()
    N = len(t)
    df = 1. / (oversampling * T)
    fmax = N / (2 * T)
    return np.arange(df, fmax, df)

Nrange = 2 ** np.arange(2, 17)
t_scipy = []
t_astroML = []
t_gatspy1 = []
t_gatspy2 = []

for N in Nrange:
    t, mag, dmag = create_data(N)
    freqs = compute_frequency_grid(t)
    periods = 1 / freqs
    ang_freqs = 2 * np.pi * freqs
    f0, df, Nf = freqs[0], freqs[1] - freqs[0], len(freqs)
    # Don't compute the slow algorithms at very high N
    if N < 2 ** 15:
        t1 = %timeit -oq ls_scipy(t, mag - mag.mean(), ang_freqs)
        t2 = %timeit -oq ls_astroML(t, mag, dmag, ang_freqs)
        t3 = %timeit -oq LombScargle().fit(t, mag, dmag).score_frequency_grid(f0, df, Nf)
    t4 = %timeit -oq LombScargleFast().fit(t, mag, dmag).score_frequency_grid(f0, df, Nf)

When these timings are finished, we can plot the results to get an idea of how the algorithms compare:

In [13]:
fig = plt.figure()
ax = fig.add_subplot(111, xscale='log', yscale='log')
ax.plot(Nrange, t_scipy, label='scipy: lombscargle')
ax.plot(Nrange, t_astroML, label='astroML: lomb_scargle')
ax.plot(Nrange, t_gatspy1, label='gatspy: LombScargle')
ax.plot(Nrange, t_gatspy2, label='gatspy: LombScargleFast')
ax.set(xlabel='N', ylabel='time (seconds)',
       title='Comparison of Lomb-Scargle Implementations')
ax.legend(loc='upper left');

Each model has a characteristic performance curve:

  • The scipy and astroML algorithms show similar behavior: fast \(O[1]\) scaling at the small-\(N\) limit, and clear \(O[N^2]\) scaling at the large-\(N\) limit. SciPy is slightly faster, primarily due to the fact that it computes the simpler noiseless non-floating-mean model.
  • Gatspy's LombScargle also becomes \(O[N^2]\) at large \(N\), but is dominated at small \(N\) by an \(O[N]\) contribution which comes from allocating & building the matrices associated with its linear algebraic approach. As \(N\) grows larger than \(\sim 10^4\), however, gatspy's model begins to beat the performance of the other two \(O[N^2]\) algorithms.
  • Gatspy's LombScargleFast has an upfront \(O[1]\) cost that makes it slower than other approaches at small \(N\), but as \(N\) grows its \(O[N\log N]\) scaling means it dominates the performance of the other approaches by orders of magnitude.

If you'd like to push the speed of the computation even further, there may be some options available. For example, the pynfftls package implements an \(O[N\log N]\) Lomb-Scargle based on the NFFT algorithm, which is similar to the NUFFT that I discussed in a previous post. The pynfftls installation depends on prior installations of the NFFT and FFTW libraries. These libraries are best-in-class implementations of their respective algorithms, and from my past experience with them, I'd expect pynfftls to be around a factor of 10 faster than LombScargleFast with the same \(O[N\log N]\) scaling.

I should mention that I briefly tried installing pynfftls for this post, but ran into difficulties with linking the source to the appropriate C headers and library/shared object files. No doubt with a couple hours of tinkering it could be done, but in a conda world I've found my threshold of tolerance for such installation headaches has gone way down. Package developers take note: in most situations, ease of installation is easily worth a factor of a few in runtime performance. If any readers want to tackle the comparison between LombScargleFast and pynfftls, I'd be intrested to learn whether my factor-of-ten intuition is correct!


If there's anything I want you to take from the above discussion, it's these three points:

  • Naive application of Nyquist-style limits to irregularly-sampled data is 100% wrong. Don't be the next person to make this mistake in the published literature! I've been meaning to write a full rant/post on this subject for a while. Perhaps I will someday.
  • Selection of period/frequency grids for Lomb-Scargle analysis should not be taken lightly. It's very easy to inadvertently use too coarse of a grid, and entirely miss important periodogram peaks!
  • Use gatspy.periodic.LombScargleFast if you want any easy-to-install means of computing a fast, \(O[N\log N]\) Lomb-Scargle periodogram in Python.

This post was written entirely in the IPython notebook. You can download this notebook, or see a static view here.

by Jake Vanderplas at June 13, 2015 09:00 PM

June 11, 2015

Titus Brown

Notes from &quot;How to grow a sustainable software development process (for scientific software)&quot;

I gave a presentation at the BEACON Center's coding group this past Monday; here are my notes and followup links. Thanks to Luiz Irber for scribing!

My short slideshow: here

The khmer project is on github, and we have a tutorial for people who want to try out our development process. khmer has ~5-10 active developers and has had ~60 contributors overall.

Growing a development process

How can you grow a process that meets your needs?

  • use version control and develop on branches;
  • create a checklist to use when merging branches into master;
  • follow the checklist!

(For more checklist motivation, see The Checklist Manifesto by Atul Gawande.)

We do the above as part of our GitHub flow-based development approach.

tl;dr? Grow your checklist slowly, but make everyone adhere to it.

What goes on your checklist?

Ideas for things that could go on your checklist:

  • I ran the tests and they passed!
  • Someone else ran the tests and they passed!
  • A computer ran the tests automatically and they passed! (Continuous Integration)
  • The code formatting guidelines are met. (> 2 people with different coding styles? CHAOS.)
  • The code coverage guidelines are met.
  • A spellchecker was run.
  • Changes were described in a ChangeLog.
  • Commit messages make sense.
  • Code coverage didn't decrease.
  • Checks on specific types of features ("Script parameters should be documented").

I also strongly suggest a two-person merge rule (the primary developer of a feature can not merge that feature); this helps ensure the checklist is followed :)

You can see our checklist for khmer here.


It's important to make the checklist as lightweight as possible, and making sure it addresses useful "pain points" in your developer process; there's a line where people start ignoring the checklist because there's less direct value.

There's no reason to start heavy; you can grow your checklist slowly, as your project accrues experience and developers turn over.


Development process goals

Add features quickly (by using branches) while keeping technical debt manageable!

The concept of technical debt is key - if you let cruft accrue in your codebase, eventually your entire codebase will become unmaintainable and unmanageable.

Other useful links:


by C. Titus Brown at June 11, 2015 10:00 PM

Continuum Analytics

Xray + Dask: Out-of-Core, Labeled Arrays in Python

Xray provides labeled, multi-dimensional arrays. Dask provides a system for parallel computing. Together, they allow for easy analysis of scientific datasets that don’t fit into memory.

by Stephan Hoyer at June 11, 2015 12:00 AM

June 09, 2015

Titus Brown

The challenge for post-publication peer review

On Tuesday, I wrote a draft blog post in response to Michael Eisen's blog post on how Lior Pachter's blog post was a a model for post-publication peer review (PPPR). (My draft post suggested that scientific bloggers aim for inclusivity by adopting a code of conduct and posting explicit site commenting policies).

I asked several people for comments on my post, and a (female) friend who is a non-scientist responded:

My big thing is that if I'm doing something in my spare time, I don't want to be dealing with the same bullshit that I do at work.

While my friend is not a scientist, and her comment spoke specifically to the challenges of women in tech, this comment nails the broader challenge for post-publication peer review: how do we make post-publication peer review a welcoming experience? If it's an optional activity that turns people off, most people won't do it. And I'd love to see more assistant professors, grad students, and women engaging in PPPR. However, if PPPR mirrors the not infrequent assholery of anonymous peer reviews and is only done amongst high-profile tenured faculty (or anonymously), we won't engage with this broader community.

I don't have any one perfect answer, but there are a few things that the tech community has done that nudge things in the right direction. A per-site code of conduct and commenting policies are two obvious idea, and (on Twitter) Joe Pickrell suggested asking people to work on the Principle of Charity which I really like, too.

Michael Eisen had a great comment on a different issue that I think applies here --

it's not really about what "science needs" - it's about not being an asshole

Yep. Any inclusive community building effort should start somewhere near the "don't be an asshole" rule - while a low bar, it should at least be achievable :)

We've had a fairly large amount of high profile aggressive alpha-male behavior in bioinformatics blogging and post-pub peer review, and it's worth thinking hard about whether this is the direction we want to pursue. I hope it's not. Fundamentally, it's our community and we should grow it responsibly. Some ground rules about conduct and commenting would be a good start. There are many interesting questions, e.g. about how to retain intellectual rigor while being welcoming to junior participants who might fear retaliation for critical commentary; I hope we can figure that out. I'm already encouraged by the quiet subculture of preprints and preprint commentary that's sprung up completely apart from Lior's blog.

Other than a general attempt to be more welcoming, I'm not sure what else to do. I'd love to hear people's thoughts on how to grow the practice of post-publication peer review, and I'm happy to post them anonymously for people who don't want to comment themselves.


p.s. There is some irony in referring to Mike and Lior's blogs, which I personally find very unwelcoming in several ways; I encourage them to think about a code of conduct of some sort, along with a commenting policy, so we can know what kind of commentary they intentionally hope to foster on their blogs.

p.p.s. I haven't put a Code of Conduct or commenting policy on my blog yet, I know. I'll get to it soon :)

by C. Titus Brown at June 09, 2015 10:00 PM

Artem Sobolev

Week 1 + 2: Takeoff

The first two weeks have ended, and it's time for a weekly (ahem) report.

Basic implementation outlined in the previous post was rewritten almost from scratch. Now there are 2 implementations of cost function calculation: a fully vectorized (that doesn't scale, but should work fast) and a semi-vectorized (that loops through training samples, but all other operations are vectorized). Meanwhile I work on a large scale version. More on that below.

Also, I wrote a simple benchmark that shows improved accuracy of 1NN with the learned distance, and compares 2 implementations.

There are several issues to solve.

The first and the major one is scalability. It takes $O(N^2 M^2)$ time to compute  NCA's gradient, which is waaay too much even for medium-size datasets. Some ideas I have in mind:

  1. Stochastic Gradient Descent. NCA's loss is a sum of each sample's contribution, so we do stochastic optimization on it reducing computational complexity down to $O(w N M^2)$ where $w$ is a number of iterations.
  2. There's a paper Fast NCA. I briefly skimmed through the paper, but my concern is that they look for $K$ nearest neighbors, which takes them $O(K N^2)$ time — don't look like quite an improvement (Though it's certainly is if you want to project some high-dimensional data to a lower dimensional space).
Another thing to do which is not an issue, but still needs to be done is choosing an optimization algorithm. For now there're 3 methods: gradient descent, gradient descent with AdaGrad and scipy's scipy.optimize.minimize. I don't think it's a good idea to overwhelm a user by the variety of settings with no particular difference in the outcome, so we should get rid of features that are known to be useless.

Also, unit tests and documentation are planned, as well.

by noreply@blogger.com (B@rmaley.exe) at June 09, 2015 01:44 PM

Matthieu Brucher

Book review: C++ Multithreading Cookbook

C++ Multithreading Cookbook in 2014 (publication year), that seems quite interesting, with all the new stuff from the current C++ standard. Is it what the book delivers?

Content and opinions

Unfortunately, when you read the table of contents, there are already orange flags. Chapter 6 is about threads in the .NET framework. What?? This is a book on C++ multithreading, not Windows specific things, right?

OK, let’s start with the beginning. Chapter 1 is a poor presentation of C++. The author says that he will be using Hungarian notation. Actually even Microsoft says do not use it. It predates modern C++, so stop it. Period. I won’t talk about the issues with misunderstanding of a .lib in Windows or what precompiled headers are.

The second chapter is actually interesting. Processes and threads are quite complex beasts, and it is not always properly explained.

The bad gets worse with the third chapter, which is supposed to be about thread. Don’t forget this is a C++ book. C++11 brought native thread management, locks, mutex… But nothing of the sort is here. Only Windows specific C threads (not even C++). Not talking about specific unexplained pragmas. Chapter 4 addresses processes, which is actually not in the C++11 standard. This is where the book could have shined, but no, it has to talk about Message Passing Interface, but this has nothing to do with Message PAssing Interface, which is an official C/Fortran/C++ standard.

By then, I’m fed up with the book, although chapter 7 does pose some good code practices and warnings about concurrency. But then, it goes bad again, even talking about OpenMP (although there is a C++ “version”, no real HPC code actually uses this unusable interface in any properly designed code).


In the end, the book title has nothing to do with the content. It may have been interesting 10 years ago with a title like “Windows C threads and processes”, but it is definitely not worth it with C++11.

by Matt at June 09, 2015 07:45 AM

June 08, 2015

Pierre de Buyl

Understanding the Hilbert curve

The Hilbert curve fills space with good properties for sorting N-dimensional data in a linear fashion. I present an IPython notebook with the complete code to follow the algorithm of C. Hamilton CS-2006-07.

The notebook can be downloaded on my GitHub account: see the notebook | download the notebook (a right-click may help).

Update on 12 june 2015: addition of compact Hilbert indices.

In [17]:
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import math
from mpl_toolkits.mplot3d import Axes3D
from __future__ import print_function, division

Understanding the Hilbert curve

Author: Pierre de Buyl
This notebook is originally developed in a github repository and presented on my blog. The code is 3-clause BSD and content CC-BY.

In this notebook, I review the algorithm by C. H. Hamilton to compute Hilbert curves and indices. The Hilbert curve is a space-filling curve that has good data locality to map N-dimensional data onto a linear space. It is based on a basis curve that visits the $2^N$ vertices of the N-dimensional binary space and that is replicated recursively $M$ times to map $2^{N M}$ points. The curve is the series of points in N-dimensional space and the index of a point is its place on this curve, computing it is subtle. The algorithm is presented in Compact Hilbert Indices, Dalhousie University Technical Report CS-2006-07, July 2006 (here after [TR]).

My motivations (and yours, maybe) are:

  • Understand the algorithm to implement it in a simulation code. The point is to sort particles spatially for better memory access.
  • Provide a full Python-based implementation that can be used for reference and illustration purposes.
  • Have a clear explanation for later reference.

The notebook is self-contained: it contains all the code for all the intermediate operations defined in [TR]. Hold on for the graphic illustrations, they will come as the algorithm is transformed into code.

Bitwise operations

I review here the bitwise operations in Python, as they will be used throughout. Only integer numbers are used. We define a word length that corresponds to the dimension of the Hilbert curve.

The examples use N=3 so that each single integer in $[0:2^N-1]$ can be mapped to a point in 3 dimension that is one vertex of the unit cube. Bitwise operations act as transformations on the integers and equivalently as geometrical operations in 3 dimensions. The XOR operator represents a reflection with respect to one or several of the symmetry planes of the cube. The rotation operator represents a rotations of a third around the diagonal, i.e. the line joining (0,0,0) and (1,1,1).

The function bin_str helps to visualize bitwise operations by converting an integer to its three bits. The right-most bit represents the x-axis, the middle bit the y-axis and the left-most bit the z-axis. In Python, the AND &, OR | and XOR ^ operators are defined and also the bit shifts (>> to the right, or less significant bits and << to the left or more significant bits). The NOT operator ~ requires further masking: It changes the sign bit of the integer. We wish here to operate on a N-bit space and thus cancel any bit beyond the N less significant bits of the integer. The fact that we remain with N non-zero bits and that NOT is operating properly is checked below. We also need rotation operators (like the shift operators but that transfer the over- and under-flowing bits around the 3-bit space) and a function to extract the i-th bit from a number, bit_component.

In [2]:
N = 3

def bin_str(i):
    """Return a string representation of i with N bits."""
    out = ''
    for j in range(N-1,-1,-1):
        if (i>>j) & 1 == 1:
            out += '1'
            out += '0'
    return out

def rotate_right(x, d):
    """Rotate x by d bits to the right."""
    d = d % N
    out = x >> d
    for i in range(d):
        bit = (x & 2**i)>>i
        out |= bit << (N+i-d)
    return out

def rotate_left(x, d):
    """Rotate x by d bits to the left."""
    d = d % N
    out = x << d
    excess = out 
    out = out & (2**N-1)
    for i in range(d):
        bit = (x & 2**(N-1-d+1+i))>> (N-1-d+1+i)
        out |= bit << i
    return out

def bit_component(x, i):
    """Return i-th bit of x"""
    return (x & 2**i) >> i
In [3]:
print('AND  ', bin_str(1), '&', bin_str(5), '=', bin_str(1 & 5))
print('OR   ', bin_str(1), '|', bin_str(5), '=', bin_str(1 | 5))
print('XOR  ', bin_str(1), '^', bin_str(5), '=', bin_str(1 ^ 5))
print('shift', bin_str(2), '>> 1  =', bin_str(2 >> 1))
print('shift', bin_str(2), '<< 1  =', bin_str(2 << 1))
print('rot  ', bin_str(1), '↻', bin_str(rotate_right(1, 1)), '↻',
      bin_str(rotate_right(1,2)), '↻', bin_str(rotate_right(1,3)))
print('NOT  ', bin_str(1), '      =', bin_str(~1 & 2**N-1))

# verify that '~i & 2**N-1' performs the NOT operation in N-bit space
for i in range(2**N):
    not_i = ~i & 2**N-1
    assert not_i >=0
    assert not_i < 2**N
    assert i & not_i == 0
    assert i | not_i == 2**N-1
AND   001 & 101 = 001
OR    001 | 101 = 101
XOR   001 ^ 101 = 100
shift 010 >> 1  = 001
shift 010 << 1  = 100
rot   001 ↻ 100 ↻ 010 ↻ 001
NOT   001       = 110

The technical reports describes with good details all the intermediate steps of the algorithm using binary operations and sequences. The gray code index of a number i, for instance, is given by gc(i) = i XOR (i ≫ 1).

To understand the building of the Hilbert curve, other quantities are needed. The graphical representation use arrows in each sub-hypercube of the curve to represent the path. The definitions of the gray curve index and of the arrows is given and we use these defintions to illustrate the Hilbert curve afterwards.

The interpretation of gc, e and f below is a binary representation in base N of the vertices of a hypercube (a square for N=2, a cube for N=3, etc). The bits represent the z, y and x coordinates.

In [4]:
# Build all edges of a square and of a cube

edges = [((0,),(1,))]

def add_edges(edges):
    """Extend the list of edges from an hypercube to the list of
    edges of the hypercube of the next dimension."""

    old_edges = list(edges)
    old_points = set( [x[0] for x in old_edges] ) | set( [x[1] for x in old_edges] )
    edges = [ ( (0,)+x[0], (0,)+x[1] ) for x in old_edges ]
    edges.extend( [ ( (1,)+x[0], (1,)+x[1] ) for x in old_edges ] )
    for e in old_points:
        edges.append( ( (0,)+e, (1,)+e) )

    return edges

square_edges = add_edges(edges)
cube_edges = add_edges(square_edges)

cube_xyz = []
for single_edge in cube_edges:
    # The np.inf value breaks the line plot into disconnected parts
cube_xyz = np.array(cube_xyz)

def set_unit_cube(ax, side=1, set_view=(10,-67)):
    """Present the unit cube."""
    ax.set_xlabel('x'); ax.set_ylabel('y'); ax.set_zlabel('z');
    ax.set_xticks(range(side+1)); ax.set_yticks(range(side+1)); ax.set_zticks(range(side+1));
    ax.set_xlim(0, side); ax.set_ylim(0,side); ax.set_zlim(0,side);
    if set_view:

fig = plt.figure(figsize=(6, 5))
ax = fig.add_subplot(111,projection='3d')

ax.plot(*zip(*cube_xyz), color='b')

for i in range(2**N):
    x = bit_component(i, 0)
    y = bit_component(i, 1)
    z = bit_component(i, 2)
    ax.text(x, y+0.05, z+0.05, bin_str(i))

In [5]:
def gc(i):
    """Return the Gray code index of i."""
    return i ^ (i >> 1)

def e(i):
    """Return the entry point of hypercube i."""
    if i==0:
        return 0
        return gc(2*int(math.floor((i-1)//2)))

def f(i):
    """Return the exit point of hypercube i."""
    return e(2**N-1-i) ^ 2**(N-1)

def i_to_p(i):
    """Extract the 3d position from a 3-bit integer."""
    return [bit_component(i,j) for j in (0,1,2)]
In [6]:
print('Gray code', map(lambda i: (gc(i), bin_str(gc(i))), range(2**N)),
      'Entry point', map(lambda i: (e(i), bin_str(e(i))), range(2**N)),
      'Exit point', map(lambda i: (f(i), bin_str(f(i))), range(2**N)),
Gray code
[(0, &apos000&apos), (1, &apos001&apos), (3, &apos011&apos), (2, &apos010&apos), (6, &apos110&apos), (7, &apos111&apos), (5, &apos101&apos), (4, &apos100&apos)]
Entry point
[(0, &apos000&apos), (0, &apos000&apos), (0, &apos000&apos), (3, &apos011&apos), (3, &apos011&apos), (6, &apos110&apos), (6, &apos110&apos), (5, &apos101&apos)]
Exit point
[(1, &apos001&apos), (2, &apos010&apos), (2, &apos010&apos), (7, &apos111&apos), (7, &apos111&apos), (4, &apos100&apos), (4, &apos100&apos), (4, &apos100&apos)]

Visiting the Hilbert curve in the cube

The path of the Hilbert curve without recursion is encoded in the sequence gc(i), whose 3-bit values are interpreted a points in 3D (see above), and that visits the edges of the cube. These coordinates (scaled by 1/2) are stored in xyz.

The first figure displays the curve visiting the centers of the subcubes, one of the typical representations of the Hilbert curve. In [TR], the path of the Hilbert curve proceeds by entering subcube gc(i) at point e(i) and leaving it a point f(i). The entry and exit points e(i) and f(i) are stored in e_xyz and f_xyz and shown below. The last figure represents the full path as the sequences of arrows joining those points.

In [7]:
# Obtain the 3 d coordinates for gc(i), e(i) and f(i) to build the Hilbert curve
xyz = np.array(map(lambda i: i_to_p(gc(i)), range(2**N)))/2
e_xyz = np.array(map(lambda i: i_to_p(e(i)), range(2**N)))/2
f_xyz = np.array(map(lambda i: i_to_p(f(i)), range(2**N)))/2

fig = plt.figure(figsize=(14, 12))

ax1 = fig.add_subplot(221,projection='3d')

for i, (x,y,z) in enumerate((xyz+0.27)):
    ax1.text(x, y, z, str(i))
ax1.set_title('The Hilbert curve, subcube centered')

edges = []
for e_point, f_point, origin in zip(e_xyz, f_xyz, xyz):
    dir = (f_point-e_point)
    # For the 3d quiver plot, the origin of the arrow is shifted
    # which is why we use origin+e_point+dir instead of origin+e_point
    edges.append(np.concatenate((origin+e_point+dir, dir)))

ax2 = fig.add_subplot(222,projection='3d')
ax2.quiver(*zip(*edges), length=0.5)
ax2.set_title('Arrows to visit the Hilbert curve')

# A naive sequence of binary points in 3D
c = 0.25 + np.array(map(i_to_p, range(2**N)))/2

ax3 = fig.add_subplot(223,projection='3d')
ax3.plot(*zip(*c), color='b')
for i, (x,y,z) in enumerate(c):
    ax3.text(x, y, z, str(i))

ax3.set_title('A naive curve to visit the cube')

for ax in [ax1, ax2, ax3]:
    ax.plot(*zip(*cube_xyz), color='k', zorder=1, alpha=0.5)

Define the curve directions and transforms

To write the Hilbert curve algorithm, we still need to define the inter-subcube direction g and the intra-subcube direction d (the direction of the arrow that connect e and f).

In [8]:
def inverse_gc(g):
    """The inverse gray code."""
    i = g
    j = 1
    while j<N:
        i = i ^ (g >> j)
        j = j + 1
    return i

def g(i):
    """The direction between subcube i and the next one"""
    return int(np.log2(gc(i)^gc(i+1)))

def d(i):
    """The direction of the arrow whithin a subcube."""
    if i==0:
        return 0
    elif (i%2)==0:
        return g(i-1) % N
        return g(i) % N

def T(e, d, b):
    """Transform b."""
    out = b ^ e
    return rotate_right(out, d+1)

def T_inv(e, d, b):
    """Inverse transform b."""
    return T(rotate_right(e, d+1), N-d-2, b)

Validating the relations

The algorithm for the Hilbert curve is based on a series of relations between e, f, d and g. We can check those easily now. In order, we verify that:

  • Only the g(i)-th bit is changed between gc(i) and gc(i+1)
  • The tranform T sends an entry point e(i) to zero and the exit point f(i) to $2^{N-1}$: $T_{e(i),d(i)}(e(i))=0$ and $T_{e(i),d(i)}(f(i))=f(i)$
  • The entry point e(i) reflected along the direction d(i) is the exit point f(i): $e(i)$ ^ $d(i) = f(i)$
  • The inverse transform $T^{-1}$ composed with the direct transform $T$ is the identity.
In [9]:
# only g(i)-th bit changes from gc(i) to gc(i+1)
for i in range(2**N-1):
    assert gc(i) ^ (1 << g(i)) == gc(i+1)

# T sends e(i) to 0 and f(i) to 2**(N-1)
for i in range(2**N):
    assert T(e(i), d(i), e(i))==0
    assert T(e(i), d(i), f(i))==2**(N-1)

# e(i) reflected in direction d(i) is f(i)
for i in range(2**N):
    assert e(i) ^ (1<<d(i)) == f(i)

# T_inv composed with T (and vice versa) is the identity operator
for i in range(2**N):
    for b in range(2**N):
        assert T(e(i), d(i), T_inv(e(i), d(i), b)) == b
        assert T_inv(e(i), d(i), T(e(i), d(i), b)) == b

Transform the curve for the subcubes

According to [TR], one can use the transform T to map a subcube into the larger cube. Using the inverse transform, it is thus possible to generate the recursed Hilbert curve by mapping the main curve into each subcube. The coordinates of this refined Hilbert curve lie within the subcubes and must be translated by the corresponding origins to fill space correctly.

We construct below the refined Hilbert curve by this manual procedure and display the individual subcubes. The coordinates are then used to show the directed path followed by the curve.

In [10]:
fig = plt.figure(figsize=(12, 18))
ax = fig.add_subplot(211,projection='3d')

# store the Hilbert curve with no recursion
main_curve = [gc(i) for i in range(2**N)]
# list of the in-subcube curves
sub_curves = []
for i, h_idx in enumerate(main_curve):
    # append points from the subcube (obtained with the inverse transform T_inv
    points = np.array( [ i_to_p(T_inv(e(i), d(i), x)) for x in main_curve ] )
    # Translate the points by the origin of the subcube
    points += np.array(i_to_p(h_idx))*2
    sub_curves.append( points )

# plot all subcurves separately
[ax.plot(*zip(*c)) for c in sub_curves]
ax.set_title('Individual Hilbert "subcubes"')
set_unit_cube(ax, 3,  set_view=False)

ax = fig.add_subplot(212,projection='3d')

# prepare the data to draw arrows with the quiver command
cc = np.concatenate(sub_curves)
U = cc[1:,0]-cc[:-1,0]
X = cc[:-1,0]+U
V = cc[1:,1]-cc[:-1,1]
Y = cc[:-1,1]+V
W = cc[1:,2]-cc[:-1,2]
Z = cc[:-1,2]+W
ax.quiver(X,Y,Z,U,V,W,color=[matplotlib.cm.gnuplot(x) for x in np.linspace(0, 1,len(X)*3)], length=1)
ax.set_title('Connected Hilbert "subcubes"')
set_unit_cube(ax, 3, set_view=False)