Thursday, 13 February 2014

Refactor to Pandas example

A lot of bioinformatics involves loading and manipulating data which often comes in spreadsheet or similar data files (csv).

Pandas and Numpy are Python libraries which allow you to efficiently manipulate data. I recently gave a talk giving an example of common data analysis tasks (filter rows, calculations, sort by column etc) written first in straight Python, and then refactored to use Pandas (which has similar Dataframes to R).
Benchmarking the initial version:
In [1]: %timeit do_experiment(SIZE)
1 loops, best of 3: 1.86 s per loop
The refactored version is over 14x faster:
In [1]: %timeit do_experiment(SIZE)
10 loops, best of 3: 129 ms per loop
Besides performance improvements, I find Pandas code much simpler to read. With less boiler plate, the important operations are more clear.
Create a table with columns "one", "two", with a million random rows.  

Create a new column based on calculations from other columns

Pull out a subset of columns:

 Save to CSV:

Full diff and example project.

No comments:

Post a Comment