Pandas and Numpy are Python libraries which allow you to efficiently manipulate data. I recently gave a talk giving an example of common data analysis tasks (filter rows, calculations, sort by column etc) written first in straight Python, and then refactored to use Pandas (which has similar Dataframes to R).
Benchmarking the initial version:
In [1]: %timeit do_experiment(SIZE)
1 loops, best of 3: 1.86 s per loop
The refactored version is over 14x faster:
In [1]: %timeit do_experiment(SIZE)
10 loops, best of 3: 129 ms per loop
Besides performance improvements, I find Pandas code much simpler to read. With less boiler plate, the important operations are more clear.
Create a table with columns "one", "two", with a million random rows.
Create a new column based on calculations from other columns
Pull out a subset of columns:
Save to CSV:
Full diff and example project.
No comments:
Post a Comment