- Python Advent Calendar
- Posts
- Day 6: The Fastest Way To Load A CSV
Day 6: The Fastest Way To Load A CSV
Achieve a 20x speedup loading CSVs using this one weird trick
š Python Advent Calendar: Day 6! š
Do you work with large CSV files? Do you handle CSV uploads into a web service? If you do, then knowing the fastest way to load a CSV into pandas is critical.
Some libraries are much faster at reading CSVs into a pandas DataFrame than the pd.read_csv
function. pandas learned some new skills recently in pandas 2.0
, but is it enough to catch up?
Contenders
What do you think is fastest?
pandas
polars
dask
datatable
Is it faster to read from a compressed CSV file? Which pandas engine
is fastest? Letās find out! Hereās the options weāll be running.
import dask.dataframe as dd import datatable as dt import pandas as pd import polars as pl # pandas df = pd.read_csv("synthetic_christmas.csv") df = pd.read_csv("synthetic_christmas.csv.tar.gz") df = pd.read_csv("synthetic_christmas.csv", low_memory=False) df = pd.read_csv("synthetic_christmas.csv.tar.gz", low_memory=False) df = pd.read_csv("synthetic_christmas.csv", engine="c") df = pd.read_csv("synthetic_christmas.csv", engine="python") df = pd.read_csv("synthetic_christmas.csv", engine="pyarrow") # polars df = pl.read_csv("synthetic_christmas.csv") df = pl.read_csv("synthetic_christmas.csv.tar.gz") # dask df = dd.read_csv("synthetic_christmas.csv").compute() # datatable df = dt.fread("synthetic_christmas.csv").to_pandas() df = dt.fread("synthetic_christmas.csv.tar.gz").to_pandas() print(f"{len(df):,} rows")
%%timeit -n2 -r10 df = pd.read_csv("synthetic_christmas.csv")
The -n2
argument will execute the cell twice in a loop. The -r10
argument repeats this double loop 10 times. This helps to get a more accurate estimate of the cellās runtime.
Summarising this visually, you can see that polars is almost 20x faster than the default pandasā pd.read_csv
! Special mention goes to using the pyarrow
engine with pandas which achieves comparable results. Note that this comes with a warning that pyarrow is currently āan experimental engine, and some features are unsupported, or may not work correctlyā.
Until polars, the fastest method around was the datatable .fread()
method which memory maps the file into memory and iterates through the file using pointers, rather than reading the file into a buffer. pyarrow builds on top of the Apache Arrow project and is designed for working with large datasets both in-memory and on-disk with highly efficient memory management, columnar data storage, and CPU-level optimisations. The absolute winner here is clearly polars, which brings a multi-threaded query engine written in Rust, columnar storage, parallelised processing and cache-coherent algorithms.
If you would like to understand more about how Polars works under-the-hood, combining Rust with Apache Arrow, check out Nico Kreilingās talk at PyData Berlin 2023: Raised by Pandas, striving for more: An opinionated introduction to Polars.
Synthetic Dataset
Todayās example used a synthetic dataset created using some of the techniques shared in our previous newsletters. The dataset comprises 20 columns and 2 million rows of uniquely generated Christmas-themed synthetic data using the faker library. Hereās the function which generates a single row:
from faker import Faker faker = Faker() def make_row(): return { "First Name": faker.first_name(), "Last Name": faker.last_name(), "Email": faker.email(), "Phone Number": faker.phone_number(), "Street Address": faker.street_address(), "City": faker.city(), "State": faker.state(), "Postal Code": faker.postcode(), "Country": faker.country(), "Company Name": faker.company(), "Job Title": faker.job(), "Department": faker.bs(), "Favourite Christmas Song": faker.sentence(), "Favourite Christmas Movie": faker.sentence(), "Christmas Wish": faker.sentence(), "Christmas Tradition": faker.sentence(), "Favourite Christmas Dish": faker.sentence(), "Christmas Travel Destination": faker.city(), "Christmas Shopping Budget": faker.random_number(digits=3), "Letter to Santa": faker.text(max_nb_chars=50), }
We then combine this with joblib to parallelise the generation using 8 CPU threads, and tqdm to monitor the progress of the parallelised generation. This takes around two minutes to run at around 16,000 rows generated per second.
import pandas as pd from joblib import Parallel, delayed from tqdm import tqdm data = Parallel(n_jobs=8)( delayed(make_row)() for _ in tqdm(range(2_000_000)) ) df = pd.DataFrame(data) df.to_csv("synthetic_christmas.csv", index=False) df.to_csv("synthetic_christmas.csv.tar.gz", index=False) df.shape (2000000, 20)
š»āāļø Do you use Polars? āļø
Weāll be doing a deep dive into Polars in a future update. If you have any Polars tips youād like to share, please get in touch via LinkedIn or Twitter.
If youāre enjoying this series, or have feedback or ideas on how we can make it better, please reach out to us via [email protected] or @CoefficientData on Twitter.
See you tomorrow! š
Your Python Advent Calendar Team š
š¤ Python Advent Calendar is brought to you by Coefficient, a data consultancy with expertise in data science, software engineering, devops, machine learning and other AI-related services. We code, we teach, we speak, weāre part of the PyData London Meetup team, and we love giving back to the community. If youād like to work with us, just email [email protected] and weāll set up a call to say hello. āļø