Day 6: The Fastest Way To Load A CSV

Achieve a 20x speedup loading CSVs using this one weird trick

šŸŽ„ Python Advent Calendar: Day 6! šŸŽ„

Do you work with large CSV files? Do you handle CSV uploads into a web service? If you do, then knowing the fastest way to load a CSV into pandas is critical.

Some libraries are much faster at reading CSVs into a pandas DataFrame than the pd.read_csvĀ function. pandas learned some new skills recently in pandas 2.0, but is it enough to catch up?

Contenders

What do you think is fastest?

  1. pandas

  2. polars

  3. dask

  4. datatable

Is it faster to read from a compressed CSV file? Which pandas engine is fastest? Letā€™s find out! Hereā€™s the options weā€™ll be running.

import dask.dataframe as dd
import datatable as dt
import pandas as pd
import polars as pl

# pandas
df = pd.read_csv("synthetic_christmas.csv")
df = pd.read_csv("synthetic_christmas.csv.tar.gz")
df = pd.read_csv("synthetic_christmas.csv", low_memory=False)
df = pd.read_csv("synthetic_christmas.csv.tar.gz", low_memory=False)
df = pd.read_csv("synthetic_christmas.csv", engine="c")
df = pd.read_csv("synthetic_christmas.csv", engine="python")
df = pd.read_csv("synthetic_christmas.csv", engine="pyarrow")

# polars
df = pl.read_csv("synthetic_christmas.csv")
df = pl.read_csv("synthetic_christmas.csv.tar.gz")

# dask
df = dd.read_csv("synthetic_christmas.csv").compute()

# datatable
df = dt.fread("synthetic_christmas.csv").to_pandas()
df = dt.fread("synthetic_christmas.csv.tar.gz").to_pandas()

print(f"{len(df):,} rows")

We use the Jupyter %%timeit magic for performance profiling of each approach:

%%timeit -n2 -r10
df = pd.read_csv("synthetic_christmas.csv")

The -n2 argument will execute the cell twice in a loop. The -r10 argument repeats this double loop 10 times. This helps to get a more accurate estimate of the cellā€™s runtime.

Summarising this visually, you can see that polarsĀ is almost 20x faster than the default pandasā€™ pd.read_csv! Special mention goes to using the pyarrow engine with pandas which achieves comparable results. Note that this comes with a warning that pyarrow is currently ā€œan experimental engine, and some features are unsupported, or may not work correctlyā€.

Until polars, the fastest method around was the datatableĀ .fread() method which memory maps the file into memory and iterates through the file using pointers, rather than reading the file into a buffer. pyarrow builds on top of the Apache Arrow project and is designed for working with large datasets both in-memory and on-disk with highly efficient memory management, columnar data storage, and CPU-level optimisations. The absolute winner here is clearly polars, which brings a multi-threaded query engine written in Rust, columnar storage, parallelised processing and cache-coherent algorithms.

If you would like to understand more about how Polars works under-the-hood, combining Rust with Apache Arrow, check out Nico Kreilingā€™s talk at PyData Berlin 2023: Raised by Pandas, striving for more: An opinionated introduction to Polars.

Synthetic Dataset

Todayā€™s example used a synthetic dataset created using some of the techniques shared in our previous newsletters. The dataset comprises 20 columns and 2 million rows of uniquely generated Christmas-themed synthetic data using the faker library. Hereā€™s the function which generates a single row:

from faker import Faker

faker = Faker()


def make_row():
    return {
        "First Name": faker.first_name(),
        "Last Name": faker.last_name(),
        "Email": faker.email(),
        "Phone Number": faker.phone_number(),
        "Street Address": faker.street_address(),
        "City": faker.city(),
        "State": faker.state(),
        "Postal Code": faker.postcode(),
        "Country": faker.country(),
        "Company Name": faker.company(),
        "Job Title": faker.job(),
        "Department": faker.bs(),
        "Favourite Christmas Song": faker.sentence(),
        "Favourite Christmas Movie": faker.sentence(),
        "Christmas Wish": faker.sentence(),
        "Christmas Tradition": faker.sentence(),
        "Favourite Christmas Dish": faker.sentence(),
        "Christmas Travel Destination": faker.city(),
        "Christmas Shopping Budget": faker.random_number(digits=3),
        "Letter to Santa": faker.text(max_nb_chars=50),
    }

We then combine this with joblib to parallelise the generation using 8 CPU threads, and tqdm to monitor the progress of the parallelised generation. This takes around two minutes to run at around 16,000 rows generated per second.

import pandas as pd
from joblib import Parallel, delayed
from tqdm import tqdm

data = Parallel(n_jobs=8)(
    delayed(make_row)() for _ in tqdm(range(2_000_000))
)
df = pd.DataFrame(data)
df.to_csv("synthetic_christmas.csv", index=False)
df.to_csv("synthetic_christmas.csv.tar.gz", index=False)
df.shape

(2000000, 20)

šŸ»ā€ā„ļøĀ Ā Do you use Polars? Ā ā„ļø

Weā€™ll be doing a deep dive into Polars in a future update. If you have any Polars tips youā€™d like to share, please get in touch via LinkedInĀ or Twitter.

If youā€™re enjoying this series, or have feedback or ideas on how we can make it better, please reach out to us via [email protected] or @CoefficientData on Twitter.

See you tomorrow! šŸ

Your Python Advent Calendar Team šŸ˜ƒĀ 

šŸ¤– Python Advent Calendar is brought to you by Coefficient, a data consultancy with expertise in data science, software engineering, devops, machine learning and other AI-related services. We code, we teach, we speak, weā€™re part of the PyData London Meetup team, and we love giving back to the community. If youā€™d like to work with us, just email [email protected] and weā€™ll set up a call to say hello. ā˜Žļø