Python Advent Calendar
Posts
Day 3: Transform Your Data Transformations

Day 3: Transform Your Data Transformations

PyJanitor: data cleaning pandas extensions

John Sandall
December 03, 2023

🎄 Python Advent Calendar: Day 3! 🎄

We apologise for the delayed delivery of today’s advent calendar update. It’s been a long day of wifi-less planes and non-existent trains. Along the way, I discovered Pocket Coffee, a delightful Italian concoction defined best as pd.concat([espresso_shot, praline_chocolate])

Behind today’s door is a remarkably library packed with so much functionality that I’m seriously considering writing a RAG-based LLM to help identify all the places my code could be simplified by switching out my hand-carved functions for their convenience methods. Introducing… PyJanitor!

PyJanitor: data cleaning pandas extensions

📆 Last updated: November 2023
⬇️  Downloads: 9,701/week
⚖️  License: MIT
🐍 PyPI |  ⭐ GitHub Stars: 1.2k

🔍 What is it?

PyJanitor extends pandas with many custom methods for common data cleaning & preprocessing tasks. Originally a port of R’s janitor package, PyJanitor aims to create a consistent API for data cleaning with an emphasis on enabling the “method chaining” paradigm.

Beyond the built-in functions (of which only a small selection is highlighted below), PyJanitor also includes convenience data processing functions specific to biology, chemistry, engineering, finance, machine learning, mathematics, and time-series processing.

📦 Install

pip install pyjanitor

🛠️ Use

1. Load the data

Today’s examples build off Mark Pack’s PollBase, a database of UK voting intention opinion polls between 1943 and 2023. Here’s how it looks once read into pandas:

df = pd.read_excel(
    "PollBase-Q3-2023.xlsx",
    sheet_name="Monthly average",
)

2. Remove empty rows/columns

It needs a lot of cleaning! Let’s import janitor and use PyJanitor’s .remove_empty() method to remove any rows/columns which are entirely empty:

import janitor

df.shape  # (515, 20)
df = df.remove_empty()
df.shape  # (515, 19)

3. Eliminate extraneous columns

Next, let’s remove all columns with a name like Unnamed: 0, these are mostly used to indicate the month-on-month change in polling for each party. This could easily be derived from the raw numbers using pandas’ .diff() method if required. We rename the second column to Date to avoid this being dropped also.

Note: This operation, like most of PyJanitor’s functions, can easily be done using pandas’ native .drop() method, however the .select_columns() method offers more powerful filtering options such as

df = (
    df.rename(columns={"Unnamed: 1": "Date"})
    .select_columns("Unnamed*", invert=True)
    .select_columns("Lead", invert=True)
)

4. Auto-clean column names

This yields a much simpler DataFrame. Next, let’s use the rather handy .clean_names() method to simplify & lower-case the column names:

df = df.clean_names()

5. Filter out rows not pertaining to opinion polls

Now let’s remove the Date rows containing the string ”GE” (these indicate UK General Elections, rather than opinion polls). We could do this using df.query(“date != ‘GE’”) but I’ll use the opportunity to demo the convenient and very powerful method, .filter_column_isin() . This is a method-chain-friendly equivalent to df = df[df["colour"].isin(["red", "green", "blue"])] with optional negation via the complement=True argument.

df = df.filter_column_isin(
    column_name="date",
    iterable=["GE"],
    complement=True,
)

6. Pivot to long

Pandas has powerful functionality for reshaping data although personally I find the method names unintuitive and tricky to remember. It’s my #1 reason for referencing the (highly recommended) official pandas cheatsheet:

"Reshaping Data” section from the pandas cheatsheet

Visually, it appears we can use the pd.melt() function to turn our wide DataFrame into a long dataset, but this doesn’t play nicely with working in a method chain paradigm. PyJanitor introduces two intuitively named methods: .pivot_longer() and .pivot_wider()

df = df.pivot_longer(
    index="date",
    names_to="party",
    values_to="voteshare",
)

7. Encode columns as Categorical

Our DataFrame is beginning to look like model-ready! With the new party column containing a set of strings, it makes sense to convert this to pandas’ Categorical data type. This can be preferably for performance reasons, as well as to signal the data type to other libraries which can utilise the data type metadata to automatically adapt data visualisation or statistical tests.

df = df.encode_categorical(["party"])
print(df.party)

8. One-hot encoding with category expansion

Another modelling convenience function, similar to pandas pd.get_dummies() or the one-hot encoder in scikit-learn:

df.expand_column("party")

9. Splitting data into features & target for machine learning

It’s common to create a features DataFrame X = df.drop(columns=target) and then y = df.target when working with pandas and modelling libraries such as scikit-learn or statsmodels. PyJanitor brings a convenience function for this.

X, y = df.get_features_targets(
    target_column_names=["voteshare"]
)

10. The best of the rest

The above gives an idea of the variety of convenience methods built into PyJanitor. It also includes functionality for:

Coalescing columns (like the SQL COALESCE() function)
Concatenation (e.g. for creating a unique index) and deconcatenation (great for working with log data)
Date conversion helpers to convert from Unix epochs, Excel integer timestamps, and Matlab-style timestamps.
Handy tools like .complete() (the “opposite of dropna”), .conditional_join(), .count_cumulative_unique(), .currency_column_to_numeric(), .groupby_topk(), .round_to_fraction(), .sort_naturally() and .update_where()
Industry-specific functionality such as working with FASTA bioinformatics files, generating RDKIT molecular descriptors, converting between various engineering units, currency convertors, getting a company’s name from its symbol, an inflation calculator, reading in multiple CSVs into a single dataframe, mathematical functions such as sigmoid() and softmax(), and flagging jumps exceeding a threshold in a time series dataset.

📝 Do you have feedback? 💬

Do you use PyJanitor? Is there a life-changing function that you use regularly that we missed? Let us know, and we’ll feature it (and credit you)!

If you’re enjoying this series, or have feedback or ideas on how we can make it better, please reach out to us via [email protected] or @CoefficientData on Twitter.

See you tomorrow! 🐍

Your Python Advent Calendar Team 😃

🤖 Python Advent Calendar is brought to you by Coefficient, a data consultancy with expertise in data science, software engineering, devops, machine learning and other AI-related services. We code, we teach, we speak, we’re part of the PyData London Meetup team, and we love giving back to the community. If you’d like to work with us, just email [email protected] and we’ll set up a call to say hello. ☎️