Day 5: Cache Me If You Can

Achieving joblib satisfaction

John Sandall
December 05, 2023

🎄 Python Advent Calendar: Day 5! 🎄

Today, we’re diving deeper into the world of object caching in Python. In yesterday’s edition, we introduced the LRU cache available within the boltons library, but the same feature also exists in the Python standard library.

Our main focus for today however is a library that I use heavily for three reasons:

Optimised caching and memoization for numpy arrays (and any complex object)
Easy parallelisation that just works.
Serialisation to disk for complex objects such as numpy arrays, DataFrames, fitted machine learning estimators, and tensors.

functools.cache: Caching built into Python

N.B. functools is part of Python’s standard library

📆 Last updated: October 2023
⬇️ Downloads: ~10 million/week
⚖️ License: PSF License 2.0
⭐ GitHub Stars: 51.7k

🔍 What is it?

Caching allows you to store frequently accessed data into in-memory temporary storage. This simple yet powerful technique can significantly speed up your programs by avoiding repetitive and time-consuming computations.

Memoization is a technique that remembers the inputs and outputs to a memoized function. If the function is called again with the same inputs, the cached result is returned instead. This is useful in scenarios where you may call a function repeatedly with the same arguments, for example when developing any new code. It can be a huge time & cost-saver when you are

Performing any expensive deterministic computation.
Calling external APIs such as the GPT-4 API or geocoding APIs.
Building web scrapers.

The functools.cache and functools.lru_cache are decorators built into Python’s functools module that cache the results of function calls. The Least Recently Used (LRU) strategy means that it keeps the most recently accessed items available, and older items are discarded first when the cache is full.

🛠️ Use

Simply decorate any function with functools.cache (for a lightweight unbound cache with no limits) or functools.lru_cache (for a bound cache using an LRU strategy to discard least used items when the cache is full). You can specify the cache size using the maxsize parameter, for example, @lru_cache(maxsize=128)

In the example below, we’ve created a simple cached wrapper around the pandas read_html() function and added a time.sleep(3) to add a three-second delay whenever the function calls. This is purely to make it easier to spot when the cache is kicking in.

import time
import pandas as pd

from functools import cache


@cache
def get_tables(url):
    time.sleep(3)
    return pd.read_html(url)

joblib: cache it, serialize it, parallelize it

📆 Last updated: December 2023
⬇️  Downloads: 9.3 million downloads/week
⚖️  License: BSD 3-Clause
🐍 PyPI  |  ⭐ GitHub Stars: 3.5k

🔍 What is it?

Joblib offers three key features:

NumPy-friendly caching & memoization.
Simple parallelisation for embarrassingly parallelisable for loops.
Efficient serialisation for complex or large objects such as NumPy arrays, machine learning models, DataFrames and tensors.

📦 Install

pip install joblib

🛠️ Use

1. Memoization with joblib

To memoize using joblib, we require three steps:

Import the Memory class
Define where we want the cache to live (note: the cache is serialised to disk and persistent, unlike the functools.cache decorator which lives in-memory). Don’t forget to add this directory to your .gitignore file.
Create an instance of the Memory class with our specified configuration.

from joblib import Memory

# Don't forget to .gitignore this directory
cache_directory = "./cache"

# Create an instance of the Memory class ready for caching
memory = Memory(cache_directory, verbose=0)

For our next example we’re going to use the openai library to perform some light generative AI. You will need to get an OpenAI API key if you want to try this at home. You can specify your key within your notebook as follows (we recommend using libraries such as python-dotenv to do this securely…we’ll cover this in more detail in a future newsletter).

import os

os.environ["OPENAI_API_KEY"] = "ENTER_KEY_HERE"

Let’s create a cached function for interacting with OpenAI’s GPT API! Here’s a breakdown of what’s happening here:

Import the OpenAI class and create a client instance.
Create a cached function using the joblib Memory class we created earlier.
Generate a completion from the gpt-3.5.turbo API with a custom system and user prompt taken from the function’s arguments.
Extract and return the generated content.

from openai import OpenAI

client = OpenAI()


@memory.cache
def gpt_limerick(system, prompt):
    completion = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": system},
            {"role": "user", "content": prompt},
        ],
    )
    response = completion.choices[0].message.content

    return response

Here’s how it looks with a few different runs. I thoroughly recommend trying this out, you can have a lot of fun with generative AI! ✨

2. Parallelisation with joblib

Joblib works especially well for “embarrassingly parallel” for loops. This isn’t about code shaming! The term is used in computer science to describe any problem that can be easily separated into multiple, independent sub-tasks, which can then be executed simultaneously without the need for communication between them.

This can lead to significant reductions in processing time, as separate processors can handle different parts of the task simultaneously. The term "embarrassingly" is used only to contrast them against other, more tricky, forms of parallelisation.

The code below describes a Python for loop that counts as embarrassingly parallel. Each run of the loop takes in a single variable, person and generates a gift idea for them. The results are then collected into a DataFrame structure.

from tqdm import tqdm

people = [
    "Homer Simpson",
    "Marge Simpson",
    "Bart Simpson",
    "Lisa Simpson",
    "Maggie Simpson",
    "Ned Flanders",
    "Montgomery Burns",
    "Apu Nahasapeemapetilon",
    "Moe Szyslak",
    "Krusty the Clown",
]

system_prompt = """
You are one of Santas elves. I will give you a
character from The Simpsons, you must give me
a 1-2 word Christmas gift idea for that character.
"""

df = []

for person in tqdm(people):
    gift = gpt_limerick(
        system=system_prompt,
        prompt=person,
    )

    df.append({"name": person, "gift": gift})

pd.DataFrame(df)

To parallelise this, we can do the following:

Refactor the contents of the for loop into a single function. In our case, this is already done, we just need to call the gpt_limerick function and store its returned value.
Rewrite the for loop as a comprehension, e.g. (your_function_name(x) for x in your_list)
Restructure the comprehension to use the Parallel() object and the delayed() function as follows:
Parallel()(delayed(your_function_name)(x) for x in your_list)

This reduces the execution time from 8 seconds down to 1.3 seconds. We specify n_jobs=8 and prefer=”threads” to use 8 CPU threads to achieve this 6x speedup in execution.

3. Serialisation with joblib

“Serialisation” refers to the process of converting Python objects to other formats more amenable to storage or sharing, such as saving as files on disk, or as binary data into a database, or into JSON to be returned via a REST API.

Python includes the pickle module for doing exactly this, but joblib’s implementation is a drop-in replacement designed for efficient serialisation of large data objects such as NumPy arrays or machine learning models.

from joblib import dump, load

dump(gifts, "gifts.joblib")
# Creates a file called gifts.joblib
# in the current directory

gifts = load("gifts.joblib")
print(gifts)

# ['Donut delivery',
#  'Baking set',
#  'Skateboard upgrade',
#  'Musical Instrument',
#  'Blanket Buddies',
#  'Religious calendar',
#  'Electric Blanket',
#  'Indian spices',
#  'New Bartender Kit',
#  'Makeup kit']

Please note that joblib’s dump() and load() are still based on pickle and therefore come with the following caveats:

⚠️ Beware calling load() on an untrusted file! Deserialisation can result in arbitrary or malicious code being run on your machine, and therefore load() should never be used to load objects from an untrusted source or otherwise you will introduce a security vulnerability in your program.
🔢 Pickle and joblib can (and do) change their approach to serialisation to ensure the most efficient algorithms are used under-the-hood. This means that you must never rely on the ability to dump() using one version of Python and/or joblib, and have this load() when you have upgraded your Python and/or joblib versions. It might work, but don’t rely on it. As a result, joblib should be used for temporary serialisation only, such as caching or saving work to be picked up in the near future. If you require medium-term or long-term storage solutions, consider using Feather for datasets or formats such as skops or mlflow for machine learning models.
📄 We recommend reviewing scikit-learn’s own guidance for model serialisation.

🤔 Do you know someone learning Python? 📚

We’ve got 20 daily updates to go, and all updates are available on our sign-up page here: https://py-advent-calendar.beehiiv.com

If you know someone who’s learning Python, or studying data science, why not ❤️ share the link above ❤️ with them? You could also tell everyone you know, and tag us on LinkedIn or Twitter!

If you’re enjoying this series, or have feedback or ideas on how we can make it better, please reach out to us via [email protected] or @CoefficientData on Twitter.

See you tomorrow! 🐍

Your Python Advent Calendar Team 😃

🤖 Python Advent Calendar is brought to you by Coefficient, a data consultancy with expertise in data science, software engineering, devops, machine learning and other AI-related services. We code, we teach, we speak, we’re part of the PyData London Meetup team, and we love giving back to the community. If you’d like to work with us, just email [email protected] and we’ll set up a call to say hello. ☎️