- Python Advent Calendar
- Posts
- Day 5: Cache Me If You Can
Day 5: Cache Me If You Can
Achieving joblib satisfaction
š Python Advent Calendar: Day 5! š
Today, weāre diving deeper into the world of object caching in Python. In yesterdayās edition, we introduced the LRU cache available within the boltons library, but the same feature also exists in the Python standard library.
Our main focus for today however is a library that I use heavily for three reasons:
Optimised caching and memoization for numpy arrays (and any complex object)
Easy parallelisation that just works.
Serialisation to disk for complex objects such as numpy arrays, DataFrames, fitted machine learning estimators, and tensors.
functools.cache: Caching built into Python
N.B. functools is part of Pythonās standard library
š Last updated: October 2023
ā¬ļø Downloads: ~10 million/week
āļø License: PSF License 2.0
ā GitHub Stars: 51.7k
š What is it?
Caching allows you to store frequently accessed data into in-memory temporary storage. This simple yet powerful technique can significantly speed up your programs by avoiding repetitive and time-consuming computations.
Memoization is a technique that remembers the inputs and outputs to a memoized function. If the function is called again with the same inputs, the cached result is returned instead. This is useful in scenarios where you may call a function repeatedly with the same arguments, for example when developing any new code. It can be a huge time & cost-saver when you are
Performing any expensive deterministic computation.
Calling external APIs such as the GPT-4 API or geocoding APIs.
Building web scrapers.
The functools.cache
and functools.lru_cache
are decorators built into Pythonās functools module that cache the results of function calls. The Least Recently Used (LRU) strategy means that it keeps the most recently accessed items available, and older items are discarded first when the cache is full.
š ļø Use
Simply decorate any function with functools.cache
(for a lightweight unbound cache with no limits) or functools.lru_cache
(for a bound cache using an LRU strategy to discard least used items when the cache is full). You can specify the cache size using the maxsize
parameter, for example, @lru_cache(maxsize=128)
In the example below, weāve created a simple cached wrapper around the pandas read_html() function and added a time.sleep(3) to add a three-second delay whenever the function calls. This is purely to make it easier to spot when the cache is kicking in.
import time import pandas as pd from functools import cache @cache def get_tables(url): time.sleep(3) return pd.read_html(url)
joblib: cache it, serialize it, parallelize it
š Last updated: December 2023
ā¬ļø Downloads: 9.3 million downloads/week
āļø License: BSD 3-Clause
š PyPI | ā GitHub Stars: 3.5k
š What is it?
Joblib offers three key features:
NumPy-friendly caching & memoization.
Simple parallelisation for embarrassingly parallelisable for loops.
Efficient serialisation for complex or large objects such as NumPy arrays, machine learning models, DataFrames and tensors.
š¦ Install
pip install joblib
š ļø Use
1. Memoization with joblib
To memoize using joblib, we require three steps:
Import the
Memory
classDefine where we want the cache to live (note: the cache is serialised to disk and persistent, unlike the
functools.cache
decorator which lives in-memory). Donāt forget to add this directory to your.gitignore
file.Create an instance of the
Memory
class with our specified configuration.
from joblib import Memory # Don't forget to .gitignore this directory cache_directory = "./cache" # Create an instance of the Memory class ready for caching memory = Memory(cache_directory, verbose=0)
For our next example weāre going to use the openai
library to perform some light generative AI. You will need to get an OpenAI API key if you want to try this at home. You can specify your key within your notebook as follows (we recommend using libraries such as python-dotenv to do this securelyā¦weāll cover this in more detail in a future newsletter).
import os os.environ["OPENAI_API_KEY"] = "ENTER_KEY_HERE"
Letās create a cached function for interacting with OpenAIās GPT API! Hereās a breakdown of whatās happening here:
Import the
OpenAI
class and create a client instance.Create a cached function using the joblib
Memory
class we created earlier.Generate a completion from the
gpt-3.5.turbo
API with a customsystem
and userprompt
taken from the functionās arguments.Extract and return the generated content.
from openai import OpenAI client = OpenAI() @memory.cache def gpt_limerick(system, prompt): completion = client.chat.completions.create( model="gpt-3.5-turbo", messages=[ {"role": "system", "content": system}, {"role": "user", "content": prompt}, ], ) response = completion.choices[0].message.content return response
Hereās how it looks with a few different runs. I thoroughly recommend trying this out, you can have a lot of fun with generative AI! āØ
2. Parallelisation with joblib
Joblib works especially well for āembarrassingly parallelā for loops. This isnāt about code shaming! The term is used in computer science to describe any problem that can be easily separated into multiple, independent sub-tasks, which can then be executed simultaneously without the need for communication between them.
This can lead to significant reductions in processing time, as separate processors can handle different parts of the task simultaneously. The term "embarrassingly" is used only to contrast them against other, more tricky, forms of parallelisation.
The code below describes a Python for
loop that counts as embarrassingly parallel. Each run of the loop takes in a single variable, person
and generates a gift idea for them. The results are then collected into a DataFrame structure.
from tqdm import tqdm people = [ "Homer Simpson", "Marge Simpson", "Bart Simpson", "Lisa Simpson", "Maggie Simpson", "Ned Flanders", "Montgomery Burns", "Apu Nahasapeemapetilon", "Moe Szyslak", "Krusty the Clown", ] system_prompt = """ You are one of Santas elves. I will give you a character from The Simpsons, you must give me a 1-2 word Christmas gift idea for that character. """ df = [] for person in tqdm(people): gift = gpt_limerick( system=system_prompt, prompt=person, ) df.append({"name": person, "gift": gift}) pd.DataFrame(df)
To parallelise this, we can do the following:
Refactor the contents of the
for
loop into a single function. In our case, this is already done, we just need to call thegpt_limerick
function and store its returned value.Rewrite the
for
loop as a comprehension, e.g.(your_function_name(x) for x in your_list)
Restructure the comprehension to use the
Parallel()
object and thedelayed()
function as follows:Parallel()(delayed(your_function_name)(x) for x in your_list)
This reduces the execution time from 8 seconds down to 1.3 seconds. We specify n_jobs=8
and prefer=āthreadsā
to use 8 CPU threads to achieve this 6x speedup in execution.
3. Serialisation with joblib
āSerialisationā refers to the process of converting Python objects to other formats more amenable to storage or sharing, such as saving as files on disk, or as binary data into a database, or into JSON to be returned via a REST API.
Python includes the pickle module for doing exactly this, but joblibās implementation is a drop-in replacement designed for efficient serialisation of large data objects such as NumPy arrays or machine learning models.
from joblib import dump, load dump(gifts, "gifts.joblib") # Creates a file called gifts.joblib # in the current directory gifts = load("gifts.joblib") print(gifts) # ['Donut delivery', # 'Baking set', # 'Skateboard upgrade', # 'Musical Instrument', # 'Blanket Buddies', # 'Religious calendar', # 'Electric Blanket', # 'Indian spices', # 'New Bartender Kit', # 'Makeup kit']
Please note that joblibās dump()
and load()
are still based on pickle and therefore come with the following caveats:
ā ļø Beware calling
load()
on an untrusted file! Deserialisation can result in arbitrary or malicious code being run on your machine, and thereforeload()
should never be used to load objects from an untrusted source or otherwise you will introduce a security vulnerability in your program.š¢ Pickle and joblib can (and do) change their approach to serialisation to ensure the most efficient algorithms are used under-the-hood. This means that you must never rely on the ability to
dump()
using one version of Python and/or joblib, and have thisload()
when you have upgraded your Python and/or joblib versions. It might work, but donāt rely on it. As a result, joblib should be used for temporary serialisation only, such as caching or saving work to be picked up in the near future. If you require medium-term or long-term storage solutions, consider using Feather for datasets or formats such as skops or mlflow for machine learning models.š We recommend reviewing scikit-learnās own guidance for model serialisation.
š¤ Do you know someone learning Python? š
Weāve got 20 daily updates to go, and all updates are available on our sign-up page here: https://py-advent-calendar.beehiiv.com
If you know someone whoās learning Python, or studying data science, why not ā¤ļø share the link above ā¤ļø with them? You could also tell everyone you know, and tag us on LinkedIn or Twitter!
If youāre enjoying this series, or have feedback or ideas on how we can make it better, please reach out to us via [email protected] or @CoefficientData on Twitter.
See you tomorrow! š
Your Python Advent Calendar Team š
š¤ Python Advent Calendar is brought to you by Coefficient, a data consultancy with expertise in data science, software engineering, devops, machine learning and other AI-related services. We code, we teach, we speak, weāre part of the PyData London Meetup team, and we love giving back to the community. If youād like to work with us, just email [email protected] and weāll set up a call to say hello. āļø