Day 4: Everyone's utils.py

"If it isn't built-in, it's a bolton"

🎄 Python Advent Calendar: Day 4! 🎄

Yesterday, we covered PyJanitor, a library that extends pandas with a variety of handy methods for data cleaning. Behind today’s door we introduce you to its Python-only equivalent, boltons. With 250+ helper functions available, it’s likely that there’s a function somewhere inside that you’ve written yourself, and some other function that you’ll need soon. Don’t reinvent the wheel, see if it exists already, and import it instead!

Boltons: “everyone’s utils.py”

📆 Last updated: December 2023
⬇️  Downloads: 687,184/week
⚖️  License: BSD-2 Clause
🐍 PyPI |  ⭐ GitHub Stars: 6.3k

🔍 What is it?

The goal of Boltons is to include functionality that “should be in the standard library” but isn’t, containing 250+ utility functions across 26 modules:

  • 🐍  Enhancements to the standard library modules such as itertools (iterutils), functools (funcutils), datetime (timeutils), and urllib (urlutils).

  • 🧰  Convenience methods for specific types such as dictionaries (dictutils), strings (formatutils and strutils), lists (listutils), 2D tables (tableutils), containers (namedutils), and sets (setutils).

  • 🤖  Helper utils for working with the filesystem (fileutils and pathutils), I/O (ioutils), JSON (jsonutils), object caching (cacheutils), and type handling (typeutils).

  • 📊  Specialised functions for mathematics (mathutils) and statistics (statsutils).

  • 🐞  Advanced libraries for debugging (debugutils), garbage collection (gcutils), mailboxes (mboxutils), priority queues (queueutils), sockets (socketutils), tracebacks (tbutils).

📦 Install

pip install boltons

🛠️ Use

1. Small batch processing

If you’ve read The Lean Startup by Eric Ries, you’ll know the most efficient way to prepare and send Christmas cards is in small batches. It’s a good time of year to test how “lean” your family & friends are by asking them this question:

What’s the fastest way to take 100 Christmas letters and for each to fold it, put it in an envelope, seal it, address it, and stamp it?

A) The whole process, one at a time.

B) Fold all the letters, put them in envelopes, seal them, and so on.

C) Option B, but in small batches of 5-10 letters.

The most common answer is B (batch process each step), yet multiple studies have confirmed that option A (do one letter at a time) gets there faster. If you’re sceptical, watch this video. My own hard-won advice from 13 years at the frontline of data science & software engineering is that:

  • Working in small batches is faster. This is a key principle of lean manufacturing: to reduce “WIP” (Work In Progress); otherwise, you end up spending valuable time moving around piles of WIP and discussing the WIP on your Jira board.

  • Adopt a “pipe cleaning” development approach. Eric Ries makes the excellent point that you may find out that the letters don’t fit in the envelopes. Would you rather find that out immediately or wait until you’ve spent time folding all the letters first?

Boltons doesn’t help with the philosophy of work management, but it does have a handy feature for chunking large lists into batches. Use cases involve writing your own minibatch logic for deep learning, or batching requests for API calls.

from boltons.iterutils import chunked
from faker import Faker

fake = Faker()
christmas_cards = [fake.name() for _ in range(20)]

for batch in chunked(christmas_cards, 3):
    print(f"Mailing batch: {batch}")


Mailing batch: ['Diane Mcdowell', 'Tony Cross', 'Robert Yu']
Mailing batch: ['Thomas Jones', 'Lisa Johnson', 'Ralph Munoz']
Mailing batch: ['Peggy Berry', 'Marcus Martinez', 'Kathleen Walton']
Mailing batch: ['Daniel Stevens', 'Michael Cobb', 'Ashley Blair']
Mailing batch: ['Kathryn Jones', 'James Rodriguez', 'Christine Espinoza']
Mailing batch: ['Todd Kaiser', 'Ashley Pace', 'Tammy Dougherty']
Mailing batch: ['Evelyn Delgado', 'Scott Wang']

2. Window functions

Windowing operations come built-in with pandas but not Python itself. Applications range from calculating statistics such as a moving average, time series modelling, and natural language processing (NLP) techniques such as n-gram detection. Here’s a simple approach for detecting n-grams in text:

from boltons.iterutils import windowed

lyrics = "I don't want a lot for Christmas"

for combo in windowed(lyrics.split(), 2):
    print(combo)

# ('I', "don't")
# ("don't", 'want')
# ('want', 'a')
# ('a', 'lot')
# ('lot', 'for')
# ('for', 'Christmas')

Let’s calculate the top 2-grams across the whole song:

words = lyrics.lower().split()
two_grams = windowed(words, 2)

(
    pd.Series(two_grams)
    .value_counts(ascending=True)
    .tail()
    .plot(kind='barh', color='red')
)

The most popular 2-grams in the most popular Christmas song

3. Unique values

You may already be a fan of using set() or np.unique() or pd.Series.unique() to identify the unique values within an iterable. The boltons unique() function is handy for when you’re not using numpy/pandas, and want to preserve the order in which each element is encountered.

from boltons.iterutils import unique

print("".join(set(lyrics)))
# kpei-)dlrs'HoTSbaYD,NIcMPnuvOj?yhgfCw W(tmA

print("".join(unique(lyrics)))
# "I don'twalfrChismTejugcbpyMvkAY()DS,ONPHW?-"

4. Flattening lists of lists

If you’re like me, you keep lists upon lists of gift ideas. Lists of the best board games, soothing music, and other lists of cool stuff. Use the flatten() function from boltons.iterutils to flatten your lists-inside-lists-inside-lists into a single shopping list ready for prioritisation and allocation.

from boltons.iterutils import flatten

gift_ideas = [
    ["Lego lighthouse", "Azul board game"],
    [
        "Programmer socks",
        [
            "Python scarf",
            [
                "Type-hinted tshirt",
                "Suntrap album",
            ],
        ],
    ],
    "Raspberry Pi",
]
print(list(flatten(gift_ideas)))

# ['Lego lighthouse', 'Azul board game',
#  'Programmer socks', 'Python scarf',
#  'Type-hinted tshirt', 'Suntrap album',
#  'Raspberry Pi']

5. Exponential backoff

Perhaps you’ll be using the ChatGPT API to write customised cards and poems for everyone this Christmas? (We made a tutorial if you want to learn how.) For those of you with thousands of friends, you may need to add some defensive time.sleep() lines into your code alongside an “exponential backoff” within your retry logic. The boltons.itertools.backoff function handles the numbercruching for you, returning a list of geometrically-increasing floating-point numbers:

from boltons.iterutils import backoff

times = backoff(
    start=1,
    stop=10,
    count=5,
    factor=3,
    jitter=True,
)
print(times)

# [0.5496827240064363, 2.6675852562747178,
#  7.326622231605776, 0.14987974382568758,
#  9.861013287132742]

print(backoff(start=1, stop=30, count=5, factor=4))

# [1.0, 4.0, 16.0, 30.0, 30.0]

6. Create an in-memory cache

There are situations where you want to store information in-memory, but have to be selective about what you keep. When your kitchen fridge runs out of space, what do you discard? The items you use least frequently (“Least Recently Used”), or the items that are oldest (“Least Recently Inserted”)?

Boltons provides a few utilities for caching, the LRU Cache (“Least Recently Used”) and the LRI Cache (“Least Recently Inserted”). The idea is to cache a preset number of items (say, 5 items) and, when the cache is full, to discard the item that was least recently used or least recently inserted. Here’s how the LRU Cache works:

from boltons.cacheutils import LRU

recipe_cache = LRU(max_size=2)
recipe_cache["Gingerbread"] = (
    "Flour, Ginger, Molasses, Sugar, "
    "Butter, Baking soda, Cinnamon, Cloves"
)
recipe_cache["Mince Pie"] = (
    "Mincemeat (dried fruits & spices), "
    "Pastry dough, Sugar (for dusting, optional)"
)

# What's in our recipe cache?
print(dict(recipe_cache).keys())
# ['Gingerbread', 'Mince Pie']

# We can access the gingerbread ingredients
print(recipe_cache["Gingerbread"])
# Flour, Ginger, Molasses, Sugar, Butter, Baking soda, Cinnamon, Cloves

# Inserting a new entry will discard the Gingerbread entry
recipe_cache["Christmas Pudding"] = (
    "Currants, Raisins, Suet, Brown sugar, "
    "Breadcrumbs, Flour, Mixed spice, "
    "Candied peel, Eggs, Stout"
)

print(recipe_cache.get("Mince Pie"))  # None

7. Memoization

Whether you’re mass-producing Christmas gifts with generative AI, or implementing your own travelling salesman solver to optimise your Christmas party travel plans, developing against any API can be a slow experience unless you can cache your API calls. The concept of memoization is incredibly valuable to learn and apply:

Memoization is an optimization technique used to speed up computer programs by storing the results of expensive function calls to pure functions and returning the cached result when the same inputs occur again.

Wikipedia — Memoization

Boltons provides utilities for memoization that use your caching tool of choice. Here’s an example using a “slow” function that takes a long time to wrap presents. When it sees the same input, it remembers this and immediately returns the output calculated previously.

import time
from tqdm import tqdm

from boltons.cacheutils import cached, LRU

my_cache = LRU()


@cached(my_cache)
def wrap_gifts(total):
    # Assume some complex calculation
    for _ in tqdm(range(total)):
        time.sleep(0.5)
    return "🎁" * total


print(wrap_gifts(20))

boltons also provides tools for memoizing methods with cachedmethod and class properties with cachedproperty

8. String utils

boltons includes a wide variety of helper utilities for so many things I’ve handwritten myself. Here’s an in-exhaustive selection, from slugifiers to pluralizers to a nicer solution for string replacement than a long list of .replace(a, b).replace(c, d).replace(e, f) !

from boltons.strutils import (
    camel2under,
    under2camel,
    slugify,
    split_punct_ws,
    ordinalize,
    pluralize,
    singularize,
    find_hashtags,
    MultiReplace,
)

camel2under("AllIWantForChristmasIsYou")
# 'all_i_want_for_christmas_is_you'

under2camel("all_i_want_for_christmas_is_you")
# 'AllIWantForChristmasIsYou'

slugify("Day 4: Everyone's utils.py")
# 'day_4_everyone_s_utils_py'

split_punct_ws("Day 4: Everyone's utils.py")
# ['Day', '4', 'Everyone', 's', 'utils', 'py']

ordinalize(4)
# '4th'

pluralize("python")
# pythons
pluralize("sheep")
# 'sheep'
pluralize("util")
# 'utils'
pluralize("utils")  # 🤔
# 'utilses'

singularize("pythons")
# 'python'
singularize("FEET")
# 'FOOT'

# Using .replace()
"Day 4: Everyone's utils.py".replace(
    "Day 4", "Today"
).replace("Everyone's", "Your new")
# 'Today: Your new utils.py'

# Using MultiReplace from boltons
MultiReplace(
    {
        "Day 4": "Today",
        "Everyone's": "Your new",
    }
).sub("Day 4: Everyone's utils.py")
# 'Today: Your new utils.py'

⚙️  What are your favourite boltons?  🔧

That’s it for today’s update! There’s so much included in boltons that we didn’t cover. If there’s a boltons feature you think we should have included, why not ❤️ share the link ❤️ to this newsletter on LinkedIn or Twitter and tell us about it!

If you’re enjoying this series, or have feedback or ideas on how we can make it better, please reach out to us via [email protected] or @CoefficientData on Twitter.

See you tomorrow! 🐍

Your Python Advent Calendar Team 😃 

🤖 Python Advent Calendar is brought to you by Coefficient, a data consultancy with expertise in data science, software engineering, devops, machine learning and other AI-related services. We code, we teach, we speak, we’re part of the PyData London Meetup team, and we love giving back to the community. If you’d like to work with us, just email [email protected] and we’ll set up a call to say hello. ☎️