Deleting characters matching a pattern

Deleting characters matching a pattern

How to:

import re

# Example string
text = "Hello, World! 1234"

# Remove all digits
no_digits = re.sub(r'\d', '', text)
print(no_digits)  # Output: "Hello, World! "

# Remove punctuation
no_punctuation = re.sub(r'[^\w\s]', '', text)
print(no_punctuation)  # Output: "Hello World 1234"

# Remove vowels
no_vowels = re.sub(r'[aeiouAEIOU]', '', text)
print(no_vowels)  # Output: "Hll, Wrld! 1234"

My custom function

I do this frequently enough that I refactored it into this simple delete() function. It’s also a good demonstration of doctests:

def delete(string: str, regex: str) -> str:
    >>> delete("Hello, world!", "l")
    'Heo, word!'

    >>> delete("Hello, world!", "[a-z]")
    'H, !'
    return re.sub(regex, "", string)

Deep Dive

The practice of deleting characters matching a pattern in text has deep roots in computer science, tracing back to early Unix tools like sed and grep. In Python, the re module provides this capability, leveraging regular expressions—a powerful and versatile tool for text processing.

Alternatives to the re module include:

  • String methods like replace() for simple cases.
  • Third-party libraries like regex for more complex patterns and better Unicode support.

Under the hood, when you use re.sub(), the Python interpreter compiles the pattern into a series of bytecodes, processed by a state machine that performs pattern-matching directly on the input text. This operation can be resource-intensive for large strings or complex patterns, so performance considerations are crucial for big data processing.

See Also