## Python: Word filtering and regex

## The Setup

Someone posed the question to me yesterday “given a list of all the permutations of vulgarities, how would you filter a relatively short message (say, 200 characters long) for one of those permutations? This is time-sensitive, imagine on the order of potentially hundreds of checks a second.”

Asking for a bit more detail, I was told this list (array for you dirty non-python folk) was “over a million lines long”. I had two immediate thoughts:

- That is a terrible representation of those permutations. At least move to a basic trie and cut out (most) of your repetition. Huge space savings if you’ve got similar words (likely, since they’re small variations of the same letters) and great savings in inclusion checks- instead of , where is the number of permutations, and is the length of the longest permutation.
- Math senses tingling, primarily from combinatorics class. There is NO WAY that this list is on the order of “millions”. Since we can have multiple delimiters between characters, even conservative estimates for permutations of single words with restrictive assumptions and a small delimiter character set put us into the billions. For one word. Not buying it.

I mentioned that I’d probably first change how we represented the permutations, but he insisted that we consider the rest of our algorithm first. Fair enough.

## Naive Solution

I heard permutation and thought regular expressions, but I remembered this quote and since I’ve never written a regular expression before, decided I’d keep them as plan B. This sounded like a variable-length sliding window check (similar to the work from this stats post) so after thinking about that for a bit, I went with that.

*[Note: This isn’t what I had originally- my quick answer had some serious index errors that I’m ashamed to mention here. Quick testing (2 minutes) and I fixed those silly things up. I’ve also modified it to take in a list of filter words, so that it doesn’t need some global somewhere. *shudder*]*

def is_profane(msg, profanity_list): nchars = len(msg) for window_size in xrange(1,nchars+1): for start_index in xrange(nchars-window_size+1): check_msg = msg[start_index:start_index+window_size] if check_msg in profanity_list: return True return False

The complexity of that is on the order of with a nice coefficient, so not too bad for . The killer here is the check against the list of permutations, pushing us to , where is the number of permutations. So for “a million” permutations we go from 40,000 to 40,000,000,000, or 40 billion. Not… good. Never mind that “a million” is low for a decent list of permutations, and we’ve still got all that nasty substring and looping in there.

## Slightly Better- Tries

So how do we cut down on p? Well, going back to my earlier suggestion, we use a trie. See this post for a quick rundown on tries. Effectively, we now limit our p factor to k, where k is the length of the longest permutation. Just to be fair, we can even say that we have some crazy permutation that is the maximum message length, n. Now we’re at , which brings us back to 8,000,000 or 8 million. Better by 4 orders of magnitude. Just by moving to a trie! And we cut down on space, so that’s even better.

## Sizing up our problem space

### Assumptions

Now, to the second point- realistic permutation counts. When I first thought about how to better check this, I knew he was probably thinking regular expressions. I don’t know these well (actually, I learned them for this post) so I didn’t want to suggest them, but until I actually did out an example permutation count I didn’t realize how big of a difference we were talking about. So to demonstrate why I think they’re actually necessary for this problem, here we go.

Say that “doggy” is a bad word we want to filter. We’re going to make the following assumptions:

- People will use anywhere between 0 and 3 “trash” characters between each letter, and won’t always use the same number between letters. So if “.” is a trash character, we could see these:

doggy |

d.og.gy |

d.o..g…g.y |

…252 others… |

dog…gy |

- (This is a conservative count) There are only 6 different characters that people will use as “trash”, and they are: [” “, “.”, “,”, “*”, “^”, “|”] Which in text is: space, period, comma, asterisk, carrot, pipe.

- To keep the word recognizable, people will always keep the first and last letters in their spot, but will shuffle the middle letters. For doggy, these are:

doggy |

dgogy |

dggoy |

- And of course, for each of these they will put a variable number of various trash characters between some or all letters.

### Separator Combinations

Let’s grab an estimate for the number of possibilities.

To start, we need to know how many ways they can place spacers. If they will use between 0 and 3 spacers between each character, then we have choices- each space can have 4 different “trash counts” and there are 4 spaces, and order does matter to us (so we’re looking at permutations, not combinations).

so we have d[0-3trash]o[0-3trash]g[0-3trash]g[0-3trash]y options. Now, let’s say we only look at the case with 3 trash characters between each letter.

so we have d___o___g___g___y. We want to catch both of the following:

d.,.o…g…g…y

d,..o…g…g…y

so order once again matters. For each position we have 6 choices, so we have:

6*6*6 between d and o (or )

6*6*6 between o and g (or )

6*6*6 between g and g (or )

6*6*6 between g and y (or )

and from that, we have or .

6^12 = 2,176,782,336

Remember now, that we had 3 ways to arrange the characters in doggy, so we multiply by 3 and get:

6,530,347,008 permutations that have 3 separators between each letter.

That’s 1 of 256 () ways to arrange our separators in the word, and while every other arrangement of separators will give us less than ~2.1billion options, they’re still quite high. We’ll break the 10s of billions with our next grouping, and I think it’s not out of the question to hit 100s of billions, but the point is that we’re already at 6.5billion, for one word, for 1 method of placing 6 separators, which again, is a very conservative count for trash characters.

## Regular (gulp) Expressions

So… very cautiously, I turn to regular expressions, and came up with the following after reviewing the docs on my phone. This should satisfy all 256 combinations of separators, for one shuffling of the letters in doggy (so 1/3 of our cases):

group = “.*|/, ”

d[group]{0,3}o[group]{0,3}g[group]{0,3}g[group]{0,3}y

d[group]{0,3}g[group]{0,3}o[group]{0,3}g[group]{0,3}y

d[group]{0,3}g[group]{0,3}g[group]{0,3}o[group]{0,3}y

Now, you’d actually need to replace the word group with the values in group, so it’s going to be a lot uglier, but this is nice and easy to automate- we just create a function to build regular expressions for each permutation of letter shuffling!

## New Complexity Analysis

As far as the original question, this drops our complexity to (another huge leap) where is the number of regular expressions (assuming the cost of doing a pattern.search(msg) is trivial with respect to the iteration over the list of regular expressions). For the word ‘doggy’ we’re now doing three checks. I think this one’s pretty much decided.

# Code

First, a small helper for getting permutations with repeated values:

import itertools def permutations_with_repetition(iterable, r=None): for unique in set(itertools.permutations(iterable, r)): yield unique

And now something to inject a substring between each character of a string. There may be a better way to do this, but it’s a one-off cost if our regex permutations don’t change:

def inject_substring(string, substring): sections = [] for char in string: sections.append(char) sections.append(substring) sections.pop() #Last substring isn't needed return "".join(sections)

A generator that yields permutations of a string with the first and last characters held constant:

def anchored_permutation(string): if len(string) < 3: yield [string] else: first = string[0] last = string[-1] for permutation in permutations_with_repetition(string[1:-1]): permute_str = "".join(permutation) yield "".join((first, permute_str, last))

And here’s how we generate a list of regexp permutations for a string with a given set of optional delimeters numbering between 0 and max_delim_count:

import re def regex_permutations(string, delims, max_delim_count): sub_pattern = "[{d}]{{0,{m}}}".format(d=delims,m=max_delim_count) for string_permutation in anchored_permutation(string): yield inject_substring(string_permutation, sub_pattern) def compiled_permutations(string, delims, max_delims): permutations = regex_permutations(string, delims, max_delims) for permutation in permutations: yield re.compile(permutation)

## Testing

Here’s some sample usage, just to demonstrate how nasty the regexps would have been to write for our test case. Note that we’d use the pattern objects in code, but for display purposes I’m just printing the regexp.

>>> string = "doggy" >>> delims = r"., *|/&" >>> max_delims = 3 >>> for pattern in regex_permutations(string, delims, max_delims): print pattern d[., *|/]{0,3}g[., *|/]{0,3}g[., *|/]{0,3}o[., *|/]{0,3}y d[., *|/]{0,3}o[., *|/]{0,3}g[., *|/]{0,3}g[., *|/]{0,3}y d[., *|/]{0,3}g[., *|/]{0,3}o[., *|/]{0,3}g[., *|/]{0,3}y

And now, we’ll throw them in a list and test it out!

string = "doggy" delims = r"., *|/" max_delims = 3 bad_patterns = list(compiled_permutations(string, delims, max_delims)) msg_good = "Hello, I am a dog!" msg_bad = "Yo, I am a bad d.o./gg** y, how you doin'." for pattern in bad_patterns: if pattern.search(msg_good): print "Failed good test!" break else: print "Good test succeeded!" for pattern in bad_patterns: if pattern.search(msg_bad): print "Bad test succeeded!" break else: print "Failed bad test!"

# Full Source

The full source is available below, or here.

import re import itertools def inject_substring(string, substring): sections = [] for char in string: sections.append(char) sections.append(substring) sections.pop() return "".join(sections) def permutations_with_repetition(iterable, r=None): for unique in set(itertools.permutations(iterable, r)): yield unique def anchored_permutation(string): if len(string) < 3: yield [string] else: first = string[0] last = string[-1] for permutation in permutations_with_repetition(string[1:-1]): permute_str = "".join(permutation) yield "".join((first, permute_str, last)) def regex_permutations(string, delims, max_delims): sub_pattern = "[{d}]{{0,{m}}}".format(d=delims,m=max_delims) for string_permutation in anchored_permutation(string): yield inject_substring(string_permutation, sub_pattern) def compiled_permutation(string, delims, max_delims): permutations = regex_permutations(string, delims, max_delims) for permutation in permutations: yield re.compile(permutation)

# Update:

I added two functions this morning on the train- one generates the list of patterns from a list of words, and the other checks if a message contains profanity for a given list of patterns:

def generate_pattern_list(wordlist, delims, max_delims): patternlist = [] for word in wordlist: for pattern in compiled_permutations(word, delims, max_delims): patternlist.append(pattern) return patternlist def is_profane(msg, patternlist): for pattern in patternlist: matches = pattern.search(msg) if matches is not None: return True return False

## Leave a Reply