Autocompleting Regular Expressions

When designing forms to collect custom data types, you’ll sometimes have a regular expression for what you’re trying to collect, and for easy visual spacing you’d like to automatically inject separators into the form where appropriate, like this example for phone numbers:

>> user_input = "415"
>> phone_number = "\\d{3}-\\d{3}-\\d{4}"
>> append_separators(user_input, regex=phone_number)
'415-'

But how to implement append_separators, without losing the regular expression as the single source of truth? There are existing implementations for phone numbers and credit cards, but for company TIN numbers? Social security numbers? More exotic inputs? The closest Python libraries I’ve found to solving this are Exrex and Greenery.

Exrex allows for randomly sampling strings matching a given regular expression, hex color codes for example:

>> from exrex import getone
>> color_hex = "#(?:[0-9a-fA-F]{3}){1,2}"
>> getone(color_hex)
'#1D7EBb'

Greenery allows for applying logical operations to regular expressions:

>> from greenery import parse
>> five_plus_chars = ".{5,}"
>> str(parse(five_plus_chars) & parse(color_hex))
'#[0-9A-Fa-f]{6}'

Considering that Greenery works by converting regular expressions to finite state machines, computing the intersection of the two FSMs as a third FSM, and using the Brzozowski algebraic method (q.v.) to convert the third FSM back to a regular expression1, re-implementing this to find where a separator can be appended would seem much higher lift than necessary to achieve the desired effect.

But if you can randomly sample from strings satisfying a regular expression:

>>> for _ in range(10):
...     print(getone(phone_number))
... 
237-178-8887
264-194-8260
945-873-3099
515-809-7348
112-255-4732
936-552-0686
422-686-3399
313-372-8544
309-291-8909
159-619-4738

Then you can see from the outputs which slots in the generated output are always the same. Would it be possible to solve this probabilistically?

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
# Generate 1000 strings satisfying this regular expression
options = []
for _ in range(1000):
    options.append(getone(phone_number))

# Find all characters that appear at each index
seen_by_index: dict[int, set] = {}
for option in options:
    for i, char in enumerate(option):
        # Add a new set if we don't yet have one for this index
        if i not in seen_by_index:
            seen_by_index[i] = set()

        seen_by_index[i].add(char)

        
# Print seen characters for each character position
for i, char_set in seen_by_index.keys():
    print(f"Character {i}: {sorted(char_set)}")

Output:

Character 0: ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
Character 1: ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
Character 2: ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
Character 3: ['-']
Character 4: ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
Character 5: ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
Character 6: ['-']
Character 7: ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
Character 8: ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
Character 9: ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
Character 10: ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']

Let’s call a character position unique if there is only one character allowed in that position.

Given this is a probabilistic algorithm, let’s establish some bounds for error rates. Which error modes are possible? Type I error would be a character position appearing to be unique, while actually having additional possibilities which didn’t appear due to random chance. Type II error can be safely disregarded, as there’s no circumstance in which a truly unique character position would have other characters appear in a string without violating the regular expression. No other error mode is possible, so long as the regex doesn’t allow for variable-length strings, which we’ll address shortly.

The more characters are valid for a given character position, the less likely each is to appear, so the case making a false positive most likely is when there are the least number of possible characters for that position, specifically 2 characters, given 1 would make the character position truly unique.

If we flip a digital coin 100 times, the probability of only seeing one of two possible faces across all 100 flips — assuming both characters are equally likely — is 1/2^100, or:

0.000000000000000000000000000000789

With the benchmark set thus at 7.9e-31, or one false positive in 1.3e30 flips, if the entire human population of the Earth were to flip coins once a second for 14 billion years, or the approximate current age of the universe,2 we’d have 3.5e27 flips, or a 0.3% chance of a single false positive after 14 billion years.

For some extra safety, we can flip 1000 times, for a false positive probability of 1/2^1000, or one false positive in 1e301 flips. At that point, the all-humans-every-second-since-the-big-bang flip count has no meaningful impact on the false positive probability, and an arbitrary precision calculator I used to test this just rounded it to 0.3

So for our complete algorithm:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
from exrex import getone

class SeparatorAppender:
    def __init__(self, regex: str):
        self.regex = regex
        self.index = self._index_separators()

    def append_separators(self, user_input: str) -> str:
        to_append = self.index.get(len(user_input), "")
        return user_input + to_append

    def _index_separators(self) -> dict[int, str]:
        # Generate 1000 strings satisfying this regular expression
        options = []
        for _ in range(1000):
            options.append(getone(self.regex))
        
        # Find the allowed character set at each index
        seen_by_index: dict[int, set] = {}
        for option in options:
            for i, char in enumerate(option):
                # Add a new set if we don't yet have one for this index
                if i not in seen_by_index:
                    seen_by_index[i] = set()
        
                seen_by_index[i].add(char)
        
        # Find the index-to-separator mapping whenever only one separator is allowed
        separators_by_index = {
            i: list(charset)[0] for i, charset in seen_by_index.items() if len(charset) == 1
        }
        
        return separators_by_index
    

phone_number = "\\d{3}-\\d{3}-\\d{4}"
appender = SeparatorAppender(phone_number)
print(appender.append_separators("415"))

Output:

415-

Finally, handling regular expressions matching strings of varying length. For this, we can bring in Greenery, by simply merging the existing user input with the regular expression:

>> import re
>> user_input = "415"
>> user_input_as_regex = re.escape(user_input) + ".*"
>> str(parse(phone_number) & parse(user_input_as_regex))
'415-\\d{3}-\\d{4}'

If you can’t definitively say whether a character position is unique until after the character preceeding it is entered, then the corresponding character can’t be appended anyway, so by accounting for the existing user input, we’ve gone as far as we can without inserting symbols between already enterred characters.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
import re
from exrex import getone
from greenery import parse

class SeparatorAppender:
    def __init__(self, regex: str):
        self.regex = regex

    def append_separators(self, user_input: str) -> str:
        index = self._index_separators(user_input)
        to_append = index.get(len(user_input), "")
        return user_input + to_append

    def _index_separators(self, user_input: str) -> dict[int, str]:
        # Combine the user input with the regex
        user_input_as_regex = parse(re.escape(user_input)) + parse(".*")
        regex = str(parse(self.regex) & user_input_as_regex)
        
        # Generate 1000 strings satisfying this regular expression
        options = []
        for _ in range(1000):
            options.append(getone(regex))
        
        # Find the allowed character set at each index
        seen_by_index: dict[int, set] = {}
        for option in options:
            for i, char in enumerate(option):
                # Add a new set if we don't yet have one for this index
                if i not in seen_by_index:
                    seen_by_index[i] = set()
        
                seen_by_index[i].add(char)
        
        # Find the index-to-separator mapping whenever only one separator is allowed
        separators_by_index = {
            i: list(charset)[0] for i, charset in seen_by_index.items() if len(charset) == 1
        }
        
        return separators_by_index


phone_number = "\\d{3}-\\d{3}-\\d{4}"
appender = SeparatorAppender(phone_number)
print(appender.append_separators("415"))

Output:

415-

Note that we unfortunately lose the ability to pre-compute our index for the regular expression and so there is some relatively significant performance cost to this approach. How much?

>> phone_number = "\\d{3}-\\d{3}-\\d{4}"
>> appender = SeparatorAppender(phone_number)   
>> %timeit appender.append_separators("415")
65.9 ms ± 3.78 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

And for the version without crossing in the existing user input to handle variable length inputs:

>> phone_number = "\\d{3}-\\d{3}-\\d{4}"
>> appender = SeparatorAppender(phone_number)
>> %timeit appender.append_separators("415")
75.9 ns ± 0.721 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

If you try reducing the number of rounds on the original regular expression to 100, tolerating that 0.3% chance of a single false positive in 14 billion years with the entire human population of Earth flipping a coin per second:

>> phone_number = "\\d{3}-\\d{3}-\\d{4}"
>> appender = SeparatorAppender(phone_number)
>> %timeit appender.append_separators("415")
47.9 ms ± 2.67 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Probably better to keep it to fixed-length strings.


  1. This is actually an exact quote ↩︎

  2. I say current age because by the time you read this the universe may be significantly older ↩︎

  3. Really, this should be a count of false advertising against the “arbitrary precision” calculator ↩︎

comments powered by Disqus