When designing forms to collect custom data types, you’ll sometimes have a regular expression for what you’re trying to collect, and for easy visual spacing you’d like to automatically inject separators into the form where appropriate, like this example for phone numbers:
>> user_input = "415"
>> phone_number = "\\d{3}-\\d{3}-\\d{4}"
>> append_separators(user_input, regex=phone_number)
'415-'
But how to implement append_separators
, without losing the regular expression as the single source of truth? There are existing implementations for phone numbers and credit cards, but for company TIN numbers? Social security numbers? More exotic inputs? The closest Python libraries I’ve found to solving this are Exrex and Greenery.
Exrex allows for randomly sampling strings matching a given regular expression, hex color codes for example:
>> from exrex import getone
>> color_hex = "#(?:[0-9a-fA-F]{3}){1,2}"
>> getone(color_hex)
'#1D7EBb'
Greenery allows for applying logical operations to regular expressions:
>> from greenery import parse
>> five_plus_chars = ".{5,}"
>> str(parse(five_plus_chars) & parse(color_hex))
'#[0-9A-Fa-f]{6}'
Considering that Greenery works by converting regular expressions to finite state machines, computing the intersection of the two FSMs as a third FSM, and using the Brzozowski algebraic method (q.v.) to convert the third FSM back to a regular expression1, re-implementing this to find where a separator can be appended would seem much higher lift than necessary to achieve the desired effect.
But if you can randomly sample from strings satisfying a regular expression:
>>> for _ in range(10):
... print(getone(phone_number))
...
237-178-8887
264-194-8260
945-873-3099
515-809-7348
112-255-4732
936-552-0686
422-686-3399
313-372-8544
309-291-8909
159-619-4738
Then you can see from the outputs which slots in the generated output are always the same. Would it be possible to solve this probabilistically?
|
|
Output:
Character 0: ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
Character 1: ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
Character 2: ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
Character 3: ['-']
Character 4: ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
Character 5: ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
Character 6: ['-']
Character 7: ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
Character 8: ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
Character 9: ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
Character 10: ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
Let’s call a character position unique if there is only one character allowed in that position.
Given this is a probabilistic algorithm, let’s establish some bounds for error rates. Which error modes are possible? Type I error would be a character position appearing to be unique, while actually having additional possibilities which didn’t appear due to random chance. Type II error can be safely disregarded, as there’s no circumstance in which a truly unique character position would have other characters appear in a string without violating the regular expression. No other error mode is possible, so long as the regex doesn’t allow for variable-length strings, which we’ll address shortly.
The more characters are valid for a given character position, the less likely each is to appear, so the case making a false positive most likely is when there are the least number of possible characters for that position, specifically 2 characters, given 1 would make the character position truly unique.
If we flip a digital coin 100 times, the probability of only seeing one of two possible faces across all 100 flips — assuming both characters are equally likely — is 1/2^100
, or:
0.000000000000000000000000000000789
With the benchmark set thus at 7.9e-31, or one false positive in 1.3e30 flips, if the entire human population of the Earth were to flip coins once a second for 14 billion years, or the approximate current age of the universe,2 we’d have 3.5e27 flips, or a 0.3% chance of a single false positive after 14 billion years.
For some extra safety, we can flip 1000 times, for a false positive probability of 1/2^1000
, or one false positive in 1e301 flips. At that point, the all-humans-every-second-since-the-big-bang flip count has no meaningful impact on the false positive probability, and an arbitrary precision calculator I used to test this just rounded it to 0.3
So for our complete algorithm:
|
|
Output:
415-
Finally, handling regular expressions matching strings of varying length. For this, we can bring in Greenery, by simply merging the existing user input with the regular expression:
>> import re
>> user_input = "415"
>> user_input_as_regex = re.escape(user_input) + ".*"
>> str(parse(phone_number) & parse(user_input_as_regex))
'415-\\d{3}-\\d{4}'
If you can’t definitively say whether a character position is unique until after the character preceeding it is entered, then the corresponding character can’t be appended anyway, so by accounting for the existing user input, we’ve gone as far as we can without inserting symbols between already enterred characters.
|
|
Output:
415-
Note that we unfortunately lose the ability to pre-compute our index for the regular expression and so there is some relatively significant performance cost to this approach. How much?
>> phone_number = "\\d{3}-\\d{3}-\\d{4}"
>> appender = SeparatorAppender(phone_number)
>> %timeit appender.append_separators("415")
65.9 ms ± 3.78 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
And for the version without crossing in the existing user input to handle variable length inputs:
>> phone_number = "\\d{3}-\\d{3}-\\d{4}"
>> appender = SeparatorAppender(phone_number)
>> %timeit appender.append_separators("415")
75.9 ns ± 0.721 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
If you try reducing the number of rounds on the original regular expression to 100, tolerating that 0.3% chance of a single false positive in 14 billion years with the entire human population of Earth flipping a coin per second:
>> phone_number = "\\d{3}-\\d{3}-\\d{4}"
>> appender = SeparatorAppender(phone_number)
>> %timeit appender.append_separators("415")
47.9 ms ± 2.67 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Probably better to keep it to fixed-length strings.