IntXeger (pronounced "integer") is a Python library for generating strings from regular expressions. Some of its core features include:
- Support for most common regular expression operations.
- Array-like indexing for mapping integers to matching strings.
- Generator interface for sequentially sampling matching strings.
- Sampling-without-replacement for generating a set of unique strings.
Compared to popular alternatives such as xeger and
exrex, IntXeger
is an order of magnitude faster at
generating strings and offers unique functionality such as array-like indexing and
sampling-without-replacement.
You can install the latest stable release of IntXeger by running:
pip install intxeger
Let's start with a simple example where our regex specifies a two-character string that only contains lowercase letters.
import intxeger
x = intxeger.build("[a-z]{2}")
You can check the number of strings that can be generated from this regex using
the length
attribute and generate the i
th matching string using the get(i)
method.
assert x.length == 26**2 # there are 676 unique strings which match this regex
assert x.get(15) == 'ap' # the 15th unique string is 'ap'
Furthermore, you can generate N
unique strings which match this regex using the
sample(N)
method. Note that N
must be less than or equal to the length.
print(x.sample(N=10))
# ['xt', 'rd', 'jm', 'pj', 'jy', 'sp', 'cm', 'ag', 'cb', 'yt']
Here's a more complicated regex which specifies a timestamp.
x = intxeger.build(r"(1[0-2]|0[1-9])(:[0-5]\d){2} (A|P)M")
print(x.sample(N=2))
# ['11:57:12 AM', '01:16:01 AM']
You can also print matches on the command line.
$ intxeger --order=desc "[a-c]"
c
b
a
$ python3 -m intxeger -0 'base/[ab]/[12]' | xargs -0 mkdir -p
$ tree base/
base
├── a
│ ├── 1
│ └── 2
└── b
├── 1
└── 2
To learn more about the functionality provided by IntXeger
, check out our
documentation!
This table, generated by benchmark.py
, shows the amount of time in
milliseconds required to generate N
examples of each regular expression
using xeger
and intxeger
.
regex | N | xeger | exrex | intxeger |
---|---|---|---|---|
[a-zA-Z]+ | 100 | 7.36 | 3.17 | 1.09 |
[0-9]{3}-[0-9]{3}-[0-9]{4} | 100 | 11.59 | 6.25 | 0.8 |
[0-9A-F]{4}-[0-9A-F]{4}-[0-9A-F]{4}-[0-9A-F]{4} | 1000 | 208.62 | 91.3 | 18.28 |
/json/([0-9]{4})/([a-z]{4}) | 1000 | 133.36 | 107.01 | 12.18 |
Have a regular expression that isn't represented here? Check out our Contributing Guide and submit a pull request!