Our RegEx filter allows you to extract text data from your PDF documents based on regular expression. This advanced filter comes in handy if you know how the data field which you want to extract looks like, but you don't know where it is located inside the document, for example, tracking numbers with a specific format, a number following a specific label, etc.
The regular expression which you provide to this filter needs to follow the PCRE syntax. Some basic examples are shown below which should get you started. For more in-depth information, please have a look at the PCRE syntax documentation.
Example #1: Find all integer numbers which are between seven and nine digits long
/[0-9]{7,9}/
Example #2: Find all text tokens which consist of two letters followed by a dash symbol and seven digits:
/[A-Z]{2}-[0-9]{7}/
Next to writing down the actual pattern, a delimiter character needs to be placed at the start and the end of the expression. Common delimiter characters are / or #.
Instead of defining the character sets like shown in the example above, one can also use specific letters representing character classes (for example, \d is the same as [0-9]). Below we will list the most common PCRE syntax elements.
Base Character Classes
\w | Any “word” character (a-z 0-9 _) |
\W | Any non “word” character |
\s | Whitespace (space, tab CRLF) |
\S | Any non whitepsace character |
\d | Digits (0-9) |
\D | Any non digit character |
. | (Period) – Any character except newline |
Meta Characters
^ | Start of subject (or line in multiline mode) |
$ | End of subject (or line in multiline mode) |
[ | Start character class definition |
] | End character class definition |
| | Alternates, eg (a|b) matches a or b |
( | Start subpattern |
) | End subpattern |
\ | Escape character |
Quantifiers
n* | Zero or more of n |
n+ | One or more of n |
n? | Zero or one occurrences of n |
{n} | n occurrences exactly |
{n,} | At least n occurrences |
{,m} | At most m occurrences |
{n,m} | Between n and m occurrences (inclusive) |
A great online resource for creating and testing Regular Expressions is https://regex101.com/. Here you can insert your strings and test the expression to ensure it is correct and working.
Another fantastic resource that can be used to help create a Regular Expression is ChatGPT. You can use their AI software to ask for a pattern to meet your parameters, and it will go ahead and create that for you. You can also ask for modifications should you require them