How to parse repeating text blocks? – Docparser

In general, our PDF table extraction filter does a great for extracting "table like" data. However, in some cases, you are dealing with lists that are built up by complex text blocks. These text blocks can be spread over multiple lines and have irregularities. The method described in this article will help you parse this kind of data sets.

Our 'Repeating Text Block' filter lets you parse tabular data which is spread over several lines. The filter provides several options on how to recognise such repeating blocks. Individual text blocks can be recognized either

by defining how many lines one text block has
by text patterns, e.g. if a line starts with a certain word
by separating them whenever there are empty lines

To obtain good results, it is important that you remove all text before and after your list. You can do this by adding text filters such as 'Define Start Position' and 'Define End Position'.

After your text blocks are recognised correctly, you can further refine the the parsed data with table filters, such as splitting and removing columns.

Related articles