How to parse tables with complex layouts? – Docparser

Docparser offers various tools to extract table data from PDF and scanned documents. The easiest way to extract tables is to use our "Table Data" parsing rule preset, or if you are processing invoices our "Line Items" preset. Both presets let you define the column boundaries of your table with a simple point & click interface. This method works great if you are dealing with simple tables having where one table entry is represented by one row.

But what if your tables are nested or have a complex layout?

If you are dealing with nested tables where one entry is spread of several rows like shown in the screenshot below, you need to apply a work-around to obtain a clean representation of your table data. Below we guide you through the process of setting up multiple parsing rules which will each extract specific parts of your table.

Example of a nested table structure in a PDF document

1/ Extract the "Main Rows" using the table extraction tool

First we create a parsing rule which extracts all "main rows" of the table. We do this by choosing the "Table Data" preset and then defining matching column boundaries.

Defining column boundaries using the Table Data parsing rule in Docparser

After confirming the column boundaries, we are presented with the entire document text split up in columns based on the column boundaries we just defined. What we need to do now is to filter out all rows which are not "main rows". We are doing this by adding "Keep rows where ..." filters.

Applying filters to retain only the main rows in table extraction

The result is a clean representation of all "main rows" of the table. Let's save this parsing rule and continue.

2/ Create additional parsing rules to get the secondary rows

The goal of this step is to extract all data fields which are not included in the main rows of the table. In our example these are the product number (e.g. Nr. ABC12345678) and the unit weights. Both of them are located in a secondary row.

Depending on your data, you have different choices to get the remaining data. You can for example create a parsing rule looking for strings matching certain patterns, a parsing rule using anchor keywords to look for repeating text values, or simply create another table extraction rule. In our example, creating another table extraction rule seems to be the easiest way.

Extracting secondary rows of data using another table parsing rule

We repeat the same procedure as in the first step and add additional "Keep rows where ..." filter to clean up the returned data in the second step of the parsing rule editor.

Filtering secondary rows to clean up extracted table data

3/ Create additional "Merge Fields" parsing rule (Optional)

We now have created two table parsing rules, each extracting certain parts of the nested table. As an optional final step, we can create an additional parsing rule to merge the two parsed tables. Creating a "Merge Fields" parsing rule is for example useful if you want to use a webhook integration and loop over all table data in one single run.

Select the "Merge Fields" parsing rule preset and set the first filter to "Append Table Data Horizontally". This will result in one single table containing all your table data. In addition, you can add another filter to name your column headers. Save this parsing rule and you are done!

Merging data from multiple parsing rules using Merge Fields in Docparser

Please note: The method of creating multiple parsing rules and merging the returned data works great when all table rows contain a similar data structure. Merging the data will however not work if your table rows have a variable number of sub-rows (e.g. a header row with N sub-items).

If you have any further questions, please don't hesitate to reach out to our support staff!

1/ Extract the "Main Rows" using the table extraction tool

2/ Create additional parsing rules to get the secondary rows

3/ Create additional "Merge Fields" parsing rule (Optional)

Related articles