Skip to main content

Introduction to parsers

What is a parser?#

A parser in DocumentPro handles the way a specific type of document should be processed by DocumentPro. It contains:

  • The document type (e.g. invoice, purchase order, w-8 ben etc.)
  • The template schema, written as a query language, that describes the fields, tables and lists that you want DocumentPro to capture from your specific document.
  • Parser configs that allow you to set how the document is parsed and post-processed.
  • Integration settings that allows you to integrate the data and documents in this parser with other platforms.

Example parser object#

{    "template_id": "710a20fc-e280-43eb-9a9f-5436e600c710",    "template_title": "Custom Invoice Parser",    "template_type": "Invoice",    "template_category": "other",    "template_schema": {      "fields": [        {          "name": "buyer_name",          "type": "text",          "description": "name of the buyer"        },        {          "name": "total_amount",          "type": "number",          "required": true        },        {          "name": "line_items",          "type": "table",          "description": "invoice line items",          "subFields": [            {              "name": "description",              "type": "text",              "description": "description of item"            },            {              "name": "subtotal",              "type": "number"            }          ]        }      ]    }    "email_id": "test_parser_22_fbb499@inbox.documentpro-ai.com",    "webhook_url": "https://your-application.com/documentpro-webhook",    "parser_config": {        "date_format": "%Y-%m-%d",        "outbound_integration": null,        "ocr_config": {            "engine": "aws_textract",            "precision": "low",            "auto_select_precision": true,            "formatting_level": "high",            "show_page_number": false,            "split_by_page": false,            "trim_spaces": true,            "show_type_label": false,            "detect_layout": true,            "detect_tables": false,            "remove_headers": true,            "remove_footers": true,            "remove_tables": false,            "remove_extra_whitespace": false,            "remove_extra_newlines": false,            "table_format": "markdown",            "tabulation_format": "github",            "minimize_table_cell_borders": false        },        "query_config": {            "query_model": "gpt-3.5-turbo-0125",            "set_max_output_tokens": false,            "include_example": false,            "minimize_tokens": true,            "selected_language": null        }    },    "created_at": "2023-11-15T12:13:12.056281"}

Parser attributes#

template_id#

The unique id of the parser. Use interchangeably with parser_id. This is generated automatically.

template_title#

The title of the parser must be unique to your account.

template_type#

The type of the parser should describe the type of document it is. E.g. an invoice, purchase order, w-8 ben etc.

template_category#

Options
finance
identity
logistics
human resources
medical
other (default)

template_schema#

The template_schema is the query language used by DocumentPro to extract information from your document. You use the query language to describe the fields, tables and lists that you want to DocumentPro to capture.

A field can have the following attributes

AttributesTypeRequiredConstraintsDescription
namestringYes30 characters, only lower case letters, numbers or underscoreName of the field. It should describe the information you want the AI to retrieve e.g invoice_number
descriptionstringNo50 charactersA description of the field to help the AI interpret the field name better. You can also add data formats e.g '9 alphanumeric characters unique identifier'
typeenum, stringYestext, number, date, tableThe type of the field as mentioned in the constraint.
requiredbooleanNotrue, falseDocumentPro will flag this field as missing if it is required but not extracted
subFieldsarrayConditionalRequired if type is tablesubFields is a list of fields as shown below

At least one subField is required if the parent field is a table. The attributes are:

AttributesTypeRequiredConstraintsDescription
namestringYes30 characters, only lower case letters, numbers or underscoreName of the field. It should describe the information you want the AI to retrieve e.g invoice_number
descriptionstringNo50 charactersA description of the field to help the AI interpret the field name better. You can also add data formats e.g '9 alphanumeric characters unique identifier'
typeenum, stringYestext, number, date, tableThe type of the field as mentioned in the constraint.
requiredbooleanNotrue, falseDocumentPro will flag this field as missing if it is required but not extracted

email_id#

An email_id is automatically generated for each parser. Documents can be directly forwarded to the email_id as an attachment to be automatically parsed by the related parser.

webhook_url#

You can set a webhook url to your server or to a third-party platform like Zapier to get parsed data from DocumentPro in real-time.

parser_config#

Parser config holds settings allows you to change how the parser processed your document. Not all options are configurable through API.

See below for config options

created_at#

The UTC date and time the parser was created.

Parser config options#

date_format#

If a date_format is set, DocumentPro will parse all date fields using the specified format.

Acceptable date formats are:

FormatDescription
%d/%m/%yDD_MM_YY
%d/%m/%YDD_MM_YYYY
%Y-%m-%dYYYY_MM_DD

ocr_config#

OCR configs allow you to set how the OCR engine processes your document. Not all options are configurable through API. View below.

query_config#

Query configs allow you to set how the query engine processes your document. Not all options are configurable through API. View below.

outbound_integration#

Outbound integration settings cannot be configured by API. They can only be configured through the DocumentPro web app.

OCR config options#

ConfigurationDescriptionValuesDefault
engineOCR engine to useaws_textractaws_textract
detect_layoutExtracts document content in reading ordertrue, falsetrue
detect_tablesDetects table rows, columns and cellstrue, falsefalse
remove_headersRemove document headers before parsingtrue, falsefalse
remove_footersRemove document footers before parsingtrue, falsefalse
remove_tablesRemove tables before parsingtrue, falsefalse
remove_extra_whitespaceRemove extra white spacestrue, falsetrue
remove_extra_newlinesRemove extra new linestrue, falsetrue
table_formatStructured with cell spacing or plaintextmarkdown, plaintextmarkdown
tabulation_formatWhat tabulation type to use for markdown table formatgithub, plain, csvgithub
minimize_table_cell_bordersMinimize borders to reduce character counttrue, falsefalse

Deprecated OCR config options#

ConfigurationDescriptionValuesDefault
precisionThe level of precision indicates whether tables are extracted with their cell structurelow, medium
auto_select_precisionEngine will decide precisiontrue, falsetrue
formatting_levelThe level of formatting indicates whether the OCR engine will try to preserve the formatting of the documentlow, mediumlow
show_page_numberWhether to show page numbertrue, falsefalse
split_by_pageWhether to split by page or by content typetrue, falsefalse
trim_spacesWhether to trim blank spacestrue, falsetrue
show_type_labelWhether to show type labeltrue, falsetrue

Query config options#

ConfigurationDescriptionValuesDefault
query_modelThe model to usegpt-3.5-turbo-0125, gpt-4-0125-previewgpt-3.5-turbo-0125
set_max_output_tokensWhether to estimate tokens or set max tokens for the LLM querytrue, falsefalse
include_exampleWhether to include example output in the LLM querytrue, falsefalse
minimize_tokensWhether to minimize tokens for the LLM query by removing line breaks and white spacestrue, falsetrue
selected_languageSet a language for the queryNone