Skip to main content

Create Parser

Read the introduction to parsers to understand Parsers.

You can use the REST API to create a parser in DocumentPro. Once you've created a parser, you can upload documents against the parser and extract data from them.

Guide to creating a parser#

API Endpoint#

POST https://api.documentpro.ai/v1/templates

Required Fields#

When creating a parser, the following fields are required:

  1. template_title (string): A unique title for your parser within your account.
  2. template_type (string): Describes the type of document (e.g., "Invoice", "Purchase Order", "W-8 BEN").
  3. template_schema (object): Defines the structure of data to be extracted. See below for details.

All other fields are optional.

Example Implementation using Python#

import requestsimport json
url = "https://api.documentpro.ai/v1/templates"
payload = {  "template_title": "Custom Invoice Parser",  "template_type": "Invoice",  "template_schema": {    "fields": [      {        "name": "buyer_name",        "type": "text",        "description": "name of the buyer"      },      {        "name": "total_amount",        "type": "number"      },      {        "name": "line_items",        "type": "table",        "description": "invoice line items",        "subFields": [          {            "name": "description",            "type": "text",            "description": "description of item"          },          {            "name": "subtotal",            "type": "number"          }        ]      }    ]  },  "webhook_url": "https://your-application.com/documentpro-webhook",  "parser_config": {    "date_format": "%Y-%m-%d",    "ocr_config": {      "use_ocr": true,      "detect_layout": true,      "detect_tables": true    },    "query_config": {      "query_model": "gpt-4o-mini",      "page_ranges": "1-3"    }  }}
headers = {  'x-api-key': 'YOUR_API_KEY',  'Content-Type': 'application/json'}
response = requests.post(url, headers=headers, json=payload)
# If the request was successful, status_code will be 200if response.status_code == 200:    print('Parser created successfully')    print(json.dumps(response.json(), indent=2))else:    print('Failed to create parser')    print(response.text)

Parser response#

A 200 status code will have the following body structure.

{    "template_id": "710a20fc-e280-43eb-9a9f-5436e600c710",    "template_title": "Custom Invoice Parser",    "template_type": "Invoice",    "template_schema": {      "fields": [        {          "name": "buyer_name",          "type": "text",          "description": "name of the buyer"        },        {          "name": "total_amount",          "type": "number"        },        {          "name": "line_items",          "type": "table",          "description": "invoice line items",          "subFields": [            {              "name": "description",              "type": "text",              "description": "description of item"            },            {              "name": "subtotal",              "type": "number"            }          ]        }      ]    },    "email_id": "test_parser_22_fbb499@inbox.documentpro-ai.com",    "webhook_url": "https://your-application.com/documentpro-webhook",    "parser_config": {      "parse_email_attachments": true,      "parse_email_body": false,      "date_format": "%Y-%m-%d",      "ocr_config": {        "use_ocr": true,        "detect_layout": true,        "detect_tables": true,        "remove_headers": false,        "remove_footers": false,        "remove_tables": false      },      "query_config": {        "query_model": "gpt-4o-mini",        "page_ranges": "1-3"      }    },    "created_at": "2023-11-15T12:13:12.056281"}

The template_id is the unique id for the parser that can be used when uploading documents against the parser. template_id and parser_id are used interchangeably in DocumentPro.

For status codes 400, 404 and 500 you will get the following response body.

{    "success": false,    "error": "error code",    "message": "descriptive error message"}

Request attributes#

template_title (required)#

The title of the parser must be unique to your account.

template_type (required)#

The type of the parser should describe the type of document it is. E.g. an invoice, purchase order, w-8 ben etc.

template_schema (required)#

The template_schema is the query language used by DocumentPro to extract information from your document. You use the query language to describe the fields, tables and lists that you want to DocumentPro to capture.

A field can have the following attributes

AttributesTypeRequiredConstraintsDescription
namestringYes50 characters, only lower case letters, numbers or underscoreName of the field. It should describe the information you want the AI to retrieve e.g invoice_number
descriptionstringNo150 charactersA description of the field to help the AI interpret the field name better. You can also add data formats e.g '9 alphanumeric characters unique identifier'
typeenum, stringYestext, number, date, tableThe type of the field as mentioned in the constraint.
requiredbooleanNotrue, falseDocumentPro will flag this field as missing if it is required but not extracted
subFieldsarrayConditionalRequired if type is tablesubFields is a list of fields as shown below

At least one subField is required if the parent field is a table. The attributes are:

AttributesTypeRequiredConstraintsDescription
namestringYes50 characters, only lower case letters, numbers or underscoreName of the field. It should describe the information you want the AI to retrieve e.g invoice_number
descriptionstringNo150 charactersA description of the field to help the AI interpret the field name better. You can also add data formats e.g '9 alphanumeric characters unique identifier'
typeenum, stringYestext, number, dateThe type of the field as mentioned in the constraint.
requiredbooleanNotrue, falseDocumentPro will flag this field as missing if it is required but not extracted

webhook_url (optional)#

You can set a webhook url to your server or to a third-party platform like Zapier to get parsed data from DocumentPro in real-time.

parser_config (optional)#

Parser config holds settings that allow you to change how the parser processes your document. It includes:

  • parse_email_attachments: Boolean to determine if email attachments should be parsed (default: true).
  • parse_email_body: Boolean to determine if the email body should be parsed (default: false).
  • date_format: Specifies the format for parsing date fields (e.g., "%Y-%m-%d" for YYYY-MM-DD).
  • ocr_config: Controls the OCR process.
  • query_config: Configures the AI query process.
OCR Config Options#
  • use_ocr: Boolean to determine if OCR should be used (default: false).
  • detect_layout: Boolean to determine if document layout should be detected (default: true).
  • detect_tables: Boolean to determine if tables should be detected (default: false).
  • remove_headers: Boolean to remove headers before parsing (default: false).
  • remove_footers: Boolean to remove footers before parsing (default: false).
  • remove_tables: Boolean to remove tables before parsing (default: false).
Query Config Options#
  • query_model: Specifies the AI model to use for parsing (e.g., "gpt-4o-mini", "gpt-4o").
  • page_ranges: Optional string to specify which pages to process (e.g., "1-3,5,7-9").
  • start_regex: Optional regex pattern to define where parsing should begin.
  • end_regex: Optional regex pattern to define where parsing should end.
  • split_regex: Optional regex pattern to split the document into sections.
  • use_all_matches: Boolean to determine if all matches should be used (default: false).

Remember, while these configurations provide powerful customization options, they are all optional. You can create a functional parser using only the required fields: template_title, template_type, and template_schema.