Create Parser
Read the introduction to parsers to understand Parsers.
You can use the REST API to create a parser in DocumentPro. Once you've created a parser, you can upload documents against the parser and extract data from them.
#
Guide to creating a parser#
API EndpointPOST https://api.documentpro.ai/v1/templates
#
Required FieldsWhen creating a parser, the following fields are required:
template_title
(string): A unique title for your parser within your account.template_type
(string): Describes the type of document (e.g., "Invoice", "Purchase Order", "W-8 BEN").template_schema
(object): Defines the structure of data to be extracted. See below for details.
All other fields are optional.
#
Example Implementation using Pythonimport requestsimport json
url = "https://api.documentpro.ai/v1/templates"
payload = { "template_title": "Custom Invoice Parser", "template_type": "Invoice", "template_schema": { "fields": [ { "name": "buyer_name", "type": "text", "description": "name of the buyer" }, { "name": "total_amount", "type": "number" }, { "name": "line_items", "type": "table", "description": "invoice line items", "subFields": [ { "name": "description", "type": "text", "description": "description of item" }, { "name": "subtotal", "type": "number" } ] } ] }, "webhook_url": "https://your-application.com/documentpro-webhook", "parser_config": { "date_format": "%Y-%m-%d", "ocr_config": { "use_ocr": true, "detect_layout": true, "detect_tables": true }, "query_config": { "query_model": "gpt-4o-mini", "page_ranges": "1-3" } }}
headers = { 'x-api-key': 'YOUR_API_KEY', 'Content-Type': 'application/json'}
response = requests.post(url, headers=headers, json=payload)
# If the request was successful, status_code will be 200if response.status_code == 200: print('Parser created successfully') print(json.dumps(response.json(), indent=2))else: print('Failed to create parser') print(response.text)
#
Parser responseA 200
status code will have the following body structure.
{ "template_id": "710a20fc-e280-43eb-9a9f-5436e600c710", "template_title": "Custom Invoice Parser", "template_type": "Invoice", "template_schema": { "fields": [ { "name": "buyer_name", "type": "text", "description": "name of the buyer" }, { "name": "total_amount", "type": "number" }, { "name": "line_items", "type": "table", "description": "invoice line items", "subFields": [ { "name": "description", "type": "text", "description": "description of item" }, { "name": "subtotal", "type": "number" } ] } ] }, "email_id": "test_parser_22_fbb499@inbox.documentpro-ai.com", "webhook_url": "https://your-application.com/documentpro-webhook", "parser_config": { "parse_email_attachments": true, "parse_email_body": false, "date_format": "%Y-%m-%d", "ocr_config": { "use_ocr": true, "detect_layout": true, "detect_tables": true, "remove_headers": false, "remove_footers": false, "remove_tables": false }, "query_config": { "query_model": "gpt-4o-mini", "page_ranges": "1-3" } }, "created_at": "2023-11-15T12:13:12.056281"}
The template_id
is the unique id for the parser that can be used when uploading documents against the parser. template_id
and parser_id
are used interchangeably in DocumentPro.
For status codes 400
, 404
and 500
you will get the following response body.
{ "success": false, "error": "error code", "message": "descriptive error message"}
#
Request attributestemplate_title
(required)#
The title of the parser must be unique to your account.
template_type
(required)#
The type of the parser should describe the type of document it is. E.g. an invoice, purchase order, w-8 ben etc.
template_schema
(required)#
The template_schema
is the query language used by DocumentPro to extract information from your document. You use the query language to describe the fields, tables and lists that you want to DocumentPro to capture.
A field can have the following attributes
Attributes | Type | Required | Constraints | Description |
---|---|---|---|---|
name | string | Yes | 50 characters, only lower case letters, numbers or underscore | Name of the field. It should describe the information you want the AI to retrieve e.g invoice_number |
description | string | No | 150 characters | A description of the field to help the AI interpret the field name better. You can also add data formats e.g '9 alphanumeric characters unique identifier' |
type | enum, string | Yes | text, number, date, table | The type of the field as mentioned in the constraint. |
required | boolean | No | true, false | DocumentPro will flag this field as missing if it is required but not extracted |
subFields | array | Conditional | Required if type is table | subFields is a list of fields as shown below |
At least one subField is required if the parent field is a table
. The attributes are:
Attributes | Type | Required | Constraints | Description |
---|---|---|---|---|
name | string | Yes | 50 characters, only lower case letters, numbers or underscore | Name of the field. It should describe the information you want the AI to retrieve e.g invoice_number |
description | string | No | 150 characters | A description of the field to help the AI interpret the field name better. You can also add data formats e.g '9 alphanumeric characters unique identifier' |
type | enum, string | Yes | text, number, date | The type of the field as mentioned in the constraint. |
required | boolean | No | true, false | DocumentPro will flag this field as missing if it is required but not extracted |
webhook_url
(optional)#
You can set a webhook url to your server or to a third-party platform like Zapier to get parsed data from DocumentPro in real-time.
parser_config
(optional)#
Parser config holds settings that allow you to change how the parser processes your document. It includes:
parse_email_attachments
: Boolean to determine if email attachments should be parsed (default: true).parse_email_body
: Boolean to determine if the email body should be parsed (default: false).date_format
: Specifies the format for parsing date fields (e.g., "%Y-%m-%d" for YYYY-MM-DD).ocr_config
: Controls the OCR process.query_config
: Configures the AI query process.
#
OCR Config Optionsuse_ocr
: Boolean to determine if OCR should be used (default: false).detect_layout
: Boolean to determine if document layout should be detected (default: true).detect_tables
: Boolean to determine if tables should be detected (default: false).remove_headers
: Boolean to remove headers before parsing (default: false).remove_footers
: Boolean to remove footers before parsing (default: false).remove_tables
: Boolean to remove tables before parsing (default: false).
#
Query Config Optionsquery_model
: Specifies the AI model to use for parsing (e.g., "gpt-4o-mini", "gpt-4o").page_ranges
: Optional string to specify which pages to process (e.g., "1-3,5,7-9").start_regex
: Optional regex pattern to define where parsing should begin.end_regex
: Optional regex pattern to define where parsing should end.split_regex
: Optional regex pattern to split the document into sections.use_all_matches
: Boolean to determine if all matches should be used (default: false).
Remember, while these configurations provide powerful customization options, they are all optional. You can create a functional parser using only the required fields: template_title
, template_type
, and template_schema
.