Skip to main content

AI Parsers

1. What is an AI Parser?#

An AI parser in DocumentPro can extract structured information from documents. It combines OCR (Optical Character Recognition) technology with advanced AI models to accurately interpret and extract relevant data from various document formats.

A key feature of DocumentPro's AI parsers is their ability to understand context and interpret information beyond exact matches. The AI doesn't just look for labels that match the field names you've defined. Instead, it uses the field names and descriptions to understand what type of information to extract, even if it's not explicitly labeled in the document.

For example, if you define a field named "invoice_number" with a description "Unique identifier for the invoice", the AI will look for information that fits this description anywhere in the document. It might find this information near labels like "Invoice #", "Reference Number", or even in a section without a specific label, using its understanding of what an invoice number typically looks like and where it's usually located.

This contextual understanding allows the AI parser to be flexible and accurate, capable of extracting information from documents with varying layouts and terminology.

2. Defining an AI Parser#

An AI parser is defined by several key attributes:

  • template_id: A unique identifier for the parser (automatically generated).
  • template_title: A user-defined title for the parser (must be unique within your account).
  • template_type: The type of document this parser is designed for (e.g., invoice, purchase order, receipt).
  • template_schema: Defines the structure of data to be extracted (more details below).
  • email_id: An automatically generated email address to forward documents to for parsing.
  • webhook_url: An optional URL where DocumentPro will send parsed data in real-time.
  • parser_config: Contains settings for OCR and query processing (more details below).

3. Template Schema#

The template_schema is the core of an AI parser, defining the structure of data to be extracted from the document. It consists of a list of fields, each with specific attributes that guide the AI in understanding what information to look for and how to interpret it.

Field Attributes#

  • name: A string that identifies the field. It must be a maximum of 50 characters, using only lowercase letters, numbers, and underscores. No spaces are allowed. For example: "invoice_number", "total_amount", "shipping_address".

  • type: The data type of the field (options: text, number, date, table).

  • description: An optional description of the field, up to 150 characters. This description is crucial as it provides additional context to the AI about what information to look for.

  • default_value: An optional default value for the field.

The combination of name and description is particularly important. These attributes work together to tell the AI what information to extract and how to identify it within the document. The AI uses this guidance to interpret the document's content, allowing it to find relevant information even when it's not explicitly labeled or when it appears in unexpected locations or formats.

Field Types#

  1. Text: For extracting string data.
  2. Number: For extracting numeric data.
  3. Date: For extracting date information. Setting a date format will apply formatting to the extracted date.
  4. Table: For extracting tabular data. Requires subFields to define the structure of each row.

Example 1: Extracting Facts from a Document#

This example shows how to define a schema for extracting specific facts from a document, such as an invoice:

{  "fields": [    {      "name": "invoice_number",      "type": "text",      "description": "Unique identifier for the invoice, usually at the top",      "required": true    },    {      "name": "issue_date",      "type": "date",      "description": "Date when the invoice was issued, often near the invoice number"    },    {      "name": "total_amount",      "type": "number",      "description": "Total amount to be paid, usually at the bottom of the invoice"    }  ]}

In this example, each field represents a specific piece of information (or "fact") to be extracted from the document. The AI will use the field names and descriptions to identify and extract this information, even if it's not labeled exactly as specified.

Example 2: Extracting Tables or Repeating Data#

This example demonstrates how to define a schema for extracting tabular or repeating data, such as line items in an invoice:

{  "fields": [    {      "name": "line_items",      "type": "table",      "description": "Individual items or services listed on the invoice",      "subFields": [        {          "name": "item_description",          "type": "text",          "description": "Description of the item or service provided"        },        {          "name": "quantity",          "type": "number",          "description": "Number of units of the item"        },        {          "name": "unit_price",          "type": "number",          "description": "Price per unit of the item"        }      ]    },    {      "name": "tax_items",      "type": "table",      "description": "List of taxes applied to the invoice",      "subFields": [        {          "name": "tax_name",          "type": "text",          "description": "Name or type of the tax (e.g., VAT, Sales Tax)"        },        {          "name": "tax_rate",          "type": "number",          "description": "Rate of the tax as a percentage"        },        {          "name": "tax_amount",          "type": "number",          "description": "Amount of tax applied"        }      ]    }  ]}

In this example, we define two table-type fields:

  1. line_items: This represents the repeating structure of items or services listed on the invoice. Each item in this table will have a description, quantity, unit price, and subtotal.

  2. tax_items: This represents a potentially repeating structure for different types of taxes applied to the invoice.

The AI will use these definitions to identify and extract structured, repeating data from the document, even if the exact format or layout varies between different invoices.

By combining these two types of field definitions - facts and tables - you can create comprehensive parser schemas that can extract both specific pieces of information and structured, repeating data from your documents.

4. Parser Configuration#

The parser_config object contains various settings that control how the parser processes documents, including email parsing options:

  • parse_email_attachments: Boolean to determine if email attachments should be parsed (default: true).
  • parse_email_body: Boolean to determine if the email body should be parsed (default: false).
  • date_format: Specifies the format for parsing date fields (e.g., "%Y-%m-%d" for YYYY-MM-DD).

When a parser is created, it automatically generates a forwarding email address, which is set in the email_id attribute. This email can be used to send documents directly to the parser for processing.

Example parser configuration:

{  "parser_config": {    "parse_email_attachments": true,    "parse_email_body": false,    "date_format": "%Y-%m-%d"  }}

5. Query Settings#

Query settings control how the AI model processes and extracts information from the document. These settings are defined in the query_config object:

  • query_model: Specifies the AI model to use for parsing (e.g., "gpt-4o-mini", "gpt-4o").
  • page_ranges: Optional string to specify which pages to process (e.g., "1-3,5,7-9").

The following settings only apply if use_ocr is set to true:

  • start_regex: Optional regex pattern to define where parsing should begin.
  • end_regex: Optional regex pattern to define where parsing should end.
  • split_regex: Optional regex pattern to split the document into sections. This will only be used if start_regex and end_regex are not provided.
  • use_all_matches: Boolean to determine if all matches should be used (default: false).

Example query configuration:

{  "query_config": {    "query_model": "gpt-4o-mini",    "page_ranges": "1-5",    "start_regex": "^Invoice Details",    "end_regex": "^Thank you for your business",    "use_all_matches": true  }}

5.1 Page Ranges#

The page_ranges parameter allows you to specify which pages of a document should be processed:

  • page_ranges: Optional string to specify which pages to process (e.g., "1-3,5,7-9").

Important notes about page selection:

  • Different parsers can be applied to different page ranges within a single document by calling the API multiple times with different page ranges.
  • A single parser can be applied multiple times to different page ranges in the same document by calling the API multiple times with different page ranges.

This flexibility allows for efficient processing of complex documents with varying structures across different pages.

5.2 Regex Features#

DocumentPro offers powerful regex-based settings for more precise control over which parts of a document are parsed. These features are explained in detail in the Document Segmentation section.

Example query configuration:

{  "query_config": {    "query_model": "gpt-4o-mini",    "page_ranges": "1-5",    "start_regex": "^Invoice Details",    "end_regex": "^Thank you for your business",    "use_all_matches": true  }}

6. OCR Settings#

OCR (Optical Character Recognition) settings control how the document is processed before being passed to the AI model. These settings are defined in the ocr_config object:

  • use_ocr: Boolean to determine if OCR should be used. Ensure it is set to true if query model is gpt-3.5-turbo (default: false).
  • detect_layout: Boolean to determine if document content (including tables) will be extracted in reading order (default: true).
  • detect_tables: Boolean to determine if tables borders should be detected (default: false).
  • remove_headers: Boolean to remove headers before parsing. Only applies if detect_layout is enabled. (default: false).
  • remove_footers: Boolean to remove footers before parsing. Only applies if detect_layout is enabled. (default: false).
  • remove_tables: Boolean to remove tables before parsing. Only applies if detect_tables is enabled. (default: false).

Example OCR configuration:

{  "ocr_config": {    "use_ocr": true,    "detect_layout": true,    "detect_tables": true,    "remove_headers": true,    "remove_footers": true,    "remove_tables": false  }}

7. Creating a Parser and Example Response#

API Call to Create a Parser#

Here's an example of how to create an AI parser using an API call:

import requestsimport json
url = "https://api.documentpro.com/v1/parsers"
headers = {    "Authorization": "Bearer YOUR_API_KEY",    "Content-Type": "application/json"}
parser_data = {    "template_title": "Custom Invoice Parser",    "template_type": "Invoice",    "template_schema": {        "fields": [            {                "name": "invoice_number",                "type": "text",                "description": "Unique identifier for the invoice",                "required": true            },            {                "name": "total_amount",                "type": "number",                "description": "Total amount on the invoice"            },            {                "name": "line_items",                "type": "table",                "description": "Individual items on the invoice",                "subFields": [                    {                        "name": "description",                        "type": "text",                        "description": "Description of the item"                    },                    {                        "name": "quantity",                        "type": "number",                        "description": "Quantity of the item"                    },                    {                        "name": "unit_price",                        "type": "number",                        "description": "Price per unit of the item"                    }                ]            }        ]    },    "webhook_url": "https://your-app.com/webhook",    "parser_config": {        "parse_email_attachments": true,        "parse_email_body": false,        "date_format": "%Y-%m-%d",        "ocr_config": {            "use_ocr": true,            "detect_layout": true,            "detect_tables": true        },        "query_config": {            "query_model": "gpt-4o-mini",            "page_ranges": "1-3"        }    }}
response = requests.post(url, headers=headers, data=json.dumps(parser_data))
print(response.status_code)print(response.json())

Example Response Object#

Upon successful creation of a parser, you might receive a response like this:

{    "template_id": "8f7e6d5c-4b3a-2a1e-9f8e-7d6c5b4a3f2e",    "template_title": "Custom Invoice Parser",    "template_type": "Invoice",    "template_schema": {        "fields": [            {                "name": "invoice_number",                "type": "text",                "description": "Unique identifier for the invoice",                "required": true            },            {                "name": "total_amount",                "type": "number",                "description": "Total amount on the invoice"            },            {                "name": "line_items",                "type": "table",                "description": "Individual items on the invoice",                "subFields": [                    {                        "name": "description",                        "type": "text",                        "description": "Description of the item"                    },                    {                        "name": "quantity",                        "type": "number",                        "description": "Quantity of the item"                    },                    {                        "name": "unit_price",                        "type": "number",                        "description": "Price per unit of the item"                    }                ]            }        ]    },    "email_id": "custom_invoice_parser_8f7e6d5c@inbox.documentpro-ai.com",    "webhook_url": "https://your-app.com/webhook",    "parser_config": {        "parse_email_attachments": true,        "parse_email_body": false,        "date_format": "%Y-%m-%d",        "ocr_config": {            "detect_layout": true,            "detect_tables": true        },        "query_config": {            "query_model": "gpt-4o-mini",            "page_ranges": "1-3"        }    },    "created_at": "2024-07-22T14:30:00Z"}

This response includes the newly generated template_id, the automatically created email_id, and a created_at timestamp, along with all the information you provided when creating the parser.