Run Extract

After uploading a document, you can use this API endpoint to run an extract on the document. This guide explains how.

Note that you can run multiple extracts on the same document, or run the same extract multiple times. Each time an extract is run a unique request ID will be generated.

API Endpoint

Note that parser and workflow are using interchangeably here

GET https://api.documentpro.ai/v1/documents/{document_id}/run_parser

Path Parameters

document_id (required): The unique identifier of the document you want to parse.

Query Parameters

template_id (required): The unique identifier of the Workflow you want to use.
use_ocr (conditional): Must be set to true if query_model is "gpt-3.5-turbo" or if using any OCR-related parameters.
query_model (optional): Specifies the AI model for parsing (e.g., "gpt-4o-mini", "gpt-4o").
detect_layout (optional): Set to true to detect document layout. Requires use_ocr=true.
detect_tables (optional): Set to true to detect tables in the document. Requires use_ocr=true.
page_ranges (optional): Specifies which pages to parse (e.g., "1-3,5,7-9").

Document Segmentation

chunk_by_pages (optional): An integer specifying how many pages to use in each segment for method 1 segmentation.
rolling_window (optional): An integer specifying the window size for method 2 segmentation.
start_regex (optional): A regex pattern to define where parsing should begin for method 3 segmentation. Requires use_ocr=true.
end_regex (optional): A regex pattern to define where parsing should end for method 3 segmentation. Requires use_ocr=true.
split_regex (optional): A regex pattern to split the document into sections for method 4 segmentation. Requires use_ocr=true.
use_all_matches (optional): Set to true to use all regex matches instead of just the first for methods 3 and 4. Requires use_ocr=true.

Headers

x-api-key (required): Your API key for authentication.
Accept (optional): Specify the desired response format (e.g., "application/json", "application/xml").

Example Implementation

Using cURL

curl --location 'https://api.documentpro.ai/v1/documents/15dadb3e-b177-4069-be85-9711bb8e3ed1/run_parser?template_id=8e9beda9-5cba-42eb-a70a-b3e5eec9120a&use_ocr=true&query_model=gpt-4o&detect_layout=true&detect_tables=true&page_ranges=1&chunk_by_pages=5&start_regex=&end_regex=&split_regex=&use_all_matches=true' \
--header 'x-api-key: YOUR_API_KEY' \
--header 'Accept: application/json'

Using Python

import requests

document_id = "15dadb3e-b177-4069-be85-9711bb8e3ed1"
url = f"https://api.documentpro.ai/v1/documents/{document_id}/run_parser"

headers = {
    'x-api-key': 'YOUR_API_KEY',
    'Accept': 'application/json'
}

params = {
    'template_id': '8e9beda9-5cba-42eb-a70a-b3e5eec9120a',
    'use_ocr': 'true',
    'query_model': 'gpt-4o',
    'detect_layout': 'true',
    'detect_tables': 'true',
    'page_ranges': '1',
    'chunk_by_pages': '5',
    'start_regex': '',
    'end_regex': '',
    'split_regex': '',
    'use_all_matches': 'true'
}

response = requests.get(url, headers=headers, params=params)

if response.status_code == 200:
    print('Parser run successfully initiated')
    print(response.json())
else:
    print('Failed to run parser')
    print(response.text)

Response

The response will contain information about the parsing job, including a request ID that you can use to check the status of the parsing process.

Successful Response (Status Code: 200)

{
    "request_id": "a7813466-6f9a-4c33-8128-427e7a4df755",
    "request_status": "pending",
    "response_body": {
        "file_name": "abcd.pdf",
        "file_presigned_url": null,
        "user_error_msg": null,
        "template_id": "8e9beda9-5cba-42eb-a70a-b3e5eec9120a",
        "template_type": "asbestos report",
        "template_title": "Acorn",
        "num_pages": 1,
        "human_verification_status": "pending",
        "has_missing_required_fields": false,
        "result_json_data": null
    },
    "created_at": "2024-07-25T16:29:40.102372",
    "updated_at": "2024-07-25T16:29:40.191219"
}

Error Response (Status Codes: 400, 401, 403, 404, 500)

{
    "success": false,
    "error": "error_code",
    "message": "descriptive error message"
}

Response Fields Explained

request_id: Unique identifier for the parsing job. Use this to retrieve results later.
request_status: Can be "pending", "processing", "complete", "exception", or "failure".
- "exception" indicates an AI-related error.
- If status is "failed" or "exception", user_error_msg will contain a human-readable error message.
file_presigned_url: URL for downloading the parsed file (only the selected pages). Available when processing is complete.
human_verification_status: Can be "pending", "approved", or "rejected".
result_json_data: Will be populated with the extracted data when processing is completed.

Important Notes

Only the template_id query parameter is required; all others are optional.
use_ocr must be set to true if:
- The query_model is "gpt-3.5-turbo"
- You're using any OCR-related parameters (detect_layout, detect_tables, start_regex, end_regex, split_regex, use_all_matches)
The page_ranges parameter does not apply to image files.
Regex parameters are powerful tools for customizing the parsing process. Use them carefully.
The parsing process is asynchronous. This API call initiates the process but does not return the parsed results directly.
You can apply multiple parsers to the same document or the same parser multiple times. Each application will generate a unique request_id.
For long documents (more than 10 pages), consider using one of the segmentation methods to improve parsing performance.

Next Steps

After initiating a parsing job:

Retrieve Workflow results once the Workflow is complete.
If needed, update the Workflow configuration to adjust default settings for future Workflow jobs.
Consider applying additional Workflows to the same document if you need to extract different types of information.

API Endpoint​

Path Parameters​

Query Parameters​

Document Segmentation​

Headers​

Example Implementation​

Using cURL​

Using Python​

Response​

Successful Response (Status Code: 200)​

Error Response (Status Codes: 400, 401, 403, 404, 500)​

Response Fields Explained​

Important Notes​

Next Steps​