Run Extract
After uploading a document, you can use this API endpoint to run an extract on the document. This guide explains how.
Note that you can run multiple extracts on the same document, or run the same extract multiple times. Each time an extract is run a unique request ID will be generated.
API Endpoint
Note that parser and workflow are using interchangeably here
GET https://api.documentpro.ai/v1/documents/{document_id}/run_parser
Path Parameters
document_id
(required): The unique identifier of the document you want to parse.
Query Parameters
template_id
(required): The unique identifier of the Workflow you want to use.use_ocr
(conditional): Must be set totrue
ifquery_model
is "gpt-3.5-turbo" or if using any OCR-related parameters.query_model
(optional): Specifies the AI model for parsing (e.g., "gpt-4o-mini", "gpt-4o").detect_layout
(optional): Set totrue
to detect document layout. Requiresuse_ocr=true
.detect_tables
(optional): Set totrue
to detect tables in the document. Requiresuse_ocr=true
.page_ranges
(optional): Specifies which pages to parse (e.g., "1-3,5,7-9").
Document Segmentation
chunk_by_pages
(optional): An integer specifying how many pages to use in each segment for method 1 segmentation.rolling_window
(optional): An integer specifying the window size for method 2 segmentation.start_regex
(optional): A regex pattern to define where parsing should begin for method 3 segmentation. Requiresuse_ocr=true
.end_regex
(optional): A regex pattern to define where parsing should end for method 3 segmentation. Requiresuse_ocr=true
.split_regex
(optional): A regex pattern to split the document into sections for method 4 segmentation. Requiresuse_ocr=true
.use_all_matches
(optional): Set totrue
to use all regex matches instead of just the first for methods 3 and 4. Requiresuse_ocr=true
.
Headers
x-api-key
(required): Your API key for authentication.Accept
(optional): Specify the desired response format (e.g., "application/json", "application/xml").
Example Implementation
Using cURL
curl --location 'https://api.documentpro.ai/v1/documents/15dadb3e-b177-4069-be85-9711bb8e3ed1/run_parser?template_id=8e9beda9-5cba-42eb-a70a-b3e5eec9120a&use_ocr=true&query_model=gpt-4o&detect_layout=true&detect_tables=true&page_ranges=1&chunk_by_pages=5&start_regex=&end_regex=&split_regex=&use_all_matches=true' \
--header 'x-api-key: YOUR_API_KEY' \
--header 'Accept: application/json'
Using Python
import requests
document_id = "15dadb3e-b177-4069-be85-9711bb8e3ed1"
url = f"https://api.documentpro.ai/v1/documents/{document_id}/run_parser"
headers = {
'x-api-key': 'YOUR_API_KEY',
'Accept': 'application/json'
}
params = {
'template_id': '8e9beda9-5cba-42eb-a70a-b3e5eec9120a',
'use_ocr': 'true',
'query_model': 'gpt-4o',
'detect_layout': 'true',
'detect_tables': 'true',
'page_ranges': '1',
'chunk_by_pages': '5',
'start_regex': '',
'end_regex': '',
'split_regex': '',
'use_all_matches': 'true'
}
response = requests.get(url, headers=headers, params=params)
if response.status_code == 200:
print('Parser run successfully initiated')
print(response.json())
else:
print('Failed to run parser')
print(response.text)
Response
The response will contain information about the parsing job, including a request ID that you can use to check the status of the parsing process.
Successful Response (Status Code: 200)
{
"request_id": "a7813466-6f9a-4c33-8128-427e7a4df755",
"request_status": "pending",
"response_body": {
"file_name": "abcd.pdf",
"file_presigned_url": null,
"user_error_msg": null,
"template_id": "8e9beda9-5cba-42eb-a70a-b3e5eec9120a",
"template_type": "asbestos report",
"template_title": "Acorn",
"num_pages": 1,
"human_verification_status": "pending",
"has_missing_required_fields": false,
"result_json_data": null
},
"created_at": "2024-07-25T16:29:40.102372",
"updated_at": "2024-07-25T16:29:40.191219"
}
Error Response (Status Codes: 400, 401, 403, 404, 500)
{
"success": false,
"error": "error_code",
"message": "descriptive error message"
}
Response Fields Explained
request_id
: Unique identifier for the parsing job. Use this to retrieve results later.request_status
: Can be "pending", "processing", "complete", "exception", or "failure".- "exception" indicates an AI-related error.
- If status is "failed" or "exception",
user_error_msg
will contain a human-readable error message.
file_presigned_url
: URL for downloading the parsed file (only the selected pages). Available when processing is complete.human_verification_status
: Can be "pending", "approved", or "rejected".result_json_data
: Will be populated with the extracted data when processing is completed.
Important Notes
- Only the
template_id
query parameter is required; all others are optional. use_ocr
must be set totrue
if:- The
query_model
is "gpt-3.5-turbo" - You're using any OCR-related parameters (detect_layout, detect_tables, start_regex, end_regex, split_regex, use_all_matches)
- The
- The
page_ranges
parameter does not apply to image files. - Regex parameters are powerful tools for customizing the parsing process. Use them carefully.
- The parsing process is asynchronous. This API call initiates the process but does not return the parsed results directly.
- You can apply multiple parsers to the same document or the same parser multiple times. Each application will generate a unique
request_id
. - For long documents (more than 10 pages), consider using one of the segmentation methods to improve parsing performance.
Next Steps
After initiating a parsing job:
- Retrieve Workflow results once the Workflow is complete.
- If needed, update the Workflow configuration to adjust default settings for future Workflow jobs.
- Consider applying additional Workflows to the same document if you need to extract different types of information.