Parse Document
After uploading a document, you can use this API endpoint to run a specific parser on it. This guide explains how to use the API to parse a document.
Note that you can apply multiple different parsers to the same document, or apply the same parser multiple times to a single document. Each parser application will generate a unique request ID and result set.
#
API EndpointGET https://api.documentpro.ai/v1/documents/{document_id}/run_parser
#
Path Parametersdocument_id
(required): The unique identifier of the document you want to parse.
#
Query Parameterstemplate_id
(required): The unique identifier of the parser you want to use.use_ocr
(conditional): Must be set totrue
ifquery_model
is "gpt-3.5-turbo" or if using any OCR-related parameters.query_model
(optional): Specifies the AI model for parsing (e.g., "gpt-4o-mini", "gpt-4o").detect_layout
(optional): Set totrue
to detect document layout. Requiresuse_ocr=true
.detect_tables
(optional): Set totrue
to detect tables in the document. Requiresuse_ocr=true
.page_ranges
(optional): Specifies which pages to parse (e.g., "1-3,5,7-9").
#
Document Segmentationchunk_by_pages
(optional): An integer specifying how many pages to use in each segment for method 1 segmentation.rolling_window
(optional): An integer specifying the window size for method 2 segmentation.start_regex
(optional): A regex pattern to define where parsing should begin for method 3 segmentation. Requiresuse_ocr=true
.end_regex
(optional): A regex pattern to define where parsing should end for method 3 segmentation. Requiresuse_ocr=true
.split_regex
(optional): A regex pattern to split the document into sections for method 4 segmentation. Requiresuse_ocr=true
.use_all_matches
(optional): Set totrue
to use all regex matches instead of just the first for methods 3 and 4. Requiresuse_ocr=true
.
#
Headersx-api-key
(required): Your API key for authentication.Accept
(optional): Specify the desired response format (e.g., "application/json", "application/xml").
#
Example Implementation#
Using cURLcurl --location 'https://api.documentpro.ai/v1/documents/15dadb3e-b177-4069-be85-9711bb8e3ed1/run_parser?template_id=8e9beda9-5cba-42eb-a70a-b3e5eec9120a&use_ocr=true&query_model=gpt-4o&detect_layout=true&detect_tables=true&page_ranges=1&chunk_by_pages=5&start_regex=&end_regex=&split_regex=&use_all_matches=true' \--header 'x-api-key: YOUR_API_KEY' \--header 'Accept: application/json'
#
Using Pythonimport requests
document_id = "15dadb3e-b177-4069-be85-9711bb8e3ed1"url = f"https://api.documentpro.ai/v1/documents/{document_id}/run_parser"
headers = { 'x-api-key': 'YOUR_API_KEY', 'Accept': 'application/json'}
params = { 'template_id': '8e9beda9-5cba-42eb-a70a-b3e5eec9120a', 'use_ocr': 'true', 'query_model': 'gpt-4o', 'detect_layout': 'true', 'detect_tables': 'true', 'page_ranges': '1', 'chunk_by_pages': '5', 'start_regex': '', 'end_regex': '', 'split_regex': '', 'use_all_matches': 'true'}
response = requests.get(url, headers=headers, params=params)
if response.status_code == 200: print('Parser run successfully initiated') print(response.json())else: print('Failed to run parser') print(response.text)
#
ResponseThe response will contain information about the parsing job, including a request ID that you can use to check the status of the parsing process.
#
Successful Response (Status Code: 200){ "request_id": "a7813466-6f9a-4c33-8128-427e7a4df755", "request_status": "pending", "response_body": { "file_name": "abcd.pdf", "file_presigned_url": null, "user_error_msg": null, "template_id": "8e9beda9-5cba-42eb-a70a-b3e5eec9120a", "template_type": "asbestos report", "template_title": "Acorn", "num_pages": 1, "human_verification_status": "pending", "has_missing_required_fields": false, "result_json_data": null }, "created_at": "2024-07-25T16:29:40.102372", "updated_at": "2024-07-25T16:29:40.191219"}
#
Error Response (Status Codes: 400, 401, 403, 404, 500){ "success": false, "error": "error_code", "message": "descriptive error message"}
#
Response Fields Explainedrequest_id
: Unique identifier for the parsing job. Use this to retrieve results later.request_status
: Can be "pending", "processing", "complete", "exception", or "failure".- "exception" indicates an AI-related error.
- If status is "failed" or "exception",
user_error_msg
will contain a human-readable error message.
file_presigned_url
: URL for downloading the parsed file (only the selected pages). Available when processing is complete.human_verification_status
: Can be "pending", "approved", or "rejected".result_json_data
: Will be populated with the extracted data when processing is completed.
#
Important Notes- Only the
template_id
query parameter is required; all others are optional. use_ocr
must be set totrue
if:- The
query_model
is "gpt-3.5-turbo" - You're using any OCR-related parameters (detect_layout, detect_tables, start_regex, end_regex, split_regex, use_all_matches)
- The
- The
page_ranges
parameter does not apply to image files. - Regex parameters are powerful tools for customizing the parsing process. Use them carefully.
- The parsing process is asynchronous. This API call initiates the process but does not return the parsed results directly.
- You can apply multiple parsers to the same document or the same parser multiple times. Each application will generate a unique
request_id
. - For long documents (more than 10 pages), consider using one of the segmentation methods to improve parsing performance.
#
Next StepsAfter initiating a parsing job:
- Retrieve parsing results once the parsing is complete.
- If needed, update the parser configuration to adjust default settings for future parsing jobs.
- Consider applying additional parsers to the same document if you need to extract different types of information.