Skip to main content

Parse Document

After uploading a document, you can use this API endpoint to run a specific parser on it. This guide explains how to use the API to parse a document.

Note that you can apply multiple different parsers to the same document, or apply the same parser multiple times to a single document. Each parser application will generate a unique request ID and result set.

API Endpoint#

GET https://api.documentpro.ai/v1/documents/{document_id}/run_parser

Path Parameters#

  • document_id (required): The unique identifier of the document you want to parse.

Query Parameters#

  • template_id (required): The unique identifier of the parser you want to use.
  • use_ocr (conditional): Must be set to true if query_model is "gpt-3.5-turbo" or if using any OCR-related parameters.
  • query_model (optional): Specifies the AI model for parsing (e.g., "gpt-4o-mini", "gpt-4o").
  • detect_layout (optional): Set to true to detect document layout. Requires use_ocr=true.
  • detect_tables (optional): Set to true to detect tables in the document. Requires use_ocr=true.
  • page_ranges (optional): Specifies which pages to parse (e.g., "1-3,5,7-9").

Document Segmentation#

  • chunk_by_pages (optional): An integer specifying how many pages to use in each segment for method 1 segmentation.
  • rolling_window (optional): An integer specifying the window size for method 2 segmentation.
  • start_regex (optional): A regex pattern to define where parsing should begin for method 3 segmentation. Requires use_ocr=true.
  • end_regex (optional): A regex pattern to define where parsing should end for method 3 segmentation. Requires use_ocr=true.
  • split_regex (optional): A regex pattern to split the document into sections for method 4 segmentation. Requires use_ocr=true.
  • use_all_matches (optional): Set to true to use all regex matches instead of just the first for methods 3 and 4. Requires use_ocr=true.

Headers#

  • x-api-key (required): Your API key for authentication.
  • Accept (optional): Specify the desired response format (e.g., "application/json", "application/xml").

Example Implementation#

Using cURL#

curl --location 'https://api.documentpro.ai/v1/documents/15dadb3e-b177-4069-be85-9711bb8e3ed1/run_parser?template_id=8e9beda9-5cba-42eb-a70a-b3e5eec9120a&use_ocr=true&query_model=gpt-4o&detect_layout=true&detect_tables=true&page_ranges=1&chunk_by_pages=5&start_regex=&end_regex=&split_regex=&use_all_matches=true' \--header 'x-api-key: YOUR_API_KEY' \--header 'Accept: application/json'

Using Python#

import requests
document_id = "15dadb3e-b177-4069-be85-9711bb8e3ed1"url = f"https://api.documentpro.ai/v1/documents/{document_id}/run_parser"
headers = {    'x-api-key': 'YOUR_API_KEY',    'Accept': 'application/json'}
params = {    'template_id': '8e9beda9-5cba-42eb-a70a-b3e5eec9120a',    'use_ocr': 'true',    'query_model': 'gpt-4o',    'detect_layout': 'true',    'detect_tables': 'true',    'page_ranges': '1',    'chunk_by_pages': '5',    'start_regex': '',    'end_regex': '',    'split_regex': '',    'use_all_matches': 'true'}
response = requests.get(url, headers=headers, params=params)
if response.status_code == 200:    print('Parser run successfully initiated')    print(response.json())else:    print('Failed to run parser')    print(response.text)

Response#

The response will contain information about the parsing job, including a request ID that you can use to check the status of the parsing process.

Successful Response (Status Code: 200)#

{    "request_id": "a7813466-6f9a-4c33-8128-427e7a4df755",    "request_status": "pending",    "response_body": {        "file_name": "abcd.pdf",        "file_presigned_url": null,        "user_error_msg": null,        "template_id": "8e9beda9-5cba-42eb-a70a-b3e5eec9120a",        "template_type": "asbestos report",        "template_title": "Acorn",        "num_pages": 1,        "human_verification_status": "pending",        "has_missing_required_fields": false,        "result_json_data": null    },    "created_at": "2024-07-25T16:29:40.102372",    "updated_at": "2024-07-25T16:29:40.191219"}

Error Response (Status Codes: 400, 401, 403, 404, 500)#

{    "success": false,    "error": "error_code",    "message": "descriptive error message"}

Response Fields Explained#

  • request_id: Unique identifier for the parsing job. Use this to retrieve results later.
  • request_status: Can be "pending", "processing", "complete", "exception", or "failure".
    • "exception" indicates an AI-related error.
    • If status is "failed" or "exception", user_error_msg will contain a human-readable error message.
  • file_presigned_url: URL for downloading the parsed file (only the selected pages). Available when processing is complete.
  • human_verification_status: Can be "pending", "approved", or "rejected".
  • result_json_data: Will be populated with the extracted data when processing is completed.

Important Notes#

  1. Only the template_id query parameter is required; all others are optional.
  2. use_ocr must be set to true if:
    • The query_model is "gpt-3.5-turbo"
    • You're using any OCR-related parameters (detect_layout, detect_tables, start_regex, end_regex, split_regex, use_all_matches)
  3. The page_ranges parameter does not apply to image files.
  4. Regex parameters are powerful tools for customizing the parsing process. Use them carefully.
  5. The parsing process is asynchronous. This API call initiates the process but does not return the parsed results directly.
  6. You can apply multiple parsers to the same document or the same parser multiple times. Each application will generate a unique request_id.
  7. For long documents (more than 10 pages), consider using one of the segmentation methods to improve parsing performance.

Next Steps#

After initiating a parsing job:

  1. Retrieve parsing results once the parsing is complete.
  2. If needed, update the parser configuration to adjust default settings for future parsing jobs.
  3. Consider applying additional parsers to the same document if you need to extract different types of information.