Skip to main content

Document Segmentation

Document segmentation is a crucial feature for processing long or complex documents. It allows the parser to break down the document into manageable chunks, which is helpful for documents longer than 10 pages. DocumentPro offers four methods for document segmentation:

Method 1: Chunk by Pages#

  • chunk_by_pages: An integer specifying how many pages to use in each segment.

This method divides the document into fixed-size chunks based on the number of pages. For example, if chunk_by_pages is set to 5, the parser will process the document in 5-page segments.

Method 2: Rolling Window#

  • rolling_window: An integer specifying the window size for segmentation.

This method creates overlapping segments of the specified size. For instance, if rolling_window is set to 3, the parser will process pages 1-3, then 2-4, then 3-5, and so on.

Method 3: Start and End Regex#

  • start_regex: A regex pattern to define where parsing should begin.
  • end_regex: A regex pattern to define where parsing should end.
  • use_all_matches: A boolean that determines whether to use all matches or just the first.

This method allows you to define specific sections of a document for parsing based on regex patterns. It's particularly useful when you're interested in extracting information from specific parts of a document, ignoring headers, footers, or other irrelevant sections.

  • If start_regex is provided, the AI will start parsing from the first occurrence of this pattern.
  • If end_regex is provided, the AI will stop parsing when it encounters this pattern.
  • If use_all_matches is true, the AI will process all sections that match the start and end patterns. If false, it will only process the first matching section.

Alt text for the image

Method 4: Split Regex#

  • split_regex: A regex pattern for splitting the document into sections.
  • use_all_matches: Should always be true for this method.

This method is used when you want to divide a document into multiple sections for separate processing. It's useful when a document contains repeating structures, and you want to extract information from each instance separately.

Alt text for the image

Important Notes#

  1. For both Method 3 and Method 4 to work, use_ocr must be set to true.
  2. The use_all_matches parameter determines how the AI handles multiple matches:
    • If false (default): The AI will only use the first match it finds.
    • If true: The AI will process all matches it finds in the document.

By using these segmentation methods, you can precisely control which parts of a document are parsed, allowing for more accurate and efficient information extraction, especially for longer or more complex documents.