Document Segmentation
Document segmentation is a crucial feature for processing long or complex documents. It allows the parser to break down the document into manageable chunks, which is helpful for documents longer than 10 pages. DocumentPro offers four methods for document segmentation:
#
Method 1: Chunk by Pageschunk_by_pages
: An integer specifying how many pages to use in each segment.
This method divides the document into fixed-size chunks based on the number of pages. For example, if chunk_by_pages
is set to 5, the parser will process the document in 5-page segments.
#
Method 2: Rolling Windowrolling_window
: An integer specifying the window size for segmentation.
This method creates overlapping segments of the specified size. For instance, if rolling_window
is set to 3, the parser will process pages 1-3, then 2-4, then 3-5, and so on.
#
Method 3: Start and End Regexstart_regex
: A regex pattern to define where parsing should begin.end_regex
: A regex pattern to define where parsing should end.use_all_matches
: A boolean that determines whether to use all matches or just the first.
This method allows you to define specific sections of a document for parsing based on regex patterns. It's particularly useful when you're interested in extracting information from specific parts of a document, ignoring headers, footers, or other irrelevant sections.
- If
start_regex
is provided, the AI will start parsing from the first occurrence of this pattern. - If
end_regex
is provided, the AI will stop parsing when it encounters this pattern. - If
use_all_matches
is true, the AI will process all sections that match the start and end patterns. If false, it will only process the first matching section.
#
Method 4: Split Regexsplit_regex
: A regex pattern for splitting the document into sections.use_all_matches
: Should always be true for this method.
This method is used when you want to divide a document into multiple sections for separate processing. It's useful when a document contains repeating structures, and you want to extract information from each instance separately.
#
Important Notes- For both Method 3 and Method 4 to work,
use_ocr
must be set to true. - The
use_all_matches
parameter determines how the AI handles multiple matches:- If
false
(default): The AI will only use the first match it finds. - If
true
: The AI will process all matches it finds in the document.
- If
By using these segmentation methods, you can precisely control which parts of a document are parsed, allowing for more accurate and efficient information extraction, especially for longer or more complex documents.