Module extract

Module extract 

Source
Expand description

Multi-format text extraction for binary documents (PDF, OOXML).

Conforms to FILE_SUPPORT.md. Extraction is pipeline-layer: connectors supply bytes + content-type; this module returns plain UTF-8 text.

Enumsยง

ExtractError
Extraction error (spec ยง5.1: no panic; return error and pipeline skips item).

Constantsยง

MAX_XML_ENTRY_BYTES ๐Ÿ”’
Maximum decompressed bytes to read from a single ZIP entry (zip-bomb protection).
MIME_DOCX
MIME_PDF
Supported MIME types for extraction (spec ยง1.1).
MIME_PPTX
MIME_XLSX
XLSX_MAX_CELLS_PER_SHEET ๐Ÿ”’
Maximum cells to process per sheet (avoids unbounded memory).
XLSX_MAX_SHEETS ๐Ÿ”’
Maximum sheets to process in an xlsx (spec ยง5.2: implementation MAY limit).

Functionsยง

extract_a_t_elements ๐Ÿ”’
extract_docx ๐Ÿ”’
extract_pdf ๐Ÿ”’
extract_pptx ๐Ÿ”’
extract_text
Extracts plain text from binary content. Returns UTF-8 string or error (spec ยง5, ยง6).
extract_w_t_elements ๐Ÿ”’
extract_xlsx ๐Ÿ”’
extract_xlsx_sheet_cells ๐Ÿ”’
Limit bounds parsing work (cells considered), not only text cells emitted.
list_worksheet_names ๐Ÿ”’
read_shared_strings ๐Ÿ”’
read_zip_entry_bounded ๐Ÿ”’