Expand description
Multi-format text extraction for binary documents (PDF, OOXML).
Conforms to FILE_SUPPORT.md. Extraction is pipeline-layer: connectors supply bytes + content-type; this module returns plain UTF-8 text.
Enumsยง
- Extract
Error - Extraction error (spec ยง5.1: no panic; return error and pipeline skips item).
Constantsยง
- MAX_
XML_ ๐ENTRY_ BYTES - Maximum decompressed bytes to read from a single ZIP entry (zip-bomb protection).
- MIME_
DOCX - MIME_
PDF - Supported MIME types for extraction (spec ยง1.1).
- MIME_
PPTX - MIME_
XLSX - XLSX_
MAX_ ๐CELLS_ PER_ SHEET - Maximum cells to process per sheet (avoids unbounded memory).
- XLSX_
MAX_ ๐SHEETS - Maximum sheets to process in an xlsx (spec ยง5.2: implementation MAY limit).
Functionsยง
- extract_
a_ ๐t_ elements - extract_
docx ๐ - extract_
pdf ๐ - extract_
pptx ๐ - extract_
text - Extracts plain text from binary content. Returns UTF-8 string or error (spec ยง5, ยง6).
- extract_
w_ ๐t_ elements - extract_
xlsx ๐ - extract_
xlsx_ ๐sheet_ cells - Limit bounds parsing work (cells considered), not only text cells emitted.
- list_
worksheet_ ๐names - read_
shared_ ๐strings - read_
zip_ ๐entry_ bounded