Context Harness has always ingested plain text from the filesystem — Markdown, .txt, .rs, whatever you put in include_globs. Now the same connector can extract and index PDF, Word (.docx), PowerPoint (.pptx), and Excel (.xlsx). Add those extensions to your globs, run ctx sync, and the extracted text is chunked and searchable like everything else. No separate “enable binary” switch; if the file matches your globs and has a supported extension, it’s in.


What’s supported

FormatExtensionWhat gets extracted
PDF.pdfText via pdf-extract (searchable PDFs)
Word.docxText from word/document.xml (OOXML)
PowerPoint.pptxText from slide XML
Excel.xlsxCell text (shared strings + sheet order)

Plain text (.md, .txt, .rs, etc.) is unchanged: still read as UTF-8 and indexed directly. For the four binary types above, the connector reads the file as raw bytes, the pipeline runs the right extractor, and the result is stored as document body with the original content-type (e.g. application/pdf) so you can filter or display it correctly.


How to use it

Add the extensions you care about to include_globs:

[connectors.filesystem.docs]
root = "./docs"
include_globs = ["**/*.md", "**/*.txt", "**/*.pdf", "**/*.docx", "**/*.pptx", "**/*.xlsx"]

Then sync as usual:

$ ctx sync filesystem:docs
sync filesystem:docs
  fetched: 42 items
  upserted documents: 42
  chunks written: 198
  extraction skipped: 0
ok

Extraction is inferred from file extension — no extra config flag required. If a file matches a glob and has a supported binary extension, it’s read as bytes and passed to the extraction pipeline. Corrupt or password-protected PDFs (and other extraction failures) are skipped and counted in extraction skipped; the rest of the sync still succeeds.


Size limit and config

Very large files can be skipped so they don’t blow up memory or CPU. The optional max_extract_bytes (default 50MB) caps which files are extracted; anything larger is skipped and included in the extraction-skipped count.

[connectors.filesystem.docs]
root = "./docs"
include_globs = ["**/*.md", "**/*.pdf"]
max_extract_bytes = 50_000_000   # optional; default 50MB

That’s the only extra knob. No per-format flags; the same rule applies to all four binary types.


Under the hood


Docs and upgrading

For the full table of supported formats and a minimal example, see the built-in connectors doc. The configuration reference covers max_extract_bytes and other filesystem options; the quick-start shows a sample config. Just ensure your include_globs list the extensions you want, and you’re set.

If you’ve been waiting to point Context Harness at a folder of PDFs and Office docs and search across them, this release is it. Sync, search, and use the same MCP and agent workflows you already have — the new formats slot right in.