Module chunk

Expand description

Paragraph-boundary text chunker.

Splits document body text into Chunks that respect a configurable max_tokens limit. Splitting occurs on paragraph boundaries (\n\n) to preserve semantic coherence within each chunk.

Each chunk receives a deterministic UUID derived from its document ID and index, plus a SHA-256 hash of its text for staleness detection in the embedding pipeline.

§Algorithm

Convert max_tokens to max_chars using a 4 chars/token ratio.
Split text on \n\n paragraph boundaries.
Accumulate paragraphs into a buffer until adding the next paragraph would exceed max_chars.
When exceeded, flush the buffer as a chunk and start a new one.
If a single paragraph exceeds max_chars, perform a hard split at the nearest newline or space boundary.
Guarantee at least one chunk per document (even for empty text).

§Example

use context_harness_core::chunk::chunk_text;

let chunks = chunk_text("doc-123", "Hello world.\n\nSecond paragraph.", 700);
assert_eq!(chunks.len(), 1);
assert_eq!(chunks[0].chunk_index, 0);

Constants§

CHARS_PER_TOKEN 🔒: Approximate characters-per-token ratio.

Functions§

chunk_text: Split text into chunks on paragraph boundaries, respecting max_tokens.
make_chunk 🔒: Create a single Chunk with a UUID and SHA-256 content hash.
snap_to_char_boundary 🔒: Snap a byte index back to the nearest valid UTF-8 char boundary.

Module chunk

Module chunk Copy item path

§Algorithm

§Example

Constants§

Functions§

Module chunk