Expand description
Paragraph-boundary text chunker.
Splits document body text into Chunks that respect a configurable
max_tokens limit. Splitting occurs on paragraph boundaries (\n\n)
to preserve semantic coherence within each chunk.
Each chunk receives a deterministic UUID derived from its document ID and index, plus a SHA-256 hash of its text for staleness detection in the embedding pipeline.
ยงAlgorithm
- Convert
max_tokenstomax_charsusing a 4 chars/token ratio. - Split text on
\n\nparagraph boundaries. - Accumulate paragraphs into a buffer until adding the next paragraph
would exceed
max_chars. - When exceeded, flush the buffer as a chunk and start a new one.
- If a single paragraph exceeds
max_chars, perform a hard split at the nearest newline or space boundary. - Guarantee at least one chunk per document (even for empty text).
ยงExample
use context_harness_core::chunk::chunk_text;
let chunks = chunk_text("doc-123", "Hello world.\n\nSecond paragraph.", 700);
assert_eq!(chunks.len(), 1);
assert_eq!(chunks[0].chunk_index, 0);Constantsยง
- CHARS_
PER_ ๐TOKEN - Approximate characters-per-token ratio.
Functionsยง
- chunk_
text - Split text into chunks on paragraph boundaries, respecting
max_tokens. - make_
chunk ๐ - Create a single
Chunkwith a UUID and SHA-256 content hash. - snap_
to_ ๐char_ boundary - Snap a byte index back to the nearest valid UTF-8 char boundary.