Module chunk

Module chunk 

Source
Expand description

Paragraph-boundary text chunker.

Splits document body text into Chunks that respect a configurable max_tokens limit. Splitting occurs on paragraph boundaries (\n\n) to preserve semantic coherence within each chunk.

Each chunk receives a deterministic UUID derived from its document ID and index, plus a SHA-256 hash of its text for staleness detection in the embedding pipeline.

ยงAlgorithm

  1. Convert max_tokens to max_chars using a 4 chars/token ratio.
  2. Split text on \n\n paragraph boundaries.
  3. Accumulate paragraphs into a buffer until adding the next paragraph would exceed max_chars.
  4. When exceeded, flush the buffer as a chunk and start a new one.
  5. If a single paragraph exceeds max_chars, perform a hard split at the nearest newline or space boundary.
  6. Guarantee at least one chunk per document (even for empty text).

ยงExample

use context_harness_core::chunk::chunk_text;

let chunks = chunk_text("doc-123", "Hello world.\n\nSecond paragraph.", 700);
assert_eq!(chunks.len(), 1);
assert_eq!(chunks[0].chunk_index, 0);

Constantsยง

CHARS_PER_TOKEN ๐Ÿ”’
Approximate characters-per-token ratio.

Functionsยง

chunk_text
Split text into chunks on paragraph boundaries, respecting max_tokens.
make_chunk ๐Ÿ”’
Create a single Chunk with a UUID and SHA-256 content hash.
snap_to_char_boundary ๐Ÿ”’
Snap a byte index back to the nearest valid UTF-8 char boundary.