Skip to main content

chunk_text

context_harness::chunk

Function chunk_text

pub fn chunk_text(
    document_id: &str,
    text: &str,
    max_tokens: usize,
) -> Vec<Chunk>

Expand description

Split text into chunks on paragraph boundaries, respecting max_tokens.

Returns chunks with contiguous indices starting at 0. Each chunk’s hash is the SHA-256 of its text content, used for embedding staleness detection.

§Arguments

document_id — The parent document’s UUID (used in chunk metadata).
text — The full document body to chunk.
max_tokens — Maximum tokens per chunk (converted to chars via × 4).

§Guarantees

At least one chunk is always returned (even for empty text).
Chunk indices are contiguous: 0, 1, 2, …, N-1.
Chunks are split on \n\n boundaries when possible.
Oversized paragraphs are hard-split at space/newline boundaries.