chunk_text

Function chunk_text 

Source
pub fn chunk_text(
    document_id: &str,
    text: &str,
    max_tokens: usize,
) -> Vec<Chunk>
Expand description

Split text into chunks on paragraph boundaries, respecting max_tokens.

Returns chunks with contiguous indices starting at 0. Each chunk’s hash is the SHA-256 of its text content, used for embedding staleness detection.

§Arguments

  • document_id — The parent document’s UUID (used in chunk metadata).
  • text — The full document body to chunk.
  • max_tokens — Maximum tokens per chunk (converted to chars via × 4).

§Guarantees

  • At least one chunk is always returned (even for empty text).
  • Chunk indices are contiguous: 0, 1, 2, …, N-1.
  • Chunks are split on \n\n boundaries when possible.
  • Oversized paragraphs are hard-split at space/newline boundaries.