pub fn chunk_text(
document_id: &str,
text: &str,
max_tokens: usize,
) -> Vec<Chunk>Expand description
Split text into chunks on paragraph boundaries, respecting max_tokens.
Returns chunks with contiguous indices starting at 0. Each chunk’s
hash is the SHA-256 of its text content, used for embedding
staleness detection.
§Arguments
document_id— The parent document’s UUID (used in chunk metadata).text— The full document body to chunk.max_tokens— Maximum tokens per chunk (converted to chars via× 4).
§Guarantees
- At least one chunk is always returned (even for empty text).
- Chunk indices are contiguous:
0, 1, 2, …, N-1. - Chunks are split on
\n\nboundaries when possible. - Oversized paragraphs are hard-split at space/newline boundaries.