context_harness/lib.rs
1//! # Context Harness
2//!
3//! **A local-first context ingestion and retrieval framework for AI tools.**
4//!
5//! Context Harness provides a connector-driven pipeline for ingesting documents
6//! from multiple sources (filesystem, Git repositories, S3 buckets, Lua scripts), chunking
7//! and embedding them, and exposing hybrid search (keyword + semantic) via a
8//! CLI and MCP-compatible HTTP server.
9//!
10//! ## Architecture
11//!
12//! ```text
13//! ┌─────────────┐ ┌─────────────┐ ┌──────────┐
14//! │ Connectors │──▶│ Pipeline │──▶│ SQLite │
15//! │ FS/Git/S3 │ │ Chunk+Embed │ │ FTS5+Vec │
16//! └─────────────┘ └─────────────┘ └────┬─────┘
17//! │
18//! ┌───────────────────┤
19//! ▼ ▼
20//! ┌──────────┐ ┌──────────┐
21//! │ CLI │ │ HTTP │
22//! │ (ctx) │ │ (MCP) │
23//! └──────────┘ └──────────┘
24//! ```
25//!
26//! ## Data Flow
27//!
28//! 1. **Connectors** scan external sources and produce [`models::SourceItem`]s.
29//! 2. The **ingestion pipeline** ([`ingest`]) normalizes items into [`models::Document`]s,
30//! computes deduplication hashes, and upserts them into SQLite.
31//! 3. Documents are split into [`models::Chunk`]s by the paragraph-boundary
32//! chunker ([`chunk`]).
33//! 4. Chunks are indexed in **FTS5** for keyword search and optionally
34//! embedded via the **embedding provider** ([`embedding`]) for vector search.
35//! 5. The **query engine** ([`search`]) supports keyword, semantic, and hybrid
36//! retrieval with min-max normalized scoring.
37//! 6. Results are exposed via the **CLI** (`ctx`) and the **MCP HTTP server** ([`server`]).
38//!
39//! ## Quick Start
40//!
41//! ```bash
42//! ctx init # create database
43//! ctx sync all # ingest all configured sources (parallel)
44//! ctx sync git:platform # ingest a specific git connector
45//! ctx embed pending # generate embeddings
46//! ctx search "deployment" --mode hybrid
47//! ctx serve mcp # start HTTP server
48//! ```
49//!
50//! ## Connectors
51//!
52//! | Connector | Source | Module |
53//! |-----------|--------|--------|
54//! | Filesystem | Local directories | [`connector_fs`] |
55//! | Git | Any Git repository (local or remote) | [`connector_git`] |
56//! | S3 | Amazon S3 / S3-compatible buckets | [`connector_s3`] |
57//! | Lua Script | Any source via custom Lua scripts | [`connector_script`] |
58//!
59//! ## Search Modes
60//!
61//! | Mode | Engine | Requires Embeddings |
62//! |------|--------|---------------------|
63//! | `keyword` | SQLite FTS5 (BM25) | No |
64//! | `semantic` | Cosine similarity over vectors | Yes |
65//! | `hybrid` | Weighted merge (configurable α) | Yes |
66//!
67//! ## Modules
68//!
69//! | Module | Purpose |
70//! |--------|---------|
71//! | [`config`] | TOML configuration parsing and validation |
72//! | [`models`] | Core data types: `SourceItem`, `Document`, `Chunk`, `SearchResult` |
73//! | [`connector_fs`] | Filesystem connector: walk local directories |
74//! | [`connector_git`] | Git connector: clone/pull repos with per-file metadata |
75//! | [`connector_s3`] | S3 connector: list and download objects with SigV4 signing |
76//! | [`connector_script`] | Lua scripted connectors: custom data sources via Lua 5.4 scripts |
77//! | [`lua_runtime`] | Shared Lua 5.4 VM runtime: sandboxing, host APIs, value conversions |
78//! | [`tool_script`] | Lua MCP tool extensions: load, validate, execute Lua tool scripts |
79//! | [`traits`] | Extension traits: `Connector`, `Tool`, `ToolContext`, registries |
80//! | [`agents`] | Agent system: `Agent` trait, `AgentPrompt`, `AgentRegistry`, `TomlAgent` |
81//! | [`agent_script`] | Lua scripted agents: load, resolve, scaffold, test |
82//! | [`chunk`] | Paragraph-boundary text chunker |
83//! | [`embedding`] | Embedding provider trait, OpenAI implementation, vector utilities |
84//! | [`embed_cmd`] | Embedding CLI commands: `pending` and `rebuild` |
85//! | [`export`] | JSON export for static site search (`ctx export`) |
86//! | [`stats`] | Database statistics: document, chunk, and embedding counts |
87//! | [`ingest`] | Ingestion pipeline: connector → normalize → chunk → embed → store |
88//! | [`search`] | Keyword, semantic, and hybrid search with score normalization |
89//! | [`get`] | Document retrieval by UUID |
90//! | [`sources`] | Connector health and status listing |
91//! | [`server`] | MCP-compatible HTTP server (Axum) with CORS |
92//! | [`db`] | SQLite connection pool with WAL mode |
93//! | [`migrate`] | Database schema migrations (idempotent) |
94//!
95//! ## Configuration
96//!
97//! Context Harness is configured via a TOML file (default: `config/ctx.toml`).
98//! See [`config`] for all available options and [`config::load_config`] for
99//! validation rules.
100
101pub mod agent_script;
102pub mod agents;
103pub mod chunk;
104pub mod config;
105pub mod connector_fs;
106pub mod connector_git;
107pub mod connector_s3;
108pub mod connector_script;
109pub mod db;
110pub mod embed_cmd;
111pub mod embedding;
112pub mod export;
113pub mod extract;
114pub mod get;
115pub mod ingest;
116pub mod lua_runtime;
117pub mod mcp;
118pub mod migrate;
119pub mod models;
120pub mod progress;
121pub mod registry;
122pub mod search;
123pub mod server;
124pub mod sources;
125pub mod sqlite_store;
126pub mod stats;
127pub mod tool_script;
128pub mod traits;
129
130pub use agents::{Agent, AgentPrompt, AgentRegistry, TomlAgent};
131pub use context_harness_core::store;
132pub use models::SourceItem;
133pub use traits::{
134 Connector, ConnectorRegistry, GetTool, SearchTool, SourcesTool, Tool, ToolContext, ToolRegistry,
135};