context_harness/
lib.rs

1//! # Context Harness
2//!
3//! **A local-first context ingestion and retrieval framework for AI tools.**
4//!
5//! Context Harness provides a connector-driven pipeline for ingesting documents
6//! from multiple sources (filesystem, Git repositories, S3 buckets, Lua scripts), chunking
7//! and embedding them, and exposing hybrid search (keyword + semantic) via a
8//! CLI and MCP-compatible HTTP server.
9//!
10//! ## Architecture
11//!
12//! ```text
13//! ┌─────────────┐   ┌─────────────┐   ┌──────────┐
14//! │ Connectors  │──▶│  Pipeline    │──▶│  SQLite   │
15//! │ FS/Git/S3   │   │ Chunk+Embed │   │ FTS5+Vec  │
16//! └─────────────┘   └─────────────┘   └────┬─────┘
17//!                                          │
18//!                      ┌───────────────────┤
19//!                      ▼                   ▼
20//!                 ┌──────────┐       ┌──────────┐
21//!                 │   CLI    │       │   HTTP   │
22//!                 │  (ctx)   │       │  (MCP)   │
23//!                 └──────────┘       └──────────┘
24//! ```
25//!
26//! ## Data Flow
27//!
28//! 1. **Connectors** scan external sources and produce [`models::SourceItem`]s.
29//! 2. The **ingestion pipeline** ([`ingest`]) normalizes items into [`models::Document`]s,
30//!    computes deduplication hashes, and upserts them into SQLite.
31//! 3. Documents are split into [`models::Chunk`]s by the paragraph-boundary
32//!    chunker ([`chunk`]).
33//! 4. Chunks are indexed in **FTS5** for keyword search and optionally
34//!    embedded via the **embedding provider** ([`embedding`]) for vector search.
35//! 5. The **query engine** ([`search`]) supports keyword, semantic, and hybrid
36//!    retrieval with min-max normalized scoring.
37//! 6. Results are exposed via the **CLI** (`ctx`) and the **MCP HTTP server** ([`server`]).
38//!
39//! ## Quick Start
40//!
41//! ```bash
42//! ctx init                      # create database
43//! ctx sync all                  # ingest all configured sources (parallel)
44//! ctx sync git:platform         # ingest a specific git connector
45//! ctx embed pending             # generate embeddings
46//! ctx search "deployment" --mode hybrid
47//! ctx serve mcp                 # start HTTP server
48//! ```
49//!
50//! ## Connectors
51//!
52//! | Connector | Source | Module |
53//! |-----------|--------|--------|
54//! | Filesystem | Local directories | [`connector_fs`] |
55//! | Git | Any Git repository (local or remote) | [`connector_git`] |
56//! | S3 | Amazon S3 / S3-compatible buckets | [`connector_s3`] |
57//! | Lua Script | Any source via custom Lua scripts | [`connector_script`] |
58//!
59//! ## Search Modes
60//!
61//! | Mode | Engine | Requires Embeddings |
62//! |------|--------|---------------------|
63//! | `keyword` | SQLite FTS5 (BM25) | No |
64//! | `semantic` | Cosine similarity over vectors | Yes |
65//! | `hybrid` | Weighted merge (configurable α) | Yes |
66//!
67//! ## Modules
68//!
69//! | Module | Purpose |
70//! |--------|---------|
71//! | [`config`] | TOML configuration parsing and validation |
72//! | [`models`] | Core data types: `SourceItem`, `Document`, `Chunk`, `SearchResult` |
73//! | [`connector_fs`] | Filesystem connector: walk local directories |
74//! | [`connector_git`] | Git connector: clone/pull repos with per-file metadata |
75//! | [`connector_s3`] | S3 connector: list and download objects with SigV4 signing |
76//! | [`connector_script`] | Lua scripted connectors: custom data sources via Lua 5.4 scripts |
77//! | [`lua_runtime`] | Shared Lua 5.4 VM runtime: sandboxing, host APIs, value conversions |
78//! | [`tool_script`] | Lua MCP tool extensions: load, validate, execute Lua tool scripts |
79//! | [`traits`] | Extension traits: `Connector`, `Tool`, `ToolContext`, registries |
80//! | [`agents`] | Agent system: `Agent` trait, `AgentPrompt`, `AgentRegistry`, `TomlAgent` |
81//! | [`agent_script`] | Lua scripted agents: load, resolve, scaffold, test |
82//! | [`chunk`] | Paragraph-boundary text chunker |
83//! | [`embedding`] | Embedding provider trait, OpenAI implementation, vector utilities |
84//! | [`embed_cmd`] | Embedding CLI commands: `pending` and `rebuild` |
85//! | [`export`] | JSON export for static site search (`ctx export`) |
86//! | [`stats`] | Database statistics: document, chunk, and embedding counts |
87//! | [`ingest`] | Ingestion pipeline: connector → normalize → chunk → embed → store |
88//! | [`search`] | Keyword, semantic, and hybrid search with score normalization |
89//! | [`get`] | Document retrieval by UUID |
90//! | [`sources`] | Connector health and status listing |
91//! | [`server`] | MCP-compatible HTTP server (Axum) with CORS |
92//! | [`db`] | SQLite connection pool with WAL mode |
93//! | [`migrate`] | Database schema migrations (idempotent) |
94//!
95//! ## Configuration
96//!
97//! Context Harness is configured via a TOML file (default: `config/ctx.toml`).
98//! See [`config`] for all available options and [`config::load_config`] for
99//! validation rules.
100
101pub mod agent_script;
102pub mod agents;
103pub mod app_store;
104pub mod chunk;
105pub mod config;
106pub mod connector_fs;
107pub mod connector_git;
108pub mod connector_s3;
109pub mod connector_script;
110pub mod ctx_dirs;
111pub mod db;
112pub mod embed_cmd;
113pub mod embedding;
114pub mod export;
115pub mod extract;
116pub mod get;
117pub mod ingest;
118pub mod lua_runtime;
119pub mod mcp;
120pub mod migrate;
121pub mod models;
122pub mod progress;
123pub mod registry;
124pub mod search;
125pub mod server;
126pub mod sources;
127pub mod sqlite_store;
128pub mod stats;
129pub mod tool_script;
130pub mod traits;
131pub mod vector_index;
132
133pub use agents::{Agent, AgentPrompt, AgentRegistry, TomlAgent};
134pub use context_harness_core::store;
135pub use models::SourceItem;
136pub use traits::{
137    Connector, ConnectorRegistry, GetTool, SearchTool, SourcesTool, Tool, ToolContext, ToolRegistry,
138};
context_harness/lib.rs

context_harness/
lib.rs