CAMEODB
edit_note Engineering Blog

Insights & Updates

Technical deep-dives, release notes, and stories from the team building the next-generation hybrid database engine.

Featured Tutorial Dataset
GC
Goran C. Apr 15, 2026 4 min read

From Zero to a Million Jokes: Loading Hugging Face Datasets into CameoDB

CameoDB is running, you've tested it with the books index from Quickstart. Now let's load something bigger. One CLI command, one million Reddit jokes from Hugging Face, fully indexed and searchable in seconds.

Hugging Face CLI Data Loading
arrow_forward

You've Got CameoDB Running. Now What?

If you followed the Quickstart, you already have CameoDB running and a books index loaded. That's your proof-of-concept. Now let's go from a handful of records to one million.

The dataset: SocialGrep/one-million-reddit-jokes on Hugging Face. A single CSV file with joke titles, body text, scores, subreddits, and timestamps. Perfect for demonstrating CameoDB's ability to detect schemas and ingest data directly from a URL.

Step 1: Detect the Schema

Point CameoDB's CLI at the raw CSV URL. The schema detect command reads the header row and samples data to infer field types automatically:

schema detect https://huggingface.co/datasets/SocialGrep/one-million-reddit-jokes/resolve/main/one-million-reddit-jokes.csv

CameoDB returns a full schema definition with 10 detected fields:

{
  "routing_field_name": "id",
  "fields": {
    "id":          { "field_type": "Text",    "indexed": true,  "stored": true,  "tokenizer": "raw" },
    "title":       { "field_type": "Text",    "indexed": true,  "stored": false },
    "selftext":    { "field_type": "Text",    "indexed": true,  "stored": false },
    "score":       { "field_type": "I64",     "indexed": true,  "fast": true  },
    "created_utc": { "field_type": "I64",     "indexed": true,  "fast": true  },
    "subreddit":   { "field_type": "Boolean", "indexed": true  },
    "type":        { "field_type": "Text",    "indexed": true  },
    "permalink":   { "field_type": "Text",    "indexed": true  },
    "domain":      { "field_type": "Text",    "indexed": true  },
    "url":         { "field_type": "Text",    "indexed": true  }
  }
}

Notice: score and created_utc are detected as I64 with fast fields enabled, meaning they support range queries and sorting. Text fields are fully indexed for search.

Step 2: Load One Million Records

Now the single command that does everything—creates the index, applies the schema, downloads the CSV, and streams all rows in batches:

data load jokes https://huggingface.co/datasets/SocialGrep/one-million-reddit-jokes/resolve/main/one-million-reddit-jokes.csv
Schema was missing; detected and applied schema to index 'jokes'
Ingestion complete for index 'jokes': loaded=1000000 failed=0 (batch size 4000)

That's it. One million records, zero failures. CameoDB auto-detected the schema on first contact and streamed the data in batches of 4,000. The jokes index didn't exist before this command—it was created on the fly.

Step 3: Search Instantly

The index is immediately queryable. Let's find football jokes:

search jokes title:football limit 5 return title, selftext
{
  "hits": [
    { "_score": 10.50, "title": "Football", "selftext": "[removed]" },
    { "_score": 10.41, "title": "Football", "selftext": "As a woman passed her daughter's closed bedroom door..." },
    { "_score": 9.91,  "title": "Fart Football", "selftext": "An old married couple no sooner hit the pillows..." },
    // ... 2 more results
  ],
  "hits_returned": 5,
  "total_hits": 1330,
  "took_ms": 11,
  "stats": { "shards": { "total": 4, "responded": 4, "failed": 0 } }
}

1,330 football jokes found across 4 shards in 11 milliseconds. The data was distributed automatically. No configuration, no manual sharding, no external tooling.

The Takeaway

Three commands. That's the entire workflow from discovering a dataset on Hugging Face to running full-text search queries against a million records. CameoDB handles schema inference, index creation, batch ingestion, and distributed search—all from the CLI.

MCP
GC Goran C. Apr 10, 2026 5 min read

Native MCP Server: Give AI Agents Direct Database Access

How CameoDB's built-in Model Context Protocol server enables Claude, Cursor, and Windsurf to query your data instantly.

MCP AI Agents
arrow_forward

Built-In, Not Bolted-On

CameoDB ships with a native Model Context Protocol (MCP) server running in the same binary—no sidecars, no middleware. The moment you start CameoDB on port 9480, your data becomes queryable by AI agents like Claude Desktop, Cursor, and Windsurf through the /mcp/sse endpoint.

What's Exposed: 6 Read-Only Tools

The MCP server exposes six tools—all read-only for security. Agents can discover, query, and validate, but never modify your data:

search_index

Full-text search on a single index with Tantivy query syntax.

search_indexes

Federated search across multiple indexes with merged results.

list_indexes

Discovery: list all indexes with schemas and queryable fields.

get_index

Schema inspector with per-field operator hints and types.

validate_query

Query linter with syntax validation and "did you mean" suggestions.

get_index_stats

Document counts, index size, and cluster metadata.

Query Syntax: Tantivy-Powered

The search_index tool supports the full Tantivy query language:

// Field targeting
title:rust

// Phrases with proximity
body:"small bike"~2

// Boolean operators (UPPERCASE required)
title:rust AND author:doe
(title:rust OR title:go) AND year:[2020 TO 2024]

// Range queries
score:>=100
date:[2024-01-01 TO 2024-12-31]

// Boosting for relevance
title:rust^3 OR body:rust

// Set operations
status: IN [active pending review]

Anti-Hallucination Rule

Every search tool includes a critical instruction: "When answering questions based on CameoDB results, you MUST use ONLY the exact data returned by this tool. Do NOT combine database results with your own prior knowledge." This ensures agents provide factual, grounded responses based solely on your data.

Field Type Awareness

The MCP server provides per-field operator hints based on data types:

text:     all operators (phrases, slop, prefix, IN, boost, range)
string:   exact match, prefix, IN, exists (no phrases/slop)
numeric:  exact, comparisons (>, <), range, boost, exists
date:     exact, comparisons, range, exists
boolean:  true/false only, exists
json:     dot notation (field.sub:value), nested exists

Setup: Two Configuration Styles

The MCP server uses Server-Sent Events (SSE) transport. Configuration depends on your AI tool:

Windsurf & Cursor (Native SSE)

Add to .windsurf/mcp.json or Cursor MCP settings:

{
  "mcpServers": {
    "cameodb": {
      "url": "http://localhost:9480/mcp/sse",
      "transport": "sse"
    }
  }
}

Claude Desktop (Curl Bridge)

Claude currently requires a curl bridge for SSE transport:

{
  "mcpServers": {
    "cameodb": {
      "command": "curl",
      "args": [
        "-N",
        "-H", "Accept: text/event-stream",
        "http://localhost:9480/mcp/sse"
      ]
    }
  }
}

Restart your AI tool after configuration. The agent will automatically discover your indexes and schemas.

Performance
GC Goran C. Apr 3, 2026 6 min read

Microsecond Latency: Benchmarking CameoDB's Zero-Copy Architecture

Deep dive into how Rust's ownership model and zero-copy serialization deliver sub-millisecond query performance at scale.

Rust Performance Architecture
arrow_forward

The Zero-Copy Advantage

When you're processing millions of documents, every memory allocation matters. Traditional databases copy data through multiple layers, network buffers, serialization formats, intermediate structures, each adding latency and GC pressure. CameoDB takes a different approach: zero-copy ingestion.

Built with Rust 2024 Edition, CameoDB leverages the language's ownership model to pass data by reference without copying. When you stream a CSV from Hugging Face, parse JSON from a local file, or ingest NDJSON over HTTP, the bytes travel through a single buffer. No serde allocations. No intermediate clones. Just raw bytes flowing from source to storage.

Hybrid Storage: Two Engines, One Pipeline

CameoDB's performance comes from its hybrid architecture, combining redb (ACID-compliant KV store) with Tantivy (full-text search engine). Each write operation follows an atomic sequence:

// 1. Generate sequence ID (AtomicU64, lock-free)
seq_id = counter.fetch_add(1)

// 2. Begin redb transaction (ACID isolation)
txn = kv.begin_write()

// 3. Write to WAL (durability before application)
wal.insert(seq_id, serialized_op)

// 4. Write to data table (complete JSON document)
data.insert(id, json_blob)

// 5. Update Tantivy index (in-memory buffer)
writer.add_document(id_only_doc)

// 6. Commit redb transaction (fsync if configured)
txn.commit()

// 7. Signal supervisor for smart commit
supervisor.reset_timer()

The key insight: Tantivy stores only indexed fields. The complete JSON document lives exclusively in redb. When you search, Tantivy returns matching document IDs, then we batch-fetch the full documents from redb. This split-storage strategy means smaller indices, faster searches, and zero data duplication.

Supervised Smart Commits: The 5-Second Guarantee

Committing to disk is expensive. Doing it on every write kills throughput. Skipping it risks data loss. CameoDB's Supervised Smart Commits thread the needle with an adaptive algorithm:

Smart Commits

Trigger when operation count reaches adaptive threshold (500-8000 ops based on memory budget). Immediate commit during write bursts.

Supervised Commits

After 5 seconds of write inactivity, background supervisor commits. Guarantees durability for low-volume patterns.

Every write signals a per-index supervisor, resetting a 5-second timer. If writes stop flowing, the supervisor fires an eventual commit. If writes keep coming, smart commits trigger at adaptive thresholds. Supervisors self-cleanup after successful commits, no resource leaks, no background tasks lingering.

Tiered Cache Sizing: Fast Startup, Steady State

Opening a 2GB database with a 32MB cache is painful. CameoDB uses a two-phase initialization strategy:

// Phase 1: Init Boost (fast WAL recovery)
cache_size = database_tier * multiplier
// Small:  32MB  → 32MB   (1×)
// Medium: 128MB → 512MB  (4×)
// Large:  256MB → 2GB    (8×)

// Phase 2: Normal Operation (steady state)
cache_size = standard_tier
// Release init boost memory for multi-shard deployments

For a 2GB database on a node with 16GB RAM, CameoDB opens with 512MB cache for fast recovery, then drops to 256MB for steady operation. Per-shard memory is automatically divided across all active shards, no manual tuning required.

Benchmark Results: The Numbers

So what does this architecture deliver in practice? Here are real-world benchmarks from the storage engine:

Single Operations

0.5-3ms per operation

Depends on fsync configuration

Batch Operations

0.05-0.5ms per operation

10-60x faster than single ops

Point Queries

~0.1ms (redb B-tree)

KV lookup by document ID

Search Queries

10-100ms (Tantivy)

Depends on index size, complexity

Throughput scales dramatically with batching: 2,000-15,000 ops/sec individual versus 10,000-100,000 ops/sec batched. The difference comes from amortized commit overhead and reduced mutex contention.

Async-Sync Isolation: Blocking Without Blocking

All redb and Tantivy I/O happens inside tokio::task::spawn_blocking. This is critical—storage operations are inherently blocking (disk I/O, B-tree traversals, index segment merges). By offloading to a dedicated blocking thread pool, CameoDB's async runtime stays responsive. No thread starvation. No latency spikes from blocking the event loop.

The Takeaway

Zero-copy ingestion, hybrid storage, supervised smart commits, tiered caching, async-sync isolation, these aren't isolated optimizations. They're an integrated architecture where each component amplifies the others. Rust's ownership model makes zero-copy safe. The hybrid split-storage strategy enables smaller indices. Smart commits reduce I/O while supervised commits guarantee durability. The result: sub-millisecond latency at scale, without the complexity of manual tuning.