Skip to content

colvint/knowledge-compiler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Knowledge Compiler

SQLite-first knowledge compiler: immutable raw sources → canonical SQLite store → grounded LLM answers.

A riff on Andrej Karpathy's LLM Wiki pattern: instead of an agent-maintained markdown wiki alone, KC keeps a canonical SQLite store as source of truth and projects lean markdown views from it (current/ for browsing, audit/ for full ingest detail).

Setup

python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
cp .env.example .env
# Set OPENAI_API_KEY or ANTHROPIC_API_KEY in .env

Quickstart

kc migrate
kc ingest --dir data/raw/
kc project
kc claims list --conflicts
kc query "When did Jane Doe stop being CEO?"

The demo corpus in data/raw/ includes three Acme Corp text files and acme-quarterly-revenue.csv.


Commands

kc migrate

Apply pending SQLite migrations.

kc migrate
Database is up to date.

kc ingest

Ingest one file or scan a directory. Text sources go through LLM extraction; CSV files become materialized tables with column semantics.

kc ingest --dir data/raw/              # batch: only changed files
kc ingest data/raw/acme-leadership.txt # single text source
kc ingest data/raw/acme-quarterly-revenue.csv
kc ingest data/raw/acme-leadership.txt --force   # re-extract despite unchanged hash

Unchanged files are a no-op (content-hash dedup):

$ kc ingest --dir data/raw/
acme-annual-report-2025.txt: noop — No changes for acme-annual-report-2025.txt
acme-corp-history.txt: noop — No changes for acme-corp-history.txt
acme-leadership.txt: noop — No changes for acme-leadership.txt
acme-quarterly-revenue.csv: noop — No changes for acme-quarterly-revenue.csv

First CSV ingest:

$ kc ingest data/raw/acme-quarterly-revenue.csv
acme-quarterly-revenue.csv: success — Ingested acme-quarterly-revenue.csv: 24 rows, 5 columns

kc project

Regenerate markdown projections from the canonical store.

kc project                  # current/ + audit/
kc project --audit-only     # audit/ only
Projections written to projections/current (audit view at projections/audit)

kc claims list

Browse claims from SQLite. Grouped by conflict slot (predicate_norm).

kc claims list
kc claims list --entity "Jane Doe"
kc claims list --conflicts
$ kc claims list --conflicts

## completed_milestone
- [18] In March 2024, Orion completed its first successful ingest pipeline demo (disputed, conf=0.95) [acme-corp-history.txt]
- [37] Project Orion completed its first production ingest pipeline in June 2024 (disputed, conf=0.9) [acme-annual-report-2025.txt]

## founded_in_year
- [19] Acme Corp was founded in 2010 by Jane Doe in San Francisco (disputed, conf=0.95) [acme-leadership.txt]
- [31] Acme Corp was founded in 2011 by Jane Doe and John Smith in San Francisco (disputed, conf=0.95) [acme-annual-report-2025.txt]

## served_as_role:CEO
- [22] Jane Doe served as CEO of Acme Corp from 2010 until 2022 (disputed, conf=0.95) [acme-leadership.txt]
- [35] Jane Doe served as CEO from 2011 until December 2023 (disputed, conf=0.95) [acme-annual-report-2025.txt]

kc claims accept / kc claims supersede

Resolve disputes in the canonical store (with ingest audit trail).

kc claims accept 22                              # accept; supersede siblings by default
kc claims accept 22 --no-supersede-siblings
kc claims supersede 35 --reason "Outdated timeline"
$ kc claims accept 22
Claim 22: disputed -> accepted
Conflict group: edd02e0397bd2ed9...
Superseded siblings: 35

After manual edits, run kc project to refresh projections.


kc reason

Plan and execute claim-logic steps (entities, claims, conflicts, assessment). Prints the trace JSON; does not synthesize prose unless --explain.

kc reason "When did Jane Doe stop being CEO?"
kc reason "When did Jane Doe stop being CEO?" --explain
$ kc reason "When did Jane Doe stop being CEO?"
Plan:
  0. resolve_entities({'query': 'Jane Doe', 'limit': 10})
  1. get_entity_claims({'entity_id': '$0[0].entity_id', 'role': 'any', 'status': 'any', 'limit': 50})
  2. find_conflicts({'entity_id': '$0[0].entity_id'})
  3. derive_assessment({'target': {'entity_id': '$0[0].entity_id'}, 'question_type': 'role_timeline'})

The trace includes a deterministic verdict (disputed, supported, etc.). The LLM plans steps and narrates; it does not decide truth.


kc query

Route by intent — text, tabular, or hybrid — then plan, execute, and synthesize a grounded answer with citations.

Intent Example question Executor
text When did Jane Doe stop being CEO? claim-logic (+ FTS fallback)
tabular Compare Acme and Globex revenue tabular (aggregate on CSV)
hybrid Compare revenue and explain the CEO dispute both
kc query "When did Jane Doe stop being CEO?"
kc query "Compare Acme and Globex revenue"
kc query "Compare revenue and explain the CEO dispute"

Tabular — totals from dataset execution (sum_revenue_usd grouped by company):

$ kc query "Compare Acme and Globex revenue"

Based on aggregated data from the quarterly revenue dataset [source:acme-quarterly-revenue.csv]:

| Company   | Total Revenue (USD) |
|-----------|---------------------|
| Acme Corp | $11,990,000         |
| Globex    | $20,010,000         |

Globex's total revenue significantly exceeds Acme Corp's by approximately $8,020,000 (~67% higher).

Note: Answer synthesized from tabular dataset execution.

Citations:
  - [acme-quarterly-revenue.csv]: company=Acme Corp, sum_revenue_usd=11990000

Text — claim trace + span citations (wording varies with dispute state):

Note: Some retrieved claims are disputed.

Citations:
  - [acme-leadership.txt] span 8: Jane Doe served as CEO of Acme Corp from 2010 until 2022...

Hybrid — merges tabular totals with claim/dispute narrative in one answer.

Note: Answer synthesized from tabular data and claim reasoning.

kc datasets list / kc datasets describe

Inspect ingested CSV datasets and column semantics (semantic_role, entity link).

kc datasets list
kc datasets describe acme-quarterly-revenue.csv
kc datasets describe 1                    # by numeric dataset ID
$ kc datasets list
- [1] acme-quarterly-revenue.csv (24 rows, 5 columns) -> d_4_acme_quarterly_revenue_csv

$ kc datasets describe acme-quarterly-revenue.csv
Dataset 1: acme-quarterly-revenue.csv
Table: d_4_acme_quarterly_revenue_csv (24 rows)
Columns:
  - company (TEXT) [entity link]
  - region (TEXT)
  - quarter (TEXT)
  - revenue_usd (INTEGER)
  - headcount (INTEGER)
Sample rows:
  {'company': 'Acme Corp', 'region': 'North America', 'quarter': '2024-Q1', 'revenue_usd': 1250000, 'headcount': 420}
  ...

kc table reason

Plan and execute tabular query steps (filter, aggregate, entity lookup). Prints plan + JSON trace — useful for debugging without LLM narration.

kc table reason "What is total revenue by company?"
kc table reason "Compare Acme and Globex revenue"
$ kc table reason "Compare Acme and Globex revenue"
Plan:
  0. describe_dataset({'dataset_id': 1})
  1. aggregate({'dataset_id': 1, 'group_by': ['company'], 'metrics': [{'column': 'revenue_usd', 'op': 'sum'}]})

  "result": [
    {"company": "Acme Corp", "sum_revenue_usd": 11990000},
    {"company": "Globex", "sum_revenue_usd": 20010000}
  ]

Tabular planning uses structured LLM output (LLMTabularPlan) with schema validation and a deterministic heuristic fallback. Column names come from the dataset catalog — not hardcoded demo fields.


kc merge-entities

Merge duplicate entities by ID or name/alias.

kc merge-entities "Stanford AI Lab" "Stanford" --dry-run
kc merge-entities "Stanford AI Lab" "Stanford"
$ kc merge-entities "Stanford AI Lab" "Stanford" --dry-run
Would merge [4] Stanford AI Lab
         into [13] Stanford
Claims to repoint: 0
Aliases: Stanford AI Lab, Stanford's AI lab

Projections

Entity page (projections/current/)

Lean browse view — claims grouped by conflict slot, source links. (conflict) only on real disputes:

### served_as_role:CEO (conflict)
- Jane Doe served as CEO of Acme Corp from 2010 until 2022 (disputed, conf=0.95) [[acme-leadership.txt](../sources/acme-leadership.txt.md)]
- Jane Doe served as CEO from 2011 until December 2023 (disputed, conf=0.95) [[acme-annual-report-2025.txt](../sources/acme-annual-report-2025.txt.md)]

Audit view (projections/audit/)

Verbose projection: relations, retired claims, span listings — for debugging ingest and reconcile.


Architecture

data/raw/  ──ingest──▶  data/knowledge.db  ──project──▶  projections/
                              │
                              ├── text path: LLM extract → claims, entities, spans
                              ├── csv path:  column roles → materialized tables + row evidence
                              └── query:
                                    intent (text | tabular | hybrid)
                                    → claim-logic executor  (reasoning/)
                                    → tabular executor      (tabular/)
                                    → LLM narration (deterministic trace in, prose out)
Layer Role
Source adapters (ingest/adapters/) Text → LLM claims; CSV → SQLite tables + semantic_role metadata
Claim-logic engine (reasoning/) Typed LLM plans → 12 deterministic API functions → assessment verdict
Tabular engine (tabular/) Catalog-driven plans → filter / aggregate / entity lookup on dataset tables
Query synthesis (query/) Intent routing, hybrid merge, FTS fallback
Manual ops claims accept/supersede, merge-entities — audited writes to canonical store

Command reference

Command Purpose
kc migrate Apply pending migrations
kc ingest <path> Ingest one raw source
kc ingest --dir <dir> Batch ingest changed sources
kc ingest --force Re-extract despite unchanged hash
kc project Regenerate markdown projections
kc project --audit-only Regenerate audit view only
kc claims list [--entity] [--conflicts] List claims
kc claims accept <id> Accept claim; supersede siblings by default
kc claims supersede <id> Mark claim superseded
kc reason "<q>" [--explain] Claim-logic trace (no narration by default)
kc query "<q>" Route + execute + grounded answer
kc datasets list List ingested CSV datasets
kc datasets describe <key|id> Schema and sample rows
kc table reason "<q>" Tabular plan + execution trace
kc merge-entities <from> <to> [--dry-run] Merge entities

About

SQLite-first knowledge compiler: raw sources → canonical store → grounded LLM answers

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages