Knowledge Compiler

SQLite-first knowledge compiler: immutable raw sources → canonical SQLite store → grounded LLM answers.

A riff on Andrej Karpathy's LLM Wiki pattern: instead of an agent-maintained markdown wiki alone, KC keeps a canonical SQLite store as source of truth and projects lean markdown views from it (current/ for browsing, audit/ for full ingest detail).

Setup

python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
cp .env.example .env
# Set OPENAI_API_KEY or ANTHROPIC_API_KEY in .env

Quickstart

kc migrate
kc ingest --dir data/raw/
kc project
kc claims list --conflicts
kc query "When did Jane Doe stop being CEO?"

The demo corpus in data/raw/ includes three Acme Corp text files and acme-quarterly-revenue.csv.

Commands

`kc migrate`

Apply pending SQLite migrations.

kc migrate

Database is up to date.

`kc ingest`

Ingest one file or scan a directory. Text sources go through LLM extraction; CSV files become materialized tables with column semantics.

kc ingest --dir data/raw/              # batch: only changed files
kc ingest data/raw/acme-leadership.txt # single text source
kc ingest data/raw/acme-quarterly-revenue.csv
kc ingest data/raw/acme-leadership.txt --force   # re-extract despite unchanged hash

Unchanged files are a no-op (content-hash dedup):

$ kc ingest --dir data/raw/
acme-annual-report-2025.txt: noop — No changes for acme-annual-report-2025.txt
acme-corp-history.txt: noop — No changes for acme-corp-history.txt
acme-leadership.txt: noop — No changes for acme-leadership.txt
acme-quarterly-revenue.csv: noop — No changes for acme-quarterly-revenue.csv

First CSV ingest:

$ kc ingest data/raw/acme-quarterly-revenue.csv
acme-quarterly-revenue.csv: success — Ingested acme-quarterly-revenue.csv: 24 rows, 5 columns

`kc project`

Regenerate markdown projections from the canonical store.

kc project                  # current/ + audit/
kc project --audit-only     # audit/ only

Projections written to projections/current (audit view at projections/audit)

`kc claims list`

Browse claims from SQLite. Grouped by conflict slot (predicate_norm).

kc claims list
kc claims list --entity "Jane Doe"
kc claims list --conflicts

$ kc claims list --conflicts

## completed_milestone
- [18] In March 2024, Orion completed its first successful ingest pipeline demo (disputed, conf=0.95) [acme-corp-history.txt]
- [37] Project Orion completed its first production ingest pipeline in June 2024 (disputed, conf=0.9) [acme-annual-report-2025.txt]

## founded_in_year
- [19] Acme Corp was founded in 2010 by Jane Doe in San Francisco (disputed, conf=0.95) [acme-leadership.txt]
- [31] Acme Corp was founded in 2011 by Jane Doe and John Smith in San Francisco (disputed, conf=0.95) [acme-annual-report-2025.txt]

## served_as_role:CEO
- [22] Jane Doe served as CEO of Acme Corp from 2010 until 2022 (disputed, conf=0.95) [acme-leadership.txt]
- [35] Jane Doe served as CEO from 2011 until December 2023 (disputed, conf=0.95) [acme-annual-report-2025.txt]

`kc claims accept` / `kc claims supersede`

Resolve disputes in the canonical store (with ingest audit trail).

kc claims accept 22                              # accept; supersede siblings by default
kc claims accept 22 --no-supersede-siblings
kc claims supersede 35 --reason "Outdated timeline"

$ kc claims accept 22
Claim 22: disputed -> accepted
Conflict group: edd02e0397bd2ed9...
Superseded siblings: 35

After manual edits, run kc project to refresh projections.

`kc reason`

Plan and execute claim-logic steps (entities, claims, conflicts, assessment). Prints the trace JSON; does not synthesize prose unless --explain.

kc reason "When did Jane Doe stop being CEO?"
kc reason "When did Jane Doe stop being CEO?" --explain

$ kc reason "When did Jane Doe stop being CEO?"
Plan:
  0. resolve_entities({'query': 'Jane Doe', 'limit': 10})
  1. get_entity_claims({'entity_id': '$0[0].entity_id', 'role': 'any', 'status': 'any', 'limit': 50})
  2. find_conflicts({'entity_id': '$0[0].entity_id'})
  3. derive_assessment({'target': {'entity_id': '$0[0].entity_id'}, 'question_type': 'role_timeline'})

The trace includes a deterministic verdict (disputed, supported, etc.). The LLM plans steps and narrates; it does not decide truth.

`kc query`

Route by intent — text, tabular, or hybrid — then plan, execute, and synthesize a grounded answer with citations.

Intent	Example question	Executor
text	`When did Jane Doe stop being CEO?`	claim-logic (+ FTS fallback)
tabular	`Compare Acme and Globex revenue`	tabular (aggregate on CSV)
hybrid	`Compare revenue and explain the CEO dispute`	both

kc query "When did Jane Doe stop being CEO?"
kc query "Compare Acme and Globex revenue"
kc query "Compare revenue and explain the CEO dispute"

Tabular — totals from dataset execution (sum_revenue_usd grouped by company):

$ kc query "Compare Acme and Globex revenue"

Based on aggregated data from the quarterly revenue dataset [source:acme-quarterly-revenue.csv]:

| Company   | Total Revenue (USD) |
|-----------|---------------------|
| Acme Corp | $11,990,000         |
| Globex    | $20,010,000         |

Globex's total revenue significantly exceeds Acme Corp's by approximately $8,020,000 (~67% higher).

Note: Answer synthesized from tabular dataset execution.

Citations:
  - [acme-quarterly-revenue.csv]: company=Acme Corp, sum_revenue_usd=11990000

Text — claim trace + span citations (wording varies with dispute state):

Note: Some retrieved claims are disputed.

Citations:
  - [acme-leadership.txt] span 8: Jane Doe served as CEO of Acme Corp from 2010 until 2022...

Hybrid — merges tabular totals with claim/dispute narrative in one answer.

Note: Answer synthesized from tabular data and claim reasoning.

`kc datasets list` / `kc datasets describe`

Inspect ingested CSV datasets and column semantics (semantic_role, entity link).

kc datasets list
kc datasets describe acme-quarterly-revenue.csv
kc datasets describe 1                    # by numeric dataset ID

$ kc datasets list
- [1] acme-quarterly-revenue.csv (24 rows, 5 columns) -> d_4_acme_quarterly_revenue_csv

$ kc datasets describe acme-quarterly-revenue.csv
Dataset 1: acme-quarterly-revenue.csv
Table: d_4_acme_quarterly_revenue_csv (24 rows)
Columns:
  - company (TEXT) [entity link]
  - region (TEXT)
  - quarter (TEXT)
  - revenue_usd (INTEGER)
  - headcount (INTEGER)
Sample rows:
  {'company': 'Acme Corp', 'region': 'North America', 'quarter': '2024-Q1', 'revenue_usd': 1250000, 'headcount': 420}
  ...

`kc table reason`

Plan and execute tabular query steps (filter, aggregate, entity lookup). Prints plan + JSON trace — useful for debugging without LLM narration.

kc table reason "What is total revenue by company?"
kc table reason "Compare Acme and Globex revenue"

$ kc table reason "Compare Acme and Globex revenue"
Plan:
  0. describe_dataset({'dataset_id': 1})
  1. aggregate({'dataset_id': 1, 'group_by': ['company'], 'metrics': [{'column': 'revenue_usd', 'op': 'sum'}]})

  "result": [
    {"company": "Acme Corp", "sum_revenue_usd": 11990000},
    {"company": "Globex", "sum_revenue_usd": 20010000}
  ]

Tabular planning uses structured LLM output (LLMTabularPlan) with schema validation and a deterministic heuristic fallback. Column names come from the dataset catalog — not hardcoded demo fields.

`kc merge-entities`

Merge duplicate entities by ID or name/alias.

kc merge-entities "Stanford AI Lab" "Stanford" --dry-run
kc merge-entities "Stanford AI Lab" "Stanford"

$ kc merge-entities "Stanford AI Lab" "Stanford" --dry-run
Would merge [4] Stanford AI Lab
         into [13] Stanford
Claims to repoint: 0
Aliases: Stanford AI Lab, Stanford's AI lab

Projections

Entity page (`projections/current/`)

Lean browse view — claims grouped by conflict slot, source links. (conflict) only on real disputes:

### served_as_role:CEO (conflict)
- Jane Doe served as CEO of Acme Corp from 2010 until 2022 (disputed, conf=0.95) [[acme-leadership.txt](../sources/acme-leadership.txt.md)]
- Jane Doe served as CEO from 2011 until December 2023 (disputed, conf=0.95) [[acme-annual-report-2025.txt](../sources/acme-annual-report-2025.txt.md)]

Audit view (`projections/audit/`)

Verbose projection: relations, retired claims, span listings — for debugging ingest and reconcile.

Architecture

data/raw/  ──ingest──▶  data/knowledge.db  ──project──▶  projections/
                              │
                              ├── text path: LLM extract → claims, entities, spans
                              ├── csv path:  column roles → materialized tables + row evidence
                              └── query:
                                    intent (text | tabular | hybrid)
                                    → claim-logic executor  (reasoning/)
                                    → tabular executor      (tabular/)
                                    → LLM narration (deterministic trace in, prose out)

Layer	Role
Source adapters (`ingest/adapters/`)	Text → LLM claims; CSV → SQLite tables + `semantic_role` metadata
Claim-logic engine (`reasoning/`)	Typed LLM plans → 12 deterministic API functions → assessment verdict
Tabular engine (`tabular/`)	Catalog-driven plans → filter / aggregate / entity lookup on dataset tables
Query synthesis (`query/`)	Intent routing, hybrid merge, FTS fallback
Manual ops	`claims accept/supersede`, `merge-entities` — audited writes to canonical store

Command reference

Command	Purpose
`kc migrate`	Apply pending migrations
`kc ingest <path>`	Ingest one raw source
`kc ingest --dir <dir>`	Batch ingest changed sources
`kc ingest --force`	Re-extract despite unchanged hash
`kc project`	Regenerate markdown projections
`kc project --audit-only`	Regenerate audit view only
`kc claims list [--entity] [--conflicts]`	List claims
`kc claims accept <id>`	Accept claim; supersede siblings by default
`kc claims supersede <id>`	Mark claim superseded
`kc reason "<q>" [--explain]`	Claim-logic trace (no narration by default)
`kc query "<q>"`	Route + execute + grounded answer
`kc datasets list`	List ingested CSV datasets
`kc datasets describe <key\|id>`	Schema and sample rows
`kc table reason "<q>"`	Tabular plan + execution trace
`kc merge-entities <from> <to> [--dry-run]`	Merge entities

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data/raw		data/raw
migrations		migrations
src/knowledge_compiler		src/knowledge_compiler
tests		tests
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Knowledge Compiler

Setup

Quickstart

Commands

`kc migrate`

`kc ingest`

`kc project`

`kc claims list`

`kc claims accept` / `kc claims supersede`

`kc reason`

`kc query`

`kc datasets list` / `kc datasets describe`

`kc table reason`

`kc merge-entities`

Projections

Entity page (`projections/current/`)

Audit view (`projections/audit/`)

Architecture

Command reference

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Knowledge Compiler

Setup

Quickstart

Commands

kc migrate

kc ingest

kc project

kc claims list

kc claims accept / kc claims supersede

kc reason

kc query

kc datasets list / kc datasets describe

kc table reason

kc merge-entities

Projections

Entity page (projections/current/)

Audit view (projections/audit/)

Architecture

Command reference

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`kc migrate`

`kc ingest`

`kc project`

`kc claims list`

`kc claims accept` / `kc claims supersede`

`kc reason`

`kc query`

`kc datasets list` / `kc datasets describe`

`kc table reason`

`kc merge-entities`

Entity page (`projections/current/`)

Audit view (`projections/audit/`)

Packages