You open a file you didn’t write. There’s a line that looks wrong — a retry count of 5, a 30-second timeout, a guard clause nobody remembers. You want to change it, but first: why is this here, and what breaks if I touch it?
So you run git blame. It points at a commit from three weeks ago: “refactor: tidy up imports.” Useless. The line’s real origin — the PR where someone argued for it, the incident that caused it — is buried under years of renames, moves, and formatting passes. You spend twenty minutes spelunking through GitHub and give up.
That’s the entire job of archaeo. The terminal above is a real run against kubernetes/kubernetes: in about six seconds it found the PR that introduced the line and surfaced the actual design debate from code review — the thing you’d otherwise spend an afternoon digging out.
The one rule
There are a hundred “chat with your codebase” tools, and most of them hallucinate. archaeo has exactly one rule that makes it different:
The LLM never answers from its own knowledge. It only summarizes retrieved evidence, and every claim cites a concrete artifact. If the evidence isn’t there, it says “no recorded decision found.” A confident guess is a defect, not a feature.
Trust is the whole game. A tool that sometimes invents a plausible reason is worse than no tool, because you can’t tell the good answers from the bad ones. archaeo would rather tell you it doesn’t know.
The hard part: blame-through-time
This is where almost every tool is mediocre, and where archaeo spends its entire complexity budget. git blame shows the last commit that touched a line — usually cosmetic. archaeo follows the line backward, skipping the cosmetic strata, to the change that introduced the behavior:
git blame → last touching commit (rename / format / move — useless)
archaeo → renames → moves → refactors → squash → cherry-picks
↓
the commit that introduced the BEHAVIORIt uses git log -L to trace the line through its in-file history, detects the “file-introduction wall” (where code was moved in from elsewhere), then uses git’s pickaxe across all history to jump files and find where the logic originally entered the repo — even if it was first written somewhere else. A deterministic classifier skips the cosmetic commits. Then it recovers the chain: commit → merged PR → linked issue → the review comments that argued about it, handling squash-merges and cherry-picks along the way.
Does it actually work? 150+ real runs.
A demo on a toy repo proves nothing. So I ran 151 real queries across kubernetes, react, cognee and others — target lines sampled programmatically, not hand-picked. The summary:
| Repo | Queries | Found the real PR | Notes |
|---|---|---|---|
| kubernetes/kubernetes | 30 | 28 (93%) | 6 HIGH · median 6s |
| topoteretes/cognee | 57 | 57 (100%) | 3.1s avg |
| facebook/react | 3 | 2 + 1 honest LOW | traced the 5ms scheduler slice |
| PR-driven, combined | 87 | 85 (97.7%) | 6 HIGH · 58 MED · 22 LOW |
| a direct-push repo | 19 | 0 — all honest LOW | refusing to fabricate = working |
A few rows from the line-level results (every row reproducible — full set on the evidence page):
| Target | Confidence | PR | s |
|---|---|---|---|
| …/handlers/responsewriters/compression.go:65 | HIGH | #139482 | 5 |
| …/scheduling/workload_aware_preemption.go:143 | HIGH | #139375 | 6 |
| …/retrieval/agentic_retriever.py:234 | MEDIUM | #2726 | 3 |
| …/loaders/core/text_loader.py:51 | MEDIUM | #1240 | 2 |
| …/legacy direct-commit line | LOW | — | 1 |
HIGH is deliberately rare — it needs a clear winning commit plus a PR plus a linked issue or a substantive human review comment. Most real PRs earn MEDIUM. The model never inflates its certainty; that’s the point.
How it’s built
local & git-only
Runs on your machine against a repo you’ve cloned. No server, no SaaS, no telemetry. The only network call fetches PR/issue text from GitHub with your token, cached in SQLite.
bring your own key
Anthropic / OpenAI / Gemini — or run fully offline with a deterministic summarizer. No inference bill on us, no vendor lock-in.
honest by construction
Summarize-only LLM layer that cannot invent evidence; three-tier confidence with the reasons shown.
self-hostable, MIT
Enterprises won’t hand their git history to someone’s cloud. They don’t have to. Node 22+, zero native builds.
Where this goes
V1 is intentionally narrow: why, risk, and explain-commit, GitHub-only. It widens from there — but only where it stays evidence-grounded.
Why a line exists
- archaeo why path:line — the behavioral-origin trace
- archaeo risk path — 0–10 blast-radius from churn, coupling, incidents
- archaeo explain-commit sha
The team’s memory
- who / expert — who actually knows this code, from authorship + reviews
- why <service> — the business purpose of a module, synthesized from its PRs
- Onboarding mode — “how does auth work here?”, fully cited
- Discovery (search / ask), plus GitLab & Bitbucket
Impact & dependencies
- impact <service> — “what breaks if I change this?” via a real dependency graph
- Multi-hop expertise & dependency reasoning — where a graph engine finally earns its place
The honest part
No tool earns trust by hiding its seams. archaeo is git-history-only — if your team’s “why” lives in Slack and Jira, it won’t see it, and it’ll tell you so rather than guess. A repo full of fix stuff commits has the historical value of a burnt library; archaeo surfaces “this part of your history is undocumented,” which is itself useful. It’s GitHub-only for now, and slower on partial/shallow clones (it warns you).
If you’ve ever been scared to touch code you didn’t write, point it at your own repo and tell me whether it finds something a senior engineer couldn’t in 30 seconds. That’s the bar.