Commit Graph

220 Commits

Author SHA1 Message Date
Tobi Lutke
6d399bc50a
release: v1.0.5 2026-02-16 08:47:23 -04:00
Tobi Lutke
e39848c030
chore: gitignore .claude/ 2026-02-16 08:47:20 -04:00
Tobi Lutke
614c8d6328
docs: write changelog for v1.0.5
Build now ships compiled JS, new release skill and tooling.
2026-02-16 08:46:46 -04:00
Tobi Lutke
7fb69a5ca2
feat: release skill with changelog-driven workflow and git hooks
- Add /release skill with full process: hook install, changelog
  validation, git history review, preview, and release execution
- Skill auto-populates [Unreleased] from git history when empty
- Install hook script symlinks pre-push for tag validation
- Register skills/ dir in .pi/settings.json for pi discovery
2026-02-16 08:46:10 -04:00
Tobi Lutke
09803a75b7
feat: compile to JS for npm, release system, full changelog
- Add tsc build step (tsconfig.build.json) so npm package ships
  compiled JS instead of raw TypeScript requiring tsx at runtime
- Update qmd wrapper and daemon spawn to use dist/qmd.js in
  production while keeping tsx for development
- Add self-installing pre-push hook validating v* tag pushes:
  package.json version match, changelog entry, CI status
- Add release.sh script that renames [Unreleased] to versioned
  entry, bumps package.json, commits, and tags
- Add extract-changelog.sh for cumulative GitHub release notes
- Update publish workflow with build step and GitHub release creation
- Flesh out CHANGELOG.md with full history from 0.1.0 through 1.0.0
  in Keep-a-Changelog format with PR/contributor attributions
- Add release standards and changelog guidelines to CLAUDE.md
2026-02-16 08:42:32 -04:00
Tobi Lutke
77c6eba159
fix: publish workflow bun test timeout and npm auth 2026-02-15 23:02:33 -04:00
Tobias Lütke
7acba1c451
Merge pull request #178 from tobi/release/v1.0.0
Release v1.0.0
2026-02-15 22:59:04 -04:00
Tobi Lutke
2780dfb5d0
fix: increase bun test timeout to 30s via CLI flag
The default 5s timeout is too short for CLI subprocess tests in CI.
2026-02-15 21:59:18 -04:00
Tobi Lutke
93f277c5e3
fix: MCP session support and cross-runtime test compat
- mcp.ts: add sessionIdGenerator to HTTP transport (fixes "stateless
  transport cannot be reused" error in CI)
- test-preload.ts: set 30s default timeout for bun test runner (matches
  vitest config, prevents CLI subprocess test timeouts)
- mcp.test.ts: use == null check instead of toBeUndefined for SQLite
  get() result (bun:sqlite returns null, better-sqlite3 returns undefined)
2026-02-15 21:54:25 -04:00
Tobi Lutke
edc9a87234
fix: correct test paths after moving to test/ directory
- cli.test.ts: fix qmdScript path from <root>/qmd.ts to <root>/src/qmd.ts
  (broke when tests moved from src/integration/ to test/)
- mcp.test.ts: forward Mcp-Session-Id header per MCP Streamable HTTP spec
2026-02-15 21:46:45 -04:00
Tobi Lutke
870d3aed3b
test: move all tests to flat test/ directory
No more src/models/ and src/integration/ subfolders to forget about.
All 9 test files live in test/, one command runs everything:

  npx vitest run test/
  bun test test/
2026-02-15 21:37:47 -04:00
Tobi Lutke
dcedfb5268
feat: cross-runtime SQLite compat layer (bun:sqlite + better-sqlite3)
Add src/db.ts that dynamically imports bun:sqlite under Bun and
better-sqlite3 under Node.js. Exports openDatabase(), loadSqliteVec(),
and a shared Database interface.

- sqlite-vec loading is now optional — FTS works without it, vector
  ops throw a clear error if unavailable
- CI tests both runtimes: Node 22/23 via vitest, Bun via bun test
- All 104 unit tests pass on both Node and Bun
2026-02-15 17:15:47 -04:00
Tobi Lutke
c685f7ac71
ci: switch from bun test to vitest on Node.js
All test files now use vitest + better-sqlite3 imports.
bun test can't load the better-sqlite3 native addon (symbol
error on Linux, segfault on macOS). Run vitest on Node 22/23.
2026-02-15 17:04:58 -04:00
Tobi Lutke
dc64166a2a
release: v1.0.0
Node.js compatibility, parallel embedding/reranking, flash attention,
GPU auto-detection, and restructured test suite.
2026-02-15 17:02:00 -04:00
Tobi Lutke
294fc76d9f
Merge remote-tracking branch 'origin/nodejs' 2026-02-15 16:58:48 -04:00
Tobi Lutke
9b89a51d10
test: split integration/model suites
Split test suites for explicit runtime execution.

- Move model-related tests under `src/models/*`.
- Move CLI/integration tests under `src/integration/*`.
- Add `src/store.helpers.unit.test.ts` for helper unit coverage.
- Add shared Vitest config with default timeout and suite organization.
- Remove legacy flat test files from `src/` root.
- Keep core test commands in scripts supporting unit/models/integration runs.
2026-02-15 16:57:13 -04:00
Tobi Lutke
4df5505bd6
Merge origin/nodejs: Node.js compat, perf improvements, vitest
Brings in Node.js compatibility (tsx, vitest), GPU auto-detection,
parallel embedding/reranking contexts, and flash attention support.
Preserves @tobilu/qmd package scope and publish config from main.
2026-02-15 16:52:30 -04:00
Tobi Lutke
b88c10bf83
docs: show bun/node install and package scope
Document both Node and Bun execution paths.
- Update install examples to `@tobilu/qmd` for npm and bun.
- Add npx/bunx one-off usage examples.
- Reflect Bun as first-class supported runtime in requirements.
2026-02-15 16:45:35 -04:00
Tobi Lutke
13e8473455
docs: update node usage and bump version
Update README installation and quick-start commands to Node examples.
- replace bun install/link commands with npm-based Node workflow
- bump package version to 0.9.9 for CLI and MCP metadata
- keep Bun guidance as optional development/runtime note
2026-02-15 16:44:47 -04:00
Tobi Lutke
ee58a685de
ci: use trusted publishing (OIDC provenance) 2026-02-15 15:15:08 -04:00
Tobi Lutke
00ff084fd9
chore: fix bin path, add author, use token-based npm publish 2026-02-15 15:14:45 -04:00
Tobi Lutke
5d73752b47
chore: rename package scope to @tobilu/qmd 2026-02-15 15:07:26 -04:00
Tobi Lutke
53bf2ebf10
ci: use npm trusted publishing (OIDC) instead of token 2026-02-15 14:59:09 -04:00
Tobi Lutke
73985a2aaa
test: skip all model-dependent tests in CI
Token-based chunking, vector search, hybrid search, and store
LlamaCpp integration tests all require model downloads.
2026-02-15 14:46:41 -04:00
Tobi Lutke
ed4df97122
test: skip LLM integration tests in CI
Model download + GPU inference won't work on CI runners.
Uses describe.skipIf(CI) for LlamaCpp Integration, LLM Session
Management, vector search, and deep search tests.
2026-02-15 14:42:20 -04:00
Tobi Lutke
2279389415
chore: set up npm publishing as @tobi/qmd v0.9.0
- Scope package to @tobi/qmd, version 0.9.0
- Add files whitelist, publishConfig, repo metadata
- Add CI workflow (bun tests on ubuntu + macos, bun latest + 1.1.0)
- Add publish workflow (triggers on v* tags, publishes to npm)
- Add release script for version bumping + changelog generation
- Add LICENSE (MIT) and initial CHANGELOG.md
- Update install instructions to use @tobi/qmd
2026-02-15 14:31:23 -04:00
Tobi Lutke
31dd977c32
fix: handle dense content (code) that tokenizes to more than expected
The 4 chars/token estimate is accurate for prose but code can be
1.7-2 chars/token. This caused chunks to exceed the embedding
model's 2048 token context limit.

- Use 3 chars/token as initial estimate (balanced for mixed content)
- Add safety net: re-chunk any chunks that still exceed token limit
- Use actual chars/token ratio when re-chunking for accuracy

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-15 14:20:09 -04:00
Tobi Lutke
537d15a9e6
fix: proper cleanup of Metal GPU resources in tests
Add test-preload.ts with global afterAll hook that ensures llama.cpp
Metal resources are properly disposed before process exit, avoiding
GGML_ASSERT failures.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-15 14:20:09 -04:00
Tobi Lutke
2d2f53034d
fix: use max chunk size for snippet search window
extractSnippet was using the snippet output length (500 chars) to
determine the search window, which was too small even for fixed
chunks. With variable-length smart chunks, this could miss relevant
content entirely.

Now uses CHUNK_SIZE_CHARS as fallback, ensuring the entire chunk
region is searched regardless of actual chunk length.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-15 14:20:09 -04:00
Tobi Lutke
32112256c1
docs: document smart chunking algorithm in README
Add Smart Chunking section explaining break point scoring, distance
decay formula, and code fence protection. Update token counts from
800 to 900 throughout.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-15 14:20:09 -04:00
Tobi Lutke
f0e87a454a
feat: smart chunking with scored markdown break points
Replace hard 800-token boundary chunking with scoring algorithm that
finds natural document break points. Chunks now end at headings,
code blocks, and paragraph boundaries when possible.

- Add break point scoring: h1=100, h2=90, h3=80, codeblock=80, blank=20
- Use squared distance decay so headings win even at window edge
- Protect code fences from being split
- Increase chunk size to 900 tokens to accommodate smart boundaries
- Add comprehensive tests for chunking functions

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-15 14:20:09 -04:00
Tobi Lütke
392934e78a
perf: CPU parallelism via multi-context thread splitting
Our assumption that CPU can't benefit from multiple contexts was
wrong. The withLock in node-llama-cpp serializes within a single
context, but separate contexts with split threads run on different
cores in true parallel.

Key changes:
- computeParallelism() now returns >1 on CPU (cores / 4, max 4)
- threadsPerContext() splits math cores evenly across contexts
- Both embed and rerank contexts get proper thread counts
- Benchmark updated to test CPU parallelism

Before (CPU, 40 docs): 9.7s (4.1 docs/s) — 6 threads, 1 context
After  (CPU, 40 docs): 2.3s (17.2 docs/s) — 32 threads, 8 contexts

Two fixes stacked:
1. Thread count: default was 6 (library hardcode), now uses all
   math cores — 2× improvement alone
2. Multi-context: splitting cores across 8 contexts gives another
   2.2× on top

End-to-end 'qmd query' on CPU: 10.3s → 2.9s

CPU benchmark (Threadripper PRO 7975WX, 32 math cores):
  1 ctx: 5001ms (8.0 docs/s)
  2 ctx: 3585ms (11.2 docs/s)  1.4×
  4 ctx: 2874ms (13.9 docs/s)  1.7×
  8 ctx: 2323ms (17.2 docs/s)  2.2×
2026-02-15 11:21:45 -05:00
Tobi Lütke
bf42223086
bench: add reranker benchmark (bench-rerank.ts)
Standalone benchmark for the reranking pipeline. Reports:
- System info (CPU, GPU, VRAM)
- Model VRAM usage
- Per-config: parallelism, flash attention, median time,
  throughput (docs/s), VRAM per context, total VRAM, peak RSS
- Speedup relative to baseline (1 context)

Usage:
  bun src/bench-rerank.ts              # full (40 docs, 3 iters, 1/2/4/8 ctx)
  bun src/bench-rerank.ts --quick      # quick (10 docs, 1 iter)
  bun src/bench-rerank.ts --docs 100   # custom doc count

Results on this machine:
  CUDA: 254ms/40 docs (8 ctx), 688ms (1 ctx) = 2.7x speedup
  CPU:  9697ms/40 docs (1 ctx) = 38x slower than single GPU ctx
2026-02-15 10:51:09 -05:00
Tobi Lütke
0a941c442f
perf: flash attention, right-sized contexts, cleaner GPU detection
Holistic tuning pass on context and GPU configuration:

GPU detection:
- Use getLlamaGpuTypes() to discover available backends at runtime
  instead of try/catch loop. Prefer CUDA > Metal > Vulkan > CPU.
- getLlama({gpu:'auto'}) returns false even when CUDA is available
  (node-llama-cpp issue), so we can't rely on it.

Context tuning:
- Rerank context: 2048 tokens (was auto=40960). The Qwen3 reranker
  template adds ~200 tokens overhead, chunks are ~800, query ~50.
  Total ~1050 tokens, so 2048 gives comfortable margin.
  VRAM per context: ~960 MB (was 11.6 GB with auto).
- Flash attention enabled for rerank contexts (~20% less VRAM).
  Falls back gracefully if flash attention not supported.
- Embed context: kept at model default (2048 for nomic-embed).

Platform considerations:
- CUDA (server): up to 8 parallel contexts, flash attention
- Metal (MacBook): 1-4 contexts depending on unified memory
- Vulkan: detected and used if CUDA/Metal unavailable
- CPU: single context (parallelism has no benefit due to locks)

Context size was 1024 initially but Qwen3's reranker template is
verbose (system prompt + instruct + think tags) — some inputs
exceeded 1024 tokens. Bumped to 2048 for safety.
2026-02-15 10:34:39 -05:00
Tobi Lütke
4ac95b5e26
perf: adaptive parallel contexts for embed + rerank, fix VRAM waste
Holistic overhaul of context management:

1. Parallel embedding contexts: embedBatch now splits work across
   multiple EmbeddingContexts (same pattern as reranking). Each
   context is ~143 MB. Benchmarked 6x speedup on 20 texts with
   4 contexts vs 1.

2. Rerank context size: was using auto (40960 tokens = 11.6 GB per
   context!). Reranking chunks are ~800 tokens max, so 1024 is
   plenty. Now 711 MB per context — 16x less VRAM. 4 contexts went
   from 46 GB to 2.8 GB.

3. Adaptive parallelism via computeParallelism(): checks available
   VRAM and allocates at most 25% of free VRAM for contexts, capped
   at 8. Falls back to 1 on CPU (no benefit from multiple contexts
   with node-llama-cpp's withLock serialization). Gracefully handles
   allocation failures — uses however many contexts succeeded.

VRAM budget per operation:
- Embed:  N × 143 MB (nomic-embed, 2048 ctx)
- Rerank: N × 711 MB (Qwen3-Reranker-0.6B, 1024 ctx)
- Generate: ~1.1 GB (qmd-expansion-1.7B, fresh ctx per call)

Works across:
- Large GPU boxes (4x A6000, 190 GB): allocates up to 8 contexts
- Consumer GPUs (16 GB): 2-4 contexts fit comfortably
- Apple Metal (8-16 GB unified): 1-4 contexts depending on memory
- CPU-only: single context (parallelism has no benefit)
2026-02-15 10:27:01 -05:00
Tobi Lütke
0a0e1e6f29
perf: parallel reranking with multiple contexts (2.7x speedup)
node-llama-cpp's LlamaRankingContext uses a single sequence with a
withLock() guard, making rankAll() effectively sequential despite
using Promise.all(). Each document evaluation erases the context,
evaluates tokens, and extracts the logit — all serialized.

Fix: create 4 parallel ranking contexts from the same model (model
weights are shared, only KV cache is duplicated). Split documents
across contexts and evaluate in parallel via Promise.all().

Benchmarks (40 chunks, CUDA, 4x A6000):
- 1 context:  898ms (baseline)
- 2 contexts: 460ms (2.0x)
- 4 contexts: 338ms (2.7x)  ← sweet spot
- 8 contexts: 458ms (VRAM contention)

End-to-end 'qmd query' time: 7.5s → 3.7s

Gracefully handles VRAM limits — if creating the Nth context fails,
falls back to however many were successfully created.
2026-02-15 10:19:55 -05:00
Tobi Lütke
ee86bba45e
feat: auto-detect GPU acceleration + device info in status
QMD was running all models on CPU even when CUDA/Vulkan/Metal
was available. The getLlama() call used no gpu option, defaulting
to false.

Now:
- ensureLlama() tries cuda → vulkan → metal → CPU fallback
- Prints warning to stderr if falling back to CPU
- 'qmd status' shows GPU type, device names, VRAM, and CPU cores
- On this machine: 7.5s query vs 5+ minutes on CPU (reranker)

The reranker (Qwen3-Reranker-0.6B) calls are serialized by a lock
in node-llama-cpp's rankAndSort() — each of the 40 chunks is
evaluated sequentially. This is inherent to the library's design
(single sequence context). GPU acceleration is the fix, not
batching — the lock prevents true parallelism regardless.
2026-02-15 10:13:07 -05:00
Tobi Lütke
b69fae7aa3
perf: batch vector embeddings + collection-aware FTS filtering
Three improvements to hybridQuery:

1. Collection filter pushed into SQL: searchFTS and searchVec now
   accept collectionName directly instead of filtering post-hoc.
   Reduces noise in FTS probe and all expanded-query FTS calls.
   Also fixes MCP server's FTS search to use SQL-level filtering.

2. Batch embed for vector searches: instead of embedding each
   vec/hyde query sequentially (one embed call per query), we now
   collect all texts that need vector search and embed them in a
   single embedBatch() call. The sqlite-vec lookups still run
   sequentially (they're fast), but the expensive LLM embed step
   is batched.

3. FTS-first ordering: all lex expansions run immediately (sync,
   no LLM needed) before the vector embedding batch. This means
   FTS results are ready while embeddings compute.

Also cleans up legacy collectionId parameter naming (was number,
now properly string collectionName throughout).
2026-02-15 09:53:28 -05:00
Tobi Lütke
03a25d69d9
Add QMD architecture diagram to README
Generated with PaperBanana (Gemini 3 Pro). Shows query expansion
fanning HyDE+Vec into vector searches, Lex into BM25, merged via
reciprocal rank fusion and LLM reranking.
2026-02-14 19:17:11 -05:00
Claude
73136e4f59
fix: verify sqlite-vec readiness after extension load. Closes #169 2026-02-14 19:15:21 -05:00
Claude
96643a28ed
fix: reactivate deactivated documents on re-index. Closes #168 2026-02-14 19:15:21 -05:00
Claude
0eabfe73db
fix: allow $ route filenames in handelize. Closes #162 2026-02-14 19:14:46 -05:00
Claude
da79e77d34
feat: add --version/-v flag. Closes #88 2026-02-14 19:14:46 -05:00
Claude
5dec3ab662
fix: disable following symlinks in glob.scan. Closes #134 2026-02-14 19:14:46 -05:00
Tobi Lütke
96634da39b feat: promote query as primary search command, add CLI aliases
List query first in --help as the recommended search method. Add
vector-search and deep-search as undocumented CLI aliases matching
MCP tool names.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 00:34:29 -05:00
Tobi Lütke
993628e768 fix: add missing context to search results markdown and XML formatters
searchResultsToMarkdown and searchResultsToXml in formatter.ts were
silently dropping the context field. Added formatter.test.ts covering
context visibility across all output formats.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 00:34:23 -05:00
Ilya Grigorik
785bbcf319
MCP: Streamable HTTP, scoring fixes, tool improvements (#149)
* feat: MCP HTTP transport with daemon lifecycle

  Add streaming HTTP transport as an alternative to stdio for the MCP
  server. A long-lived HTTP server avoids reloading 3 GGUF models (~2GB)
  on every client connection, reducing warm query latency from ~16s (CLI)
  to ~10s.

  New CLI surface:
    qmd mcp --http [--port N]   # foreground, default port 3000
    qmd mcp --http --daemon     # background, PID in ~/.cache/qmd/mcp.pid
    qmd mcp stop                # stop daemon via PID file
    qmd status                  # now shows MCP daemon liveness

  Server implementation (mcp.ts):
  - Extract createMcpServer(store) shared by stdio and HTTP transports
  - HTTP transport uses WebStandardStreamableHTTPServerTransport with
    JSON responses (stateless, no SSE)
  - /health endpoint with uptime, /mcp for MCP protocol, 404 otherwise
  - Request logging to stderr with timestamps, tool names, query args

  Daemon lifecycle (qmd.ts):
  - PID file + log file management with stale PID detection
  - Absolute paths in Bun.spawn (process.execPath + import.meta.path)
    so daemon works regardless of cwd
  - mkdirSync for cache dir on fresh installs
  - Removes top-level SIGTERM/SIGINT handlers before starting HTTP
    server so async cleanup in mcp.ts actually runs

  Move hybridQuery() and vectorSearchQuery() into store.ts as standalone
  functions that take a Store as first argument. Both CLI and MCP now
  call the identical pipeline, eliminating the class of bugs where one
  copy drifts from the other.

  Shared pipeline (store.ts):
  - hybridQuery(): BM25 probe → expand → FTS+vec search → RRF →
    chunk → rerank (chunks only) → position-aware blending → dedup
  - vectorSearchQuery(): expand → vec search → dedup → sort
  - SearchHooks interface for optional progress callbacks
  - Constants: STRONG_SIGNAL_MIN_SCORE, STRONG_SIGNAL_MIN_GAP,
    RERANK_CANDIDATE_LIMIT (40), addLineNumbers()

  Bugs fixed by unification:
  - MCP now gets strong-signal short-circuit (was CLI-only)
  - Reranker candidate limit unified at 40 (MCP had 30)
  - File dedup added to hybrid query (MCP was missing it)
  - Collection filter pushed into searchVec DB query
  - Filter-then-slice ordering fixed (MCP was slice-then-filter)

* feat: type-routed query expansion — lex→FTS, vec/hyde→vector

  expandQuery() now returns typed ExpandedQuery[] instead of string[],
  preserving the lex/vec/hyde type info from the LLM's GBNF-structured
  output. hybridQuery() and vectorSearchQuery() route searches by type:
  lex queries go to FTS only, vec/hyde go to vector only.

  Previously, every expanded query ran through BOTH backends — keyword
  variants wasted embedding forward passes, semantic paraphrases wasted
  BM25 lookups. Type routing eliminates ~4 calls/query with zero quality
  loss (cross-backend noise actually hurt RRF fusion).

  Cache format changed from newline-separated text to JSON (preserves
  types). Old cache entries gracefully re-expand on first access.

  CLI expansion tree now shows query types:
    ├─ original query
    ├─ lex: keyword variant
    ├─ vec: semantic meaning
    └─ hyde: hypothetical document...

  Benchmark (5 queries, 1756-doc index, warm LLM, Apple Silicon):

    Metric              Old (untyped)  New (typed)  Delta
    Avg backend calls   10.0           6.0          -40%
    Total wall time     1278ms         549ms        -57%
    Avg saved/query     —              —            146ms

    "authentication setup"          12 → 7 calls   511 → 112ms
    "database migration strategy"   10 → 6 calls   182 → 106ms
    "how to handle errors in API"   10 → 6 calls   216 → 121ms
    "meeting notes from last week"  10 → 6 calls   228 → 110ms
    "performance optimization"       8 → 5 calls   141 → 100ms

  Savings come from skipped embed() calls (~30-80ms each). FTS is
  synchronous SQLite (~0ms), so lex→FTS routing is free while
  vec/hyde→vector-only avoids wasted embedding passes.

* fix: MCP query snippets now use reranker's best chunk, not full body

  extractSnippet() was scanning the entire document body for keyword
  matches to build the snippet. But hybridQuery() already identified
  the most relevant chunk via cross-attention reranking — rescanning
  the full body is redundant and can land on a less relevant section
  if the query terms appear elsewhere in the document.

  CLI was already using bestChunk (set during the refactor). MCP was
  still using body — a pre-existing inconsistency, not a regression.

* feat: dynamic MCP instructions + tool annotations

  The MCP server now generates instructions at startup from actual index
  state and injects them into the initialize response. LLMs see collection
  names, document counts, content descriptions, and search strategy
  guidance in their system prompt — zero tool calls needed for orientation.

  Previously, the only guidance was generic static tool descriptions and
  a user-invocable "query" prompt that no LLM would discover on its own.
  An LLM connecting to QMD had no idea what collections existed, what they
  contained, or how to scope searches effectively.

* change default port to 8181

* fix: BM25 score normalization was inverted

  The normalization formula `1 / (1 + |bm25|)` is a decreasing function of
  match strength. FTS5 BM25 scores are negative where more negative = better
  match (e.g., -10 is strong, -0.5 is weak). The formula mapped:

    strong match (raw -10) → 1/(1+10) =  9%   ← should be highest
    weak match   (raw -0.5) → 1/(1+0.5) = 67%  ← should be lowest

  Three downstream effects:
  1. `--min-score 0.5` (or MCP minScore: 0.5) filtered OUT strong matches
     and kept only weak ones. The MCP instructions recommend this threshold.
  2. CLI `formatScore()` color bands never showed green for BM25 results
     (best matches scored ~9%, green threshold is 70%).
  3. The strong signal optimization in hybridQuery (skip ~2s LLM expansion
     when BM25 already has a clear winner) was dead code — strong matches
     scored ~0.09, never reaching the 0.85 threshold.

  Fix: `|x| / (1 + |x|)` — same (0,1) range, monotonic, no per-query
  normalization needed, but now correctly maps strong → high, weak → low.

  The normalization was born broken (Math.max(0, x) clamped all
  negative BM25 to 0 → every score = 1.0), then PR #76 changed to
  Math.abs which made scores vary but inverted the direction. Neither
  state was ever correct.

* fix: rerank cache key ignores chunk content

  The rerank cache key was (query, file, model) but the actual text sent
  to the reranker is a keyword-selected chunk that varies by query terms.
  Two different queries hitting the same file can select different chunks,
  but the second query gets a stale cached score from the first chunk.

  Example:
    Query "auth flow" → selects chunk about authentication → score 0.92
    Query "auth tokens" → same file, selects chunk about tokens
      → cache HIT on (query, file, model) → returns 0.92 from wrong chunk

  Fix: include full chunk text in cache key. getCacheKey() already
  SHA-256 hashes its inputs, so this adds no key bloat — just
  disambiguation. Old cache entries become natural misses (different key
  shape) and re-warm on next query.

* rename MCP tools for clarity, rewrite descriptions for LLM tool selection

  Rename MCP tools: vsearch → vector_search, query → deep_search.
  LLMs see these names — self-documenting names reduce reliance on
  descriptions for tool selection. CLI commands stay unchanged
  (qmd vsearch, qmd query) — different namespace, users type those.

  Rewrite all search tool descriptions to be action-oriented:
    - search: "Search by keyword. Finds documents containing exact
      words and phrases in the query."
    - vector_search: "Search by meaning. Finds relevant documents even
      when they use different words than the query — handles synonyms,
      paraphrases, and related concepts."
    - deep_search: "Deep search. Auto-expands the query into variations,
      searches each by keyword and meaning, and reranks for top hits
      across all results."

  Rewrite instructions ladder — each tool says what it does, no
  "start here" / "escalate as needed" strategy language.

  Delete the "query" prompt (registerPrompt) — it restated what
  descriptions + instructions already cover. No LLM proactively
  calls prompts/get to learn how to use tools.

* supress HTTP server logs during tests
2026-02-10 16:37:33 -05:00
Matt Galligan
63028fd5e9
feat: add Claude Code plugin support with inline status check (#99)
- Add marketplace.json for Claude Code plugin installation
- Simplify skill status check to inline `qmd status` (portable across agents)
- Update SKILL.md MCP section, reference mcp-setup.md for manual config
- Clean up mcp-setup.md (remove redundant prerequisites)
- Rename MCP-SETUP.md to mcp-setup.md

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-02 14:14:24 -05:00
David Gil
47b705409e
fix: BM25 score normalization - use Math.abs instead of Math.max (#76)
BM25 scores in SQLite FTS5 are negative (lower = better match).
The previous code used Math.max(0, score) which clamped all negative
scores to 0, resulting in all results showing 100% (score = 1.0).

Fix: Use Math.abs(score) to properly convert negative BM25 scores
to positive values for the normalization formula.

Before: All results show Score: 100%
After:  Scores vary based on actual BM25 relevance (e.g., 16%, 5%, 6%)

Fixes #74
2026-02-01 16:38:52 -05:00
Christopher Stöckl
0f87e2429d
fix: workaround Bun UTF-8 path corruption bug (#82)
Replace Bun.file() async calls with Node.js fs sync methods to work
around a Bun bug that corrupts UTF-8 file paths containing non-ASCII
characters.

Bug: Bun.file(filepath).stat() and Bun.file(filepath).text() internally
mangle UTF-8 encoding, causing ENOENT errors with mojibake paths when
accessing files in iCloud Drive and other locations.

Changes:
- src/qmd.ts: Use readFileSync instead of Bun.file().text()
- src/qmd.ts: Use statSync instead of Bun.file().stat() for file metadata
- src/store.ts: Use statSync for SQLite custom path detection
2026-02-01 16:37:04 -05:00