qmd/skills at 785bbcf319cf60d9e81b8307a1ea89bcdf3159c2 - qmd

History

Ilya Grigorik 785bbcf319 MCP: Streamable HTTP, scoring fixes, tool improvements (#149 ) * feat: MCP HTTP transport with daemon lifecycle Add streaming HTTP transport as an alternative to stdio for the MCP server. A long-lived HTTP server avoids reloading 3 GGUF models (~2GB) on every client connection, reducing warm query latency from ~16s (CLI) to ~10s. New CLI surface: qmd mcp --http [--port N] # foreground, default port 3000 qmd mcp --http --daemon # background, PID in ~/.cache/qmd/mcp.pid qmd mcp stop # stop daemon via PID file qmd status # now shows MCP daemon liveness Server implementation (mcp.ts): - Extract createMcpServer(store) shared by stdio and HTTP transports - HTTP transport uses WebStandardStreamableHTTPServerTransport with JSON responses (stateless, no SSE) - /health endpoint with uptime, /mcp for MCP protocol, 404 otherwise - Request logging to stderr with timestamps, tool names, query args Daemon lifecycle (qmd.ts): - PID file + log file management with stale PID detection - Absolute paths in Bun.spawn (process.execPath + import.meta.path) so daemon works regardless of cwd - mkdirSync for cache dir on fresh installs - Removes top-level SIGTERM/SIGINT handlers before starting HTTP server so async cleanup in mcp.ts actually runs Move hybridQuery() and vectorSearchQuery() into store.ts as standalone functions that take a Store as first argument. Both CLI and MCP now call the identical pipeline, eliminating the class of bugs where one copy drifts from the other. Shared pipeline (store.ts): - hybridQuery(): BM25 probe → expand → FTS+vec search → RRF → chunk → rerank (chunks only) → position-aware blending → dedup - vectorSearchQuery(): expand → vec search → dedup → sort - SearchHooks interface for optional progress callbacks - Constants: STRONG_SIGNAL_MIN_SCORE, STRONG_SIGNAL_MIN_GAP, RERANK_CANDIDATE_LIMIT (40), addLineNumbers() Bugs fixed by unification: - MCP now gets strong-signal short-circuit (was CLI-only) - Reranker candidate limit unified at 40 (MCP had 30) - File dedup added to hybrid query (MCP was missing it) - Collection filter pushed into searchVec DB query - Filter-then-slice ordering fixed (MCP was slice-then-filter) * feat: type-routed query expansion — lex→FTS, vec/hyde→vector expandQuery() now returns typed ExpandedQuery[] instead of string[], preserving the lex/vec/hyde type info from the LLM's GBNF-structured output. hybridQuery() and vectorSearchQuery() route searches by type: lex queries go to FTS only, vec/hyde go to vector only. Previously, every expanded query ran through BOTH backends — keyword variants wasted embedding forward passes, semantic paraphrases wasted BM25 lookups. Type routing eliminates ~4 calls/query with zero quality loss (cross-backend noise actually hurt RRF fusion). Cache format changed from newline-separated text to JSON (preserves types). Old cache entries gracefully re-expand on first access. CLI expansion tree now shows query types: ├─ original query ├─ lex: keyword variant ├─ vec: semantic meaning └─ hyde: hypothetical document... Benchmark (5 queries, 1756-doc index, warm LLM, Apple Silicon): Metric Old (untyped) New (typed) Delta Avg backend calls 10.0 6.0 -40% Total wall time 1278ms 549ms -57% Avg saved/query — — 146ms "authentication setup" 12 → 7 calls 511 → 112ms "database migration strategy" 10 → 6 calls 182 → 106ms "how to handle errors in API" 10 → 6 calls 216 → 121ms "meeting notes from last week" 10 → 6 calls 228 → 110ms "performance optimization" 8 → 5 calls 141 → 100ms Savings come from skipped embed() calls (~30-80ms each). FTS is synchronous SQLite (~0ms), so lex→FTS routing is free while vec/hyde→vector-only avoids wasted embedding passes. * fix: MCP query snippets now use reranker's best chunk, not full body extractSnippet() was scanning the entire document body for keyword matches to build the snippet. But hybridQuery() already identified the most relevant chunk via cross-attention reranking — rescanning the full body is redundant and can land on a less relevant section if the query terms appear elsewhere in the document. CLI was already using bestChunk (set during the refactor). MCP was still using body — a pre-existing inconsistency, not a regression. * feat: dynamic MCP instructions + tool annotations The MCP server now generates instructions at startup from actual index state and injects them into the initialize response. LLMs see collection names, document counts, content descriptions, and search strategy guidance in their system prompt — zero tool calls needed for orientation. Previously, the only guidance was generic static tool descriptions and a user-invocable "query" prompt that no LLM would discover on its own. An LLM connecting to QMD had no idea what collections existed, what they contained, or how to scope searches effectively. * change default port to 8181 * fix: BM25 score normalization was inverted The normalization formula `1 / (1 + \|bm25\|)` is a decreasing function of match strength. FTS5 BM25 scores are negative where more negative = better match (e.g., -10 is strong, -0.5 is weak). The formula mapped: strong match (raw -10) → 1/(1+10) = 9% ← should be highest weak match (raw -0.5) → 1/(1+0.5) = 67% ← should be lowest Three downstream effects: 1. `--min-score 0.5` (or MCP minScore: 0.5) filtered OUT strong matches and kept only weak ones. The MCP instructions recommend this threshold. 2. CLI `formatScore()` color bands never showed green for BM25 results (best matches scored ~9%, green threshold is 70%). 3. The strong signal optimization in hybridQuery (skip ~2s LLM expansion when BM25 already has a clear winner) was dead code — strong matches scored ~0.09, never reaching the 0.85 threshold. Fix: `\|x\| / (1 + \|x\|)` — same (0,1) range, monotonic, no per-query normalization needed, but now correctly maps strong → high, weak → low. The normalization was born broken (Math.max(0, x) clamped all negative BM25 to 0 → every score = 1.0), then PR #76 changed to Math.abs which made scores vary but inverted the direction. Neither state was ever correct. * fix: rerank cache key ignores chunk content The rerank cache key was (query, file, model) but the actual text sent to the reranker is a keyword-selected chunk that varies by query terms. Two different queries hitting the same file can select different chunks, but the second query gets a stale cached score from the first chunk. Example: Query "auth flow" → selects chunk about authentication → score 0.92 Query "auth tokens" → same file, selects chunk about tokens → cache HIT on (query, file, model) → returns 0.92 from wrong chunk Fix: include full chunk text in cache key. getCacheKey() already SHA-256 hashes its inputs, so this adds no key bloat — just disambiguation. Old cache entries become natural misses (different key shape) and re-warm on next query. * rename MCP tools for clarity, rewrite descriptions for LLM tool selection Rename MCP tools: vsearch → vector_search, query → deep_search. LLMs see these names — self-documenting names reduce reliance on descriptions for tool selection. CLI commands stay unchanged (qmd vsearch, qmd query) — different namespace, users type those. Rewrite all search tool descriptions to be action-oriented: - search: "Search by keyword. Finds documents containing exact words and phrases in the query." - vector_search: "Search by meaning. Finds relevant documents even when they use different words than the query — handles synonyms, paraphrases, and related concepts." - deep_search: "Deep search. Auto-expands the query into variations, searches each by keyword and meaning, and reranks for top hits across all results." Rewrite instructions ladder — each tool says what it does, no "start here" / "escalate as needed" strategy language. Delete the "query" prompt (registerPrompt) — it restated what descriptions + instructions already cover. No LLM proactively calls prompts/get to learn how to use tools. * supress HTTP server logs during tests	2026-02-10 16:37:33 -05:00
..
qmd	MCP: Streamable HTTP, scoring fixes, tool improvements (#149 )	2026-02-10 16:37:33 -05:00

MCP: Streamable HTTP, scoring fixes, tool improvements (#149 )

* feat: MCP HTTP transport with daemon lifecycle

  Add streaming HTTP transport as an alternative to stdio for the MCP
  server. A long-lived HTTP server avoids reloading 3 GGUF models (~2GB)
  on every client connection, reducing warm query latency from ~16s (CLI)
  to ~10s.

  New CLI surface:
    qmd mcp --http [--port N]   # foreground, default port 3000
    qmd mcp --http --daemon     # background, PID in ~/.cache/qmd/mcp.pid
    qmd mcp stop                # stop daemon via PID file
    qmd status                  # now shows MCP daemon liveness

  Server implementation (mcp.ts):
  - Extract createMcpServer(store) shared by stdio and HTTP transports
  - HTTP transport uses WebStandardStreamableHTTPServerTransport with
    JSON responses (stateless, no SSE)
  - /health endpoint with uptime, /mcp for MCP protocol, 404 otherwise
  - Request logging to stderr with timestamps, tool names, query args

  Daemon lifecycle (qmd.ts):
  - PID file + log file management with stale PID detection
  - Absolute paths in Bun.spawn (process.execPath + import.meta.path)
    so daemon works regardless of cwd
  - mkdirSync for cache dir on fresh installs
  - Removes top-level SIGTERM/SIGINT handlers before starting HTTP
    server so async cleanup in mcp.ts actually runs

  Move hybridQuery() and vectorSearchQuery() into store.ts as standalone
  functions that take a Store as first argument. Both CLI and MCP now
  call the identical pipeline, eliminating the class of bugs where one
  copy drifts from the other.

  Shared pipeline (store.ts):
  - hybridQuery(): BM25 probe → expand → FTS+vec search → RRF →
    chunk → rerank (chunks only) → position-aware blending → dedup
  - vectorSearchQuery(): expand → vec search → dedup → sort
  - SearchHooks interface for optional progress callbacks
  - Constants: STRONG_SIGNAL_MIN_SCORE, STRONG_SIGNAL_MIN_GAP,
    RERANK_CANDIDATE_LIMIT (40), addLineNumbers()

  Bugs fixed by unification:
  - MCP now gets strong-signal short-circuit (was CLI-only)
  - Reranker candidate limit unified at 40 (MCP had 30)
  - File dedup added to hybrid query (MCP was missing it)
  - Collection filter pushed into searchVec DB query
  - Filter-then-slice ordering fixed (MCP was slice-then-filter)

* feat: type-routed query expansion — lex→FTS, vec/hyde→vector

  expandQuery() now returns typed ExpandedQuery[] instead of string[],
  preserving the lex/vec/hyde type info from the LLM's GBNF-structured
  output. hybridQuery() and vectorSearchQuery() route searches by type:
  lex queries go to FTS only, vec/hyde go to vector only.

  Previously, every expanded query ran through BOTH backends — keyword
  variants wasted embedding forward passes, semantic paraphrases wasted
  BM25 lookups. Type routing eliminates ~4 calls/query with zero quality
  loss (cross-backend noise actually hurt RRF fusion).

  Cache format changed from newline-separated text to JSON (preserves
  types). Old cache entries gracefully re-expand on first access.

  CLI expansion tree now shows query types:
    ├─ original query
    ├─ lex: keyword variant
    ├─ vec: semantic meaning
    └─ hyde: hypothetical document...

  Benchmark (5 queries, 1756-doc index, warm LLM, Apple Silicon):

    Metric              Old (untyped)  New (typed)  Delta
    Avg backend calls   10.0           6.0          -40%
    Total wall time     1278ms         549ms        -57%
    Avg saved/query     —              —            146ms

    "authentication setup"          12 → 7 calls   511 → 112ms
    "database migration strategy"   10 → 6 calls   182 → 106ms
    "how to handle errors in API"   10 → 6 calls   216 → 121ms
    "meeting notes from last week"  10 → 6 calls   228 → 110ms
    "performance optimization"       8 → 5 calls   141 → 100ms

  Savings come from skipped embed() calls (~30-80ms each). FTS is
  synchronous SQLite (~0ms), so lex→FTS routing is free while
  vec/hyde→vector-only avoids wasted embedding passes.

* fix: MCP query snippets now use reranker's best chunk, not full body

  extractSnippet() was scanning the entire document body for keyword
  matches to build the snippet. But hybridQuery() already identified
  the most relevant chunk via cross-attention reranking — rescanning
  the full body is redundant and can land on a less relevant section
  if the query terms appear elsewhere in the document.

  CLI was already using bestChunk (set during the refactor). MCP was
  still using body — a pre-existing inconsistency, not a regression.

* feat: dynamic MCP instructions + tool annotations

  The MCP server now generates instructions at startup from actual index
  state and injects them into the initialize response. LLMs see collection
  names, document counts, content descriptions, and search strategy
  guidance in their system prompt — zero tool calls needed for orientation.

  Previously, the only guidance was generic static tool descriptions and
  a user-invocable "query" prompt that no LLM would discover on its own.
  An LLM connecting to QMD had no idea what collections existed, what they
  contained, or how to scope searches effectively.

* change default port to 8181

* fix: BM25 score normalization was inverted

  The normalization formula `1 / (1 + |bm25|)` is a decreasing function of
  match strength. FTS5 BM25 scores are negative where more negative = better
  match (e.g., -10 is strong, -0.5 is weak). The formula mapped:

    strong match (raw -10) → 1/(1+10) =  9%   ← should be highest
    weak match   (raw -0.5) → 1/(1+0.5) = 67%  ← should be lowest

  Three downstream effects:
  1. `--min-score 0.5` (or MCP minScore: 0.5) filtered OUT strong matches
     and kept only weak ones. The MCP instructions recommend this threshold.
  2. CLI `formatScore()` color bands never showed green for BM25 results
     (best matches scored ~9%, green threshold is 70%).
  3. The strong signal optimization in hybridQuery (skip ~2s LLM expansion
     when BM25 already has a clear winner) was dead code — strong matches
     scored ~0.09, never reaching the 0.85 threshold.

  Fix: `|x| / (1 + |x|)` — same (0,1) range, monotonic, no per-query
  normalization needed, but now correctly maps strong → high, weak → low.

  The normalization was born broken (Math.max(0, x) clamped all
  negative BM25 to 0 → every score = 1.0), then PR #76 changed to
  Math.abs which made scores vary but inverted the direction. Neither
  state was ever correct.

* fix: rerank cache key ignores chunk content

  The rerank cache key was (query, file, model) but the actual text sent
  to the reranker is a keyword-selected chunk that varies by query terms.
  Two different queries hitting the same file can select different chunks,
  but the second query gets a stale cached score from the first chunk.

  Example:
    Query "auth flow" → selects chunk about authentication → score 0.92
    Query "auth tokens" → same file, selects chunk about tokens
      → cache HIT on (query, file, model) → returns 0.92 from wrong chunk

  Fix: include full chunk text in cache key. getCacheKey() already
  SHA-256 hashes its inputs, so this adds no key bloat — just
  disambiguation. Old cache entries become natural misses (different key
  shape) and re-warm on next query.

* rename MCP tools for clarity, rewrite descriptions for LLM tool selection

  Rename MCP tools: vsearch → vector_search, query → deep_search.
  LLMs see these names — self-documenting names reduce reliance on
  descriptions for tool selection. CLI commands stay unchanged
  (qmd vsearch, qmd query) — different namespace, users type those.

  Rewrite all search tool descriptions to be action-oriented:
    - search: "Search by keyword. Finds documents containing exact
      words and phrases in the query."
    - vector_search: "Search by meaning. Finds relevant documents even
      when they use different words than the query — handles synonyms,
      paraphrases, and related concepts."
    - deep_search: "Deep search. Auto-expands the query into variations,
      searches each by keyword and meaning, and reranks for top hits
      across all results."

  Rewrite instructions ladder — each tool says what it does, no
  "start here" / "escalate as needed" strategy language.

  Delete the "query" prompt (registerPrompt) — it restated what
  descriptions + instructions already cover. No LLM proactively
  calls prompts/get to learn how to use tools.

* supress HTTP server logs during tests

2026-02-10 16:37:33 -05:00

qmd

MCP: Streamable HTTP, scoring fixes, tool improvements (#149 )

2026-02-10 16:37:33 -05:00