- Move model info from --help to `qmd status` with live HuggingFace
links derived from actual configured URIs
- Pre-push hook: handle non-interactive shells gracefully, resolve
annotated tags correctly for CI checks
- Add /release skill with full process: hook install, changelog
validation, git history review, preview, and release execution
- Skill auto-populates [Unreleased] from git history when empty
- Install hook script symlinks pre-push for tag validation
- Register skills/ dir in .pi/settings.json for pi discovery
- Add tsc build step (tsconfig.build.json) so npm package ships
compiled JS instead of raw TypeScript requiring tsx at runtime
- Update qmd wrapper and daemon spawn to use dist/qmd.js in
production while keeping tsx for development
- Add self-installing pre-push hook validating v* tag pushes:
package.json version match, changelog entry, CI status
- Add release.sh script that renames [Unreleased] to versioned
entry, bumps package.json, commits, and tags
- Add extract-changelog.sh for cumulative GitHub release notes
- Update publish workflow with build step and GitHub release creation
- Flesh out CHANGELOG.md with full history from 0.1.0 through 1.0.0
in Keep-a-Changelog format with PR/contributor attributions
- Add release standards and changelog guidelines to CLAUDE.md
- mcp.ts: add sessionIdGenerator to HTTP transport (fixes "stateless
transport cannot be reused" error in CI)
- test-preload.ts: set 30s default timeout for bun test runner (matches
vitest config, prevents CLI subprocess test timeouts)
- mcp.test.ts: use == null check instead of toBeUndefined for SQLite
get() result (bun:sqlite returns null, better-sqlite3 returns undefined)
- cli.test.ts: fix qmdScript path from <root>/qmd.ts to <root>/src/qmd.ts
(broke when tests moved from src/integration/ to test/)
- mcp.test.ts: forward Mcp-Session-Id header per MCP Streamable HTTP spec
No more src/models/ and src/integration/ subfolders to forget about.
All 9 test files live in test/, one command runs everything:
npx vitest run test/
bun test test/
Add src/db.ts that dynamically imports bun:sqlite under Bun and
better-sqlite3 under Node.js. Exports openDatabase(), loadSqliteVec(),
and a shared Database interface.
- sqlite-vec loading is now optional — FTS works without it, vector
ops throw a clear error if unavailable
- CI tests both runtimes: Node 22/23 via vitest, Bun via bun test
- All 104 unit tests pass on both Node and Bun
All test files now use vitest + better-sqlite3 imports.
bun test can't load the better-sqlite3 native addon (symbol
error on Linux, segfault on macOS). Run vitest on Node 22/23.
Split test suites for explicit runtime execution.
- Move model-related tests under `src/models/*`.
- Move CLI/integration tests under `src/integration/*`.
- Add `src/store.helpers.unit.test.ts` for helper unit coverage.
- Add shared Vitest config with default timeout and suite organization.
- Remove legacy flat test files from `src/` root.
- Keep core test commands in scripts supporting unit/models/integration runs.
Document both Node and Bun execution paths.
- Update install examples to `@tobilu/qmd` for npm and bun.
- Add npx/bunx one-off usage examples.
- Reflect Bun as first-class supported runtime in requirements.
Update README installation and quick-start commands to Node examples.
- replace bun install/link commands with npm-based Node workflow
- bump package version to 0.9.9 for CLI and MCP metadata
- keep Bun guidance as optional development/runtime note
Model download + GPU inference won't work on CI runners.
Uses describe.skipIf(CI) for LlamaCpp Integration, LLM Session
Management, vector search, and deep search tests.
The 4 chars/token estimate is accurate for prose but code can be
1.7-2 chars/token. This caused chunks to exceed the embedding
model's 2048 token context limit.
- Use 3 chars/token as initial estimate (balanced for mixed content)
- Add safety net: re-chunk any chunks that still exceed token limit
- Use actual chars/token ratio when re-chunking for accuracy
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add test-preload.ts with global afterAll hook that ensures llama.cpp
Metal resources are properly disposed before process exit, avoiding
GGML_ASSERT failures.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
extractSnippet was using the snippet output length (500 chars) to
determine the search window, which was too small even for fixed
chunks. With variable-length smart chunks, this could miss relevant
content entirely.
Now uses CHUNK_SIZE_CHARS as fallback, ensuring the entire chunk
region is searched regardless of actual chunk length.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add Smart Chunking section explaining break point scoring, distance
decay formula, and code fence protection. Update token counts from
800 to 900 throughout.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace hard 800-token boundary chunking with scoring algorithm that
finds natural document break points. Chunks now end at headings,
code blocks, and paragraph boundaries when possible.
- Add break point scoring: h1=100, h2=90, h3=80, codeblock=80, blank=20
- Use squared distance decay so headings win even at window edge
- Protect code fences from being split
- Increase chunk size to 900 tokens to accommodate smart boundaries
- Add comprehensive tests for chunking functions
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Our assumption that CPU can't benefit from multiple contexts was
wrong. The withLock in node-llama-cpp serializes within a single
context, but separate contexts with split threads run on different
cores in true parallel.
Key changes:
- computeParallelism() now returns >1 on CPU (cores / 4, max 4)
- threadsPerContext() splits math cores evenly across contexts
- Both embed and rerank contexts get proper thread counts
- Benchmark updated to test CPU parallelism
Before (CPU, 40 docs): 9.7s (4.1 docs/s) — 6 threads, 1 context
After (CPU, 40 docs): 2.3s (17.2 docs/s) — 32 threads, 8 contexts
Two fixes stacked:
1. Thread count: default was 6 (library hardcode), now uses all
math cores — 2× improvement alone
2. Multi-context: splitting cores across 8 contexts gives another
2.2× on top
End-to-end 'qmd query' on CPU: 10.3s → 2.9s
CPU benchmark (Threadripper PRO 7975WX, 32 math cores):
1 ctx: 5001ms (8.0 docs/s)
2 ctx: 3585ms (11.2 docs/s) 1.4×
4 ctx: 2874ms (13.9 docs/s) 1.7×
8 ctx: 2323ms (17.2 docs/s) 2.2×
Holistic tuning pass on context and GPU configuration:
GPU detection:
- Use getLlamaGpuTypes() to discover available backends at runtime
instead of try/catch loop. Prefer CUDA > Metal > Vulkan > CPU.
- getLlama({gpu:'auto'}) returns false even when CUDA is available
(node-llama-cpp issue), so we can't rely on it.
Context tuning:
- Rerank context: 2048 tokens (was auto=40960). The Qwen3 reranker
template adds ~200 tokens overhead, chunks are ~800, query ~50.
Total ~1050 tokens, so 2048 gives comfortable margin.
VRAM per context: ~960 MB (was 11.6 GB with auto).
- Flash attention enabled for rerank contexts (~20% less VRAM).
Falls back gracefully if flash attention not supported.
- Embed context: kept at model default (2048 for nomic-embed).
Platform considerations:
- CUDA (server): up to 8 parallel contexts, flash attention
- Metal (MacBook): 1-4 contexts depending on unified memory
- Vulkan: detected and used if CUDA/Metal unavailable
- CPU: single context (parallelism has no benefit due to locks)
Context size was 1024 initially but Qwen3's reranker template is
verbose (system prompt + instruct + think tags) — some inputs
exceeded 1024 tokens. Bumped to 2048 for safety.
Holistic overhaul of context management:
1. Parallel embedding contexts: embedBatch now splits work across
multiple EmbeddingContexts (same pattern as reranking). Each
context is ~143 MB. Benchmarked 6x speedup on 20 texts with
4 contexts vs 1.
2. Rerank context size: was using auto (40960 tokens = 11.6 GB per
context!). Reranking chunks are ~800 tokens max, so 1024 is
plenty. Now 711 MB per context — 16x less VRAM. 4 contexts went
from 46 GB to 2.8 GB.
3. Adaptive parallelism via computeParallelism(): checks available
VRAM and allocates at most 25% of free VRAM for contexts, capped
at 8. Falls back to 1 on CPU (no benefit from multiple contexts
with node-llama-cpp's withLock serialization). Gracefully handles
allocation failures — uses however many contexts succeeded.
VRAM budget per operation:
- Embed: N × 143 MB (nomic-embed, 2048 ctx)
- Rerank: N × 711 MB (Qwen3-Reranker-0.6B, 1024 ctx)
- Generate: ~1.1 GB (qmd-expansion-1.7B, fresh ctx per call)
Works across:
- Large GPU boxes (4x A6000, 190 GB): allocates up to 8 contexts
- Consumer GPUs (16 GB): 2-4 contexts fit comfortably
- Apple Metal (8-16 GB unified): 1-4 contexts depending on memory
- CPU-only: single context (parallelism has no benefit)
node-llama-cpp's LlamaRankingContext uses a single sequence with a
withLock() guard, making rankAll() effectively sequential despite
using Promise.all(). Each document evaluation erases the context,
evaluates tokens, and extracts the logit — all serialized.
Fix: create 4 parallel ranking contexts from the same model (model
weights are shared, only KV cache is duplicated). Split documents
across contexts and evaluate in parallel via Promise.all().
Benchmarks (40 chunks, CUDA, 4x A6000):
- 1 context: 898ms (baseline)
- 2 contexts: 460ms (2.0x)
- 4 contexts: 338ms (2.7x) ← sweet spot
- 8 contexts: 458ms (VRAM contention)
End-to-end 'qmd query' time: 7.5s → 3.7s
Gracefully handles VRAM limits — if creating the Nth context fails,
falls back to however many were successfully created.
QMD was running all models on CPU even when CUDA/Vulkan/Metal
was available. The getLlama() call used no gpu option, defaulting
to false.
Now:
- ensureLlama() tries cuda → vulkan → metal → CPU fallback
- Prints warning to stderr if falling back to CPU
- 'qmd status' shows GPU type, device names, VRAM, and CPU cores
- On this machine: 7.5s query vs 5+ minutes on CPU (reranker)
The reranker (Qwen3-Reranker-0.6B) calls are serialized by a lock
in node-llama-cpp's rankAndSort() — each of the 40 chunks is
evaluated sequentially. This is inherent to the library's design
(single sequence context). GPU acceleration is the fix, not
batching — the lock prevents true parallelism regardless.
Three improvements to hybridQuery:
1. Collection filter pushed into SQL: searchFTS and searchVec now
accept collectionName directly instead of filtering post-hoc.
Reduces noise in FTS probe and all expanded-query FTS calls.
Also fixes MCP server's FTS search to use SQL-level filtering.
2. Batch embed for vector searches: instead of embedding each
vec/hyde query sequentially (one embed call per query), we now
collect all texts that need vector search and embed them in a
single embedBatch() call. The sqlite-vec lookups still run
sequentially (they're fast), but the expensive LLM embed step
is batched.
3. FTS-first ordering: all lex expansions run immediately (sync,
no LLM needed) before the vector embedding batch. This means
FTS results are ready while embeddings compute.
Also cleans up legacy collectionId parameter naming (was number,
now properly string collectionName throughout).
Generated with PaperBanana (Gemini 3 Pro). Shows query expansion
fanning HyDE+Vec into vector searches, Lex into BM25, merged via
reciprocal rank fusion and LLM reranking.