- Add missing subprocess import (NameError on any quantize path)
- Replace broken optimum-cli quantize calls with direct onnxruntime:
Q4 uses MatMulNBitsQuantizer, Q8 uses quantize_dynamic
- Add onnxconverter-common to deps for FP16 (was silently swallowed)
- Make FP16 fail loudly on missing dep instead of silently uploading FP32
- README and transformers_js_config now reflect actual quantize_type
instead of always hardcoding Q4
- Remove dead _convert_fp16_external function
- Use no_post_process=True for ONNX export to avoid protobuf serialize error
- Add --validate and --validate-only flags for inference verification
- Fix position_ids in validation feed (required by Qwen3 ONNX export)
- Use optimum-cli for quantization to handle external data format
- Fix optimum dependency to optimum[onnxruntime]
Tested: export + validation passes on CPU, KV cache present (56 tensors).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add convert_onnx.py that mirrors convert_gguf.py's structure:
- Loads base Qwen3 model, merges SFT + GRPO adapters
- Exports to ONNX via Optimum (text-generation-with-past task)
- Supports Q4 (MatMulNBits), Q8, FP16, and FP32 output
- Uploads to separate HF repo (e.g. tobil/qmd-query-expansion-1.7B-ONNX)
- Writes Transformers.js compatibility config
- Includes model card with usage example
Usage:
uv run convert_onnx.py --size 1.7B
uv run convert_onnx.py --size 1.7B --quantize q4 --no-upload
Also adds `just convert-onnx` and `just convert-gguf` tasks.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
bun.lock still resolved better-sqlite3 to 11.x after package.json was
bumped to ^12.4.5 in v2.0.0. This breaks sandboxed builds (e.g. Nix
with bun2nix) where network access is unavailable to resolve the
mismatch.
CI and the publish workflow now use --frozen-lockfile so drift is caught
immediately. The release script also validates lockfile consistency
before tagging.
Closes#386
When a chunk exceeds the embedding model's context window (trainContextSize),
node-llama-cpp's getEmbeddingFor() triggers a native SIGABRT in GGML/Metal,
crashing the entire process.
Fix: Add truncateToContextSize() guard in embed() and embedBatch() that uses
the model's own tokenizer to check token count before calling getEmbeddingFor().
Oversized text is truncated to (trainContextSize - 4) tokens with a warning,
preserving partial embedding coverage instead of crashing.
Fixes#303
When Bun is installed on the system but QMD was installed via npm,
$BUN_INSTALL is always set (typically to ~/.bun), causing the launcher
to incorrectly run QMD under Bun. This leads to ABI mismatches with
native modules (better-sqlite3, sqlite-vec) that were compiled for Node,
breaking vector operations with "no such module: vec0".
Only check for bun.lock/bun.lockb files, which reliably indicate that
QMD was actually installed with Bun.
Fixes#361
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add _ciMode flag to LlamaCpp that throws immediately on embedBatch,
generate, expandQuery, and rerank when CI=true — prevents silent 30s
timeouts. Skip MCP HTTP Transport tests in CI (they instantiate a real
LlamaCpp). Bump vitest/bun test timeouts to 60s for slower CI runners.
- Bump better-sqlite3 from ^11 to ^12.4.5 for Node 25 support (prebuilds
+ V8 API compat). Closes#257.
- Add bin/qmd shell wrapper that detects bun vs node install and execs
with the matching runtime, preventing native module ABI mismatches
when installed via bun. Closes#319.
Move frontends into src/cli/ and src/mcp/ to separate them from the
core library. The MCP server is fully rewritten to import only from
the SDK (src/index.ts) — zero direct store.ts/collections.ts/llm.ts
access.
- src/qmd.ts → src/cli/qmd.ts
- src/formatter.ts → src/cli/formatter.ts
- src/mcp.ts → src/mcp/server.ts (rewritten to use QMDStore SDK)
- New src/maintenance.ts: Maintenance class for CLI housekeeping
- SDK gains: getDocumentBody(), getDefaultCollectionNames(),
extractSnippet/addLineNumbers/DEFAULT_MULTI_GET_MAX_BYTES exports,
getDefaultDbPath re-export, InternalStore type export
- package.json bin/scripts updated for new paths
- All 692 tests pass
Replace three separate search methods (query, search, structuredSearch)
with a single search(options) that accepts either a query string
(auto-expanded) or pre-expanded queries. Add searchLex/searchVector
convenience methods and expandQuery for manual control.
Unify StructuredSubSearch and ExpandedQuery into a single ExpandedQuery
type with { type, query } used throughout the pipeline. Add skipRerank
option to hybridQuery and structuredSearch for fast no-LLM searches.
New SDK surface:
- search({ query, intent, rerank, limit, ... })
- search({ queries: expanded })
- searchLex(query, opts)
- searchVector(query, opts)
- expandQuery(query, { intent })
HuggingFace filenames are case-sensitive. The documented filename
'qwen3-embedding-0.6b-q8_0.gguf' (lowercase) returns 404. The correct
filename is 'Qwen3-Embedding-0.6B-Q8_0.gguf' (original case from the
HuggingFace repo).
Co-Authored-By: Oz <oz-agent@warp.dev>
Allow QMD to be used as a library (`import { createStore } from '@tobilu/qmd'`)
in addition to CLI and MCP modes. The constructor requires explicit dbPath and
either a configPath (YAML file) or inline config object — no defaults assumed,
making it safe to embed in any application.
- Add src/index.ts entry point with QMDStore interface exposing search,
retrieval, collection/context management, and index health
- Add setConfigSource() to collections.ts for inline config support
(in-memory config with no file I/O)
- Add main/types/exports fields to package.json
- Add SDK documentation section to README
- Add 56 unit tests covering constructor, collections, contexts, search,
document retrieval, config isolation, YAML persistence, and lifecycle
Add optional `intent` parameter that steers query expansion, reranking,
chunk selection, and snippet extraction without searching on its own.
When a query like "performance" is ambiguous (web-perf vs team health vs
fitness), intent provides background context that disambiguates results
across all pipeline stages:
- expandQuery: includes intent in LLM prompt ("Query intent: {intent}")
- rerank: prepends intent to rerank query for Qwen3-Reranker
- chunk selection: intent terms scored at 0.5x weight vs query terms
- snippet extraction: intent terms scored at 0.3x weight
- strong-signal bypass: disabled when intent provided
Available via CLI (--intent flag or intent: line in query documents),
MCP (intent field on query tool), and programmatic API.
Adapted from PR #180 (thanks @vyalamar).
- Cap rerank contexts at 4 to avoid VRAM exhaustion on high-core machines
- Deduplicate identical chunk texts before sending to reranker
- Cache rerank scores by chunk content instead of file path — same text
from different files now shares a single reranker call
- Add truncation cache to avoid re-tokenizing duplicate documents