Commit Graph

68 Commits

Author SHA1 Message Date
Kim Junmo
fee576bf98 fix: migrate legacy lowercase paths on reindex
When qmd update runs against an index created before case-preservation,
documents may exist under lowercase paths (e.g. "skill.md" for a file
actually named "SKILL.md"). Add findOrMigrateLegacyDocument() that:

- Falls back to a lowercase lookup when the canonical path is not found
- Renames the document path in-place via UPDATE OR IGNORE
- Manually rebuilds the FTS entry (FTS5 INSERT OR REPLACE does not
  reliably update existing rows via triggers)
- Handles UNIQUE conflicts gracefully (returns null on conflict)

Embeddings are keyed by content hash, so the rename preserves all
existing vectors — no re-embedding required.

Both the CLI indexer and the library reindexer share the same helper,
eliminating the duplication that a previous review flagged.

Includes integration tests for: successful migration, already-lowercase
no-op, and UNIQUE conflict handling.
2026-04-09 08:25:00 +09:00
Kim Junmo
9fb9de4fd2 fix: preserve original case in handelize()
The blanket .toLowerCase() in handelize() drops filename casing,
which breaks path resolution on case-sensitive filesystems (Linux).
Files like README.md, CHANGELOG.md, and SKILL.md become unreachable
when the index stores them as readme.md, changelog.md, skill.md.

Since FTS5 already performs case-insensitive matching via the
unicode61 tokenizer, lowercasing the stored path provides no search
benefit — it only corrupts the metadata used to locate files on disk.

Remove .toLowerCase() and update all affected test expectations.
2026-04-09 07:59:22 +09:00
Tobi Lutke
66e70c028e
fix(test): reset _productionMode in getDefaultDbPath test
Bun runs all test files in a single process, so module-level state
leaks between files. The getDefaultDbPath test now resets the
_productionMode flag before asserting it throws, fixing the flaky
failure on Bun (ubuntu-latest) in CI.
2026-04-05 18:39:51 -04:00
Tobi Lutke
32e504c883
fix(test): remove duplicate path/handelize tests from store.test.ts
These tests are already in store.helpers.unit.test.ts. The duplicates
in store.test.ts failed in CI because _productionMode module state
leaked from earlier tests in the same bun process, causing
getDefaultDbPath to return a path instead of throwing.
2026-04-05 18:31:17 -04:00
JohnRichardEnders
50ce17bbfa feat(llm): resolve models as config > env > default
Separate hardcoded default from env var in DEFAULT_EMBED_MODEL so the
constructor can resolve: config param > env var > hardcoded default.
Also add env var support for QMD_GENERATE_MODEL and QMD_RERANK_MODEL.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 18:00:08 -04:00
dan mackinlay
1bada2eba6 Add explicit TTY link output tests 2026-04-05 17:58:09 -04:00
dan mackinlay
06f5642252 Fix stale ls test expectation 2026-04-05 17:56:26 -04:00
dan mackinlay
636631225e Add clickable OSC8 editor links for CLI search results 2026-04-05 17:56:26 -04:00
James Risberg
33fae1c4f5 chore: migrate AST chunking tests to vitest
Replace standalone test-ast-chunking.mjs (823 lines, custom check()
harness, invisible to CI) with proper vitest integration tests.

All unique assertions preserved; duplicates already in ast.test.ts
dropped. Performance benchmarks and real-collection scanner removed
(dev tools, not regression tests).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 17:19:59 -04:00
John R Milinovich
b7a5a86a9b feat(cli): add qmd bench command for search quality benchmarks
Adds a benchmark harness that measures search quality across backends.
Given a fixture file with queries and expected results, it runs each
query through BM25, vector, hybrid (no rerank), and full pipeline,
then reports precision@k, recall, MRR, F1, and latency.

This is primarily a regression testing tool — users create fixtures
for their own vaults to catch quality regressions after config or
index changes. Ships with an example fixture against the eval-docs
test collection to demonstrate the format.

New files:
  src/bench/bench.ts       — main runner
  src/bench/score.ts       — precision, recall, MRR, F1, path matching
  src/bench/types.ts       — fixture and result types
  src/bench/fixtures/      — example fixture
  test/bench-score.test.ts — unit tests for scoring (16 tests)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 17:17:59 -04:00
Tobias Lütke
76a2f0fb31 Merge pull request #506 from danmackinlay/fix-505-json-line-output
feat: Include line in --json search output

# Conflicts:
#	CHANGELOG.md
2026-04-05 17:16:05 -04:00
Tobias Lütke
9c9de94bd8 fix(handelize): restore lowercase + convert dots to dashes
- Restore .toLowerCase() in handelize (was dropped, both test files
  expected it inconsistently)
- Convert dots to dashes in filename body (e.g. v2.0 -> v2-0), keeping
  only the extension dot. Tobi confirmed this is the intended behavior.
- Align both test/store.test.ts and test/store.helpers.unit.test.ts to
  match (they had diverged, one expected case-preserved, one lowercase)
- Adjust 'ensureVecTable recreates' test to expect throw behavior
  (matches #501 dimension-mismatch fix)
2026-04-05 17:12:53 -04:00
Surma
2de225c9e7
Test nix flake builds in CI (#487)
* Test nix flake builds in CI

* Update outdated bun.lock file

* fix: restore toLowerCase() in handelize and update tests

* Fix flake to use proper FODs

---------

Co-authored-by: Tobias Lütke <tobi@shopify.com>
2026-04-05 16:59:27 -04:00
Tobias Lütke
828823d20a fix: restore toLowerCase() in handelize + align tests with post-#501 behavior
- Restore .toLowerCase() in handelize (was dropped somewhere, tests expect it)
- Update dimension-mismatch test to expect throw instead of silent rebuild
  (matches new behavior from #501)
- Fix one stale test expectation for preserved dots in filenames
2026-04-05 16:56:06 -04:00
Antonio Mello
ef062e1b54
fix(multi-get): support brace expansion patterns in glob matching (#424)
Brace expansion patterns like `{doc1,doc2}.md` or `collection/{a,b}.md`
were incorrectly parsed as comma-separated file lists instead of being
passed to the glob matcher (picomatch). This happened because the
comma-detection heuristic only checked for `*` and `?` but not `{`.

Also adds `collection/path` matching in `matchFilesByGlob` so patterns
like `my-collection/{file1,file2}.md` work — previously the glob only
matched against `qmd://collection/path` (virtual) and `path` (relative
to collection root), missing the `collection/path` form.

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 16:45:33 -04:00
LJY
698b44fe87
Fix qmd embed model selection (#494) 2026-04-05 16:45:04 -04:00
Matt Van Horn
1ad3388132
fix(store): preserve underscores in BM25 search terms (#404)
sanitizeFTS5Term stripped all non-letter/non-number characters including
underscores, causing snake_case identifiers like `my_variable` to become
`myvariable` and silently fail BM25 matches.

Add underscore to the preserved character set in the Unicode regex.
Export the function and add unit tests covering snake_case, contractions,
punctuation stripping, and unicode.

Fixes #305

Co-authored-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 16:44:14 -04:00
dan mackinlay
c22d00829b Add line to JSON search output 2026-04-05 10:08:57 +00:00
Tobias Lütke
1fb2e2819e Merge origin/main into feat/ast-aware-chunking
Resolve conflicts: combine AST chunking args (filepath, chunkStrategy)
with abort signal parameter from #458.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-28 20:00:49 -04:00
Tobias Lütke
dd27f499c7
Merge pull request #463 from goldsr09/fix/hyphenated-lex-queries
Fix hyphenated tokens in FTS5 lex queries
2026-03-28 19:58:22 -04:00
Tobias Lütke
08566ec316
Merge pull request #462 from goldsr09/fix/bm25-field-weights
Fix BM25 field weights to include all 3 FTS columns
2026-03-28 19:56:04 -04:00
Tobias Lütke
8d343b9da1 Update handelize tests for case/dot preservation (#475)
PR #475 changed handelize() to preserve original case and dots,
but the tests still expected lowercase output. Update assertions
to match the new behavior.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-28 19:54:18 -04:00
Ryan
7b9bd01226 fix: handle hyphenated tokens in FTS5 lex queries
Hyphenated terms like multi-agent, DEC-0054, gpt-4 were being stripped
of hyphens and concatenated (e.g., "multiagent") which missed matches.
Now they're split into FTS5 phrase queries ("multi agent") so the porter
tokenizer matches them correctly.
2026-03-24 20:13:52 -04:00
Ryan
fa214db367 fix: correct BM25 field weights to include all 3 FTS columns
The bm25() call only had 2 weights for 3 columns (filepath, title, body),
giving body an implicit weight of 0. Add proper weights: filepath=1.5,
title=4.0, body=1.0 so title matches are boosted and body content is scored.
2026-03-24 20:12:45 -04:00
James Risberg
244ddf5ecb feat: AST-aware chunking for code files via tree-sitter
Add opt-in AST-aware chunk boundary detection for code files using
web-tree-sitter. When enabled with `--chunk-strategy auto`, code files
(.ts, .tsx, .js, .jsx, .py, .go, .rs) are chunked at function, class,
and import boundaries instead of arbitrary text positions. Default
behavior (`regex`) is unchanged — no surprises on upgrade.

In testing on QMD's own codebase, AST mode split 42% fewer function
bodies across chunk boundaries compared to regex-only chunking.

Usage:
  qmd embed --chunk-strategy auto
  qmd query "search terms" --chunk-strategy auto

What's included:
- Language detection from file extension with support for TypeScript,
  JavaScript (including arrow functions and function expressions),
  Python, Go, and Rust
- Per-language tree-sitter queries with scored break points aligned to
  the existing markdown scale (class=100, function=90, type=80, import=60)
- AST break points merged with regex break points — highest score wins
  at each position, so embedded markdown (comments, docstrings) still
  benefits from regex patterns
- Refactored chunking core: chunkDocumentWithBreakPoints() extracted,
  mergeBreakPoints() added, async chunkDocumentAsync() wrapper for AST
- ChunkStrategy type ("auto" | "regex") threaded through
  generateEmbeddings(), hybridQuery(), structuredSearch(), CLI, and SDK
- getASTStatus() health check wired into `qmd status`
- Parse failures log a warning and fall back to regex — never crash

Hardening:
- Grammar packages are optionalDependencies with pinned versions to
  prevent ABI breaks from semver drift
- web-tree-sitter is a direct dependency (pinned)
- Errors are logged (not silently swallowed) for debuggability
- Tested on both Node.js and Bun (Bun is actually faster)

Testing:
- 26 unit tests (test/ast.test.ts) — all 4 languages, error handling
- 7 integration tests (test/store.test.ts) — merge, equivalence, bypass
- Standalone test-ast-chunking.mjs with 63 synthetic tests and a
  real-collection performance scanner (npx tsx test-ast-chunking.mjs ~/code)
- Validated end-to-end with qmd embed + qmd query on QMD's own codebase
- Zero markdown regressions across all test paths

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-22 01:22:39 -04:00
Tobias Lütke
5f6821629b
Merge pull request #385 from rymalia/fix/launcher-lockfile-priority
fix: prioritize package-lock.json in launcher to prevent Bun false positive
2026-03-14 08:08:03 -04:00
Tobias Lütke
5b48bcb6c1
Merge pull request #389 from sonwr/fix-issue-380-cleanup-no-sqlite-vec
fix: skip cleanup when sqlite-vec is unavailable
2026-03-14 08:07:11 -04:00
programcaicai
809aa36172 fix: bound memory usage during embed 2026-03-13 17:39:17 +08:00
sonwr
7df09e8235 fix: skip vector cleanup when sqlite-vec is unavailable 2026-03-12 13:51:20 +00:00
Ryan Malia
28903d8eba fix: prioritize package-lock.json in launcher to prevent Bun false positive
The bin/qmd wrapper checks for bun.lock to select the runtime, but since
bun.lock is committed to the repo, source builds using npm install are
incorrectly routed to Bun — causing native module ABI mismatches (#381)
and sqlite-vec crashes (#380).

Add package-lock.json as a higher-priority signal: if it exists, npm
installed the dependencies and Node should be used. Also fix
cleanupOrphanedVectors() to use the existing isSqliteVecAvailable()
guard instead of checking sqlite_master, which can report the virtual
table even when the vec0 module isn't loaded.

Fixes #381, fixes #380
Continuation of #362 (runtime detection false positives)
2026-03-12 01:46:38 -07:00
nkkko
b16d77146a feat(skill): install packaged qmd skill 2026-03-10 23:18:15 +01:00
Tobi Lutke
55f16460d0
fix(ci): guard LLM calls in CI and increase test timeouts
Add _ciMode flag to LlamaCpp that throws immediately on embedBatch,
generate, expandQuery, and rerank when CI=true — prevents silent 30s
timeouts. Skip MCP HTTP Transport tests in CI (they instantiate a real
LlamaCpp). Bump vitest/bun test timeouts to 60s for slower CI runners.
2026-03-10 13:28:37 -04:00
Tobi Lutke
ed0249fd6b
fix(test): increase timeout for SDK search tests that trigger LLM expansion
These tests load the query expansion model on first call, which
consistently exceeds the 30s timeout on CI runners.
2026-03-10 12:59:46 -04:00
Tobi Lutke
c68904fe08
refactor: move CLI and MCP to subdirectories, MCP consumes SDK
Move frontends into src/cli/ and src/mcp/ to separate them from the
core library. The MCP server is fully rewritten to import only from
the SDK (src/index.ts) — zero direct store.ts/collections.ts/llm.ts
access.

- src/qmd.ts → src/cli/qmd.ts
- src/formatter.ts → src/cli/formatter.ts
- src/mcp.ts → src/mcp/server.ts (rewritten to use QMDStore SDK)
- New src/maintenance.ts: Maintenance class for CLI housekeeping
- SDK gains: getDocumentBody(), getDefaultCollectionNames(),
  extractSnippet/addLineNumbers/DEFAULT_MULTI_GET_MAX_BYTES exports,
  getDefaultDbPath re-export, InternalStore type export
- package.json bin/scripts updated for new paths
- All 692 tests pass
2026-03-10 11:39:55 -04:00
Tobi Lutke
839d774a06
feat: redesign SDK search API with unified search() and ExpandedQuery type
Replace three separate search methods (query, search, structuredSearch)
with a single search(options) that accepts either a query string
(auto-expanded) or pre-expanded queries. Add searchLex/searchVector
convenience methods and expandQuery for manual control.

Unify StructuredSubSearch and ExpandedQuery into a single ExpandedQuery
type with { type, query } used throughout the pipeline. Add skipRerank
option to hybridQuery and structuredSearch for fast no-LLM searches.

New SDK surface:
- search({ query, intent, rerank, limit, ... })
- search({ queries: expanded })
- searchLex(query, opts)
- searchVector(query, opts)
- expandQuery(query, { intent })
2026-03-10 11:04:45 -04:00
Tobi Lutke
040c6fa904
feat: add SDK/library mode for programmatic access
Allow QMD to be used as a library (`import { createStore } from '@tobilu/qmd'`)
in addition to CLI and MCP modes. The constructor requires explicit dbPath and
either a configPath (YAML file) or inline config object — no defaults assumed,
making it safe to embed in any application.

- Add src/index.ts entry point with QMDStore interface exposing search,
  retrieval, collection/context management, and index health
- Add setConfigSource() to collections.ts for inline config support
  (in-memory config with no file I/O)
- Add main/types/exports fields to package.json
- Add SDK documentation section to README
- Add 56 unit tests covering constructor, collections, contexts, search,
  document retrieval, config isolation, YAML persistence, and lifecycle
2026-03-08 15:59:22 -04:00
Tobi Lutke
ad38c1f698
feat: add intent parameter for query disambiguation
Add optional `intent` parameter that steers query expansion, reranking,
chunk selection, and snippet extraction without searching on its own.

When a query like "performance" is ambiguous (web-perf vs team health vs
fitness), intent provides background context that disambiguates results
across all pipeline stages:

- expandQuery: includes intent in LLM prompt ("Query intent: {intent}")
- rerank: prepends intent to rerank query for Qwen3-Reranker
- chunk selection: intent terms scored at 0.5x weight vs query terms
- snippet extraction: intent terms scored at 0.3x weight
- strong-signal bypass: disabled when intent provided

Available via CLI (--intent flag or intent: line in query documents),
MCP (intent field on query tool), and programmatic API.

Adapted from PR #180 (thanks @vyalamar).
2026-03-07 19:27:29 -04:00
Tobi Lutke
e3549dab1a
perf(rerank): cap parallelism, deduplicate chunks, cache by content
- Cap rerank contexts at 4 to avoid VRAM exhaustion on high-core machines
- Deduplicate identical chunk texts before sending to reranker
- Cache rerank scores by chunk content instead of file path — same text
  from different files now shares a single reranker call
- Add truncation cache to avoid re-tokenizing duplicate documents
2026-03-07 15:57:36 -04:00
vyalamar
b068ad0dd6
feat(query): add --explain score traces for hybrid search 2026-03-07 14:35:10 -04:00
Tobias Lütke
8bd93366ad
Merge pull request #228 from amsminn/fix-empty-results-format
fix(cli): prevent parser breakage on empty results across output formats
2026-03-07 14:25:16 -04:00
Tobias Lütke
ee08997f23
Merge pull request #313 from 0xble/fix/expand-context-size-config
fix(llm): make query expansion context size configurable
2026-03-07 14:25:04 -04:00
Tobias Lütke
a28163fb2c
Merge pull request #304 from sebkouba/feature/collection-ignore
feat: add ignore patterns for collections
2026-03-07 14:25:02 -04:00
Tobias Lütke
e6b50cfca9
Merge pull request #308 from debugerman/fix/handelize-emoji-crash
fix(store): handle emoji-only filenames in handelize (#302)
2026-03-07 14:24:59 -04:00
Brian Le
0dec1df047
fix(llm): make expansion context size configurable 2026-03-06 16:35:33 -05:00
Brian Le
49d5b4f450
fix(index): deactivate stale docs on empty collection updates 2026-03-06 16:29:52 -05:00
Ning
dc777e3be0
fix(store): handle emoji-only filenames in handelize (#302)
Convert emoji codepoints to hex representation (e.g. 🐘 → 1f418) instead
of crashing, so files like 🐘.md can be indexed without halting the
entire update process.

Fixes #302
2026-03-06 14:24:24 +08:00
Sebastian Kouba
fde542cd0d feat: add ignore patterns for collections
Add an optional 'ignore' field to collection config that accepts an array
of glob patterns to exclude from indexing. This allows collections to skip
specific subdirectories without needing separate collections.

Example YAML config:
  personal:
    path: ~/personal_synced
    pattern: '**/*.md'
    ignore:
      - 'Sessions/**'
      - 'archive/**'

The ignore patterns are passed to fast-glob's ignore option alongside the
existing hardcoded excludes (node_modules, .git, etc). Already-indexed
files matching new ignore patterns are deactivated on the next update.

Changes:
- Add ignore?: string[] to Collection interface
- Pass ignore patterns through to fast-glob in indexFiles()
- Show ignore patterns in collection list/status output
- 5 new CLI integration tests covering the feature
2026-03-05 19:17:44 +01:00
CHAEWAN KIM
b024693f5d
Merge branch 'main' into fix-empty-results-format 2026-02-23 22:36:21 -08:00
Tobi Lütke
5233e676d9
fix(rerank): truncate documents exceeding 2048-token context size
node-llama-cpp throws a hard error when any document + query + template
overhead exceeds the ranking context size. Truncate oversized documents
using the rerank model's tokenizer before passing them to rankAll().

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 12:41:59 -05:00
Tobi Lutke
64ef25e1f6
Document query grammar and add skill helpers 2026-02-22 13:36:08 -04:00