Commit Graph

13 Commits

Author SHA1 Message Date
Bek
e4990e470e Harden embedding overflow handling 2026-04-10 16:02:46 -04:00
Tobi Lutke
32e504c883
fix(test): remove duplicate path/handelize tests from store.test.ts
These tests are already in store.helpers.unit.test.ts. The duplicates
in store.test.ts failed in CI because _productionMode module state
leaked from earlier tests in the same bun process, causing
getDefaultDbPath to return a path instead of throwing.
2026-04-05 18:31:17 -04:00
Tobias Lütke
9c9de94bd8 fix(handelize): restore lowercase + convert dots to dashes
- Restore .toLowerCase() in handelize (was dropped, both test files
  expected it inconsistently)
- Convert dots to dashes in filename body (e.g. v2.0 -> v2-0), keeping
  only the extension dot. Tobi confirmed this is the intended behavior.
- Align both test/store.test.ts and test/store.helpers.unit.test.ts to
  match (they had diverged, one expected case-preserved, one lowercase)
- Adjust 'ensureVecTable recreates' test to expect throw behavior
  (matches #501 dimension-mismatch fix)
2026-04-05 17:12:53 -04:00
Tobias Lütke
828823d20a fix: restore toLowerCase() in handelize + align tests with post-#501 behavior
- Restore .toLowerCase() in handelize (was dropped somewhere, tests expect it)
- Update dimension-mismatch test to expect throw instead of silent rebuild
  (matches new behavior from #501)
- Fix one stale test expectation for preserved dots in filenames
2026-04-05 16:56:06 -04:00
Antonio Mello
ef062e1b54
fix(multi-get): support brace expansion patterns in glob matching (#424)
Brace expansion patterns like `{doc1,doc2}.md` or `collection/{a,b}.md`
were incorrectly parsed as comma-separated file lists instead of being
passed to the glob matcher (picomatch). This happened because the
comma-detection heuristic only checked for `*` and `?` but not `{`.

Also adds `collection/path` matching in `matchFilesByGlob` so patterns
like `my-collection/{file1,file2}.md` work — previously the glob only
matched against `qmd://collection/path` (virtual) and `path` (relative
to collection root), missing the `collection/path` form.

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 16:45:33 -04:00
LJY
698b44fe87
Fix qmd embed model selection (#494) 2026-04-05 16:45:04 -04:00
Tobias Lütke
1fb2e2819e Merge origin/main into feat/ast-aware-chunking
Resolve conflicts: combine AST chunking args (filepath, chunkStrategy)
with abort signal parameter from #458.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-28 20:00:49 -04:00
Ryan
fa214db367 fix: correct BM25 field weights to include all 3 FTS columns
The bm25() call only had 2 weights for 3 columns (filepath, title, body),
giving body an implicit weight of 0. Add proper weights: filepath=1.5,
title=4.0, body=1.0 so title matches are boosted and body content is scored.
2026-03-24 20:12:45 -04:00
James Risberg
244ddf5ecb feat: AST-aware chunking for code files via tree-sitter
Add opt-in AST-aware chunk boundary detection for code files using
web-tree-sitter. When enabled with `--chunk-strategy auto`, code files
(.ts, .tsx, .js, .jsx, .py, .go, .rs) are chunked at function, class,
and import boundaries instead of arbitrary text positions. Default
behavior (`regex`) is unchanged — no surprises on upgrade.

In testing on QMD's own codebase, AST mode split 42% fewer function
bodies across chunk boundaries compared to regex-only chunking.

Usage:
  qmd embed --chunk-strategy auto
  qmd query "search terms" --chunk-strategy auto

What's included:
- Language detection from file extension with support for TypeScript,
  JavaScript (including arrow functions and function expressions),
  Python, Go, and Rust
- Per-language tree-sitter queries with scored break points aligned to
  the existing markdown scale (class=100, function=90, type=80, import=60)
- AST break points merged with regex break points — highest score wins
  at each position, so embedded markdown (comments, docstrings) still
  benefits from regex patterns
- Refactored chunking core: chunkDocumentWithBreakPoints() extracted,
  mergeBreakPoints() added, async chunkDocumentAsync() wrapper for AST
- ChunkStrategy type ("auto" | "regex") threaded through
  generateEmbeddings(), hybridQuery(), structuredSearch(), CLI, and SDK
- getASTStatus() health check wired into `qmd status`
- Parse failures log a warning and fall back to regex — never crash

Hardening:
- Grammar packages are optionalDependencies with pinned versions to
  prevent ABI breaks from semver drift
- web-tree-sitter is a direct dependency (pinned)
- Errors are logged (not silently swallowed) for debuggability
- Tested on both Node.js and Bun (Bun is actually faster)

Testing:
- 26 unit tests (test/ast.test.ts) — all 4 languages, error handling
- 7 integration tests (test/store.test.ts) — merge, equivalence, bypass
- Standalone test-ast-chunking.mjs with 63 synthetic tests and a
  real-collection performance scanner (npx tsx test-ast-chunking.mjs ~/code)
- Validated end-to-end with qmd embed + qmd query on QMD's own codebase
- Zero markdown regressions across all test paths

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-22 01:22:39 -04:00
programcaicai
809aa36172 fix: bound memory usage during embed 2026-03-13 17:39:17 +08:00
Tobi Lutke
839d774a06
feat: redesign SDK search API with unified search() and ExpandedQuery type
Replace three separate search methods (query, search, structuredSearch)
with a single search(options) that accepts either a query string
(auto-expanded) or pre-expanded queries. Add searchLex/searchVector
convenience methods and expandQuery for manual control.

Unify StructuredSubSearch and ExpandedQuery into a single ExpandedQuery
type with { type, query } used throughout the pipeline. Add skipRerank
option to hybridQuery and structuredSearch for fast no-LLM searches.

New SDK surface:
- search({ query, intent, rerank, limit, ... })
- search({ queries: expanded })
- searchLex(query, opts)
- searchVector(query, opts)
- expandQuery(query, { intent })
2026-03-10 11:04:45 -04:00
Tobi Lutke
e3549dab1a
perf(rerank): cap parallelism, deduplicate chunks, cache by content
- Cap rerank contexts at 4 to avoid VRAM exhaustion on high-core machines
- Deduplicate identical chunk texts before sending to reranker
- Cache rerank scores by chunk content instead of file path — same text
  from different files now shares a single reranker call
- Add truncation cache to avoid re-tokenizing duplicate documents
2026-03-07 15:57:36 -04:00
Tobi Lutke
870d3aed3b
test: move all tests to flat test/ directory
No more src/models/ and src/integration/ subfolders to forget about.
All 9 test files live in test/, one command runs everything:

  npx vitest run test/
  bun test test/
2026-02-15 21:37:47 -04:00