Commit Graph

22 Commits

Author SHA1 Message Date
Tobias Lütke
e6b50cfca9
Merge pull request #308 from debugerman/fix/handelize-emoji-crash
fix(store): handle emoji-only filenames in handelize (#302)
2026-03-07 14:24:59 -04:00
Brian Le
49d5b4f450
fix(index): deactivate stale docs on empty collection updates 2026-03-06 16:29:52 -05:00
Ning
dc777e3be0
fix(store): handle emoji-only filenames in handelize (#302)
Convert emoji codepoints to hex representation (e.g. 🐘 → 1f418) instead
of crashing, so files like 🐘.md can be indexed without halting the
entire update process.

Fixes #302
2026-03-06 14:24:24 +08:00
Tobi Lütke
5233e676d9
fix(rerank): truncate documents exceeding 2048-token context size
node-llama-cpp throws a hard error when any document + query + template
overhead exceeds the ranking context size. Truncate oversized documents
using the rerank model's tokenizer before passing them to rankAll().

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 12:41:59 -05:00
Tobi Lutke
64ef25e1f6
Document query grammar and add skill helpers 2026-02-22 13:36:08 -04:00
Tobi Lutke
c7e8ea02a5
test: restructure container smoke tests for interactive use
Replaces the inner test script with an outer driver that runs individual
podman/docker commands against a pre-built image. Tests sqlite-vec
loading and store unit tests under both node and bun runtimes.

Supports --build (image only), --shell (interactive), and -- CMD
(arbitrary command) for debugging install issues in isolation.
2026-02-22 11:09:36 -04:00
Tobi Lutke
0b57711d32
refactor: replace bash wrapper with standard #!/usr/bin/env node shebang
The qmd bin was a custom bash script that discovered node via hardcoded
fallback paths (mise, asdf, nvm, homebrew). This was nonstandard and
caused ABI mismatches when installed via bun (native modules compiled
for bun but executed with node).

Now uses the standard npm bin convention: dist/qmd.js with a node
shebang, added by the build script. The isMain guard resolves symlinks
so it works when npm/bun create symlinked bin entries.

Also converts all dynamic require() calls in tests to ESM imports, and
adds container-based smoke tests (test/smoke-install.sh) that verify
install + run under both node and bun via mise in a Debian container.
2026-02-22 11:09:36 -04:00
Tobi Lütke
3b87e3e224
feat: query document format, lex phrase/negation syntax, training data
The 'query document' is now a first-class concept in QMD: a structured
document with typed sub-queries that combine for best recall.

## Query types
- lex:    BM25 keyword search with phrase and negation syntax
- vec:    Semantic vector search (natural language questions)
- hyde:   Hypothetical document (write the expected answer)
- expand: Auto-expand via local LLM (max 1, default for plain queries)

## Lex syntax
Full BM25 operator support:
  "exact phrase"     verbatim match, no prefix
  -term              exclude documents containing term
  -"exact phrase"   exclude documents containing phrase

Examples:
  "C++ performance" optimization -sports -athlete
  "connection pool" timeout -redis
  "machine learning" -sports -athlete

## MCP tool description rewritten
The 'query' tool description now fully teaches AI agents the query
document format, lex syntax, and strategy for combining types.
Includes worked examples including intent-aware lex (C++ performance,
not sports) which is critical for disambiguation in dense corpora.

## Unit tests
11 new lex parser tests covering:
- plain terms, quoted phrases, negation, combined
- intent-aware disambiguation (performance -sports -athlete)
- only-negation returns null (FTS5 constraint)
- empty/whitespace handling

## Training data
12 new intent-aware examples for next model training round:
- Real technical topics with lex phrase+negation combinations
- Covers: C++ perf, Python memory, DB connections, rate limiting,
  SQL optimization, ML overfitting, Docker, JWT, async/await,
  git conflicts, Kubernetes, React state
- Each shows how context/intent shapes lex query construction
  (e.g. performance with C++ context → -sports -athlete exclusions)
2026-02-19 06:52:58 -05:00
Tobi Lütke
4649069e62
feat: add expand: type, rename to query, document syntax
BREAKING CHANGES:
- MCP tool renamed: structured_search → query
- HTTP endpoint renamed: /search → /query

New features:
- expand: type auto-expands via local LLM (max 1 per query)
- docs/SYNTAX.md formal grammar for query documents
- lex syntax: "phrase", -negation documented

Query types: lex, vec, hyde, expand
Default (no prefix) = expand (backwards compatible)
2026-02-18 22:22:50 -05:00
Tobi Lütke
de3a83a553
refactor: remove OR operator from lex queries
Simplify to just: terms, "phrases", and -negation
2026-02-18 22:17:52 -05:00
Tobi Lütke
efb39616e6
feat(lex): add query syntax for exact phrases, negation, and OR
Lex queries now support:
- "exact phrase" - quoted exact matching (no prefix)
- -term or -"phrase" - exclude from results
- term1 OR term2 - match either term

Semantic queries (vec/hyde) validate and reject these operators
with helpful error messages.

Examples:
  performance -sports     → matches "performance" excluding "sports"
  "machine learning"      → exact phrase match
  auth OR authentication  → matches either term
2026-02-18 22:14:09 -05:00
Tobi Lütke
19284ddb80
refactor(mcp): remove deprecated search tools, keep only structured_search
BREAKING CHANGE: MCP tools search, vector_search, deep_search removed.
Use structured_search with lex/vec/hyde queries instead.

- Remove search, vector_search, deep_search MCP tool registrations
- Update MCP instructions to focus on structured_search
- Update skill docs to reflect simplified API
- Rename test describes to reflect they test store functions
- CLI commands (qmd search, vsearch, query) unchanged for backwards compat
2026-02-18 21:50:25 -05:00
Tobi Lütke
db44e1a5bc
test: add comprehensive tests for structured search
32 tests covering:
- parseStructuredQuery parser (24 tests)
  - plain queries returning null
  - single/multiple prefixed queries
  - mixed plain + prefixed lines
  - error on multiple plain lines
  - whitespace handling
  - edge cases (colons in text, etc.)
- StructuredSubSearch type validation (3 tests)
- structuredSearch function basics (5 tests)
  - empty searches
  - no matches
  - limit/minScore options
2026-02-18 21:39:40 -05:00
Tobi Lutke
648779a04d
fix(test): reset currentIndexName between test files
collections-config.test.ts set currentIndexName to "myindex" in its
last test but only restored env vars in afterEach — not the module
variable. Under bun test (single process), this leaked into mcp.test.ts,
causing it to look for myindex.yml instead of index.yml.

Fix: reset setConfigIndexName("index") in afterEach, and add defensive
reset in mcp.test.ts beforeAll.
2026-02-18 15:53:58 -04:00
Tobi Lutke
640ac13cd0
fix: support multiple -c collection filters in search commands
Closes #191 (thanks @openclaw)
2026-02-16 14:03:53 -04:00
Tobi Lutke
8c2282c979
fix: respect XDG_CONFIG_HOME in collection config path
Closes #190 (thanks @openclaw)
2026-02-16 14:03:49 -04:00
Tobi Lutke
93f277c5e3
fix: MCP session support and cross-runtime test compat
- mcp.ts: add sessionIdGenerator to HTTP transport (fixes "stateless
  transport cannot be reused" error in CI)
- test-preload.ts: set 30s default timeout for bun test runner (matches
  vitest config, prevents CLI subprocess test timeouts)
- mcp.test.ts: use == null check instead of toBeUndefined for SQLite
  get() result (bun:sqlite returns null, better-sqlite3 returns undefined)
2026-02-15 21:54:25 -04:00
Tobi Lutke
edc9a87234
fix: correct test paths after moving to test/ directory
- cli.test.ts: fix qmdScript path from <root>/qmd.ts to <root>/src/qmd.ts
  (broke when tests moved from src/integration/ to test/)
- mcp.test.ts: forward Mcp-Session-Id header per MCP Streamable HTTP spec
2026-02-15 21:46:45 -04:00
Tobi Lutke
870d3aed3b
test: move all tests to flat test/ directory
No more src/models/ and src/integration/ subfolders to forget about.
All 9 test files live in test/, one command runs everything:

  npx vitest run test/
  bun test test/
2026-02-15 21:37:47 -04:00
Tobi Lutke
431f6e505b
Fix qmd embed crash and resolve all TypeScript errors
- Fix ReferenceError in vectorIndex(): firstResult was used but never
  defined. Added code to embed first chunk to get embedding dimensions.

- Fix 87 TypeScript errors across codebase:
  - formatter.ts: Define MultiGetFile type locally (was missing from store.ts)
  - collections.ts: Add non-null assertion for array access
  - mcp.ts: Fix StatusResult type to match store.ts CollectionInfo,
    add list parameter to ResourceTemplate, fix undefined checks
  - qmd.ts: Fix boolean/string type coercions, undefined array access
  - llm.test.ts: Update expandQuery tests for Queryable[] return type,
    fix array access assertions
  - store.test.ts: Add non-null assertions for array access in tests
  - eval-harness.ts: Fix array access assertion
2025-12-31 13:32:30 -04:00
Tobi Lutke
945d4b4572
Add 6 synthetic evaluation documents
Topics covered:
- API design principles
- Startup fundraising memo
- Distributed systems overview
- Product launch retrospective
- Machine learning primer
- Remote work policy

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-21 13:10:35 -04:00
Tobi Lutke
7828566333
Add evaluation harness with synthetic test documents
- 6 public-style documents covering diverse topics
- 18 test queries: 6 easy, 6 medium, 6 hard
- Easy: exact keyword matches
- Medium: semantic/conceptual queries
- Hard: partial recall, indirect references
- Measures Hit@1, Hit@3, Hit@5 by difficulty
- Tests both search (BM25) and query (hybrid) modes

Run: bun test/eval-harness.ts

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-21 13:10:24 -04:00