qmd/CLAUDE.md
James Risberg 244ddf5ecb feat: AST-aware chunking for code files via tree-sitter
Add opt-in AST-aware chunk boundary detection for code files using
web-tree-sitter. When enabled with `--chunk-strategy auto`, code files
(.ts, .tsx, .js, .jsx, .py, .go, .rs) are chunked at function, class,
and import boundaries instead of arbitrary text positions. Default
behavior (`regex`) is unchanged — no surprises on upgrade.

In testing on QMD's own codebase, AST mode split 42% fewer function
bodies across chunk boundaries compared to regex-only chunking.

Usage:
  qmd embed --chunk-strategy auto
  qmd query "search terms" --chunk-strategy auto

What's included:
- Language detection from file extension with support for TypeScript,
  JavaScript (including arrow functions and function expressions),
  Python, Go, and Rust
- Per-language tree-sitter queries with scored break points aligned to
  the existing markdown scale (class=100, function=90, type=80, import=60)
- AST break points merged with regex break points — highest score wins
  at each position, so embedded markdown (comments, docstrings) still
  benefits from regex patterns
- Refactored chunking core: chunkDocumentWithBreakPoints() extracted,
  mergeBreakPoints() added, async chunkDocumentAsync() wrapper for AST
- ChunkStrategy type ("auto" | "regex") threaded through
  generateEmbeddings(), hybridQuery(), structuredSearch(), CLI, and SDK
- getASTStatus() health check wired into `qmd status`
- Parse failures log a warning and fall back to regex — never crash

Hardening:
- Grammar packages are optionalDependencies with pinned versions to
  prevent ABI breaks from semver drift
- web-tree-sitter is a direct dependency (pinned)
- Errors are logged (not silently swallowed) for debuggability
- Tested on both Node.js and Bun (Bun is actually faster)

Testing:
- 26 unit tests (test/ast.test.ts) — all 4 languages, error handling
- 7 integration tests (test/store.test.ts) — merge, equivalence, bypass
- Standalone test-ast-chunking.mjs with 63 synthetic tests and a
  real-collection performance scanner (npx tsx test-ast-chunking.mjs ~/code)
- Validated end-to-end with qmd embed + qmd query on QMD's own codebase
- Zero markdown regressions across all test paths

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-22 01:22:39 -04:00

167 lines
5.7 KiB
Markdown

# QMD - Query Markup Documents
Use Bun instead of Node.js (`bun` not `node`, `bun install` not `npm install`).
## Commands
```sh
qmd collection add . --name <n> # Create/index collection
qmd collection list # List all collections with details
qmd collection remove <name> # Remove a collection by name
qmd collection rename <old> <new> # Rename a collection
qmd ls [collection[/path]] # List collections or files in a collection
qmd context add [path] "text" # Add context for path (defaults to current dir)
qmd context list # List all contexts
qmd context check # Check for collections/paths missing context
qmd context rm <path> # Remove context
qmd get <file> # Get document by path or docid (#abc123)
qmd multi-get <pattern> # Get multiple docs by glob or comma-separated list
qmd status # Show index status and collections
qmd update [--pull] # Re-index all collections (--pull: git pull first)
qmd embed # Generate vector embeddings (uses node-llama-cpp)
qmd query <query> # Search with query expansion + reranking (recommended)
qmd search <query> # Full-text keyword search (BM25, no LLM)
qmd vsearch <query> # Vector similarity search (no reranking)
qmd mcp # Start MCP server (stdio transport)
qmd mcp --http [--port N] # Start MCP server (HTTP, default port 8181)
qmd mcp --http --daemon # Start as background daemon
qmd mcp stop # Stop background MCP daemon
```
## Collection Management
```sh
# List all collections
qmd collection list
# Create a collection with explicit name
qmd collection add ~/Documents/notes --name mynotes --mask '**/*.md'
# Remove a collection
qmd collection remove mynotes
# Rename a collection
qmd collection rename mynotes my-notes
# List all files in a collection
qmd ls mynotes
# List files with a path prefix
qmd ls journals/2025
qmd ls qmd://journals/2025
```
## Context Management
```sh
# Add context to current directory (auto-detects collection)
qmd context add "Description of these files"
# Add context to a specific path
qmd context add /subfolder "Description for subfolder"
# Add global context to all collections (system message)
qmd context add / "Always include this context"
# Add context using virtual paths
qmd context add qmd://journals/ "Context for entire journals collection"
qmd context add qmd://journals/2024 "Journal entries from 2024"
# List all contexts
qmd context list
# Check for collections or paths without context
qmd context check
# Remove context
qmd context rm qmd://journals/2024
qmd context rm / # Remove global context
```
## Document IDs (docid)
Each document has a unique short ID (docid) - the first 6 characters of its content hash.
Docids are shown in search results as `#abc123` and can be used with `get` and `multi-get`:
```sh
# Search returns docid in results
qmd search "query" --json
# Output: [{"docid": "#abc123", "score": 0.85, "file": "docs/readme.md", ...}]
# Get document by docid
qmd get "#abc123"
qmd get abc123 # Leading # is optional
# Docids also work in multi-get comma-separated lists
qmd multi-get "#abc123, #def456"
```
## Options
```sh
# Search & retrieval
-c, --collection <name> # Restrict search to a collection (matches pwd suffix)
-n <num> # Number of results
--all # Return all matches
--min-score <num> # Minimum score threshold
--full # Show full document content
--line-numbers # Add line numbers to output
# Multi-get specific
-l <num> # Maximum lines per file
--max-bytes <num> # Skip files larger than this (default 10KB)
# Output formats (search and multi-get)
--json, --csv, --md, --xml, --files
```
## Development
```sh
bun src/cli/qmd.ts <command> # Run from source
bun link # Install globally as 'qmd'
```
## Tests
All tests live in `test/`. Run everything:
```sh
npx vitest run --reporter=verbose test/
bun test --preload ./src/test-preload.ts test/
```
## Architecture
- SQLite FTS5 for full-text search (BM25)
- sqlite-vec for vector similarity search
- node-llama-cpp for embeddings (embeddinggemma), reranking (qwen3-reranker), and query expansion (Qwen3)
- Reciprocal Rank Fusion (RRF) for combining results
- Smart chunking: 900 tokens/chunk with 15% overlap, prefers markdown headings as boundaries
- AST-aware chunking: use `--chunk-strategy auto` to chunk code files (.ts/.js/.py/.go/.rs) at function/class/import boundaries via tree-sitter. Default is `regex` (existing behavior). Markdown and unknown file types always use regex chunking.
## Important: Do NOT run automatically
- Never run `qmd collection add`, `qmd embed`, or `qmd update` automatically
- Never modify the SQLite database directly
- Write out example commands for the user to run manually
- Index is stored at `~/.cache/qmd/index.sqlite`
## Do NOT compile
- Never run `bun build --compile` - it overwrites the shell wrapper and breaks sqlite-vec
- The `qmd` file is a shell script that runs compiled JS from `dist/` - do not replace it
- `npm run build` compiles TypeScript to `dist/` via `tsc -p tsconfig.build.json`
## Releasing
Use `/release <version>` to cut a release. Full changelog standards,
release workflow, and git hook setup are documented in the
[release skill](skills/release/SKILL.md).
Key points:
- Add changelog entries under `## [Unreleased]` **as you make changes**
- The release script renames `[Unreleased]``[X.Y.Z] - date` at release time
- Credit external PRs with `#NNN (thanks @username)`
- GitHub releases roll up the full minor series (e.g. 1.2.0 through 1.2.3)