Document query grammar and add skill helpers
This commit is contained in:
parent
0e0feb6f2b
commit
64ef25e1f6
@ -4,15 +4,15 @@
|
||||
|
||||
## [1.1.0] - 2026-02-20
|
||||
|
||||
QMD now speaks in **query documents** — structured multi-line queries where each line is typed (`lex:`, `vec:`, `hyde:`, `expand:`), combining keyword precision with semantic recall. A single plain query still works exactly as before. Lex now supports quoted phrases and negation (`"C++ performance" -sports -athlete`), making intent-aware disambiguation practical. The formal query grammar is documented in `docs/SYNTAX.md`.
|
||||
QMD now speaks in **query documents** — structured multi-line queries where every line is typed (`lex:`, `vec:`, `hyde:`), combining keyword precision with semantic recall. A single plain query still works exactly as before (it's treated as an implicit `expand:` and auto-expanded by the LLM). Lex now supports quoted phrases and negation (`"C++ performance" -sports -athlete`), making intent-aware disambiguation practical. The formal query grammar is documented in `docs/SYNTAX.md`.
|
||||
|
||||
The npm package now uses the standard `#!/usr/bin/env node` bin convention, replacing the custom bash wrapper. This fixes native module ABI mismatches when installed via bun and works on any platform with node >= 22 on PATH.
|
||||
|
||||
### Changes
|
||||
|
||||
- **Query document format**: multi-line queries with typed sub-queries (`lex:`, `vec:`, `hyde:`, `expand:`). Plain queries remain the default (`expand:` implicit). First sub-query gets 2× fusion weight — put your strongest signal first. Formal grammar in `docs/SYNTAX.md`.
|
||||
- **Query document format**: multi-line queries with typed sub-queries (`lex:`, `vec:`, `hyde:`). Plain queries remain the default (`expand:` implicit, but not written inside the document). First sub-query gets 2× fusion weight — put your strongest signal first. Formal grammar in `docs/SYNTAX.md`.
|
||||
- **Lex syntax**: full BM25 operator support. `"exact phrase"` for verbatim matching; `-term` and `-"phrase"` for exclusions. Essential for disambiguation when a term is overloaded across domains (e.g. `performance -sports -athlete`).
|
||||
- **`expand:` type**: explicit auto-expansion via local LLM. Max one per query document. Identical to the prior default behavior for plain queries.
|
||||
- **`expand:` shortcut**: send a single plain query (or start the document with `expand:` on its only line) to auto-expand via the local LLM. Query documents themselves are limited to `lex`, `vec`, and `hyde` lines.
|
||||
- **MCP `query` tool** (renamed from `structured_search`): rewrote the tool description to fully teach AI agents the query document format, lex syntax, and combination strategy. Includes worked examples with intent-aware lex.
|
||||
- **HTTP `/query` endpoint** (renamed from `/search`; `/search` kept as silent alias).
|
||||
- **`collections` array filter**: filter by multiple collections in a single query (`collections: ["notes", "brain"]`). Removed the single `collection` string param — array only.
|
||||
@ -362,4 +362,3 @@ notes, journals, and meeting transcripts.
|
||||
[Unreleased]: https://github.com/tobi/qmd/compare/v1.0.0...HEAD
|
||||
[1.0.0]: https://github.com/tobi/qmd/releases/tag/v1.0.0
|
||||
[0.9.0]: https://github.com/tobi/qmd/compare/v0.8.0...v0.9.0
|
||||
|
||||
|
||||
@ -5,9 +5,12 @@ QMD queries are structured documents with typed sub-queries. Each line specifies
|
||||
## Grammar
|
||||
|
||||
```ebnf
|
||||
query_document = { line } ;
|
||||
line = [ type ":" ] text newline ;
|
||||
type = "lex" | "vec" | "hyde" | "expand" ;
|
||||
query = expand_query | query_document ;
|
||||
expand_query = text | explicit_expand ;
|
||||
explicit_expand= "expand:" text ;
|
||||
query_document = { typed_line } ;
|
||||
typed_line = type ":" text newline ;
|
||||
type = "lex" | "vec" | "hyde" ;
|
||||
text = quoted_phrase | plain_text ;
|
||||
quoted_phrase = '"' { character } '"' ;
|
||||
plain_text = { character } ;
|
||||
@ -21,14 +24,13 @@ newline = "\n" ;
|
||||
| `lex` | BM25 | Keyword search with exact matching |
|
||||
| `vec` | Vector | Semantic similarity search |
|
||||
| `hyde` | Vector | Hypothetical document embedding |
|
||||
| `expand` | LLM | Auto-expand into lex/vec/hyde via local model |
|
||||
|
||||
## Default Behavior
|
||||
|
||||
A query without any type prefix is treated as `expand:` — it gets passed to the query expansion model which generates lex, vec, and hyde variations automatically.
|
||||
A QMD query is either a single expand query or a multi-line query document. Any single-line query with no prefix is treated as an expand query and passed to the expansion model, which emits lex, vec, and hyde variants automatically.
|
||||
|
||||
```
|
||||
# These are equivalent:
|
||||
# These are equivalent and cannot be combined with typed lines:
|
||||
how does authentication work
|
||||
expand: how does authentication work
|
||||
```
|
||||
@ -89,17 +91,20 @@ hyde: The API implements rate limiting using a token bucket algorithm...
|
||||
|
||||
## Expand Queries
|
||||
|
||||
Use `expand:` to leverage the local query expansion model. Limited to one per query document.
|
||||
An expand query stands alone; it's not mixed with typed lines. You can either rely on the default untyped form or add the explicit `expand:` prefix:
|
||||
|
||||
```
|
||||
expand: error handling best practices
|
||||
# equivalent
|
||||
error handling best practices
|
||||
```
|
||||
|
||||
This generates lex, vec, and hyde variations automatically. Useful when you don't know the exact terms.
|
||||
Both forms call the local query expansion model, which generates lex, vec, and hyde variations automatically.
|
||||
|
||||
## Constraints
|
||||
|
||||
- Maximum one `expand:` query per document
|
||||
- Top-level query must be either a standalone expand query or a multi-line document
|
||||
- Query documents allow only `lex`, `vec`, and `hyde` typed lines (no `expand:` inside)
|
||||
- `lex` syntax (`-term`, `"phrase"`) only works in lex queries
|
||||
- Empty lines are ignored
|
||||
- Leading/trailing whitespace is trimmed
|
||||
|
||||
@ -37,7 +37,6 @@ Local search engine for markdown content.
|
||||
| `lex` | BM25 | Keywords — exact terms, names, code |
|
||||
| `vec` | Vector | Question — natural language |
|
||||
| `hyde` | Vector | Answer — hypothetical result (50-100 words) |
|
||||
| `expand` | LLM | Auto-expand via local model (max 1 per query) |
|
||||
|
||||
### Writing Good Queries
|
||||
|
||||
@ -57,16 +56,16 @@ Local search engine for markdown content.
|
||||
- Use the vocabulary you expect in the result
|
||||
|
||||
**expand (auto-expand)**
|
||||
- Let the local LLM generate lex/vec/hyde variations
|
||||
- Good when you don't know exact terms
|
||||
- Max one expand: per query
|
||||
- Use a single-line query (implicit) or `expand: question` on its own line
|
||||
- Lets the local LLM generate lex/vec/hyde variations
|
||||
- Do not mix `expand:` with other typed lines — it's either a standalone expand query or a full query document
|
||||
|
||||
### Combining Types
|
||||
|
||||
| Goal | Approach |
|
||||
|------|----------|
|
||||
| Know exact terms | `lex` only |
|
||||
| Don't know vocabulary | `vec` or `expand` |
|
||||
| Don't know vocabulary | Use a single-line query (implicit `expand:`) or `vec` |
|
||||
| Best recall | `lex` + `vec` |
|
||||
| Complex topic | `lex` + `vec` + `hyde` |
|
||||
|
||||
@ -107,6 +106,8 @@ qmd query $'lex: X\nvec: Y' # Structured
|
||||
qmd query $'expand: question' # Explicit expand
|
||||
qmd search "keywords" # BM25 only (no LLM)
|
||||
qmd get "#abc123" # By docid
|
||||
qmd multi-get "journals/2026-*.md" -l 40 # Batch pull snippets by glob
|
||||
qmd multi-get notes/foo.md,notes/bar.md # Comma-separated list, preserves order
|
||||
```
|
||||
|
||||
## HTTP API
|
||||
|
||||
15
src/mcp.ts
15
src/mcp.ts
@ -120,7 +120,7 @@ function buildInstructions(store: Store): string {
|
||||
|
||||
// --- Search tool ---
|
||||
lines.push("");
|
||||
lines.push("Search: Use `query` with sub-queries (lex/vec/hyde/expand):");
|
||||
lines.push("Search: Use `query` with sub-queries (lex/vec/hyde):");
|
||||
lines.push(" - type:'lex' — BM25 keyword search (exact terms, fast)");
|
||||
lines.push(" - type:'vec' — semantic vector search (meaning-based)");
|
||||
lines.push(" - type:'hyde' — hypothetical document (write what the answer looks like)");
|
||||
@ -229,10 +229,9 @@ function createMcpServer(store: Store): McpServer {
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
const subSearchSchema = z.object({
|
||||
type: z.enum(['lex', 'vec', 'hyde', 'expand']).describe(
|
||||
"lex = BM25 keywords (supports \"phrase\" and -negation), " +
|
||||
"vec = semantic question, hyde = hypothetical answer passage, " +
|
||||
"expand = auto-expand via LLM (max 1 per query)"
|
||||
type: z.enum(['lex', 'vec', 'hyde']).describe(
|
||||
"lex = BM25 keywords (supports \"phrase\" and -negation); " +
|
||||
"vec = semantic question; hyde = hypothetical answer passage"
|
||||
),
|
||||
query: z.string().describe(
|
||||
"The query text. For lex: use keywords, \"quoted phrases\", and -negation. " +
|
||||
@ -266,8 +265,6 @@ Good lex examples:
|
||||
**hyde** — Hypothetical document. Write 50-100 words that look like the answer. Often the most powerful for nuanced topics.
|
||||
- \`The rate limiter uses a token bucket algorithm. When a client exceeds 100 req/min, subsequent requests return 429 until the window resets.\`
|
||||
|
||||
**expand** — Auto-expand via local LLM. Generates lex+vec+hyde variations automatically. Max one per query. Useful when you don't know the exact terms.
|
||||
|
||||
## Strategy
|
||||
|
||||
Combine types for best results. First sub-query gets 2× weight — put your strongest signal first.
|
||||
@ -278,7 +275,7 @@ Combine types for best results. First sub-query gets 2× weight — put your str
|
||||
| Concept search | \`vec\` only |
|
||||
| Best recall | \`lex\` + \`vec\` |
|
||||
| Complex/nuanced | \`lex\` + \`vec\` + \`hyde\` |
|
||||
| Unknown vocabulary | \`expand\` |
|
||||
| Unknown vocabulary | Use a standalone natural-language query (no typed lines) so the server can auto-expand it |
|
||||
|
||||
## Examples
|
||||
|
||||
@ -306,7 +303,7 @@ Intent-aware lex (C++ performance, not sports):
|
||||
annotations: { readOnlyHint: true, openWorldHint: false },
|
||||
inputSchema: {
|
||||
searches: z.array(subSearchSchema).min(1).max(10).describe(
|
||||
"Sub-queries to execute. First gets 2x weight. Max one expand: per query."
|
||||
"Typed sub-queries to execute (lex/vec/hyde). First gets 2x weight."
|
||||
),
|
||||
limit: z.number().optional().default(10).describe("Max results (default: 10)"),
|
||||
minScore: z.number().optional().default(0).describe("Min relevance 0-1 (default: 0)"),
|
||||
|
||||
209
src/qmd.ts
209
src/qmd.ts
@ -1950,46 +1950,53 @@ function filterByCollections<T extends { filepath?: string; file?: string }>(res
|
||||
* "CAP\nconsistency" -> throws (multiple plain lines)
|
||||
*/
|
||||
function parseStructuredQuery(query: string): StructuredSubSearch[] | null {
|
||||
const lines = query.split('\n').map(l => l.trim()).filter(l => l.length > 0);
|
||||
if (lines.length === 0) return null;
|
||||
const rawLines = query.split('\n').map((line, idx) => ({
|
||||
raw: line,
|
||||
trimmed: line.trim(),
|
||||
number: idx + 1,
|
||||
})).filter(line => line.trimmed.length > 0);
|
||||
|
||||
const prefixRe = /^(lex|vec|hyde|expand):\s*/i;
|
||||
const searches: StructuredSubSearch[] = [];
|
||||
const plainLines: string[] = [];
|
||||
if (rawLines.length === 0) return null;
|
||||
|
||||
for (const line of lines) {
|
||||
const match = line.match(prefixRe);
|
||||
if (match) {
|
||||
const type = match[1]!.toLowerCase() as 'lex' | 'vec' | 'hyde' | 'expand';
|
||||
const text = line.slice(match[0].length).trim();
|
||||
if (text.length > 0) {
|
||||
searches.push({ type, query: text });
|
||||
const prefixRe = /^(lex|vec|hyde):\s*/i;
|
||||
const expandRe = /^expand:\s*/i;
|
||||
const typed: StructuredSubSearch[] = [];
|
||||
|
||||
for (const line of rawLines) {
|
||||
if (expandRe.test(line.trimmed)) {
|
||||
if (rawLines.length > 1) {
|
||||
throw new Error(`Line ${line.number} starts with expand:, but query documents cannot mix expand with typed lines. Submit a single expand query instead.`);
|
||||
}
|
||||
} else {
|
||||
plainLines.push(line);
|
||||
const text = line.trimmed.replace(expandRe, '').trim();
|
||||
if (!text) {
|
||||
throw new Error('expand: query must include text.');
|
||||
}
|
||||
return null; // treat as standalone expand query
|
||||
}
|
||||
|
||||
const match = line.trimmed.match(prefixRe);
|
||||
if (match) {
|
||||
const type = match[1]!.toLowerCase() as 'lex' | 'vec' | 'hyde';
|
||||
const text = line.trimmed.slice(match[0].length).trim();
|
||||
if (!text) {
|
||||
throw new Error(`Line ${line.number} (${type}:) must include text.`);
|
||||
}
|
||||
if (/\r|\n/.test(text)) {
|
||||
throw new Error(`Line ${line.number} (${type}:) contains a newline. Keep each query on a single line.`);
|
||||
}
|
||||
typed.push({ type, query: text, line: line.number });
|
||||
continue;
|
||||
}
|
||||
|
||||
if (rawLines.length === 1) {
|
||||
// Single plain line -> implicit expand
|
||||
return null;
|
||||
}
|
||||
|
||||
throw new Error(`Line ${line.number} is missing a lex:/vec:/hyde: prefix. Each line in a query document must start with one.`);
|
||||
}
|
||||
|
||||
// All plain lines, no prefixes -> null (use normal expansion)
|
||||
if (searches.length === 0 && plainLines.length === 1) {
|
||||
return null;
|
||||
}
|
||||
|
||||
// Multiple plain lines without prefixes -> ambiguous, error
|
||||
if (plainLines.length > 1) {
|
||||
throw new Error(
|
||||
`Ambiguous query: multiple lines without lex:/vec:/hyde: prefix.\n` +
|
||||
`Either use a single line (for query expansion) or prefix each line.\n` +
|
||||
`Example:\n lex: keyword terms\n vec: natural language question\n hyde: hypothetical answer passage`
|
||||
);
|
||||
}
|
||||
|
||||
// Mix of prefixed and one plain line -> treat plain as lex
|
||||
if (plainLines.length === 1) {
|
||||
searches.unshift({ type: 'lex', query: plainLines[0]! });
|
||||
}
|
||||
|
||||
return searches.length > 0 ? searches : null;
|
||||
return typed.length > 0 ? typed : null;
|
||||
}
|
||||
|
||||
function search(query: string, opts: OutputOptions): void {
|
||||
@ -2239,6 +2246,7 @@ function parseCLI() {
|
||||
},
|
||||
help: { type: "boolean", short: "h" },
|
||||
version: { type: "boolean", short: "v" },
|
||||
skill: { type: "boolean" },
|
||||
// Search options
|
||||
n: { type: "string" },
|
||||
"min-score": { type: "string" },
|
||||
@ -2311,58 +2319,104 @@ function parseCLI() {
|
||||
};
|
||||
}
|
||||
|
||||
function showSkill(): void {
|
||||
const scriptDir = dirname(fileURLToPath(import.meta.url));
|
||||
const relativePath = pathJoin("skills", "qmd", "SKILL.md");
|
||||
const skillPath = pathJoin(scriptDir, "..", relativePath);
|
||||
|
||||
console.log(`QMD Skill (${relativePath})`);
|
||||
console.log(`Location: ${skillPath}`);
|
||||
console.log("");
|
||||
|
||||
if (!existsSync(skillPath)) {
|
||||
console.error("SKILL.md not found. If you built from source, ensure skills/qmd/SKILL.md exists.");
|
||||
return;
|
||||
}
|
||||
|
||||
const content = readFileSync(skillPath, "utf-8");
|
||||
process.stdout.write(content.endsWith("\n") ? content : content + "\n");
|
||||
}
|
||||
|
||||
function showHelp(): void {
|
||||
console.log("qmd — Quick Markdown Search");
|
||||
console.log("");
|
||||
console.log("Usage:");
|
||||
console.log(" qmd collection add [path] --name <name> --mask <pattern> - Create/index collection");
|
||||
console.log(" qmd collection list - List all collections with details");
|
||||
console.log(" qmd collection remove <name> - Remove a collection by name");
|
||||
console.log(" qmd collection rename <old> <new> - Rename a collection");
|
||||
console.log(" qmd ls [collection[/path]] - List collections or files in a collection");
|
||||
console.log(" qmd context add [path] \"text\" - Add context for path (defaults to current dir)");
|
||||
console.log(" qmd context list - List all contexts");
|
||||
console.log(" qmd context rm <path> - Remove context");
|
||||
console.log(" qmd get <file>[:line] [-l N] [--from N] - Get document (optionally from line, max N lines)");
|
||||
console.log(" qmd multi-get <pattern> [-l N] [--max-bytes N] - Get multiple docs by glob or comma-separated list");
|
||||
console.log(" qmd status - Show index status and collections");
|
||||
console.log(" qmd update [--pull] - Re-index all collections (--pull: git pull first)");
|
||||
console.log(" qmd embed [-f] - Create vector embeddings (900 tokens/chunk, 15% overlap)");
|
||||
console.log(" qmd cleanup - Remove cache and orphaned data, vacuum DB");
|
||||
console.log(" qmd query <query> - Search with query expansion + reranking (recommended)");
|
||||
console.log(" qmd query 'lex:..\\nvec:...' - Structured search (you provide lex/vec/hyde queries)");
|
||||
console.log(" qmd search <query> - Full-text keyword search (BM25, no LLM)");
|
||||
console.log(" qmd vsearch <query> - Vector similarity search (no reranking)");
|
||||
console.log(" qmd mcp - Start MCP server (stdio transport)");
|
||||
console.log(" qmd mcp --http [--port N] - Start MCP server (HTTP transport, default port 8181)");
|
||||
console.log(" qmd mcp --http --daemon - Start MCP server as background daemon");
|
||||
console.log(" qmd mcp stop - Stop background MCP daemon");
|
||||
console.log(" qmd <command> [options]");
|
||||
console.log("");
|
||||
console.log("Primary commands:");
|
||||
console.log(" qmd query <query> - Hybrid search with auto expansion + reranking (recommended)");
|
||||
console.log(" qmd query 'lex:..\\nvec:...' - Structured query document (you provide lex/vec/hyde lines)");
|
||||
console.log(" qmd search <query> - Full-text BM25 keywords (no LLM)");
|
||||
console.log(" qmd vsearch <query> - Vector similarity only");
|
||||
console.log(" qmd get <file>[:line] [-l N] - Show a single document, optional line slice");
|
||||
console.log(" qmd multi-get <pattern> - Batch fetch via glob or comma-separated list");
|
||||
console.log(" qmd mcp - Start the MCP server (stdio transport for AI agents)");
|
||||
console.log("");
|
||||
console.log("Collections & context:");
|
||||
console.log(" qmd collection add/list/remove/rename/show - Manage indexed folders");
|
||||
console.log(" qmd context add/list/rm - Attach human-written summaries");
|
||||
console.log(" qmd ls [collection[/path]] - Inspect indexed files");
|
||||
console.log("");
|
||||
console.log("Maintenance:");
|
||||
console.log(" qmd status - View index + collection health");
|
||||
console.log(" qmd update [--pull] - Re-index collections (optionally git pull first)");
|
||||
console.log(" qmd embed [-f] - Generate/refresh vector embeddings");
|
||||
console.log(" qmd cleanup - Clear caches, vacuum DB");
|
||||
console.log("");
|
||||
console.log("Query syntax (qmd query):");
|
||||
console.log(" QMD queries are either a single expand query (no prefix) or a multi-line");
|
||||
console.log(" document where every line is typed with lex:, vec:, or hyde:. This grammar");
|
||||
console.log(" matches the docs in docs/SYNTAX.md and is enforced in the CLI.");
|
||||
console.log("");
|
||||
const grammar = [
|
||||
`query = expand_query | query_document ;`,
|
||||
`expand_query = text | explicit_expand ;`,
|
||||
`explicit_expand= "expand:" text ;`,
|
||||
`query_document = { typed_line } ;`,
|
||||
`typed_line = type ":" text newline ;`,
|
||||
`type = "lex" | "vec" | "hyde" ;`,
|
||||
`text = quoted_phrase | plain_text ;`,
|
||||
`quoted_phrase = '"' { character } '"' ;`,
|
||||
`plain_text = { character } ;`,
|
||||
`newline = "\\n" ;`,
|
||||
];
|
||||
console.log(" Grammar:");
|
||||
for (const line of grammar) {
|
||||
console.log(` ${line}`);
|
||||
}
|
||||
console.log("");
|
||||
console.log(" Examples:");
|
||||
console.log(" qmd query \"how does auth work\" # single-line → implicit expand");
|
||||
console.log(" qmd query $'lex: CAP theorem\\nvec: consistency' # typed query document");
|
||||
console.log(" qmd query $'lex: \"exact matches\" sports -baseball' # phrase + negation lex search");
|
||||
console.log(" qmd query $'hyde: Hypothetical answer text' # hyde-only document");
|
||||
console.log("");
|
||||
console.log(" Constraints:");
|
||||
console.log(" - Standalone expand queries cannot mix with typed lines.");
|
||||
console.log(" - Query documents allow only lex:, vec:, or hyde: prefixes.");
|
||||
console.log(" - Each typed line must be single-line text with balanced quotes.");
|
||||
console.log("");
|
||||
console.log("AI agents & integrations:");
|
||||
console.log(" - Run `qmd mcp` to expose the MCP server (stdio) to agents/IDEs.");
|
||||
console.log(" - `qmd --skill` prints the packaged skills/qmd/SKILL.md (path + contents).");
|
||||
console.log(" - Advanced: `qmd mcp --http ...` and `qmd mcp --http --daemon` are optional for custom transports.");
|
||||
console.log("");
|
||||
console.log("Global options:");
|
||||
console.log(" --index <name> - Use custom index name (default: index)");
|
||||
console.log(" --index <name> - Use a named index (default: index)");
|
||||
console.log("");
|
||||
console.log("Search options:");
|
||||
console.log(" -n <num> - Number of results (default: 5, or 20 for --files)");
|
||||
console.log(" --all - Return all matches (use with --min-score to filter)");
|
||||
console.log(" -n <num> - Max results (default 5, or 20 for --files/--json)");
|
||||
console.log(" --all - Return all matches (pair with --min-score)");
|
||||
console.log(" --min-score <num> - Minimum similarity score");
|
||||
console.log(" --full - Output full document instead of snippet");
|
||||
console.log(" --line-numbers - Add line numbers to output");
|
||||
console.log(" --files - Output docid,score,filepath,context (default: 20 results)");
|
||||
console.log(" --json - JSON output with snippets (default: 20 results)");
|
||||
console.log(" --csv - CSV output with snippets");
|
||||
console.log(" --md - Markdown output");
|
||||
console.log(" --xml - XML output");
|
||||
console.log(" -c, --collection <name> - Filter results to a specific collection");
|
||||
console.log("");
|
||||
console.log("Structured queries (qmd query):");
|
||||
console.log(" Prefix lines with lex:, vec:, or hyde: to skip automatic expansion.");
|
||||
console.log(" lex: BM25 keyword search (exact terms)");
|
||||
console.log(" vec: Vector similarity (natural language question)");
|
||||
console.log(" hyde: Vector similarity (hypothetical answer passage)");
|
||||
console.log(" Example: qmd query $'lex: CAP theorem\\nvec: consistency vs availability tradeoff'");
|
||||
console.log(" --line-numbers - Include line numbers in output");
|
||||
console.log(" --files | --json | --csv | --md | --xml - Output format");
|
||||
console.log(" -c, --collection <name> - Filter by one or more collections");
|
||||
console.log("");
|
||||
console.log("Multi-get options:");
|
||||
console.log(" -l <num> - Maximum lines per file");
|
||||
console.log(" --max-bytes <num> - Skip files larger than N bytes (default: 10240)");
|
||||
console.log(" --json/--csv/--md/--xml/--files - Output format (same as search)");
|
||||
console.log(" --max-bytes <num> - Skip files larger than N bytes (default 10240)");
|
||||
console.log(" --json/--csv/--md/--xml/--files - Same formats as search");
|
||||
console.log("");
|
||||
console.log(`Index: ${getDbPath()}`);
|
||||
}
|
||||
@ -2398,6 +2452,11 @@ if (isMain) {
|
||||
process.exit(0);
|
||||
}
|
||||
|
||||
if (cli.values.skill) {
|
||||
showSkill();
|
||||
process.exit(0);
|
||||
}
|
||||
|
||||
if (!cli.command || cli.values.help) {
|
||||
showHelp();
|
||||
process.exit(cli.values.help ? 0 : 1);
|
||||
|
||||
52
src/store.ts
52
src/store.ts
@ -2082,6 +2082,17 @@ export function validateSemanticQuery(query: string): string | null {
|
||||
return null;
|
||||
}
|
||||
|
||||
export function validateLexQuery(query: string): string | null {
|
||||
if (/[\r\n]/.test(query)) {
|
||||
return 'Lex queries must be a single line. Remove newline characters or split into separate lex: lines.';
|
||||
}
|
||||
const quoteCount = (query.match(/"/g) ?? []).length;
|
||||
if (quoteCount % 2 === 1) {
|
||||
return 'Lex query has an unmatched double quote ("). Add the closing quote or remove it.';
|
||||
}
|
||||
return null;
|
||||
}
|
||||
|
||||
export function searchFTS(db: Database, query: string, limit: number = 20, collectionName?: string): SearchResult[] {
|
||||
const ftsQuery = buildFTS5Query(query);
|
||||
if (!ftsQuery) return [];
|
||||
@ -3164,10 +3175,12 @@ export async function vectorSearchQuery(
|
||||
* Matches the format used in QMD training data.
|
||||
*/
|
||||
export interface StructuredSubSearch {
|
||||
/** Search type: 'lex' for BM25, 'vec' for semantic, 'hyde' for hypothetical, 'expand' for LLM expansion */
|
||||
type: 'lex' | 'vec' | 'hyde' | 'expand';
|
||||
/** Search type: 'lex' for BM25, 'vec' for semantic, 'hyde' for hypothetical */
|
||||
type: 'lex' | 'vec' | 'hyde';
|
||||
/** The search query text */
|
||||
query: string;
|
||||
/** Optional line number for error reporting (CLI parser) */
|
||||
line?: number;
|
||||
}
|
||||
|
||||
export interface StructuredSearchOptions {
|
||||
@ -3212,36 +3225,25 @@ export async function structuredSearch(
|
||||
|
||||
if (searches.length === 0) return [];
|
||||
|
||||
// Validate: max one expand query, semantic queries don't use lex syntax
|
||||
const expandSearches = searches.filter(s => s.type === 'expand');
|
||||
if (expandSearches.length > 1) {
|
||||
throw new Error('Maximum one expand: query per document');
|
||||
}
|
||||
// Validate queries before executing
|
||||
for (const search of searches) {
|
||||
if (search.type === 'vec' || search.type === 'hyde') {
|
||||
const location = search.line ? `Line ${search.line}` : 'Structured search';
|
||||
if (/[\r\n]/.test(search.query)) {
|
||||
throw new Error(`${location} (${search.type}): queries must be single-line. Remove newline characters.`);
|
||||
}
|
||||
if (search.type === 'lex') {
|
||||
const error = validateLexQuery(search.query);
|
||||
if (error) {
|
||||
throw new Error(`${location} (lex): ${error}`);
|
||||
}
|
||||
} else if (search.type === 'vec' || search.type === 'hyde') {
|
||||
const error = validateSemanticQuery(search.query);
|
||||
if (error) {
|
||||
throw new Error(`Invalid ${search.type} query: ${error}`);
|
||||
throw new Error(`${location} (${search.type}): ${error}`);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Process expand: queries by calling the query expansion model
|
||||
let processedSearches = searches.filter(s => s.type !== 'expand');
|
||||
if (expandSearches.length > 0) {
|
||||
const expandQuery = expandSearches[0]!.query;
|
||||
const expanded = await store.expandQuery(expandQuery);
|
||||
// Add expanded queries (lex, vec, hyde from the model)
|
||||
for (const exp of expanded) {
|
||||
processedSearches.push({ type: exp.type as 'lex' | 'vec' | 'hyde', query: exp.text });
|
||||
}
|
||||
// Also add original as lex for strong signal matching
|
||||
processedSearches.unshift({ type: 'lex', query: expandQuery });
|
||||
}
|
||||
|
||||
// Use processed searches from here on
|
||||
searches = processedSearches;
|
||||
|
||||
const rankedLists: RankedResult[][] = [];
|
||||
const docidMap = new Map<string, string>(); // filepath -> docid
|
||||
const hasVectors = !!store.db.prepare(
|
||||
|
||||
@ -17,6 +17,7 @@ import {
|
||||
createStore,
|
||||
structuredSearch,
|
||||
validateSemanticQuery,
|
||||
validateLexQuery,
|
||||
type StructuredSubSearch,
|
||||
type Store,
|
||||
} from "../src/store.js";
|
||||
@ -26,47 +27,53 @@ import { disposeDefaultLlamaCpp } from "../src/llm.js";
|
||||
// parseStructuredQuery Tests (CLI Parser)
|
||||
// =============================================================================
|
||||
|
||||
/**
|
||||
* Parse structured search query syntax.
|
||||
* This is a copy of the function from qmd.ts for isolated testing.
|
||||
*/
|
||||
function parseStructuredQuery(query: string): StructuredSubSearch[] | null {
|
||||
const lines = query.split('\n').map(l => l.trim()).filter(l => l.length > 0);
|
||||
if (lines.length === 0) return null;
|
||||
const rawLines = query.split('\n').map((line, idx) => ({
|
||||
raw: line,
|
||||
trimmed: line.trim(),
|
||||
number: idx + 1,
|
||||
})).filter(line => line.trimmed.length > 0);
|
||||
|
||||
if (rawLines.length === 0) return null;
|
||||
|
||||
const prefixRe = /^(lex|vec|hyde):\s*/i;
|
||||
const searches: StructuredSubSearch[] = [];
|
||||
const plainLines: string[] = [];
|
||||
const expandRe = /^expand:\s*/i;
|
||||
const typed: StructuredSubSearch[] = [];
|
||||
|
||||
for (const line of lines) {
|
||||
const match = line.match(prefixRe);
|
||||
for (const line of rawLines) {
|
||||
if (expandRe.test(line.trimmed)) {
|
||||
if (rawLines.length > 1) {
|
||||
throw new Error(`Line ${line.number} starts with expand:, but query documents cannot mix expand with typed lines. Submit a single expand query instead.`);
|
||||
}
|
||||
const text = line.trimmed.replace(expandRe, '').trim();
|
||||
if (!text) {
|
||||
throw new Error('expand: query must include text.');
|
||||
}
|
||||
return null;
|
||||
}
|
||||
|
||||
const match = line.trimmed.match(prefixRe);
|
||||
if (match) {
|
||||
const type = match[1]!.toLowerCase() as 'lex' | 'vec' | 'hyde';
|
||||
const text = line.slice(match[0].length).trim();
|
||||
if (text.length > 0) {
|
||||
searches.push({ type, query: text });
|
||||
const text = line.trimmed.slice(match[0].length).trim();
|
||||
if (!text) {
|
||||
throw new Error(`Line ${line.number} (${type}:) must include text.`);
|
||||
}
|
||||
} else {
|
||||
plainLines.push(line);
|
||||
if (/\r|\n/.test(text)) {
|
||||
throw new Error(`Line ${line.number} (${type}:) contains a newline. Keep each query on a single line.`);
|
||||
}
|
||||
typed.push({ type, query: text, line: line.number });
|
||||
continue;
|
||||
}
|
||||
|
||||
if (rawLines.length === 1) {
|
||||
return null;
|
||||
}
|
||||
|
||||
throw new Error(`Line ${line.number} is missing a lex:/vec:/hyde: prefix. Each line in a query document must start with one.`);
|
||||
}
|
||||
|
||||
// All plain lines, no prefixes -> null (use normal expansion)
|
||||
if (searches.length === 0 && plainLines.length === 1) {
|
||||
return null;
|
||||
}
|
||||
|
||||
// Multiple plain lines without prefixes -> ambiguous, error
|
||||
if (plainLines.length > 1) {
|
||||
throw new Error("Ambiguous query: multiple lines without lex:/vec:/hyde: prefix.");
|
||||
}
|
||||
|
||||
// Mix of prefixed and one plain line -> treat plain as lex
|
||||
if (plainLines.length === 1) {
|
||||
searches.unshift({ type: 'lex', query: plainLines[0]! });
|
||||
}
|
||||
|
||||
return searches.length > 0 ? searches : null;
|
||||
return typed.length > 0 ? typed : null;
|
||||
}
|
||||
|
||||
describe("parseStructuredQuery", () => {
|
||||
@ -76,6 +83,10 @@ describe("parseStructuredQuery", () => {
|
||||
expect(parseStructuredQuery("distributed systems")).toBeNull();
|
||||
});
|
||||
|
||||
test("explicit expand line treated as plain query", () => {
|
||||
expect(parseStructuredQuery("expand: error handling best practices")).toBeNull();
|
||||
});
|
||||
|
||||
test("empty queries", () => {
|
||||
expect(parseStructuredQuery("")).toBeNull();
|
||||
expect(parseStructuredQuery(" ")).toBeNull();
|
||||
@ -86,28 +97,28 @@ describe("parseStructuredQuery", () => {
|
||||
describe("single prefixed queries", () => {
|
||||
test("lex: prefix", () => {
|
||||
const result = parseStructuredQuery("lex: CAP theorem");
|
||||
expect(result).toEqual([{ type: "lex", query: "CAP theorem" }]);
|
||||
expect(result).toEqual([{ type: "lex", query: "CAP theorem", line: 1 }]);
|
||||
});
|
||||
|
||||
test("vec: prefix", () => {
|
||||
const result = parseStructuredQuery("vec: what is the CAP theorem");
|
||||
expect(result).toEqual([{ type: "vec", query: "what is the CAP theorem" }]);
|
||||
expect(result).toEqual([{ type: "vec", query: "what is the CAP theorem", line: 1 }]);
|
||||
});
|
||||
|
||||
test("hyde: prefix", () => {
|
||||
const result = parseStructuredQuery("hyde: The CAP theorem states that...");
|
||||
expect(result).toEqual([{ type: "hyde", query: "The CAP theorem states that..." }]);
|
||||
expect(result).toEqual([{ type: "hyde", query: "The CAP theorem states that...", line: 1 }]);
|
||||
});
|
||||
|
||||
test("uppercase prefix", () => {
|
||||
expect(parseStructuredQuery("LEX: keywords")).toEqual([{ type: "lex", query: "keywords" }]);
|
||||
expect(parseStructuredQuery("VEC: question")).toEqual([{ type: "vec", query: "question" }]);
|
||||
expect(parseStructuredQuery("HYDE: passage")).toEqual([{ type: "hyde", query: "passage" }]);
|
||||
expect(parseStructuredQuery("LEX: keywords")).toEqual([{ type: "lex", query: "keywords", line: 1 }]);
|
||||
expect(parseStructuredQuery("VEC: question")).toEqual([{ type: "vec", query: "question", line: 1 }]);
|
||||
expect(parseStructuredQuery("HYDE: passage")).toEqual([{ type: "hyde", query: "passage", line: 1 }]);
|
||||
});
|
||||
|
||||
test("mixed case prefix", () => {
|
||||
expect(parseStructuredQuery("Lex: test")).toEqual([{ type: "lex", query: "test" }]);
|
||||
expect(parseStructuredQuery("VeC: test")).toEqual([{ type: "vec", query: "test" }]);
|
||||
expect(parseStructuredQuery("Lex: test")).toEqual([{ type: "lex", query: "test", line: 1 }]);
|
||||
expect(parseStructuredQuery("VeC: test")).toEqual([{ type: "vec", query: "test", line: 1 }]);
|
||||
});
|
||||
});
|
||||
|
||||
@ -115,65 +126,71 @@ describe("parseStructuredQuery", () => {
|
||||
test("lex + vec", () => {
|
||||
const result = parseStructuredQuery("lex: keywords\nvec: natural language");
|
||||
expect(result).toEqual([
|
||||
{ type: "lex", query: "keywords" },
|
||||
{ type: "vec", query: "natural language" },
|
||||
{ type: "lex", query: "keywords", line: 1 },
|
||||
{ type: "vec", query: "natural language", line: 2 },
|
||||
]);
|
||||
});
|
||||
|
||||
test("all three types", () => {
|
||||
const result = parseStructuredQuery("lex: keywords\nvec: question\nhyde: hypothetical doc");
|
||||
expect(result).toEqual([
|
||||
{ type: "lex", query: "keywords" },
|
||||
{ type: "vec", query: "question" },
|
||||
{ type: "hyde", query: "hypothetical doc" },
|
||||
{ type: "lex", query: "keywords", line: 1 },
|
||||
{ type: "vec", query: "question", line: 2 },
|
||||
{ type: "hyde", query: "hypothetical doc", line: 3 },
|
||||
]);
|
||||
});
|
||||
|
||||
test("duplicate types allowed", () => {
|
||||
const result = parseStructuredQuery("lex: term1\nlex: term2\nlex: term3");
|
||||
expect(result).toEqual([
|
||||
{ type: "lex", query: "term1" },
|
||||
{ type: "lex", query: "term2" },
|
||||
{ type: "lex", query: "term3" },
|
||||
{ type: "lex", query: "term1", line: 1 },
|
||||
{ type: "lex", query: "term2", line: 2 },
|
||||
{ type: "lex", query: "term3", line: 3 },
|
||||
]);
|
||||
});
|
||||
|
||||
test("order preserved", () => {
|
||||
const result = parseStructuredQuery("hyde: passage\nvec: question\nlex: keywords");
|
||||
expect(result).toEqual([
|
||||
{ type: "hyde", query: "passage" },
|
||||
{ type: "vec", query: "question" },
|
||||
{ type: "lex", query: "keywords" },
|
||||
{ type: "hyde", query: "passage", line: 1 },
|
||||
{ type: "vec", query: "question", line: 2 },
|
||||
{ type: "lex", query: "keywords", line: 3 },
|
||||
]);
|
||||
});
|
||||
});
|
||||
|
||||
describe("mixed plain and prefixed", () => {
|
||||
test("single plain line with prefixed lines -> plain becomes lex first", () => {
|
||||
const result = parseStructuredQuery("plain keywords\nvec: semantic question");
|
||||
expect(result).toEqual([
|
||||
{ type: "lex", query: "plain keywords" },
|
||||
{ type: "vec", query: "semantic question" },
|
||||
]);
|
||||
test("plain line with prefixed lines throws helpful error", () => {
|
||||
expect(() => parseStructuredQuery("plain keywords\nvec: semantic question"))
|
||||
.toThrow(/missing a lex:\/vec:\/hyde:/);
|
||||
});
|
||||
|
||||
test("plain line prepended before other prefixed", () => {
|
||||
const result = parseStructuredQuery("keywords\nhyde: passage\nvec: question");
|
||||
expect(result).toEqual([
|
||||
{ type: "lex", query: "keywords" },
|
||||
{ type: "hyde", query: "passage" },
|
||||
{ type: "vec", query: "question" },
|
||||
]);
|
||||
test("plain line prepended before other prefixed throws", () => {
|
||||
expect(() => parseStructuredQuery("keywords\nhyde: passage\nvec: question"))
|
||||
.toThrow(/missing a lex:\/vec:\/hyde:/);
|
||||
});
|
||||
});
|
||||
|
||||
describe("error cases", () => {
|
||||
test("multiple plain lines throws", () => {
|
||||
expect(() => parseStructuredQuery("line one\nline two")).toThrow("Ambiguous query");
|
||||
expect(() => parseStructuredQuery("line one\nline two")).toThrow(/missing a lex:\/vec:\/hyde:/);
|
||||
});
|
||||
|
||||
test("three plain lines throws", () => {
|
||||
expect(() => parseStructuredQuery("a\nb\nc")).toThrow("Ambiguous query");
|
||||
expect(() => parseStructuredQuery("a\nb\nc")).toThrow(/missing a lex:\/vec:\/hyde:/);
|
||||
});
|
||||
|
||||
test("mixing expand: with other lines throws", () => {
|
||||
expect(() => parseStructuredQuery("expand: question\nlex: keywords"))
|
||||
.toThrow(/cannot mix expand with typed lines/);
|
||||
});
|
||||
|
||||
test("expand: without text throws", () => {
|
||||
expect(() => parseStructuredQuery("expand: ")).toThrow(/must include text/);
|
||||
});
|
||||
|
||||
test("typed line without text throws", () => {
|
||||
expect(() => parseStructuredQuery("lex: \nvec: real")).toThrow(/must include text/);
|
||||
});
|
||||
});
|
||||
|
||||
@ -181,58 +198,56 @@ describe("parseStructuredQuery", () => {
|
||||
test("empty lines ignored", () => {
|
||||
const result = parseStructuredQuery("lex: keywords\n\nvec: question\n");
|
||||
expect(result).toEqual([
|
||||
{ type: "lex", query: "keywords" },
|
||||
{ type: "vec", query: "question" },
|
||||
{ type: "lex", query: "keywords", line: 1 },
|
||||
{ type: "vec", query: "question", line: 3 },
|
||||
]);
|
||||
});
|
||||
|
||||
test("whitespace-only lines ignored", () => {
|
||||
const result = parseStructuredQuery("lex: keywords\n \nvec: question");
|
||||
expect(result).toEqual([
|
||||
{ type: "lex", query: "keywords" },
|
||||
{ type: "vec", query: "question" },
|
||||
{ type: "lex", query: "keywords", line: 1 },
|
||||
{ type: "vec", query: "question", line: 3 },
|
||||
]);
|
||||
});
|
||||
|
||||
test("leading/trailing whitespace trimmed from lines", () => {
|
||||
const result = parseStructuredQuery(" lex: keywords \n vec: question ");
|
||||
expect(result).toEqual([
|
||||
{ type: "lex", query: "keywords" },
|
||||
{ type: "vec", query: "question" },
|
||||
{ type: "lex", query: "keywords", line: 1 },
|
||||
{ type: "vec", query: "question", line: 2 },
|
||||
]);
|
||||
});
|
||||
|
||||
test("internal whitespace preserved in query", () => {
|
||||
const result = parseStructuredQuery("lex: multiple spaces ");
|
||||
expect(result).toEqual([{ type: "lex", query: "multiple spaces" }]);
|
||||
expect(result).toEqual([{ type: "lex", query: "multiple spaces", line: 1 }]);
|
||||
});
|
||||
|
||||
test("empty prefix value skipped", () => {
|
||||
const result = parseStructuredQuery("lex: \nvec: actual query");
|
||||
expect(result).toEqual([{ type: "vec", query: "actual query" }]);
|
||||
test("empty prefix value throws", () => {
|
||||
expect(() => parseStructuredQuery("lex: \nvec: actual query")).toThrow(/must include text/);
|
||||
});
|
||||
|
||||
test("only empty prefix values returns null", () => {
|
||||
const result = parseStructuredQuery("lex: \nvec: \nhyde: ");
|
||||
expect(result).toBeNull();
|
||||
test("only empty prefix values throws", () => {
|
||||
expect(() => parseStructuredQuery("lex: \nvec: \nhyde: ")).toThrow(/must include text/);
|
||||
});
|
||||
});
|
||||
|
||||
describe("edge cases", () => {
|
||||
test("colon in query text preserved", () => {
|
||||
const result = parseStructuredQuery("lex: time: 12:30 PM");
|
||||
expect(result).toEqual([{ type: "lex", query: "time: 12:30 PM" }]);
|
||||
expect(result).toEqual([{ type: "lex", query: "time: 12:30 PM", line: 1 }]);
|
||||
});
|
||||
|
||||
test("prefix-like text in query preserved", () => {
|
||||
const result = parseStructuredQuery("vec: what does lex: mean");
|
||||
expect(result).toEqual([{ type: "vec", query: "what does lex: mean" }]);
|
||||
expect(result).toEqual([{ type: "vec", query: "what does lex: mean", line: 1 }]);
|
||||
});
|
||||
|
||||
test("newline in hyde passage (as single line)", () => {
|
||||
// If user wants actual newlines in hyde, they need to escape or use multiline syntax
|
||||
const result = parseStructuredQuery("hyde: The answer is X. It means Y.");
|
||||
expect(result).toEqual([{ type: "hyde", query: "The answer is X. It means Y." }]);
|
||||
expect(result).toEqual([{ type: "hyde", query: "The answer is X. It means Y.", line: 1 }]);
|
||||
});
|
||||
});
|
||||
});
|
||||
@ -318,6 +333,18 @@ describe("structuredSearch", () => {
|
||||
expect(r.score).toBeGreaterThanOrEqual(0.5);
|
||||
}
|
||||
});
|
||||
|
||||
test("throws when lex query contains newline characters", async () => {
|
||||
await expect(structuredSearch(store, [
|
||||
{ type: "lex", query: "foo\nbar", line: 3 }
|
||||
])).rejects.toThrow(/Line 3 \(lex\):/);
|
||||
});
|
||||
|
||||
test("throws when lex query has unmatched quote", async () => {
|
||||
await expect(structuredSearch(store, [
|
||||
{ type: "lex", query: "\"unfinished phrase", line: 2 }
|
||||
])).rejects.toThrow(/unmatched double quote/);
|
||||
});
|
||||
});
|
||||
|
||||
// =============================================================================
|
||||
@ -346,6 +373,20 @@ describe("lex query syntax", () => {
|
||||
)).toBeNull();
|
||||
});
|
||||
});
|
||||
|
||||
describe("validateLexQuery", () => {
|
||||
test("accepts basic lex query", () => {
|
||||
expect(validateLexQuery("auth token")).toBeNull();
|
||||
});
|
||||
|
||||
test("rejects newline", () => {
|
||||
expect(validateLexQuery("foo\nbar")).toContain("single line");
|
||||
});
|
||||
|
||||
test("rejects unmatched quote", () => {
|
||||
expect(validateLexQuery("\"unfinished")).toContain("unmatched");
|
||||
});
|
||||
});
|
||||
});
|
||||
|
||||
// =============================================================================
|
||||
|
||||
Loading…
Reference in New Issue
Block a user