Document query grammar and add skill helpers

This commit is contained in:
Tobi Lutke 2026-02-22 13:36:08 -04:00
parent 0e0feb6f2b
commit 64ef25e1f6
No known key found for this signature in database
7 changed files with 312 additions and 208 deletions

View File

@ -4,15 +4,15 @@
## [1.1.0] - 2026-02-20
QMD now speaks in **query documents** — structured multi-line queries where each line is typed (`lex:`, `vec:`, `hyde:`, `expand:`), combining keyword precision with semantic recall. A single plain query still works exactly as before. Lex now supports quoted phrases and negation (`"C++ performance" -sports -athlete`), making intent-aware disambiguation practical. The formal query grammar is documented in `docs/SYNTAX.md`.
QMD now speaks in **query documents** — structured multi-line queries where every line is typed (`lex:`, `vec:`, `hyde:`), combining keyword precision with semantic recall. A single plain query still works exactly as before (it's treated as an implicit `expand:` and auto-expanded by the LLM). Lex now supports quoted phrases and negation (`"C++ performance" -sports -athlete`), making intent-aware disambiguation practical. The formal query grammar is documented in `docs/SYNTAX.md`.
The npm package now uses the standard `#!/usr/bin/env node` bin convention, replacing the custom bash wrapper. This fixes native module ABI mismatches when installed via bun and works on any platform with node >= 22 on PATH.
### Changes
- **Query document format**: multi-line queries with typed sub-queries (`lex:`, `vec:`, `hyde:`, `expand:`). Plain queries remain the default (`expand:` implicit). First sub-query gets 2× fusion weight — put your strongest signal first. Formal grammar in `docs/SYNTAX.md`.
- **Query document format**: multi-line queries with typed sub-queries (`lex:`, `vec:`, `hyde:`). Plain queries remain the default (`expand:` implicit, but not written inside the document). First sub-query gets 2× fusion weight — put your strongest signal first. Formal grammar in `docs/SYNTAX.md`.
- **Lex syntax**: full BM25 operator support. `"exact phrase"` for verbatim matching; `-term` and `-"phrase"` for exclusions. Essential for disambiguation when a term is overloaded across domains (e.g. `performance -sports -athlete`).
- **`expand:` type**: explicit auto-expansion via local LLM. Max one per query document. Identical to the prior default behavior for plain queries.
- **`expand:` shortcut**: send a single plain query (or start the document with `expand:` on its only line) to auto-expand via the local LLM. Query documents themselves are limited to `lex`, `vec`, and `hyde` lines.
- **MCP `query` tool** (renamed from `structured_search`): rewrote the tool description to fully teach AI agents the query document format, lex syntax, and combination strategy. Includes worked examples with intent-aware lex.
- **HTTP `/query` endpoint** (renamed from `/search`; `/search` kept as silent alias).
- **`collections` array filter**: filter by multiple collections in a single query (`collections: ["notes", "brain"]`). Removed the single `collection` string param — array only.
@ -362,4 +362,3 @@ notes, journals, and meeting transcripts.
[Unreleased]: https://github.com/tobi/qmd/compare/v1.0.0...HEAD
[1.0.0]: https://github.com/tobi/qmd/releases/tag/v1.0.0
[0.9.0]: https://github.com/tobi/qmd/compare/v0.8.0...v0.9.0

View File

@ -5,9 +5,12 @@ QMD queries are structured documents with typed sub-queries. Each line specifies
## Grammar
```ebnf
query_document = { line } ;
line = [ type ":" ] text newline ;
type = "lex" | "vec" | "hyde" | "expand" ;
query = expand_query | query_document ;
expand_query = text | explicit_expand ;
explicit_expand= "expand:" text ;
query_document = { typed_line } ;
typed_line = type ":" text newline ;
type = "lex" | "vec" | "hyde" ;
text = quoted_phrase | plain_text ;
quoted_phrase = '"' { character } '"' ;
plain_text = { character } ;
@ -21,14 +24,13 @@ newline = "\n" ;
| `lex` | BM25 | Keyword search with exact matching |
| `vec` | Vector | Semantic similarity search |
| `hyde` | Vector | Hypothetical document embedding |
| `expand` | LLM | Auto-expand into lex/vec/hyde via local model |
## Default Behavior
A query without any type prefix is treated as `expand:` — it gets passed to the query expansion model which generates lex, vec, and hyde variations automatically.
A QMD query is either a single expand query or a multi-line query document. Any single-line query with no prefix is treated as an expand query and passed to the expansion model, which emits lex, vec, and hyde variants automatically.
```
# These are equivalent:
# These are equivalent and cannot be combined with typed lines:
how does authentication work
expand: how does authentication work
```
@ -89,17 +91,20 @@ hyde: The API implements rate limiting using a token bucket algorithm...
## Expand Queries
Use `expand:` to leverage the local query expansion model. Limited to one per query document.
An expand query stands alone; it's not mixed with typed lines. You can either rely on the default untyped form or add the explicit `expand:` prefix:
```
expand: error handling best practices
# equivalent
error handling best practices
```
This generates lex, vec, and hyde variations automatically. Useful when you don't know the exact terms.
Both forms call the local query expansion model, which generates lex, vec, and hyde variations automatically.
## Constraints
- Maximum one `expand:` query per document
- Top-level query must be either a standalone expand query or a multi-line document
- Query documents allow only `lex`, `vec`, and `hyde` typed lines (no `expand:` inside)
- `lex` syntax (`-term`, `"phrase"`) only works in lex queries
- Empty lines are ignored
- Leading/trailing whitespace is trimmed

View File

@ -37,7 +37,6 @@ Local search engine for markdown content.
| `lex` | BM25 | Keywords — exact terms, names, code |
| `vec` | Vector | Question — natural language |
| `hyde` | Vector | Answer — hypothetical result (50-100 words) |
| `expand` | LLM | Auto-expand via local model (max 1 per query) |
### Writing Good Queries
@ -57,16 +56,16 @@ Local search engine for markdown content.
- Use the vocabulary you expect in the result
**expand (auto-expand)**
- Let the local LLM generate lex/vec/hyde variations
- Good when you don't know exact terms
- Max one expand: per query
- Use a single-line query (implicit) or `expand: question` on its own line
- Lets the local LLM generate lex/vec/hyde variations
- Do not mix `expand:` with other typed lines — it's either a standalone expand query or a full query document
### Combining Types
| Goal | Approach |
|------|----------|
| Know exact terms | `lex` only |
| Don't know vocabulary | `vec` or `expand` |
| Don't know vocabulary | Use a single-line query (implicit `expand:`) or `vec` |
| Best recall | `lex` + `vec` |
| Complex topic | `lex` + `vec` + `hyde` |
@ -107,6 +106,8 @@ qmd query $'lex: X\nvec: Y' # Structured
qmd query $'expand: question' # Explicit expand
qmd search "keywords" # BM25 only (no LLM)
qmd get "#abc123" # By docid
qmd multi-get "journals/2026-*.md" -l 40 # Batch pull snippets by glob
qmd multi-get notes/foo.md,notes/bar.md # Comma-separated list, preserves order
```
## HTTP API

View File

@ -120,7 +120,7 @@ function buildInstructions(store: Store): string {
// --- Search tool ---
lines.push("");
lines.push("Search: Use `query` with sub-queries (lex/vec/hyde/expand):");
lines.push("Search: Use `query` with sub-queries (lex/vec/hyde):");
lines.push(" - type:'lex' — BM25 keyword search (exact terms, fast)");
lines.push(" - type:'vec' — semantic vector search (meaning-based)");
lines.push(" - type:'hyde' — hypothetical document (write what the answer looks like)");
@ -229,10 +229,9 @@ function createMcpServer(store: Store): McpServer {
// ---------------------------------------------------------------------------
const subSearchSchema = z.object({
type: z.enum(['lex', 'vec', 'hyde', 'expand']).describe(
"lex = BM25 keywords (supports \"phrase\" and -negation), " +
"vec = semantic question, hyde = hypothetical answer passage, " +
"expand = auto-expand via LLM (max 1 per query)"
type: z.enum(['lex', 'vec', 'hyde']).describe(
"lex = BM25 keywords (supports \"phrase\" and -negation); " +
"vec = semantic question; hyde = hypothetical answer passage"
),
query: z.string().describe(
"The query text. For lex: use keywords, \"quoted phrases\", and -negation. " +
@ -266,8 +265,6 @@ Good lex examples:
**hyde** Hypothetical document. Write 50-100 words that look like the answer. Often the most powerful for nuanced topics.
- \`The rate limiter uses a token bucket algorithm. When a client exceeds 100 req/min, subsequent requests return 429 until the window resets.\`
**expand** Auto-expand via local LLM. Generates lex+vec+hyde variations automatically. Max one per query. Useful when you don't know the exact terms.
## Strategy
Combine types for best results. First sub-query gets 2× weight put your strongest signal first.
@ -278,7 +275,7 @@ Combine types for best results. First sub-query gets 2× weight — put your str
| Concept search | \`vec\` only |
| Best recall | \`lex\` + \`vec\` |
| Complex/nuanced | \`lex\` + \`vec\` + \`hyde\` |
| Unknown vocabulary | \`expand\` |
| Unknown vocabulary | Use a standalone natural-language query (no typed lines) so the server can auto-expand it |
## Examples
@ -306,7 +303,7 @@ Intent-aware lex (C++ performance, not sports):
annotations: { readOnlyHint: true, openWorldHint: false },
inputSchema: {
searches: z.array(subSearchSchema).min(1).max(10).describe(
"Sub-queries to execute. First gets 2x weight. Max one expand: per query."
"Typed sub-queries to execute (lex/vec/hyde). First gets 2x weight."
),
limit: z.number().optional().default(10).describe("Max results (default: 10)"),
minScore: z.number().optional().default(0).describe("Min relevance 0-1 (default: 0)"),

View File

@ -1950,46 +1950,53 @@ function filterByCollections<T extends { filepath?: string; file?: string }>(res
* "CAP\nconsistency" -> throws (multiple plain lines)
*/
function parseStructuredQuery(query: string): StructuredSubSearch[] | null {
const lines = query.split('\n').map(l => l.trim()).filter(l => l.length > 0);
if (lines.length === 0) return null;
const rawLines = query.split('\n').map((line, idx) => ({
raw: line,
trimmed: line.trim(),
number: idx + 1,
})).filter(line => line.trimmed.length > 0);
const prefixRe = /^(lex|vec|hyde|expand):\s*/i;
const searches: StructuredSubSearch[] = [];
const plainLines: string[] = [];
if (rawLines.length === 0) return null;
for (const line of lines) {
const match = line.match(prefixRe);
if (match) {
const type = match[1]!.toLowerCase() as 'lex' | 'vec' | 'hyde' | 'expand';
const text = line.slice(match[0].length).trim();
if (text.length > 0) {
searches.push({ type, query: text });
const prefixRe = /^(lex|vec|hyde):\s*/i;
const expandRe = /^expand:\s*/i;
const typed: StructuredSubSearch[] = [];
for (const line of rawLines) {
if (expandRe.test(line.trimmed)) {
if (rawLines.length > 1) {
throw new Error(`Line ${line.number} starts with expand:, but query documents cannot mix expand with typed lines. Submit a single expand query instead.`);
}
} else {
plainLines.push(line);
const text = line.trimmed.replace(expandRe, '').trim();
if (!text) {
throw new Error('expand: query must include text.');
}
return null; // treat as standalone expand query
}
const match = line.trimmed.match(prefixRe);
if (match) {
const type = match[1]!.toLowerCase() as 'lex' | 'vec' | 'hyde';
const text = line.trimmed.slice(match[0].length).trim();
if (!text) {
throw new Error(`Line ${line.number} (${type}:) must include text.`);
}
if (/\r|\n/.test(text)) {
throw new Error(`Line ${line.number} (${type}:) contains a newline. Keep each query on a single line.`);
}
typed.push({ type, query: text, line: line.number });
continue;
}
if (rawLines.length === 1) {
// Single plain line -> implicit expand
return null;
}
throw new Error(`Line ${line.number} is missing a lex:/vec:/hyde: prefix. Each line in a query document must start with one.`);
}
// All plain lines, no prefixes -> null (use normal expansion)
if (searches.length === 0 && plainLines.length === 1) {
return null;
}
// Multiple plain lines without prefixes -> ambiguous, error
if (plainLines.length > 1) {
throw new Error(
`Ambiguous query: multiple lines without lex:/vec:/hyde: prefix.\n` +
`Either use a single line (for query expansion) or prefix each line.\n` +
`Example:\n lex: keyword terms\n vec: natural language question\n hyde: hypothetical answer passage`
);
}
// Mix of prefixed and one plain line -> treat plain as lex
if (plainLines.length === 1) {
searches.unshift({ type: 'lex', query: plainLines[0]! });
}
return searches.length > 0 ? searches : null;
return typed.length > 0 ? typed : null;
}
function search(query: string, opts: OutputOptions): void {
@ -2239,6 +2246,7 @@ function parseCLI() {
},
help: { type: "boolean", short: "h" },
version: { type: "boolean", short: "v" },
skill: { type: "boolean" },
// Search options
n: { type: "string" },
"min-score": { type: "string" },
@ -2311,58 +2319,104 @@ function parseCLI() {
};
}
function showSkill(): void {
const scriptDir = dirname(fileURLToPath(import.meta.url));
const relativePath = pathJoin("skills", "qmd", "SKILL.md");
const skillPath = pathJoin(scriptDir, "..", relativePath);
console.log(`QMD Skill (${relativePath})`);
console.log(`Location: ${skillPath}`);
console.log("");
if (!existsSync(skillPath)) {
console.error("SKILL.md not found. If you built from source, ensure skills/qmd/SKILL.md exists.");
return;
}
const content = readFileSync(skillPath, "utf-8");
process.stdout.write(content.endsWith("\n") ? content : content + "\n");
}
function showHelp(): void {
console.log("qmd — Quick Markdown Search");
console.log("");
console.log("Usage:");
console.log(" qmd collection add [path] --name <name> --mask <pattern> - Create/index collection");
console.log(" qmd collection list - List all collections with details");
console.log(" qmd collection remove <name> - Remove a collection by name");
console.log(" qmd collection rename <old> <new> - Rename a collection");
console.log(" qmd ls [collection[/path]] - List collections or files in a collection");
console.log(" qmd context add [path] \"text\" - Add context for path (defaults to current dir)");
console.log(" qmd context list - List all contexts");
console.log(" qmd context rm <path> - Remove context");
console.log(" qmd get <file>[:line] [-l N] [--from N] - Get document (optionally from line, max N lines)");
console.log(" qmd multi-get <pattern> [-l N] [--max-bytes N] - Get multiple docs by glob or comma-separated list");
console.log(" qmd status - Show index status and collections");
console.log(" qmd update [--pull] - Re-index all collections (--pull: git pull first)");
console.log(" qmd embed [-f] - Create vector embeddings (900 tokens/chunk, 15% overlap)");
console.log(" qmd cleanup - Remove cache and orphaned data, vacuum DB");
console.log(" qmd query <query> - Search with query expansion + reranking (recommended)");
console.log(" qmd query 'lex:..\\nvec:...' - Structured search (you provide lex/vec/hyde queries)");
console.log(" qmd search <query> - Full-text keyword search (BM25, no LLM)");
console.log(" qmd vsearch <query> - Vector similarity search (no reranking)");
console.log(" qmd mcp - Start MCP server (stdio transport)");
console.log(" qmd mcp --http [--port N] - Start MCP server (HTTP transport, default port 8181)");
console.log(" qmd mcp --http --daemon - Start MCP server as background daemon");
console.log(" qmd mcp stop - Stop background MCP daemon");
console.log(" qmd <command> [options]");
console.log("");
console.log("Primary commands:");
console.log(" qmd query <query> - Hybrid search with auto expansion + reranking (recommended)");
console.log(" qmd query 'lex:..\\nvec:...' - Structured query document (you provide lex/vec/hyde lines)");
console.log(" qmd search <query> - Full-text BM25 keywords (no LLM)");
console.log(" qmd vsearch <query> - Vector similarity only");
console.log(" qmd get <file>[:line] [-l N] - Show a single document, optional line slice");
console.log(" qmd multi-get <pattern> - Batch fetch via glob or comma-separated list");
console.log(" qmd mcp - Start the MCP server (stdio transport for AI agents)");
console.log("");
console.log("Collections & context:");
console.log(" qmd collection add/list/remove/rename/show - Manage indexed folders");
console.log(" qmd context add/list/rm - Attach human-written summaries");
console.log(" qmd ls [collection[/path]] - Inspect indexed files");
console.log("");
console.log("Maintenance:");
console.log(" qmd status - View index + collection health");
console.log(" qmd update [--pull] - Re-index collections (optionally git pull first)");
console.log(" qmd embed [-f] - Generate/refresh vector embeddings");
console.log(" qmd cleanup - Clear caches, vacuum DB");
console.log("");
console.log("Query syntax (qmd query):");
console.log(" QMD queries are either a single expand query (no prefix) or a multi-line");
console.log(" document where every line is typed with lex:, vec:, or hyde:. This grammar");
console.log(" matches the docs in docs/SYNTAX.md and is enforced in the CLI.");
console.log("");
const grammar = [
`query = expand_query | query_document ;`,
`expand_query = text | explicit_expand ;`,
`explicit_expand= "expand:" text ;`,
`query_document = { typed_line } ;`,
`typed_line = type ":" text newline ;`,
`type = "lex" | "vec" | "hyde" ;`,
`text = quoted_phrase | plain_text ;`,
`quoted_phrase = '"' { character } '"' ;`,
`plain_text = { character } ;`,
`newline = "\\n" ;`,
];
console.log(" Grammar:");
for (const line of grammar) {
console.log(` ${line}`);
}
console.log("");
console.log(" Examples:");
console.log(" qmd query \"how does auth work\" # single-line → implicit expand");
console.log(" qmd query $'lex: CAP theorem\\nvec: consistency' # typed query document");
console.log(" qmd query $'lex: \"exact matches\" sports -baseball' # phrase + negation lex search");
console.log(" qmd query $'hyde: Hypothetical answer text' # hyde-only document");
console.log("");
console.log(" Constraints:");
console.log(" - Standalone expand queries cannot mix with typed lines.");
console.log(" - Query documents allow only lex:, vec:, or hyde: prefixes.");
console.log(" - Each typed line must be single-line text with balanced quotes.");
console.log("");
console.log("AI agents & integrations:");
console.log(" - Run `qmd mcp` to expose the MCP server (stdio) to agents/IDEs.");
console.log(" - `qmd --skill` prints the packaged skills/qmd/SKILL.md (path + contents).");
console.log(" - Advanced: `qmd mcp --http ...` and `qmd mcp --http --daemon` are optional for custom transports.");
console.log("");
console.log("Global options:");
console.log(" --index <name> - Use custom index name (default: index)");
console.log(" --index <name> - Use a named index (default: index)");
console.log("");
console.log("Search options:");
console.log(" -n <num> - Number of results (default: 5, or 20 for --files)");
console.log(" --all - Return all matches (use with --min-score to filter)");
console.log(" -n <num> - Max results (default 5, or 20 for --files/--json)");
console.log(" --all - Return all matches (pair with --min-score)");
console.log(" --min-score <num> - Minimum similarity score");
console.log(" --full - Output full document instead of snippet");
console.log(" --line-numbers - Add line numbers to output");
console.log(" --files - Output docid,score,filepath,context (default: 20 results)");
console.log(" --json - JSON output with snippets (default: 20 results)");
console.log(" --csv - CSV output with snippets");
console.log(" --md - Markdown output");
console.log(" --xml - XML output");
console.log(" -c, --collection <name> - Filter results to a specific collection");
console.log("");
console.log("Structured queries (qmd query):");
console.log(" Prefix lines with lex:, vec:, or hyde: to skip automatic expansion.");
console.log(" lex: BM25 keyword search (exact terms)");
console.log(" vec: Vector similarity (natural language question)");
console.log(" hyde: Vector similarity (hypothetical answer passage)");
console.log(" Example: qmd query $'lex: CAP theorem\\nvec: consistency vs availability tradeoff'");
console.log(" --line-numbers - Include line numbers in output");
console.log(" --files | --json | --csv | --md | --xml - Output format");
console.log(" -c, --collection <name> - Filter by one or more collections");
console.log("");
console.log("Multi-get options:");
console.log(" -l <num> - Maximum lines per file");
console.log(" --max-bytes <num> - Skip files larger than N bytes (default: 10240)");
console.log(" --json/--csv/--md/--xml/--files - Output format (same as search)");
console.log(" --max-bytes <num> - Skip files larger than N bytes (default 10240)");
console.log(" --json/--csv/--md/--xml/--files - Same formats as search");
console.log("");
console.log(`Index: ${getDbPath()}`);
}
@ -2398,6 +2452,11 @@ if (isMain) {
process.exit(0);
}
if (cli.values.skill) {
showSkill();
process.exit(0);
}
if (!cli.command || cli.values.help) {
showHelp();
process.exit(cli.values.help ? 0 : 1);

View File

@ -2082,6 +2082,17 @@ export function validateSemanticQuery(query: string): string | null {
return null;
}
export function validateLexQuery(query: string): string | null {
if (/[\r\n]/.test(query)) {
return 'Lex queries must be a single line. Remove newline characters or split into separate lex: lines.';
}
const quoteCount = (query.match(/"/g) ?? []).length;
if (quoteCount % 2 === 1) {
return 'Lex query has an unmatched double quote ("). Add the closing quote or remove it.';
}
return null;
}
export function searchFTS(db: Database, query: string, limit: number = 20, collectionName?: string): SearchResult[] {
const ftsQuery = buildFTS5Query(query);
if (!ftsQuery) return [];
@ -3164,10 +3175,12 @@ export async function vectorSearchQuery(
* Matches the format used in QMD training data.
*/
export interface StructuredSubSearch {
/** Search type: 'lex' for BM25, 'vec' for semantic, 'hyde' for hypothetical, 'expand' for LLM expansion */
type: 'lex' | 'vec' | 'hyde' | 'expand';
/** Search type: 'lex' for BM25, 'vec' for semantic, 'hyde' for hypothetical */
type: 'lex' | 'vec' | 'hyde';
/** The search query text */
query: string;
/** Optional line number for error reporting (CLI parser) */
line?: number;
}
export interface StructuredSearchOptions {
@ -3212,36 +3225,25 @@ export async function structuredSearch(
if (searches.length === 0) return [];
// Validate: max one expand query, semantic queries don't use lex syntax
const expandSearches = searches.filter(s => s.type === 'expand');
if (expandSearches.length > 1) {
throw new Error('Maximum one expand: query per document');
}
// Validate queries before executing
for (const search of searches) {
if (search.type === 'vec' || search.type === 'hyde') {
const location = search.line ? `Line ${search.line}` : 'Structured search';
if (/[\r\n]/.test(search.query)) {
throw new Error(`${location} (${search.type}): queries must be single-line. Remove newline characters.`);
}
if (search.type === 'lex') {
const error = validateLexQuery(search.query);
if (error) {
throw new Error(`${location} (lex): ${error}`);
}
} else if (search.type === 'vec' || search.type === 'hyde') {
const error = validateSemanticQuery(search.query);
if (error) {
throw new Error(`Invalid ${search.type} query: ${error}`);
throw new Error(`${location} (${search.type}): ${error}`);
}
}
}
// Process expand: queries by calling the query expansion model
let processedSearches = searches.filter(s => s.type !== 'expand');
if (expandSearches.length > 0) {
const expandQuery = expandSearches[0]!.query;
const expanded = await store.expandQuery(expandQuery);
// Add expanded queries (lex, vec, hyde from the model)
for (const exp of expanded) {
processedSearches.push({ type: exp.type as 'lex' | 'vec' | 'hyde', query: exp.text });
}
// Also add original as lex for strong signal matching
processedSearches.unshift({ type: 'lex', query: expandQuery });
}
// Use processed searches from here on
searches = processedSearches;
const rankedLists: RankedResult[][] = [];
const docidMap = new Map<string, string>(); // filepath -> docid
const hasVectors = !!store.db.prepare(

View File

@ -17,6 +17,7 @@ import {
createStore,
structuredSearch,
validateSemanticQuery,
validateLexQuery,
type StructuredSubSearch,
type Store,
} from "../src/store.js";
@ -26,47 +27,53 @@ import { disposeDefaultLlamaCpp } from "../src/llm.js";
// parseStructuredQuery Tests (CLI Parser)
// =============================================================================
/**
* Parse structured search query syntax.
* This is a copy of the function from qmd.ts for isolated testing.
*/
function parseStructuredQuery(query: string): StructuredSubSearch[] | null {
const lines = query.split('\n').map(l => l.trim()).filter(l => l.length > 0);
if (lines.length === 0) return null;
const rawLines = query.split('\n').map((line, idx) => ({
raw: line,
trimmed: line.trim(),
number: idx + 1,
})).filter(line => line.trimmed.length > 0);
if (rawLines.length === 0) return null;
const prefixRe = /^(lex|vec|hyde):\s*/i;
const searches: StructuredSubSearch[] = [];
const plainLines: string[] = [];
const expandRe = /^expand:\s*/i;
const typed: StructuredSubSearch[] = [];
for (const line of lines) {
const match = line.match(prefixRe);
for (const line of rawLines) {
if (expandRe.test(line.trimmed)) {
if (rawLines.length > 1) {
throw new Error(`Line ${line.number} starts with expand:, but query documents cannot mix expand with typed lines. Submit a single expand query instead.`);
}
const text = line.trimmed.replace(expandRe, '').trim();
if (!text) {
throw new Error('expand: query must include text.');
}
return null;
}
const match = line.trimmed.match(prefixRe);
if (match) {
const type = match[1]!.toLowerCase() as 'lex' | 'vec' | 'hyde';
const text = line.slice(match[0].length).trim();
if (text.length > 0) {
searches.push({ type, query: text });
const text = line.trimmed.slice(match[0].length).trim();
if (!text) {
throw new Error(`Line ${line.number} (${type}:) must include text.`);
}
} else {
plainLines.push(line);
if (/\r|\n/.test(text)) {
throw new Error(`Line ${line.number} (${type}:) contains a newline. Keep each query on a single line.`);
}
typed.push({ type, query: text, line: line.number });
continue;
}
if (rawLines.length === 1) {
return null;
}
throw new Error(`Line ${line.number} is missing a lex:/vec:/hyde: prefix. Each line in a query document must start with one.`);
}
// All plain lines, no prefixes -> null (use normal expansion)
if (searches.length === 0 && plainLines.length === 1) {
return null;
}
// Multiple plain lines without prefixes -> ambiguous, error
if (plainLines.length > 1) {
throw new Error("Ambiguous query: multiple lines without lex:/vec:/hyde: prefix.");
}
// Mix of prefixed and one plain line -> treat plain as lex
if (plainLines.length === 1) {
searches.unshift({ type: 'lex', query: plainLines[0]! });
}
return searches.length > 0 ? searches : null;
return typed.length > 0 ? typed : null;
}
describe("parseStructuredQuery", () => {
@ -76,6 +83,10 @@ describe("parseStructuredQuery", () => {
expect(parseStructuredQuery("distributed systems")).toBeNull();
});
test("explicit expand line treated as plain query", () => {
expect(parseStructuredQuery("expand: error handling best practices")).toBeNull();
});
test("empty queries", () => {
expect(parseStructuredQuery("")).toBeNull();
expect(parseStructuredQuery(" ")).toBeNull();
@ -86,28 +97,28 @@ describe("parseStructuredQuery", () => {
describe("single prefixed queries", () => {
test("lex: prefix", () => {
const result = parseStructuredQuery("lex: CAP theorem");
expect(result).toEqual([{ type: "lex", query: "CAP theorem" }]);
expect(result).toEqual([{ type: "lex", query: "CAP theorem", line: 1 }]);
});
test("vec: prefix", () => {
const result = parseStructuredQuery("vec: what is the CAP theorem");
expect(result).toEqual([{ type: "vec", query: "what is the CAP theorem" }]);
expect(result).toEqual([{ type: "vec", query: "what is the CAP theorem", line: 1 }]);
});
test("hyde: prefix", () => {
const result = parseStructuredQuery("hyde: The CAP theorem states that...");
expect(result).toEqual([{ type: "hyde", query: "The CAP theorem states that..." }]);
expect(result).toEqual([{ type: "hyde", query: "The CAP theorem states that...", line: 1 }]);
});
test("uppercase prefix", () => {
expect(parseStructuredQuery("LEX: keywords")).toEqual([{ type: "lex", query: "keywords" }]);
expect(parseStructuredQuery("VEC: question")).toEqual([{ type: "vec", query: "question" }]);
expect(parseStructuredQuery("HYDE: passage")).toEqual([{ type: "hyde", query: "passage" }]);
expect(parseStructuredQuery("LEX: keywords")).toEqual([{ type: "lex", query: "keywords", line: 1 }]);
expect(parseStructuredQuery("VEC: question")).toEqual([{ type: "vec", query: "question", line: 1 }]);
expect(parseStructuredQuery("HYDE: passage")).toEqual([{ type: "hyde", query: "passage", line: 1 }]);
});
test("mixed case prefix", () => {
expect(parseStructuredQuery("Lex: test")).toEqual([{ type: "lex", query: "test" }]);
expect(parseStructuredQuery("VeC: test")).toEqual([{ type: "vec", query: "test" }]);
expect(parseStructuredQuery("Lex: test")).toEqual([{ type: "lex", query: "test", line: 1 }]);
expect(parseStructuredQuery("VeC: test")).toEqual([{ type: "vec", query: "test", line: 1 }]);
});
});
@ -115,65 +126,71 @@ describe("parseStructuredQuery", () => {
test("lex + vec", () => {
const result = parseStructuredQuery("lex: keywords\nvec: natural language");
expect(result).toEqual([
{ type: "lex", query: "keywords" },
{ type: "vec", query: "natural language" },
{ type: "lex", query: "keywords", line: 1 },
{ type: "vec", query: "natural language", line: 2 },
]);
});
test("all three types", () => {
const result = parseStructuredQuery("lex: keywords\nvec: question\nhyde: hypothetical doc");
expect(result).toEqual([
{ type: "lex", query: "keywords" },
{ type: "vec", query: "question" },
{ type: "hyde", query: "hypothetical doc" },
{ type: "lex", query: "keywords", line: 1 },
{ type: "vec", query: "question", line: 2 },
{ type: "hyde", query: "hypothetical doc", line: 3 },
]);
});
test("duplicate types allowed", () => {
const result = parseStructuredQuery("lex: term1\nlex: term2\nlex: term3");
expect(result).toEqual([
{ type: "lex", query: "term1" },
{ type: "lex", query: "term2" },
{ type: "lex", query: "term3" },
{ type: "lex", query: "term1", line: 1 },
{ type: "lex", query: "term2", line: 2 },
{ type: "lex", query: "term3", line: 3 },
]);
});
test("order preserved", () => {
const result = parseStructuredQuery("hyde: passage\nvec: question\nlex: keywords");
expect(result).toEqual([
{ type: "hyde", query: "passage" },
{ type: "vec", query: "question" },
{ type: "lex", query: "keywords" },
{ type: "hyde", query: "passage", line: 1 },
{ type: "vec", query: "question", line: 2 },
{ type: "lex", query: "keywords", line: 3 },
]);
});
});
describe("mixed plain and prefixed", () => {
test("single plain line with prefixed lines -> plain becomes lex first", () => {
const result = parseStructuredQuery("plain keywords\nvec: semantic question");
expect(result).toEqual([
{ type: "lex", query: "plain keywords" },
{ type: "vec", query: "semantic question" },
]);
test("plain line with prefixed lines throws helpful error", () => {
expect(() => parseStructuredQuery("plain keywords\nvec: semantic question"))
.toThrow(/missing a lex:\/vec:\/hyde:/);
});
test("plain line prepended before other prefixed", () => {
const result = parseStructuredQuery("keywords\nhyde: passage\nvec: question");
expect(result).toEqual([
{ type: "lex", query: "keywords" },
{ type: "hyde", query: "passage" },
{ type: "vec", query: "question" },
]);
test("plain line prepended before other prefixed throws", () => {
expect(() => parseStructuredQuery("keywords\nhyde: passage\nvec: question"))
.toThrow(/missing a lex:\/vec:\/hyde:/);
});
});
describe("error cases", () => {
test("multiple plain lines throws", () => {
expect(() => parseStructuredQuery("line one\nline two")).toThrow("Ambiguous query");
expect(() => parseStructuredQuery("line one\nline two")).toThrow(/missing a lex:\/vec:\/hyde:/);
});
test("three plain lines throws", () => {
expect(() => parseStructuredQuery("a\nb\nc")).toThrow("Ambiguous query");
expect(() => parseStructuredQuery("a\nb\nc")).toThrow(/missing a lex:\/vec:\/hyde:/);
});
test("mixing expand: with other lines throws", () => {
expect(() => parseStructuredQuery("expand: question\nlex: keywords"))
.toThrow(/cannot mix expand with typed lines/);
});
test("expand: without text throws", () => {
expect(() => parseStructuredQuery("expand: ")).toThrow(/must include text/);
});
test("typed line without text throws", () => {
expect(() => parseStructuredQuery("lex: \nvec: real")).toThrow(/must include text/);
});
});
@ -181,58 +198,56 @@ describe("parseStructuredQuery", () => {
test("empty lines ignored", () => {
const result = parseStructuredQuery("lex: keywords\n\nvec: question\n");
expect(result).toEqual([
{ type: "lex", query: "keywords" },
{ type: "vec", query: "question" },
{ type: "lex", query: "keywords", line: 1 },
{ type: "vec", query: "question", line: 3 },
]);
});
test("whitespace-only lines ignored", () => {
const result = parseStructuredQuery("lex: keywords\n \nvec: question");
expect(result).toEqual([
{ type: "lex", query: "keywords" },
{ type: "vec", query: "question" },
{ type: "lex", query: "keywords", line: 1 },
{ type: "vec", query: "question", line: 3 },
]);
});
test("leading/trailing whitespace trimmed from lines", () => {
const result = parseStructuredQuery(" lex: keywords \n vec: question ");
expect(result).toEqual([
{ type: "lex", query: "keywords" },
{ type: "vec", query: "question" },
{ type: "lex", query: "keywords", line: 1 },
{ type: "vec", query: "question", line: 2 },
]);
});
test("internal whitespace preserved in query", () => {
const result = parseStructuredQuery("lex: multiple spaces ");
expect(result).toEqual([{ type: "lex", query: "multiple spaces" }]);
expect(result).toEqual([{ type: "lex", query: "multiple spaces", line: 1 }]);
});
test("empty prefix value skipped", () => {
const result = parseStructuredQuery("lex: \nvec: actual query");
expect(result).toEqual([{ type: "vec", query: "actual query" }]);
test("empty prefix value throws", () => {
expect(() => parseStructuredQuery("lex: \nvec: actual query")).toThrow(/must include text/);
});
test("only empty prefix values returns null", () => {
const result = parseStructuredQuery("lex: \nvec: \nhyde: ");
expect(result).toBeNull();
test("only empty prefix values throws", () => {
expect(() => parseStructuredQuery("lex: \nvec: \nhyde: ")).toThrow(/must include text/);
});
});
describe("edge cases", () => {
test("colon in query text preserved", () => {
const result = parseStructuredQuery("lex: time: 12:30 PM");
expect(result).toEqual([{ type: "lex", query: "time: 12:30 PM" }]);
expect(result).toEqual([{ type: "lex", query: "time: 12:30 PM", line: 1 }]);
});
test("prefix-like text in query preserved", () => {
const result = parseStructuredQuery("vec: what does lex: mean");
expect(result).toEqual([{ type: "vec", query: "what does lex: mean" }]);
expect(result).toEqual([{ type: "vec", query: "what does lex: mean", line: 1 }]);
});
test("newline in hyde passage (as single line)", () => {
// If user wants actual newlines in hyde, they need to escape or use multiline syntax
const result = parseStructuredQuery("hyde: The answer is X. It means Y.");
expect(result).toEqual([{ type: "hyde", query: "The answer is X. It means Y." }]);
expect(result).toEqual([{ type: "hyde", query: "The answer is X. It means Y.", line: 1 }]);
});
});
});
@ -318,6 +333,18 @@ describe("structuredSearch", () => {
expect(r.score).toBeGreaterThanOrEqual(0.5);
}
});
test("throws when lex query contains newline characters", async () => {
await expect(structuredSearch(store, [
{ type: "lex", query: "foo\nbar", line: 3 }
])).rejects.toThrow(/Line 3 \(lex\):/);
});
test("throws when lex query has unmatched quote", async () => {
await expect(structuredSearch(store, [
{ type: "lex", query: "\"unfinished phrase", line: 2 }
])).rejects.toThrow(/unmatched double quote/);
});
});
// =============================================================================
@ -346,6 +373,20 @@ describe("lex query syntax", () => {
)).toBeNull();
});
});
describe("validateLexQuery", () => {
test("accepts basic lex query", () => {
expect(validateLexQuery("auth token")).toBeNull();
});
test("rejects newline", () => {
expect(validateLexQuery("foo\nbar")).toContain("single line");
});
test("rejects unmatched quote", () => {
expect(validateLexQuery("\"unfinished")).toContain("unmatched");
});
});
});
// =============================================================================