qmd/finetune
Tobias Lütke eb1b77c8cb
Deploy fine-tuned GRPO model as default query expansion (#67)
* Add query expansion model finetuning infrastructure

- Training scripts for Qwen3-0.6B and 1.7B models
- Dataset generation from s-emanuilov/query-expansion
- Evaluation scripts comparing finetuned vs baseline models
- GRPO RL training script (optional improvement)
- Export script for GGUF conversion

Results:
- 0.6B finetuned: 95% format compliance (lex/vec/hyde)
- Baseline: 0% format compliance
- Dataset: 5,157 examples on HuggingFace Hub

Models available at:
- tobil/qmd-query-expansion-0.6B (recommended)
- tobil/qmd-query-expansion-train (dataset)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Fix GRPO training script for TRL API compatibility

- Use max_completion_length instead of max_new_tokens
- Use processing_class instead of tokenizer
- Use args instead of config for GRPOTrainer
- Add __name__ attribute to reward function class
- Accept **kwargs in reward function for extra TRL args
- Add new LoRA adapter after merging SFT weights

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Update README with final evaluation results

- 0.6B SFT: 95% format compliance (best)
- 0.6B GRPO: 0% (catastrophic forgetting from RL)
- 1.7B v2: training completed, evaluation pending
- Added GRPO evaluation results

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Add comprehensive scoring system for query expansion

New scoring criteria (0-100 points):
- Format (30): Must have lex: and vec: prefixes
- Diversity (30): Multiple types, no echoing query, diverse expansions
- Hyde (20): Optional, concise, no newlines, no word repetition
- Quality (20): Lex=keywords, vec=natural language

See SCORING.md for full documentation.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Add HuggingFace login and comprehensive scoring to GRPO v2 training

- Add explicit HF_TOKEN login before training
- Use SCORING.md criteria as RL reward function
- Conservative training: LR 1e-6, LoRA rank 4
- Reward scores: good=0.94, bad=0.38

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Refactor finetune folder: train/rl scripts with YAML configs

Major changes:
- train.py: Generic SFT training script using YAML config
- rl.py: Generic GRPO training script using YAML config
- configs/: YAML configs per training run (sft_v4.yaml, grpo_v4.yaml)
- dataset/: Data preparation scripts moved here
- tui.py: Interactive model testing interface

Training results:
- SFT v4: 98.8% avg score (all Excellent)
- GRPO v4: 0% (failed - model drifted to verbose explanations)

Removed per-model scripts (train_0.6B.py, train_1.7B.py, etc)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Add named entity extraction to GRPO reward function

Key changes:
- Extract named entities (acronyms, proper nouns, technical terms)
- Heavy penalty (-30) when lex queries miss named entities
- Penalty (-15) for generic filler phrases like "find information about"
- Compound entity detection (TDS motorsports -> both words)
- Update GRPO config with KL regularization (beta=0.04)
- Lower learning rate (5e-7) and add max_steps (200)

Test results:
- "who is TDS motorsports" good: 1.00, bad: 0.30 (was 0.75)
- "how to use React hooks" good: 0.87, bad: 0.45 (was 0.75)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Add chat template leakage detection to reward function

Zero reward for outputs containing:
- <|im_start|>, <|im_end|> tokens
- <think>, </think> tags (Qwen3 thinking mode)
- Role markers like \nassistant\n, \nuser\n
- <|endoftext|> token

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Strict format validation: every line must be lex:/vec:/hyde:

Any line that doesn't start with a valid prefix now returns 0.0
instead of just counting as a penalty. This prevents any prose,
explanations, bullet points, or other invalid content.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Clean up evaluation files

- Remove old versioned evaluation files (0.6B, 1.7B, baseline)
- Rename evaluation_v4.json -> evaluation_sft.json
- Rename evaluation_v4_grpo.json -> evaluation_grpo_failed.json

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Refactor evals into separate run and score scripts

New structure:
- evals/run.py: Generate model outputs to JSONL
- evals/score.py: Score outputs with detailed breakdown
- evals/queries.txt: Test queries (26 total)

Features:
- Supports both HF Hub and local model paths
- Named entity preservation scoring
- Chat template leakage detection
- Strict format validation (every line must be lex:/vec:/hyde:)
- Generic phrase detection

Usage:
  uv run evals/run.py --model tobil/qmd-query-expansion-0.6B-v4
  uv run evals/score.py evals/results_*.jsonl

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Fix GRPO model loading to use SFT base first

The GRPO adapter was trained on merged SFT weights, so loading it
directly on the base model results in 0% score. Added --sft-model
parameter to evals/run.py to load SFT first, then apply GRPO adapter.

With correct loading: GRPO scores 89.7% (all 26 queries Excellent).

Updated README with correct GRPO score and loading instructions.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Fix TUI to load GRPO models with SFT base first

GRPO adapters were trained on merged SFT weights, so they need SFT
loaded and merged first before applying the GRPO adapter.

Updated MODELS config to include sft_base path for GRPO models,
and load_model() now handles the SFT -> merge -> GRPO flow.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Update README for unified model repository structure

All models (0.6B, 1.7B, 4B) with SFT and GRPO variants now go into
a single HuggingFace repo (tobil/qmd-query-expansion) with subfolders
for each size and training method.

Updated loading examples to show subfolder-based model loading.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Update README with separate model repos

Changed from subfolder approach to separate repos per model since
trainer.push_to_hub() doesn't support subfolder argument.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Add 1.7B and 4B GRPO training and GGUF conversion scripts

Training scripts for GRPO fine-tuning:
- train_1.7B_grpo.py: GRPO training for Qwen3-1.7B
- train_4B_grpo.py: GRPO training for Qwen3-4B

GGUF conversion scripts:
- convert_1.7B_gguf.py: Merge SFT+GRPO adapters and convert to GGUF
- convert_4B_gguf.py: Merge SFT+GRPO adapters and convert to GGUF

All scripts use PEP 723 inline dependencies for HuggingFace Jobs.

Models published:
- tobil/qmd-query-expansion-1.7B-sft
- tobil/qmd-query-expansion-1.7B-grpo
- tobil/qmd-query-expansion-1.7B-gguf
- tobil/qmd-query-expansion-4B-sft
- tobil/qmd-query-expansion-4B-grpo
- tobil/qmd-query-expansion-4B-gguf

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Remove beads issue tracking

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Remove beads reference from CLAUDE.md

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Fix GRPO reward function to handle think blocks and end tokens

- Strip <|im_end|> token from completions (model output includes it)
- Change think_penalty to skipped_think bonus (+20 for not using think)
- Adjust max_possible to account for bonus (120/140)
- Fix typo in chat template artifact check

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Make TUI model list dynamic from HuggingFace Hub

- Fetch available qmd-query-expansion models from tobil/ on Hub
- Auto-detect model size (0.6B, 1.7B, 4B) and use correct base model
- Group models by type (SFT vs GRPO) in menu
- Skip GGUF repos in model listing

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Fix GRPO training: apply chat template to prompts

The SFT model was trained with chat template format but GRPO was
passing raw prompts. Now prompts are formatted with tokenizer.apply_chat_template()
so the model sees the same format it learned during SFT.

Also update extract_query_from_prompt to strip chat template artifacts.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Finetune 2.0: consolidate and simplify the entire training pipeline

Consolidate ~2,800 lines of duplicated code across 12 files into 5 clean,
well-documented files targeting Qwen3-1.7B end-to-end.

Key changes:
- Extract reward function into single source of truth (reward.py)
  Previously duplicated 3x with divergent bugs across rl.py,
  train_1.7B_grpo.py, and train_4B_grpo.py
- Unify training into one script with sft/grpo subcommands (train.py)
  Replaces train.py + rl.py + train_1.7B_grpo.py + train_4B_grpo.py
- Merge eval generate+score into single eval.py
  Replaces evals/run.py + evals/score.py
- Parameterize GGUF conversion by --size (convert_gguf.py)
  Replaces convert_1.7B_gguf.py + convert_4B_gguf.py
- Fix critical bug: rl.py silently ignored beta/temperature from config,
  causing the exact catastrophic drift its own comments warned about
- Fix prompt consistency: all files use /no_think chat template format
- Retarget configs from 0.6B to 1.7B
- Comprehensive README documenting the full pipeline

Removed: rl.py, train_1.7B_grpo.py, train_4B_grpo.py, convert_1.7B_gguf.py,
convert_4B_gguf.py, tui.py, evals/run.py, evals/score.py

Net: -3,429 lines, +382 lines

Co-Authored-By: Claude (claude-fudge-eap-cc) <noreply@anthropic.com>

* Add HF Jobs scripts, temporal query examples, and training results

- jobs/sft.py and jobs/grpo.py: self-contained scripts for
  `hf jobs uv run` (no local GPU needed)
- 12 temporal/recency query examples in training data (e.g. "recent
  news about Shopify" -> lex with years 2025/2026)
- 4 temporal test queries in evals/queries.txt
- README updated with HF Jobs workflow, training results, and
  updated file structure
- Remove .beads tracking

SFT and GRPO successfully trained on A10G via HF Jobs:
  SFT: eval loss 0.321, token accuracy 92.4%
  GRPO: mean reward 0.757, 200 steps, KL 0.00048

Co-Authored-By: Claude (claude-fudge-eap-cc) <noreply@anthropic.com>

* Deploy fine-tuned GRPO model as default for query expansion

Switch from generic Qwen3-1.7B-Q8_0 (~2.2GB) to fine-tuned
qmd-query-expansion-1.7B-q4_k_m (~1.1GB). The fine-tuned Q4
scores 91.7% avg with 30/30 Excellent, outperforming the base Q8.

- Update default generate model in src/llm.ts
- Update README model table, architecture diagram, config block
- Add v2 training data, eval scripts, and quantize job
- Remove superseded v1 training data (5,742 → 1,000 examples)
- Update finetune README with v2 results and file structure

Co-Authored-By: Claude (claude-fudge-eap-cc) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-28 23:25:17 -08:00
..
configs Deploy fine-tuned GRPO model as default query expansion (#67) 2026-01-28 23:25:17 -08:00
data Deploy fine-tuned GRPO model as default query expansion (#67) 2026-01-28 23:25:17 -08:00
dataset Deploy fine-tuned GRPO model as default query expansion (#67) 2026-01-28 23:25:17 -08:00
evals Deploy fine-tuned GRPO model as default query expansion (#67) 2026-01-28 23:25:17 -08:00
jobs Deploy fine-tuned GRPO model as default query expansion (#67) 2026-01-28 23:25:17 -08:00
.gitignore Deploy fine-tuned GRPO model as default query expansion (#67) 2026-01-28 23:25:17 -08:00
convert_gguf.py Deploy fine-tuned GRPO model as default query expansion (#67) 2026-01-28 23:25:17 -08:00
eval.py Deploy fine-tuned GRPO model as default query expansion (#67) 2026-01-28 23:25:17 -08:00
README.md Deploy fine-tuned GRPO model as default query expansion (#67) 2026-01-28 23:25:17 -08:00
reward.py Deploy fine-tuned GRPO model as default query expansion (#67) 2026-01-28 23:25:17 -08:00
SCORING.md Deploy fine-tuned GRPO model as default query expansion (#67) 2026-01-28 23:25:17 -08:00
train.py Deploy fine-tuned GRPO model as default query expansion (#67) 2026-01-28 23:25:17 -08:00

QMD Query Expansion Fine-Tuning

Train small language models to expand search queries for QMD's hybrid retrieval pipeline.

What This Does

Given a raw search query like "auth config", the trained model produces structured expansions:

lex: authentication configuration
lex: auth settings setup
vec: how to configure authentication settings
vec: authentication configuration options
hyde: Authentication can be configured by setting the AUTH_SECRET environment variable.

These feed into QMD's three search backends:

  • lex: lines go to BM25 full-text search (short, keyword-focused)
  • vec: lines go to vector similarity search (natural language phrases)
  • hyde: is a hypothetical document passage for embedding-based retrieval (HyDE technique)

Quick Start

Cloud training via HuggingFace Jobs (no GPU needed)

# 1. SFT: teach the model the output format (~45 min on A10G, ~$1.50)
hf jobs uv run --flavor a10g-large --secrets HF_TOKEN --timeout 2h jobs/sft.py

# 2. GRPO: RL refinement on top of SFT (~20 min on A10G, ~$0.50)
hf jobs uv run --flavor a10g-large --secrets HF_TOKEN --timeout 4h jobs/grpo.py

# 3. Evaluate against test queries (needs local GPU or use eval job)
uv run eval.py --model tobil/qmd-query-expansion-1.7B-grpo \
               --sft-model tobil/qmd-query-expansion-1.7B-sft

# 4. Convert to GGUF for local deployment (Ollama, llama.cpp)
uv run convert_gguf.py --size 1.7B

Local training (if you have a GPU)

uv run train.py sft  --config configs/sft.yaml
uv run train.py grpo --config configs/grpo.yaml

Monitoring HF Jobs

hf jobs ps                           # list running jobs
hf jobs inspect <job-id>             # check status
hf jobs logs <job-id>                # stream logs
hf jobs cancel <job-id>              # cancel a job

Prompt Format

All tools use the same prompt — Qwen3 chat template with /no_think:

<|im_start|>user
/no_think Expand this search query: {query}<|im_end|>
<|im_start|>assistant

The /no_think directive suppresses Qwen3's chain-of-thought mode, producing direct lex:/vec:/hyde: output without <think> blocks.

File Structure

finetune/
├── reward.py          # Scoring/reward function (single source of truth)
├── train.py           # Unified SFT + GRPO training (two subcommands)
├── eval.py            # Generate expansions and score them
├── convert_gguf.py    # GGUF conversion for Ollama/llama.cpp
├── jobs/
│   ├── sft.py         # Self-contained SFT for HuggingFace Jobs
│   ├── grpo.py        # Self-contained GRPO for HuggingFace Jobs
│   ├── eval.py        # Self-contained eval for HuggingFace Jobs
│   ├── eval_common.py # Shared eval utilities
│   └── quantize.py    # GGUF quantization for HuggingFace Jobs
├── configs/
│   ├── sft.yaml       # SFT hyperparameters for Qwen3-1.7B
│   └── grpo.yaml      # GRPO hyperparameters for Qwen3-1.7B
├── evals/
│   └── queries.txt    # 31 test queries across 8 categories
├── data/
│   └── qmd_expansion_v2.jsonl  # Source training data (1,000 high-quality examples)
├── dataset/
│   ├── generate_data.py         # Generate data via Claude API
│   ├── generate_data_offline.py # Generate from existing HF dataset
│   ├── prepare_data.py          # Format for Qwen3 chat template
│   └── clean_data.py            # Detect technical term misinterpretations
├── SCORING.md         # Detailed scoring rubric reference
└── README.md          # This file

Training Pipeline

Stage 1: SFT (Supervised Fine-Tuning)

Teaches the model the lex:/vec:/hyde: output format from labeled examples.

Parameter Value
Base model Qwen/Qwen3-1.7B
Method LoRA (rank 16, alpha 32)
Target modules All projection layers (q/k/v/o/gate/up/down)
Dataset ~2,290 examples (train split)
Effective batch size 16 (4 × 4 gradient accumulation)
Epochs 5
Learning rate 2e-4 (cosine schedule)
uv run train.py sft --config configs/sft.yaml
uv run train.py sft --config configs/sft.yaml --dry-run  # preview config

Stage 2: GRPO (Group Relative Policy Optimization)

Reinforcement learning on top of the merged SFT weights. The model generates multiple expansions per query, they are scored by the reward function, and the model is updated to prefer higher-scoring outputs.

Parameter Value
Base Merged SFT checkpoint
Method LoRA (rank 4, alpha 8) — smaller for RL stability
Target modules q_proj, v_proj only
Reward reward.py (rule-based, 5 dimensions)
KL beta 0.04 — prevents drift from SFT checkpoint
Generations per prompt 4
Max steps 200
Learning rate 5e-7

Important: beta > 0 is critical. With beta=0 the model experiences catastrophic drift and scores drop to 0%.

uv run train.py grpo --config configs/grpo.yaml
uv run train.py grpo --config configs/grpo.yaml --dry-run  # test reward function

Evaluation

eval.py generates expansions from a model and scores them against test queries:

# Evaluate an SFT model
uv run eval.py --model tobil/qmd-query-expansion-1.7B-sft

# Evaluate a GRPO model (needs SFT adapter merged first)
uv run eval.py --model tobil/qmd-query-expansion-1.7B-grpo \
               --sft-model tobil/qmd-query-expansion-1.7B-sft

# Verbose output with deduction details
uv run eval.py --model tobil/qmd-query-expansion-1.7B-sft -v

# Save detailed scores to JSON
uv run eval.py --model tobil/qmd-query-expansion-1.7B-sft -o scores.json

# Score an existing JSONL file (backwards compat with old run.py output)
uv run eval.py --score-only evals/results_old.jsonl

Reward Function

reward.py is the single source of truth for scoring. It is used both as the GRPO reward signal during training and for evaluation.

Five scoring dimensions (max 120 without hyde, 140 with):

Dimension Points What It Measures
Format 0-30 Has lex/vec lines, no invalid lines
Diversity 0-30 Multiple expansion types, diverse content, no query echoes
HyDE 0-20 Present, 50-200 chars, single line, not repetitive
Quality 0-20 Lex shorter than vec, natural language, preserves key terms
Entity -45 to +20 Named entities preserved in lex and vec lines
Think bonus 0-20 Reward for NOT using <think> mode

Hard failures (instant 0.0):

  • Chat template leakage (<|im_start|>, <|im_end|>, etc.)
  • Any line without a valid lex:, vec:, or hyde: prefix
# Self-test the reward function
uv run reward.py

GGUF Conversion

Merges base + SFT + GRPO adapters into a single model and produces quantized GGUF files for deployment:

# Use preset for 1.7B
uv run convert_gguf.py --size 1.7B

# Use preset for 4B
uv run convert_gguf.py --size 4B

# Custom models
uv run convert_gguf.py --base Qwen/Qwen3-1.7B \
                       --sft tobil/qmd-query-expansion-1.7B-sft \
                       --grpo tobil/qmd-query-expansion-1.7B-grpo \
                       --output tobil/qmd-query-expansion-1.7B-gguf

Using with Ollama

huggingface-cli download tobil/qmd-query-expansion-1.7B-gguf \
    qmd-query-expansion-1.7B-q4_k_m.gguf --local-dir .

echo 'FROM ./qmd-query-expansion-1.7B-q4_k_m.gguf' > Modelfile
ollama create qmd-expand -f Modelfile
ollama run qmd-expand

Data Pipeline

The training data (1,000 examples in data/qmd_expansion_v2.jsonl) was generated from two sources and cleaned for quality. To regenerate:

# Generate from existing HuggingFace dataset (bulk, no API needed)
uv run dataset/generate_data_offline.py

# Generate via Claude API (higher quality, needs ANTHROPIC_API_KEY)
uv run dataset/generate_data.py --count 100

# Detect and fix technical term misinterpretations
uv run dataset/clean_data.py

# Format for Qwen3 chat template, add short-query augmentation, split train/val
uv run dataset/prepare_data.py

Architecture Notes

The two-stage training approach (SFT → GRPO) is standard for structured-output models:

  1. SFT establishes format compliance and basic query understanding. It uses a large LoRA (rank 16, all projection layers) because it needs to learn a new output format from scratch.

  2. GRPO refines quality within the learned format. It uses a small LoRA (rank 4, q/v only) and KL regularization to make incremental improvements without losing what SFT taught.

The reward function is entirely rule-based (no LLM judge) which makes it fast, deterministic, and suitable as an RL signal. See SCORING.md for the full rubric.

Training Results (Qwen3-1.7B, v2)

SFT

Metric Value
Final train loss 0.472
Final eval loss 0.304
Token accuracy (train) 97.4%
Token accuracy (eval) 93.8%
Epochs 5
Hardware A10G (24 GB VRAM)

GRPO

Metric Value
Mean reward 0.757
Final loss 0.0005
KL divergence 0.00048
Mean completion length ~58 tokens
Training time ~19 min (200 steps)
Hardware A10G (24 GB VRAM)

Evaluation Scores

Model Average Score Excellent (30)
SFT 92.0% 30/30
GRPO 91.7% 30/30