Add named entity extraction to GRPO reward function

Key changes:
- Extract named entities (acronyms, proper nouns, technical terms)
- Heavy penalty (-30) when lex queries miss named entities
- Penalty (-15) for generic filler phrases like "find information about"
- Compound entity detection (TDS motorsports -> both words)
- Update GRPO config with KL regularization (beta=0.04)
- Lower learning rate (5e-7) and add max_steps (200)

Test results:
- "who is TDS motorsports" good: 1.00, bad: 0.30 (was 0.75)
- "how to use React hooks" good: 0.87, bad: 0.45 (was 0.75)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-25 00:05:40 -05:00

5.2 KiB

Raw Blame History

QMD Query Expansion Model Finetuning

Finetune small Qwen models for QMD's query expansion task.

Goal

Train models that convert user queries into retrieval-optimized outputs:

Input: "how to configure authentication"

Output:
lex: authentication setup
lex: auth configuration
vec: how to set up user authentication in the application
hyde: To configure authentication, set the AUTH_SECRET environment variable and enable the auth middleware in your application config.

Output Format

Type	Purpose	Count
`lex:`	BM25 keyword variations (short, keyword-focused)	1-3
`vec:`	Semantic reformulations (natural language)	1-3
`hyde:`	Hypothetical document passage (50-150 chars)	0-1

Trained Models

Model	HuggingFace	Score	Status
Qwen3-0.6B v4 (SFT)	tobil/qmd-query-expansion-0.6B-v4	98.8%	Recommended
Qwen3-0.6B v4 (GRPO)	tobil/qmd-query-expansion-0.6B-v4-grpo	0%	Failed - catastrophic drift

Prompt Format

The models use Qwen3 chat template with /no_think to disable thinking mode.

Inference (Python)

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B")

# CRITICAL: Use /no_think to disable Qwen3's thinking mode
messages = [{"role": "user", "content": f"/no_think Expand this search query: {query}"}]

prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

# Generate and decode
output = tokenizer.decode(tokens, skip_special_tokens=True)

# Extract assistant response (skip_special_tokens converts to "user\n...\nassistant\n...")
if "\nassistant\n" in output:
    expansion = output.split("\nassistant\n")[-1].strip()

Raw Format

<|im_start|>user
/no_think Expand this search query: auth<|im_end|>
<|im_start|>assistant
lex: authentication configuration
lex: auth settings
vec: how to configure authentication
vec: authentication setup guide
hyde: To configure authentication, set AUTH_SECRET in your environment.<|im_end|>

See PROMPT_FORMAT.md for complete specification.

Directory Structure

finetune/
├── train.py              # SFT training (uses YAML config)
├── rl.py                 # GRPO/RL training (uses YAML config)
├── evaluate_model.py     # Evaluate finetuned models
├── tui.py                # Interactive testing interface
├── configs/
│   ├── sft_v4.yaml       # SFT training config
│   └── grpo_v4.yaml      # GRPO training config
├── dataset/
│   ├── prepare_data.py   # Prepare training data
│   ├── clean_data.py     # Data quality improvements
│   └── generate_data*.py # Generate from source datasets
├── PROMPT_FORMAT.md      # Prompt format specification
├── SCORING.md            # Scoring criteria
└── data/
    └── train/            # Prepared training data

Quick Start

1. Prepare Training Data

cd dataset
uv run prepare_data.py --add-short 5

2. Train with YAML Config

# Local training
uv run train.py --config configs/sft_v4.yaml

# Or on HuggingFace Jobs
hf jobs uv run --flavor a10g-large --timeout 2h --secrets HF_TOKEN \
  "https://huggingface.co/datasets/tobil/qmd-query-expansion-train-v2/resolve/main/train_sft_v4.py"

3. Evaluate

uv run evaluate_model.py --model tobil/qmd-query-expansion-0.6B-v4

4. Interactive Testing

uv run tui.py

Training Configuration

Default SFT config (configs/sft_v4.yaml):

Parameter	Value
Method	LoRA (rank 16, alpha 32)
Learning Rate	2e-4
Epochs	3
Batch Size	4 (with 4x gradient accumulation)
Max Seq Length	512
Target Modules	q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj

Training Dataset

Dataset: tobil/qmd-query-expansion-train-v2
Size: 6,180 examples (26.5% short queries)
Format: Qwen3 chat messages with /no_think directive

Key improvements in v2:

Short query examples with proper expansions
Hyde passages truncated to 150 chars
Key term preservation in lex lines

Evaluation Results

SFT v4 (98.8% average score)

All 21 test queries rated "Excellent":

Query	Score	Rating
`how to configure authentication`	99%	Excellent
`auth`	95%	Excellent
`git rebase vs merge`	100%	Excellent
`react useEffect cleanup`	100%	Excellent

GRPO v4 (0% - Failed)

The GRPO training caused catastrophic drift. The model now generates verbose explanations instead of structured lex:/vec:/hyde: format.

Root cause: Reward function didn't enforce format strictly enough. The model learned that verbose explanations could score higher than concise structured output.

Known Issues

GRPO drift: RL training causes the model to lose SFT-learned formatting. Needs stricter format enforcement in reward function.
Key term preservation: Some lex lines still too generic (missing query key terms)

5.2 KiB Raw Blame History