I tested local models on 100+ real RAG tasks. Here are the best 1B model picks

Published on October 13, 2025

I ran a head-to-head comparison of lightweight, on-device models over 100+ real retrieval-augmented generation (RAG) tasks to see what actually performs best in practical workflows.

This post summarizes key takeaways and links to the full write-up with benchmarks, observations, and recommendations.

TL;DR — Best model by task

(Tested on 16GB Macbook Air M2)

A — Find facts + cite sources → Qwen3–1.7B-MLX-8bit
B — Compare evidence across files → LMF2–1.2B-MLX
C — Build timelines → LMF2–1.2B-MLX
D — Summarize documents → Qwen3–1.7B-MLX-8bit & LMF2–1.2B-MLX
E — Organize themed collections → models > 1B needed

Read the full article

Read the full write-up with benchmarks and methodology on Medium: Read the full article →

Thanks for reading! If you have thoughts or want to compare notes, feel free to reach out.