Project Overview
NoteRecall is a lightweight, privacy-preserving Retrieval-Augmented Generation (RAG) system that enables secure, on-device question answering over personal documents such as medical notes or offline records. Built with quantized embedding and reader models, the system performs end-to-end inference on devices like M2 MacBooks while ensuring user data never leaves the device.
✨ Developed a privacy-preserving RAG pipeline for on-device question answering over medical documents
✨ Applied quantization, pruning, and model distillation to optimize inference efficiency on consumer hardware
✨ Achieved a BERTScore of 0.56 while preserving retrieval and generation quality
✨ Enabled 2.4× more inference cycles under a 10Wh energy budget compared to baseline models
Description
The NoteRecall pipeline embeds user documents and retrieves relevant chunks to answer questions locally using a distilled and quantized reader model.
1. Motivation & Privacy-First Setup
- Eliminates the need for cloud-based LLMs and reduces risk of data leakage
- Designed for privacy-critical use cases like medical records and personal notes
- Empowers users with offline, secure information retrieval
- Ensures all computation (embedding, search, generation) remains local
2. Task Setup & Data
- Input: User documents (context) and natural language queries
- Output: Retrieved passages and generated answers
- Dataset: BioASQ Task B medical QA corpus (3,680 docs, 300 QA pairs)
- Metrics: BERTScore F1, end-to-end latency, energy efficiency (inferences per 10Wh)
3. Model Architecture & Pipeline
- Dense retriever: GTE-Qwen-2 Instruct (1.5B)
- Reader model: LLaMA3.2-1B or distilled Flan-T5
- Vector store: FAISS with HNSW indexing
- Chunking strategy: 512-token overlapping spans
- Top-k retrieval (k=3) used as reader context
- End-to-end inference optimized with MLX and MPS
4. Efficiency Optimizations
- Quantization (Q3_K_S, Q5_K_M via llama.cpp)
- Structured/unstructured pruning of FFNs & attention heads
- Model distillation from LLaMA3 to Flan-T5 for latency and energy gains
- Evaluated MoE-style readers for tradeoff exploration
5. Results Summary
- BERTScore F1: 0.5693 (full-precision), 0.5597 (distilled Flan-T5)
- Latency: Reduced from 8.5s to 1.45s (quantized models)
- Energy: ~2.4× more inferences under 10Wh using Flan-T5
- MoE models yielded higher accuracy but were memory inefficient on M2 hardware
Tools & Frameworks
Area | Tools / Stack Used |
---|---|
Retriever | GTE-Qwen-2 Instruct , FAISS , HNSW |
Reader Models | LLaMA3.2-1B-Instruct , Flan-T5-Base , Qwen1.5-MoE-A2.7B |
Quantization & Pruning | llama.cpp , GGUF , L1-based head pruning , Optimum , torch.nn.utils.prune |
Distillation | Synthetic QA pairs , LLaMA 7B (teacher) → Flan-T5 (student) |
On-Device Inference | MLX , MPS (Apple Silicon) , GGML |
Evaluation | BioASQ , BERTScore , Latency profiler , CodeCarbon (for energy tracking) |