Project Overview

NoteRecall is a lightweight, privacy-preserving Retrieval-Augmented Generation (RAG) system that enables secure, on-device question answering over personal documents such as medical notes or offline records. Built with quantized embedding and reader models, the system performs end-to-end inference on devices like M2 MacBooks while ensuring user data never leaves the device.

✨ Developed a privacy-preserving RAG pipeline for on-device question answering over medical documents
✨ Applied quantization, pruning, and model distillation to optimize inference efficiency on consumer hardware
✨ Achieved a BERTScore of 0.56 while preserving retrieval and generation quality
✨ Enabled 2.4× more inference cycles under a 10Wh energy budget compared to baseline models

Description

The NoteRecall pipeline embeds user documents and retrieves relevant chunks to answer questions locally using a distilled and quantized reader model.

1. Motivation & Privacy-First Setup

Eliminates the need for cloud-based LLMs and reduces risk of data leakage
Designed for privacy-critical use cases like medical records and personal notes
Empowers users with offline, secure information retrieval
Ensures all computation (embedding, search, generation) remains local

2. Task Setup & Data

Input: User documents (context) and natural language queries
Output: Retrieved passages and generated answers
Dataset: BioASQ Task B medical QA corpus (3,680 docs, 300 QA pairs)
Metrics: BERTScore F1, end-to-end latency, energy efficiency (inferences per 10Wh)

3. Model Architecture & Pipeline

Dense retriever: GTE-Qwen-2 Instruct (1.5B)
Reader model: LLaMA3.2-1B or distilled Flan-T5
Vector store: FAISS with HNSW indexing
Chunking strategy: 512-token overlapping spans
Top-k retrieval (k=3) used as reader context
End-to-end inference optimized with MLX and MPS

4. Efficiency Optimizations

Quantization (Q3_K_S, Q5_K_M via llama.cpp)
Structured/unstructured pruning of FFNs & attention heads
Model distillation from LLaMA3 to Flan-T5 for latency and energy gains
Evaluated MoE-style readers for tradeoff exploration

5. Results Summary

BERTScore F1: 0.5693 (full-precision), 0.5597 (distilled Flan-T5)
Latency: Reduced from 8.5s to 1.45s (quantized models)
Energy: ~2.4× more inferences under 10Wh using Flan-T5
MoE models yielded higher accuracy but were memory inefficient on M2 hardware

Tools & Frameworks

Area	Tools / Stack Used
Retriever	`GTE-Qwen-2 Instruct`, `FAISS`, `HNSW`
Reader Models	`LLaMA3.2-1B-Instruct`, `Flan-T5-Base`, `Qwen1.5-MoE-A2.7B`
Quantization & Pruning	`llama.cpp`, `GGUF`, `L1-based head pruning`, `Optimum`, `torch.nn.utils.prune`
Distillation	`Synthetic QA pairs`, `LLaMA 7B (teacher) → Flan-T5 (student)`
On-Device Inference	`MLX`, `MPS (Apple Silicon)`, `GGML`
Evaluation	`BioASQ`, `BERTScore`, `Latency profiler`, `CodeCarbon (for energy tracking)`

Related Links & Artifacts

Project Overview

Description

Tools & Frameworks