Hands-On: Build a Local Content Recommender on a Pi with the AI HAT+ 2
Step-by-step guide to deploy a privacy-first on-device recommender on Raspberry Pi 5 with AI HAT+ 2 for demos and prototypes.
Hands-On: Build a Local Content Recommender on a Pi with the AI HAT+ 2
Hook: If you are a developer or publisher tired of cloud costs, privacy audits, and the friction of demoing recommender features, this hands-on guide shows how to build a micro on-device recommender that runs on a Raspberry Pi 5 equipped with the AI HAT+ 2. You'll get a working prototype for demos or privacy-first products — fast, local, and repeatable.
Why this matters in 2026
Local AI exploded from a niche experiment to a practical architecture in late 2024–2025, and 2026 vendors and developers are shipping privacy-first experiences that keep user data on-device. The Raspberry Pi 5 plus the AI HAT+ 2 — announced in late 2025 and matured through early 2026 — make it realistic to run efficient embedding models and microrecommendation pipelines directly on edge hardware without sending user profiles to the cloud. For publishers, that means demoable prototypes, privacy-safe personalization, and a lower barrier to experimentation.
What you'll build
This tutorial walks you through a minimal, production-minded pipeline that runs entirely on-device:
- Content ingestion and preprocessing (HTML, RSS, or Markdown)
- On-device embeddings using a compact transformer model optimized for the AI HAT+ 2
- A lightweight ANN (approximate nearest neighbor) index using Annoy
- A small web API (FastAPI) exposing personalized recommendations
- Privacy controls and an example UI for demos
Who this is for
This guide targets developers, content engineers, and product teams building publisher demos or privacy-first personalization for users who prefer local-only experiences. You should be comfortable with Linux command line, Python, and basic machine learning concepts. If you want examples of small, non-developer builds that succeeded at solving ops problems, check the micro apps case studies.
What you need
- Hardware: Raspberry Pi 5, AI HAT+ 2, USB-C power supply, fast microSD or NVMe SSD via USB 3.0 enclosure (recommended for performance)
- Software: Raspberry Pi OS 64-bit or Ubuntu 24.04 arm64 (choose one you maintain), Python 3.10+, Docker optional
- Libraries: ONNX Runtime (or the AI HAT SDK), sentence-transformers or an ONNX-exported embedding model, Annoy, FastAPI, Uvicorn
- Dataset: Your site’s articles or a small synthetic feed (we'll show how to prepare 500–5k items)
Design choices and recommendations
Before we jump into commands, here are decisions that shape this project:
- Use embeddings + ANN, not full LLM inference. Embedding-based retrieval is lightweight, explainable, and fast on-device. It fits demos and first-phase personalization.
- Choose a compact embedding model. Models like all-MiniLM or similar compact sentence-transformers give solid recall at small sizes. Export to ONNX and run with hardware acceleration provided by the AI HAT+ 2.
- Index with Annoy for simplicity. Annoy is lightweight, cross-platform, and easy to persist; it avoids heavy dependencies like FAISS on constrained devices.
- Keep the server minimal. A FastAPI app that returns top-k cosine similarity results is easy to expose via a local web UI for demos.
Step 1 — Prepare your Pi and AI HAT+ 2
Start with an updated OS, then install the vendor runtime for the AI HAT+ 2. If there is an official SDK or ONNX runtime build that leverages the HAT's NPU, install it now. The vendor docs published in late 2025 include optimized runtimes for Raspberry Pi 5 with the AI HAT+ 2; check their repo for the latest packages.
Quick setup commands
Run these commands on your Pi (use Ubuntu or Raspberry Pi OS as you prefer):
sudo apt update && sudo apt upgrade -y
sudo apt install -y python3-pip git build-essential
python3 -m pip install --upgrade pip
Install the AI HAT+ 2 runtime (example placeholder — follow vendor steps):
# vendor-supplied install script (replace with vendor instructions)
# sudo bash install-ai-hat-runtime.sh
Install ONNX Runtime with appropriate acceleration (example):
python3 -m pip install onnxruntime==1.15.1 onnx
If the vendor provides a tuned onnxruntime build for the HAT, use that package instead for better performance. For an architectural view on running workloads at the edge and integrating NPUs into product stacks, see edge‑first patterns for 2026.
Step 2 — Prep your content and metadata
Gather source content for the recommender. For publishers this is your recent articles; for demos you can use a curated set. Clean HTML to plaintext and store metadata: title, url, publish_date, author, tags.
Example Python script to extract and clean
# scripts/prepare_content.py
from bs4 import BeautifulSoup
import json
def html_to_text(html):
s = BeautifulSoup(html, 'html.parser')
return s.get_text(separator=' ', strip=True)
# iterate over files or RSS and produce content.jsonl
# each line: { 'id': ..., 'title': ..., 'url': ..., 'text': ... }
Keep content length modest for this micro recommender; truncate long articles to the first 512 tokens or extract the key paragraph using heuristics. If you want to automate richer metadata extraction workflows, consider resources on automating metadata extraction.
Step 3 — Choose and prepare an embedding model
Use a compact model that can be exported to ONNX. In 2026, several community models and vendor-optimized small embedding models are available for edge devices. For example, an all-MiniLM-style model exported to ONNX usually fits and runs efficiently on Pi 5 with the AI HAT+ 2.
Export to ONNX and optimize
On a workstation or cloud build machine, export the PyTorch model to ONNX and apply quantization or dynamic axes to reduce size. Then copy the ONNX model to the Pi. Example steps (high-level):
- Load the sentence-transformers model.
- Trace and export to ONNX with dynamic axes for variable-length text.
- Run an ONNX optimizer and optionally quantize to int8.
Run embeddings on the Pi
On the Pi, compute embeddings with the ONNX runtime. A simple embedding script:
import onnxruntime as ort
import numpy as np
sess = ort.InferenceSession('mini_embedding.onnx')
def embed(text_tokens):
# tokenization routine depends on how you exported the model
inputs = {...} # token ids, attention mask
out = sess.run(None, inputs)
vec = out[0].mean(axis=1) # or use pooled output
return vec / np.linalg.norm(vec)
Note: vendor SDKs for the AI HAT+ 2 may provide a simpler infer API that automatically uses the NPU. Use that if available for better latency. If you need guidance on balancing edge compute, network sync, and intermittent updates, see hybrid edge workflows.
Step 4 — Build the ANN index with Annoy
Annoy is ideal for small to medium catalogs and works well on ARM devices. Index your embeddings and persist the index to disk so your demo loads instantly.
python3 -m pip install annoy
from annoy import AnnoyIndex
import json
dim = 384 # match your embedding dim
index = AnnoyIndex(dim, 'angular')
meta = {}
with open('content.jsonl') as f:
for i, line in enumerate(f):
item = json.loads(line)
emb = compute_embedding(item['text']) # from previous step
index.add_item(i, emb.tolist())
meta[i] = { 'id': item['id'], 'title': item['title'], 'url': item['url'] }
index.build(10)
index.save('content.ann')
with open('meta.json', 'w') as out:
json.dump(meta, out)
Step 5 — Serve recommendations with FastAPI
Expose a tiny API to fetch recommendations. The server loads the Annoy index and returns the top-k nearest items for a query or a user profile vector.
python3 -m pip install fastapi uvicorn
# app.py
from fastapi import FastAPI
from annoy import AnnoyIndex
import numpy as np
import json
app = FastAPI()
index = AnnoyIndex(dim, 'angular')
index.load('content.ann')
with open('meta.json') as f:
meta = json.load(f)
@app.get('/recommend')
def recommend(q: str, k: int = 5):
q_emb = compute_embedding(q)
ids = index.get_nns_by_vector(q_emb.tolist(), k, include_distances=False)
return [meta[str(i)] for i in ids]
Run the server:
uvicorn app:app --host 0.0.0.0 --port 8080 --workers 1
Step 6 — Add local personalization
Make recommendations contextual and private by computing a user profile vector on-device. For example, compute an exponential moving average of embeddings from articles the user reads or explicitly likes. Use that vector to blend with the query embedding.
# simple profile update
alpha = 0.2
profile = np.zeros(dim)
def update_profile(read_text):
emb = compute_embedding(read_text)
global profile
profile = (1 - alpha) * profile + alpha * emb
def recommend_for_user(query, k=5):
q_emb = compute_embedding(query)
combined = 0.6 * q_emb + 0.4 * profile
combined /= np.linalg.norm(combined)
ids = index.get_nns_by_vector(combined.tolist(), k)
return [meta[str(i)] for i in ids]
Performance and optimization tips for Pi + AI HAT+ 2
- Quantize the ONNX model to reduce memory and accelerate inference; int8 quantization often yields large speedups with minimal accuracy loss. See hybrid edge workflows for patterns on quantization and deployment.
- Batch embeddings when building the index to maximize throughput.
- Use SSD storage if your catalog exceeds a few thousand items; microSD can be slow for frequent writes — for a discussion of storage tradeoffs and emerging flash economics, read A CTO’s guide to storage costs.
- Leverage the vendor runtime on AI HAT+ 2 for NPU acceleration; that materially lowers latency compared with raw CPU ONNX runtime.
- Keep index size in check. For catalogs up to 10k items, Annoy with 384-d embeddings is trivial on Pi; for larger corpora, consider pruning or sharding the index.
Monitoring and validation
Because this runs on-device, monitoring is lightweight but essential for demos. Log recommendation latencies, top-k stability (how often results change), and simple click-throughs saved locally. Use these signals to tune alpha blending and to test UX assumptions without sending user data off-device. If you need guidance on low-latency edge metrics and location-aware playback, check notes on low-latency location audio for related measurement techniques.
Privacy, compliance, and UX best practices
- Explicit consent and explainability: For privacy-first products, show users what data is used for personalization and give a toggle to opt-out or clear local profile state. On-device data patterns are covered in the on-device AI playbook.
- Local persistence: Store profile vectors encrypted at rest on the device if required by policy.
- Data export: Offer an export option so users can move their profile off-device if they want cloud sync later.
Testing the prototype
Run microbenchmarks and small A/B tests locally:
- Measure cold start time for the index and model load.
- Measure average query latency for embeddings and ANN retrieval.
- Collect qualitative feedback in demos: relevance of top-5 results and perceived privacy reassurance.
Advanced extensions
Once the base system works, here are scalable next steps:
- Session-based re-ranking: Use a small re-ranker model (distilled transformer) to rerank top-50 ANN results locally for higher precision.
- Hybrid retrieval: Combine content-based embeddings with local collaborative signals (anonymized usage vectors) for better personalization without cloud leaks.
- Federated updates: Push occasional model updates or new content indexes from a central server as signed artifacts, preserving on-device privacy. For architecture patterns that combine offline devices and occasional signed updates, see edge-first patterns.
- Embeddings on-device with tiny LLMs: If you need generative explanations alongside recommendations, run a small LLM like a distilled decoder-only model on the AI HAT+ 2 for on-device generation, but be mindful of compute and thermal budgets.
Common pitfalls and how to avoid them
- Overloading the Pi: Don’t run heavy training or large models; keep inference compact.
- Slow storage: Use an SSD for catalogs that grow beyond a few thousand items. See storage cost considerations in A CTO’s guide to storage costs.
- Tokenization mismatches: Ensure tokenization at export and runtime match exactly; off-by-one token shapes break ONNX inference.
- Ignoring UX: Even a perfect local model fails if users don’t understand why suggestions appear. Provide clear labels like 'Recommended for you — private on this device'.
Tip: In late 2025 and early 2026, many browser and mobile vendors shipped local-AI features that validated strong user demand for privacy-first personalization. Consumers increasingly prefer on-device models for sensitive content — publishers can use that to differentiate demo experiences and premium offerings.
Example costs and sizing (practical guide)
For a small demo catalog (1k articles):
- Embedding model: 50–200 MB ONNX (quantized)
- Annoy index: ~1k * 384 * 4 bytes ≈ 1.5 MB plus overhead
- Memory: 1–2 GB comfortable if using quantized models and streaming indexing
- Latency: optimized ONNX + HAT NPU often yields sub-second per-embedding inference; ANN lookups are milliseconds
Case study: Publisher demo in 2 hours
Real-world playbook used by a mid-size publisher in late 2025:
- Exported 1,200 article summaries and meta to JSONL (30 minutes)
- Exported a compact embedding model to ONNX and quantized it on a build VM (45 minutes)
- Deployed the ONNX model to Pi + AI HAT+ 2, built the Annoy index, and stood up a FastAPI server (30 minutes)
- Built a tiny demo UI that called /recommend and showed local-only badges (15 minutes)
Result: a privacy-first demo that executives could hand to partners without cloud access.
Wrapping up: When to use on-device recommenders
On-device recommenders are ideal for:
- Privacy-focused products that must keep user data local
- Trade-show demos or offline scenarios
- Prototyping personalization concepts before investing in cloud infra
They are less appropriate when you need heavy cross-user collaborative signals at scale or real-time global analytics; in those cases a hybrid approach is often best. For more on hybrid patterns that mix local inference and occasional cloud coordination, see hybrid edge workflows.
Actionable checklist
- Set up Raspberry Pi 5 + AI HAT+ 2 and install vendor runtime
- Collect and clean 500–2,000 article summaries
- Export and quantize a compact embedding model to ONNX
- Compute embeddings and build Annoy index on-device
- Deploy a FastAPI server and simple demo UI
- Add local profiling and clear privacy controls
Further reading and tools (2026)
- Vendor SDK docs for AI HAT+ 2 and on-device AI playbook
- Automating metadata extraction with modern tools
- Edge‑first patterns for 2026 cloud architectures
- Hybrid edge workflows for productivity tools
Final takeaways
Running a micro recommender on a Raspberry Pi 5 with the AI HAT+ 2 is practical and valuable in 2026. The approach balances relevance, latency, and privacy. For publishers and developers, it's a low-cost way to prototype personalization, deliver demos without cloud dependencies, and build privacy-first product features.
Ready to build? Clone a starter repo, export a compact embedding, and follow the checklist above. Start with a 500-item catalog and iterate — you can have a demo running in a few hours. If you want, adapt the pipeline later for hybrid cloud sync or federated updates.
Call to action
Try this on a Pi today: build the micro recommender, test it in a live demo, and share your lessons learned with the community. If you want a starter repo or an optimized ONNX export script tuned for AI HAT+ 2, subscribe to our tutorial series or request the sample code — we'll send a compact starter kit and checklist to help you ship fast.
Related Reading
- Why On‑Device AI Is Now Essential for Secure Personal Data Forms (2026 Playbook)
- Hybrid Edge Workflows for Productivity Tools in 2026
- Automating Metadata Extraction with Gemini and Claude: A DAM Integration Guide
- Edge‑First Patterns for 2026 Cloud Architectures
- How Jewelry Brands Can Win Discoverability in 2026: Marrying Digital PR with Social Search
- Monetize Vertical Video with AI: Lessons from Holywater’s Playbook
- When to Buy Kitchen Tech: Timing Purchases Around Big Discounts on Vacuums, Monitors and Smart Lamps
- Is the Mac mini M4 Deal Worth It? A Buyer’s Guide for Bargain Shoppers
- From Orchestra Halls to Classroom Halls: How to Run a University Concert Series Inspired by CBSO/Yamada
Related Topics
Unknown
Contributor
Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.
Up Next
More stories handpicked for you
From Personal Photos to Viral Trends: Leveraging Meme Culture in Your Blog
The New Age of Satire: Bridging the Gap Between Entertainment and News Media for Engagement
Leveraging AI in Storytelling: Lessons from Modern Theatre
Designing AI-Powered Video Ads: Creative Inputs That Actually Move KPIs
Unlocking Substack’s SEO Secrets: Maximizing Your Newsletter's Reach
From Our Network
Trending stories across our publication group