Build an On-Device Recommender on Pi with AI HAT+ 2

Step-by-step guide to deploy a privacy-first on-device recommender on Raspberry Pi 5 with AI HAT+ 2 for demos and prototypes.

Hands-On: Build a Local Content Recommender on a Pi with the AI HAT+ 2

Hook: If you are a developer or publisher tired of cloud costs, privacy audits, and the friction of demoing recommender features, this hands-on guide shows how to build a micro on-device recommender that runs on a Raspberry Pi 5 equipped with the AI HAT+ 2. You'll get a working prototype for demos or privacy-first products — fast, local, and repeatable.

Why this matters in 2026

Local AI exploded from a niche experiment to a practical architecture in late 2024–2025, and 2026 vendors and developers are shipping privacy-first experiences that keep user data on-device. The Raspberry Pi 5 plus the AI HAT+ 2 — announced in late 2025 and matured through early 2026 — make it realistic to run efficient embedding models and microrecommendation pipelines directly on edge hardware without sending user profiles to the cloud. For publishers, that means demoable prototypes, privacy-safe personalization, and a lower barrier to experimentation.

What you'll build

This tutorial walks you through a minimal, production-minded pipeline that runs entirely on-device:

Content ingestion and preprocessing (HTML, RSS, or Markdown)
On-device embeddings using a compact transformer model optimized for the AI HAT+ 2
A lightweight ANN (approximate nearest neighbor) index using Annoy
A small web API (FastAPI) exposing personalized recommendations
Privacy controls and an example UI for demos

Who this is for

This guide targets developers, content engineers, and product teams building publisher demos or privacy-first personalization for users who prefer local-only experiences. You should be comfortable with Linux command line, Python, and basic machine learning concepts. If you want examples of small, non-developer builds that succeeded at solving ops problems, check the micro apps case studies.

What you need

Hardware: Raspberry Pi 5, AI HAT+ 2, USB-C power supply, fast microSD or NVMe SSD via USB 3.0 enclosure (recommended for performance)
Software: Raspberry Pi OS 64-bit or Ubuntu 24.04 arm64 (choose one you maintain), Python 3.10+, Docker optional
Libraries: ONNX Runtime (or the AI HAT SDK), sentence-transformers or an ONNX-exported embedding model, Annoy, FastAPI, Uvicorn
Dataset: Your site’s articles or a small synthetic feed (we'll show how to prepare 500–5k items)

Design choices and recommendations

Before we jump into commands, here are decisions that shape this project:

Use embeddings + ANN, not full LLM inference. Embedding-based retrieval is lightweight, explainable, and fast on-device. It fits demos and first-phase personalization.
Choose a compact embedding model. Models like all-MiniLM or similar compact sentence-transformers give solid recall at small sizes. Export to ONNX and run with hardware acceleration provided by the AI HAT+ 2.
Index with Annoy for simplicity. Annoy is lightweight, cross-platform, and easy to persist; it avoids heavy dependencies like FAISS on constrained devices.
Keep the server minimal. A FastAPI app that returns top-k cosine similarity results is easy to expose via a local web UI for demos.

Step 1 — Prepare your Pi and AI HAT+ 2

Start with an updated OS, then install the vendor runtime for the AI HAT+ 2. If there is an official SDK or ONNX runtime build that leverages the HAT's NPU, install it now. The vendor docs published in late 2025 include optimized runtimes for Raspberry Pi 5 with the AI HAT+ 2; check their repo for the latest packages.

Quick setup commands

Run these commands on your Pi (use Ubuntu or Raspberry Pi OS as you prefer):

sudo apt update && sudo apt upgrade -y
sudo apt install -y python3-pip git build-essential
python3 -m pip install --upgrade pip

Install the AI HAT+ 2 runtime (example placeholder — follow vendor steps):

# vendor-supplied install script (replace with vendor instructions)
# sudo bash install-ai-hat-runtime.sh

Install ONNX Runtime with appropriate acceleration (example):

python3 -m pip install onnxruntime==1.15.1 onnx

If the vendor provides a tuned onnxruntime build for the HAT, use that package instead for better performance. For an architectural view on running workloads at the edge and integrating NPUs into product stacks, see edge‑first patterns for 2026.

Step 2 — Prep your content and metadata

Gather source content for the recommender. For publishers this is your recent articles; for demos you can use a curated set. Clean HTML to plaintext and store metadata: title, url, publish_date, author, tags.

Example Python script to extract and clean

# scripts/prepare_content.py
from bs4 import BeautifulSoup
import json

def html_to_text(html):
    s = BeautifulSoup(html, 'html.parser')
    return s.get_text(separator=' ', strip=True)

# iterate over files or RSS and produce content.jsonl
# each line: { 'id': ..., 'title': ..., 'url': ..., 'text': ... }

Keep content length modest for this micro recommender; truncate long articles to the first 512 tokens or extract the key paragraph using heuristics. If you want to automate richer metadata extraction workflows, consider resources on automating metadata extraction.

Step 3 — Choose and prepare an embedding model

Use a compact model that can be exported to ONNX. In 2026, several community models and vendor-optimized small embedding models are available for edge devices. For example, an all-MiniLM-style model exported to ONNX usually fits and runs efficiently on Pi 5 with the AI HAT+ 2.

Export to ONNX and optimize

On a workstation or cloud build machine, export the PyTorch model to ONNX and apply quantization or dynamic axes to reduce size. Then copy the ONNX model to the Pi. Example steps (high-level):

Load the sentence-transformers model.
Trace and export to ONNX with dynamic axes for variable-length text.
Run an ONNX optimizer and optionally quantize to int8.

Run embeddings on the Pi

On the Pi, compute embeddings with the ONNX runtime. A simple embedding script:

import onnxruntime as ort
import numpy as np

sess = ort.InferenceSession('mini_embedding.onnx')

def embed(text_tokens):
    # tokenization routine depends on how you exported the model
    inputs = {...}  # token ids, attention mask
    out = sess.run(None, inputs)
    vec = out[0].mean(axis=1)  # or use pooled output
    return vec / np.linalg.norm(vec)

Note: vendor SDKs for the AI HAT+ 2 may provide a simpler infer API that automatically uses the NPU. Use that if available for better latency. If you need guidance on balancing edge compute, network sync, and intermittent updates, see hybrid edge workflows.

Step 4 — Build the ANN index with Annoy

Annoy is ideal for small to medium catalogs and works well on ARM devices. Index your embeddings and persist the index to disk so your demo loads instantly.

python3 -m pip install annoy

from annoy import AnnoyIndex
import json

dim = 384  # match your embedding dim
index = AnnoyIndex(dim, 'angular')
meta = {}

with open('content.jsonl') as f:
    for i, line in enumerate(f):
        item = json.loads(line)
        emb = compute_embedding(item['text'])  # from previous step
        index.add_item(i, emb.tolist())
        meta[i] = { 'id': item['id'], 'title': item['title'], 'url': item['url'] }

index.build(10)
index.save('content.ann')
with open('meta.json', 'w') as out:
    json.dump(meta, out)

Step 5 — Serve recommendations with FastAPI

Expose a tiny API to fetch recommendations. The server loads the Annoy index and returns the top-k nearest items for a query or a user profile vector.

python3 -m pip install fastapi uvicorn

# app.py
from fastapi import FastAPI
from annoy import AnnoyIndex
import numpy as np
import json

app = FastAPI()

index = AnnoyIndex(dim, 'angular')
index.load('content.ann')
with open('meta.json') as f:
    meta = json.load(f)

@app.get('/recommend')
def recommend(q: str, k: int = 5):
    q_emb = compute_embedding(q)
    ids = index.get_nns_by_vector(q_emb.tolist(), k, include_distances=False)
    return [meta[str(i)] for i in ids]

Run the server:

uvicorn app:app --host 0.0.0.0 --port 8080 --workers 1

Step 6 — Add local personalization

Make recommendations contextual and private by computing a user profile vector on-device. For example, compute an exponential moving average of embeddings from articles the user reads or explicitly likes. Use that vector to blend with the query embedding.

# simple profile update
alpha = 0.2
profile = np.zeros(dim)

def update_profile(read_text):
    emb = compute_embedding(read_text)
    global profile
    profile = (1 - alpha) * profile + alpha * emb

def recommend_for_user(query, k=5):
    q_emb = compute_embedding(query)
    combined = 0.6 * q_emb + 0.4 * profile
    combined /= np.linalg.norm(combined)
    ids = index.get_nns_by_vector(combined.tolist(), k)
    return [meta[str(i)] for i in ids]

Performance and optimization tips for Pi + AI HAT+ 2

Quantize the ONNX model to reduce memory and accelerate inference; int8 quantization often yields large speedups with minimal accuracy loss. See hybrid edge workflows for patterns on quantization and deployment.
Batch embeddings when building the index to maximize throughput.
Use SSD storage if your catalog exceeds a few thousand items; microSD can be slow for frequent writes — for a discussion of storage tradeoffs and emerging flash economics, read A CTO’s guide to storage costs.
Leverage the vendor runtime on AI HAT+ 2 for NPU acceleration; that materially lowers latency compared with raw CPU ONNX runtime.
Keep index size in check. For catalogs up to 10k items, Annoy with 384-d embeddings is trivial on Pi; for larger corpora, consider pruning or sharding the index.

Monitoring and validation

Because this runs on-device, monitoring is lightweight but essential for demos. Log recommendation latencies, top-k stability (how often results change), and simple click-throughs saved locally. Use these signals to tune alpha blending and to test UX assumptions without sending user data off-device. If you need guidance on low-latency edge metrics and location-aware playback, check notes on low-latency location audio for related measurement techniques.

Privacy, compliance, and UX best practices

Explicit consent and explainability: For privacy-first products, show users what data is used for personalization and give a toggle to opt-out or clear local profile state. On-device data patterns are covered in the on-device AI playbook.
Local persistence: Store profile vectors encrypted at rest on the device if required by policy.
Data export: Offer an export option so users can move their profile off-device if they want cloud sync later.

Testing the prototype

Run microbenchmarks and small A/B tests locally:

Measure cold start time for the index and model load.
Measure average query latency for embeddings and ANN retrieval.
Collect qualitative feedback in demos: relevance of top-5 results and perceived privacy reassurance.

Advanced extensions

Once the base system works, here are scalable next steps:

Session-based re-ranking: Use a small re-ranker model (distilled transformer) to rerank top-50 ANN results locally for higher precision.
Hybrid retrieval: Combine content-based embeddings with local collaborative signals (anonymized usage vectors) for better personalization without cloud leaks.
Federated updates: Push occasional model updates or new content indexes from a central server as signed artifacts, preserving on-device privacy. For architecture patterns that combine offline devices and occasional signed updates, see edge-first patterns.
Embeddings on-device with tiny LLMs: If you need generative explanations alongside recommendations, run a small LLM like a distilled decoder-only model on the AI HAT+ 2 for on-device generation, but be mindful of compute and thermal budgets.

Common pitfalls and how to avoid them

Overloading the Pi: Don’t run heavy training or large models; keep inference compact.
Slow storage: Use an SSD for catalogs that grow beyond a few thousand items. See storage cost considerations in A CTO’s guide to storage costs.
Tokenization mismatches: Ensure tokenization at export and runtime match exactly; off-by-one token shapes break ONNX inference.
Ignoring UX: Even a perfect local model fails if users don’t understand why suggestions appear. Provide clear labels like 'Recommended for you — private on this device'.

Tip: In late 2025 and early 2026, many browser and mobile vendors shipped local-AI features that validated strong user demand for privacy-first personalization. Consumers increasingly prefer on-device models for sensitive content — publishers can use that to differentiate demo experiences and premium offerings.

Example costs and sizing (practical guide)

For a small demo catalog (1k articles):

Embedding model: 50–200 MB ONNX (quantized)
Annoy index: ~1k * 384 * 4 bytes ≈ 1.5 MB plus overhead
Memory: 1–2 GB comfortable if using quantized models and streaming indexing
Latency: optimized ONNX + HAT NPU often yields sub-second per-embedding inference; ANN lookups are milliseconds

Case study: Publisher demo in 2 hours

Real-world playbook used by a mid-size publisher in late 2025:

Exported 1,200 article summaries and meta to JSONL (30 minutes)
Exported a compact embedding model to ONNX and quantized it on a build VM (45 minutes)
Deployed the ONNX model to Pi + AI HAT+ 2, built the Annoy index, and stood up a FastAPI server (30 minutes)
Built a tiny demo UI that called /recommend and showed local-only badges (15 minutes)

Result: a privacy-first demo that executives could hand to partners without cloud access.

Wrapping up: When to use on-device recommenders

On-device recommenders are ideal for:

Privacy-focused products that must keep user data local
Trade-show demos or offline scenarios
Prototyping personalization concepts before investing in cloud infra

They are less appropriate when you need heavy cross-user collaborative signals at scale or real-time global analytics; in those cases a hybrid approach is often best. For more on hybrid patterns that mix local inference and occasional cloud coordination, see hybrid edge workflows.

Actionable checklist

Set up Raspberry Pi 5 + AI HAT+ 2 and install vendor runtime
Collect and clean 500–2,000 article summaries
Export and quantize a compact embedding model to ONNX
Compute embeddings and build Annoy index on-device
Deploy a FastAPI server and simple demo UI
Add local profiling and clear privacy controls

Final takeaways

Running a micro recommender on a Raspberry Pi 5 with the AI HAT+ 2 is practical and valuable in 2026. The approach balances relevance, latency, and privacy. For publishers and developers, it's a low-cost way to prototype personalization, deliver demos without cloud dependencies, and build privacy-first product features.

Ready to build? Clone a starter repo, export a compact embedding, and follow the checklist above. Start with a 500-item catalog and iterate — you can have a demo running in a few hours. If you want, adapt the pipeline later for hybrid cloud sync or federated updates.

Call to action

Try this on a Pi today: build the micro recommender, test it in a live demo, and share your lessons learned with the community. If you want a starter repo or an optimized ONNX export script tuned for AI HAT+ 2, subscribe to our tutorial series or request the sample code — we'll send a compact starter kit and checklist to help you ship fast.

Hands-On: Build a Local Content Recommender on a Pi with the AI HAT+ 2

Why this matters in 2026

What you'll build

Who this is for

What you need

Design choices and recommendations

Step 1 — Prepare your Pi and AI HAT+ 2

Quick setup commands

Step 2 — Prep your content and metadata

Example Python script to extract and clean

Step 3 — Choose and prepare an embedding model

Export to ONNX and optimize

Run embeddings on the Pi

Step 4 — Build the ANN index with Annoy

Step 5 — Serve recommendations with FastAPI

Step 6 — Add local personalization

Performance and optimization tips for Pi + AI HAT+ 2

Monitoring and validation

Privacy, compliance, and UX best practices

Testing the prototype

Advanced extensions

Common pitfalls and how to avoid them

Example costs and sizing (practical guide)

Case study: Publisher demo in 2 hours

Wrapping up: When to use on-device recommenders

Actionable checklist

Further reading and tools (2026)

Final takeaways

Call to action

Related Reading

Related Topics

smartcontent

Up Next

Character Counter Tools Compared: Best Options for Titles, Meta Descriptions, and Social Posts

Reading Time Calculator Guide: How Publishers Use It for Better Content UX

How to Improve Blog Writing Quality With a Simple Editorial Review Process

From Our Network

How to Measure Blog Content Quality: A Scorecard for Editors and Solo Creators

Laptop Ports Guide 2026: Which Ports You Actually Need Before You Buy

Chromebook vs Laptop in 2026: Which One Should You Buy?

MacBook vs Windows Laptop in 2026: Which Is Better for Students, Work, and Creators?

Best Times to Publish Blog Posts for More Comments and Discussion

Should You Turn Comments Off on a Blog Post? A Decision Guide