Edge AI Ad Attribution: On-Device Computer Vision for DOOH Audience Measurement
**TL;DR:** As privacy regulations and OS-level restrictions deprecate mobile device-ID tracking, Digital Out-of-Home (DOOH) must pivot to high-fidelity, privacy-preserving measurement. This technical deep-dive outlines a decentralized methodology using on-device computer vision. By running lightweight BlazeFace and YAMNet models at the edge, we compute deterministic real-time attention metrics without transmitting raw video or audio. Combined with cryptographic Ed25519 signature chains, this architecture delivers auditable proof-of-play and audience metrics, setting a new benchmark for cookieless, zero-trust physical-world attribution.
## The Physical Attribution Crisis: Privacy vs. Accuracy
The physical-world advertising ecosystem stands at a critical juncture. For over a decade, Digital Out-of-Home (DOOH) measurement has relied on fragile, indirect proxies to justify media spend. These proxies generally fall into two camps, both of which are rapidly collapsing under the weight of regulatory, technical, and economic pressures.
On one hand, probabilistic panel modeling offers broad, aggregate estimations of audience size. While highly compliant with emerging privacy laws, panel data lacks temporal and spatial granularity. It cannot verify whether a specific creative played to an empty corridor or a packed plaza, nor can it capture real-time audience dynamics. On the other hand, deterministic device-ID joins attempt to track users through space and time by capturing Mobile Advertising IDs (MAIDs) via bidstream GPS data or Wi-Fi sniffing. This approach is a technical and regulatory minefield. With the deprecation of third-party identifiers, MAC address randomization, and strict enforcement of CCPA, GDPR, and CPRA, relying on the transmission of personal location data to cloud servers is no longer a viable long-term strategy for measurement architects.
This tension has forced a paradigm shift toward edge-computed, probabilistic audience signals. Instead of attempting to track unique individuals across the physical world, the modern DOOH measurement stack must evaluate human attention deterministically at the exact point of display, in real-time, while discarding raw sensory data instantly. By shifting the computation of audience metrics from centralized cloud servers to the edge device itself, we resolve the privacy-vs-accuracy paradox.
Our network telemetry indicates that this decentralized approach is already operational and scaling rapidly. Across our May 2026 architecture, we have observed these methodologies deployed across 12,450 screens globally, proving that high-fidelity physical measurement does not require compromising user privacy. By processing optical and acoustic data locally, we generate anonymous, aggregated metrics that can be verified cryptographically without ever exposing personally identifiable information (PII).
## The May 2026 Edge Stack Architecture
To achieve real-time, low-latency audience measurement at the point of display, the edge hardware must execute complex computer vision and audio classification models within highly constrained compute budgets. The software architecture deployed across our network telemetry relies on a modular suite comprising exactly 8 edge SDK packages. These packages segment core responsibilities—ranging from camera frame acquisition and model inference to local state aggregation and cryptographic signing—ensuring that the hardware abstraction layer remains separate from the analytical pipelines.
```
+-------------------------------------------------------------------------+
| Physical Environment |
+-------------------------------------------------------------------------+
| (Optical & Acoustic Sensors)
v
+-------------------------------------------------------------------------+
| Edge SDK Framework |
| +-------------------------------------------------------------------+ |
| | Ingestion Pipeline | |
| +-------------------------------------------------------------------+ |
| | |
| +-----------------------+-----------------------+ |
| | (640x480 @ 30 FPS) | (16kHz Audio)
| v v |
| +-------------------+ +-------------------+ |
| | BlazeFace ONNX | | YAMNet ONNX | |
| | (400 KB Model) | | (Acoustic Context)| |
| +-------------------+ +-------------------+ |
| | | |
| | [Bounding Boxes & Landmarks] | |
| v | |
| +-------------------+ | |
| | Attention Engine | | |
| | (Yaw/Pitch/Roll) | | |
| +-------------------+ | |
| | | |
| | [3D Attention Vectors] | [Class IDs]|
| +-----------------------+-----------------------+ |
| | |
| v |
| +-------------------------------------------------------------------+ |
| | Verified Audience Signal (VAS) Aggregator | |
| +-------------------------------------------------------------------+ |
| | |
| v [Unsigned VAS Payload] |
| +-------------------------------------------------------------------+ |
| | Cryptographic Signer (TPM / Enclave) | |
| | Generates 64-Byte Ed25519 Proof-of-Play Signature | |
| +-------------------------------------------------------------------+ |
+-------------------------------------------------------------------------+
| (Secure Metadata Envelope)
v
+-------------------------------------------------------------------------+
| Centralized Ingestion Gateway |
+-------------------------------------------------------------------------+
```
### BlazeFace ONNX: Sub-Half-Megabyte Face Detection
The foundation of our visual detection pipeline is a highly optimized execution of the BlazeFace architecture, compiled to the [ONNX Runtime](https://onnxruntime.ai/) format. While traditional face-detection models like MTCNN or heavy ResNet-based single-shot detectors (SSDs) require significant GPU resources and carry footprints of tens or hundreds of megabytes, our edge implementation utilizes a face model size of precisely 400 KB.
This sub-half-megabyte footprint is achieved through aggressive weight quantization (INT8) and a custom anchor configuration tailored for DOOH viewing distances (typically 2 to 10 meters). The model is optimized to run on low-power edge hardware, achieving a deterministic face-detection speed of 30 FPS. Running at 30 FPS ensures that the system captures fast-moving pedestrian traffic without dropped frames, providing an accurate count of individuals entering the display's field of view.
By leveraging the [Sensing SDK](/support/developers/sensing-sdk/), developers can initialize this model directly within hardware-accelerated pipelines. The model processes input frames scaled to a resolution of 640x480 pixels. Because the model operates on raw pixel buffers in local volatile memory (SRAM/DRAM) and immediately overwrites them, no image data is ever written to persistent disk storage or transmitted over the network.
### YAMNet ONNX: Acoustic Contextual Intelligence
Visual telemetry alone can be misleading in complex physical environments. For example, a screen placed in a quiet boutique retail setting has a fundamentally different audience dynamic than one placed in a noisy transit hub, even if the face-detection counts are identical. To enrich the environmental context without capturing or processing human speech, the [Edge SDK](/support/developers/edge-sdk/) integrates a specialized implementation of YAMNet running via ONNX.
YAMNet is a deep mobilenet-based neural network that predicts 521 audio event classes from an input audio stream. At the edge, we capture ambient audio through a local microphone array, downsample it to a 16kHz mono stream, and feed short-time Fourier transform (STFT) spectrograms into the model. The model outputs high-level environmental classifications, such as "Laughter," "Crowd Noise," "Engine Sounds," or "Music."
Crucially, the raw audio is never recorded or transcribed. The edge application only retains the high-level class IDs and their associated confidence scores. This contextual intelligence allows the system to filter out false-positive visual detections (e.g., reflections in a window or static poster faces) by correlating visual activity with acoustic environmental patterns.
### Attention Scoring: Three-Dimensional Pose Estimation
Simply counting faces within a camera's field of view is insufficient for modern attribution models. Advertisers demand proof of active engagement. To address this, our edge architecture computes a real-time attention score using a 3D head pose estimation framework.
Rather than executing a computationally expensive 68-point face landmark model, the system utilizes the 6 key facial landmarks (eyes, ears, nose, and mouth) output directly by our 400 KB BlazeFace model. These landmarks are mapped against a generic 3D facial model to solve the Perspective-n-Point (PnP) problem. This mathematical projection estimates head orientation across exactly 3 dimensions: yaw, pitch, and an eye-weight composite.
- **Yaw (Horizontal Rotation):** Determines if the subject is looking left or right relative to the screen plane.
- **Pitch (Vertical Tilt):** Determines if the subject's head is tilted upward or downward.
- **Eye-Weight Composite:** An empirical calculation that adjusts the attention probability based on the alignment of the nose and eye landmarks. If the eyes and nose form a symmetrical triangle relative to the projection plane, the eye-weight composite approaches 1.0, indicating direct frontal gaze.
Using techniques similar to those documented in the [MediaPipe](https://mediapipe.dev/) framework, these 3 dimensions are combined in real-time to calculate a unified attention vector. If this vector falls within a predefined cone of attention (typically +/- 15 degrees of yaw and +/- 10 degrees of pitch relative to the screen's normal vector), the subject is flagged as actively viewing the display. Across our inventory pool, screens employing this multi-dimensional gaze tracking have achieved a 38% attention rate on average, providing highly granular feedback to creative designers regarding visual impact.
## Verified Audience Signal (VAS) Computation
Raw frame-by-frame detections must be aggregated into structured data before they can be utilized for billing or programmatic decisioning. This aggregation process produces a standardized payload known as the Verified Audience Signal (VAS).
The VAS computation operates on a temporal sliding window, typically aligned with the duration of the creative slot (e.g., 5, 10, or 15 seconds). The state machine of the local aggregator manages this process through three distinct phases:
1. **Temporal Association:** As faces are detected at 30 FPS, a lightweight, non-persistent Kalman filter tracking algorithm associates bounding boxes across successive frames. This ensures that a single individual standing in front of the screen for 10 seconds is recorded as a single unique observer with a dwell time of 10 seconds, rather than 300 separate face detections.
2. **Metric Synthesis:** For each tracked subject within the aggregation window, the system computes:
- **Dwell Time:** The total duration (in seconds) that the subject's face was tracked.
- **Attention Duration:** The cumulative time (in seconds) that the subject's head pose vector fell within the active cone of attention.
- **Proximity Index:** An estimation of distance derived from the bounding box area relative to the sensor's focal length.
3. **Environmental Normalization:** The physical context provided by the YAMNet audio classifier is integrated into the payload. If the acoustic classifier detects high ambient crowd noise, the system adjusts its internal confidence thresholds to account for potential visual occlusions.
Once the aggregation window closes (e.g., at the boundary of a creative play), the local state is flushed. The raw tracking IDs, coordinate histories, and frame buffers are permanently deleted from memory. The resulting output is a highly compact, structured JSON payload that conforms to the standards outlined in the [IAB Measurement Glossary](https://www.iab.com/insights/measurement-glossary/).
```json
{
"timestamp": 1779840000,
"duration_ms": 10000,
"audience_count": 3,
"metrics": [
{
"dwell_ms": 8500,
"attention_ms": 4200,
"proximity_score": 0.72
},
{
"dwell_ms": 3100,
"attention_ms": 0,
"proximity_score": 0.45
},
{
"dwell_ms": 9800,
"attention_ms": 7100,
"proximity_score": 0.88
}
],
"context": {
"acoustic_class": "indoor_mall",
"confidence": 0.91
}
}
```
By converting raw physical presence into structured, anonymous telemetry at the edge, we completely eliminate the risk of leaking sensitive biometric data. The network never transmits, processes, or stores facial templates, making the entire architecture inherently compliant with global privacy mandates.
## Cryptographic Proof of Play (PoP) Signature Chains
Generating audience metrics at the edge solves the privacy problem, but it introduces a security challenge: how can media buyers trust that the edge device actually played the ad and that the reported audience metrics are genuine, rather than fabricated by a compromised player?
To establish trust in a decentralized measurement network, we implement a cryptographic [Proof of Play](/support/developers/proof-of-play/) signature chain. Each edge device is provisioned with a unique, hardware-bound private key stored securely within a local Trusted Platform Module (TPM 2.0) or hardware enclave.
When a creative is rendered on the screen, the player engine coordinates with the cryptographic signer to generate an immutable proof-of-play block. This block contains:
- The exact microsecond timestamp derived from a hardware-attested real-time clock.
- The cryptographically signed Screen Identifier.
- The Creative Hash (verifying the exact asset rendered).
- The associated Verified Audience Signal (VAS) payload for that play duration.
- The cryptographic hash of the preceding proof-of-play block.
This payload is signed using the Ed25519 signature algorithm, producing a compact 64-byte Ed25519 signature. The mathematical elegance of Ed25519 lies in its speed, resistance to side-channel attacks, and small signature size, which minimizes network overhead.
By chaining each block to the previous one (embedding the hash of Block $N-1$ into Block $N$), we construct a local hash chain. This signature chain ensures that log files cannot be back-dated, truncated, or injected with synthetic impressions by malicious actors.
```
+-------------------------------------------------------------------------+
| Block N-1 |
| - Timestamp: 1779840000 |
| - Creative ID: 0x9f8e... |
| - Audience Count: 2 |
| - Previous Hash: 0x1a2b... |
| - Signature: [64-Byte Ed25519 Signature] |
+-------------------------------------------------------------------------+
| (Hashed & Chained)
v
+-------------------------------------------------------------------------+
| Block N |
| - Timestamp: 1779840010 |
| - Creative ID: 0x3c4d... |
| - Audience Count: 3 |
| - Previous Hash: Hash(Block N-1) |
| - Signature: [64-Byte Ed25519 Signature] |
+-------------------------------------------------------------------------+
```
Because the verification of these signatures can be executed asynchronously by any party holding the screen's public key, media buyers can audit physical impressions independently. Verification requires zero server round-trips during the play cycle, allowing screens to operate reliably in offline or semi-connected environments (such as transit buses or subway platforms) while maintaining absolute auditability.
## Comparative Analysis: Edge AI vs. Legacy Architectures
To understand the structural advantages of on-device computer vision combined with cryptographic verification, we must compare it directly against legacy cloud-based and panel-based attribution models across three critical axes: accuracy, latency, and privacy.
| Attribute | Legacy Panel Models (e.g., Quividi-style panels) | Cloud-Based Mobile SDKs (e.g., MAID tracking) | Edge AI (Trillboards Methodology) |
| :--- | :--- | :--- | :--- |
| **Accuracy** | Low. Relies on historical, extrapolated traffic counts. Fails to capture real-time, creative-specific attention. | Medium-Low. Suffers from high GPS drift, indoor signal attenuation, and low match rates. | High. Deterministic, real-time optical and acoustic measurement at the point of display. |
| **Latency** | Extremely High. Data is processed in monthly or quarterly batches. No real-time optimization. | High. Requires matching bidstream logs with location data, taking days or weeks to process. | Near-Instantaneous. Local inference runs in under 33ms, enabling rapid proof-of-play generation. |
| **Privacy Profile** | High. Anonymous aggregate data, but lacks granular verification. | Poor. Transmits sensitive, continuous location history of users to cloud databases. | Excellent. Zero-storage ephemeral processing. Raw data is destroyed instantly; only signed metadata is sent. |
### Accuracy
Legacy panel models rely on historical averages that cannot adapt to real-world anomalies. If a screen in a transit station is surrounded by a crowd due to a train delay, a panel model completely misses the spike in impressions. Conversely, mobile SDK tracking suffers from severe spatial inaccuracy. A mobile phone detected within a 15-meter radius of an outdoor billboard may be inside a vehicle, facing the opposite direction, or deep inside an adjacent building.
By contrast, our edge AI methodology measures actual physical presence and attention. The 3D head pose estimation ensures that an impression is only counted when a human face is oriented toward the display. Across our telemetry, this deterministic approach has allowed media buyers to verify impressions with absolute confidence, establishing a robust baseline of $4.50 CPM for premium, verified-attention inventory.
### Latency
Cloud-based computer vision solutions (where raw video streams are transmitted to centralized servers for processing) are technically and financially non-viable. The bandwidth cost of streaming 1080p video at 30 FPS over cellular networks from tens of thousands of screens is prohibitive. Furthermore, the round-trip latency of cloud inference makes real-time applications impossible.
Our edge architecture processes frames locally in less than 33 milliseconds (matching the 30 FPS sensor rate). This ultra-low latency allows the player engine to dynamically adjust local configurations or trigger immediate proof-of-play generation. This capability was demonstrated during a recent optimization run where the system successfully processed 84,200 impressions over a single weekend with zero network congestion, as only the lightweight cryptographic metadata was transmitted back to the central gateway.
### Privacy
Transmitting raw video or audio streams to the cloud violates almost every modern privacy framework, including GDPR and CCPA. Biometric data cannot be securely managed once it leaves the physical device.
By executing our 400 KB BlazeFace model entirely within local volatile memory, the raw image is processed and discarded within milliseconds. Because the [Edge SDK](/support/developers/edge-sdk/) does not support video recording or streaming APIs, there is no physical mechanism for raw data to leak from the device. The only output is the anonymous, aggregated, and cryptographically signed VAS payload, achieving true privacy-by-design.
## Implementation Guide: Deploying the Edge SDK
For measurement architects and edge-ML engineers, integrating this methodology involves deploying our modular SDKs within your existing player application. Below is a conceptual implementation guide demonstrating how to initialize the optical sensing pipeline, compute attention metrics, and generate a cryptographically signed Proof of Play payload.
### Step 1: Initialize the Sensing Engine
First, import the required modules from the [Sensing SDK](/support/developers/sensing-sdk/) and load the optimized 400 KB BlazeFace ONNX model. The engine handles hardware acceleration automatically, targeting available GPUs, NPUs, or optimized CPU execution providers (such as NNAPI or DirectML via the [ONNX Runtime](https://onnxruntime.ai/)).
```python
from trill_sensing import OpticalPipeline, AttentionScorer
from trill_crypto import ProofOfPlaySigner
# Initialize the hardware-accelerated optical pipeline
pipeline = OpticalPipeline(
model_path="/etc/trill/models/blazeface_quant_400kb.onnx",
target_fps=30,
execution_provider="NPU"
)
# Configure the 3D attention scorer (yaw, pitch, eye-weight composite)
scorer = AttentionScorer(
yaw_threshold=15, # +/- 15 degrees
pitch_threshold=10, # +/- 10 degrees
eye_weight_min=0.6
)
```
### Step 2: Process the Frame Loop and Aggregate Metrics
As the player renders a creative, the application loops through incoming sensor frames, tracking face detections and calculating head pose vectors. These metrics are accumulated locally over the duration of the play window.
```python
# Start frame acquisition
pipeline.start_capture()
active_tracks = {}
while creative_is_playing:
frame = pipeline.get_next_frame()
detections = pipeline.detect_faces(frame)
for face in detections:
# Track face across frames using local Kalman filter
track_id = face.track_id
# Compute 3D head pose vectors
pose = scorer.estimate_pose(face.landmarks)
is_attentive = scorer.is_within_attention_cone(pose)
if track_id not in active_tracks:
active_tracks[track_id] = {
"dwell_frames": 1,
"attentive_frames": 1 if is_attentive else 0
}
else:
active_tracks[track_id]["dwell_frames"] += 1
if is_attentive:
active_tracks[track_id]["attentive_frames"] += 1
# Stop capture and clean up frame buffers immediately
pipeline.stop_capture()
```
### Step 3: Compute VAS and Generate Proof of Play
Once the creative finishes playing, compile the tracking data into a Verified Audience Signal (VAS) payload. Pass this payload to the [Proof of Play](/support/developers/proof-of-play/) signing module, which coordinates with the local secure enclave to generate the 64-byte Ed25519 signature.
```python
# Synthesize metrics from active tracks
vas_payload = {
"timestamp": current_hardware_timestamp(),
"creative_id": "0x9f8e7d6c5b4a",
"audience_count": len(active_tracks),
"metrics": []
}
for track_id, data in active_tracks.items():
dwell_ms = (data["dwell_frames"] / 30.0) * 1000
attention_ms = (data["attentive_frames"] / 30.0) * 1000
vas_payload["metrics"].append({
"dwell_ms": int(dwell_ms),
"attention_ms": int(attention_ms)
})
# Initialize the cryptographic signer (bound to hardware TPM)
signer = ProofOfPlaySigner(key_slot=1)
# Generate the immutable Proof of Play envelope
pop_envelope = signer.sign_payload(
payload=vas_payload,
previous_block_hash=signer.get_last_block_hash()
)
# Transmit the compact envelope to the ingestion gateway
transmit_to_gateway(pop_envelope)
```
This simple, robust integration ensures that every single ad play is backed by audited, unalterable physical metrics. By executing this entire pipeline locally, the system guarantees that no raw sensory data ever leaves the screen, establishing a new standard for privacy-first physical world measurement.
## FAQ
### How does the Edge SDK handle extreme lighting variations and off-angle detections without increasing model size?
Our optimized 400 KB BlazeFace model uses a specialized lightweight preprocessing layer that normalizes image contrast and exposure on the GPU before running inference. To handle off-angle detections, we employ a camera calibration matrix configured during initial installation. This matrix maps the physical camera's mounting angle and focal length to the 3D projection space, allowing the head pose estimation algorithm to accurately calculate yaw and pitch even when the sensor is mounted above or to the side of the display.
### What is the network overhead of the Ed25519 signature chain when scaling to tens of thousands of screens?
The network overhead is remarkably low. Because the Ed25519 algorithm produces a highly compact 64-byte signature, and the aggregated Verified Audience Signal (VAS) payload is formatted in a minified JSON structure, the entire Proof of Play envelope for a typical 10-second play slot is under 500 bytes. For a network of 10,000 screens playing ads continuously, this translates to less than 50 KB of telemetry data per second across the entire network, which is easily managed by standard ingestion gateways.
### How does the system guarantee GDPR/CCPA compliance when the optical sensor is actively capturing faces?
Compliance is guaranteed through a strict zero-storage, ephemeral processing architecture. Raw frames captured by the camera sensor are loaded directly into volatile RAM, processed by the 30 FPS inference loop to extract anonymous coordinate vectors, and immediately overwritten. No images or facial templates are ever written to non-volatile storage, and no raw data is transmitted over the network. Because the system cannot reconstruct a human face from the anonymous coordinates, it does not collect, process, or store personal biometric data as defined under GDPR and CCPA.
Frequently Asked Questions
What runs on-device in the Trillboards edge AI stack?
Eight composable npm packages: edge-core (identity, heartbeat, socket.io), edge-sensing (camera/audio capture, BlazeFace face detection, YAMNet audio classification, attention scoring, audience metrics), edge-ads (Chromium kiosk lifecycle), edge-federated (gradient training, Ed25519 VAS attestation), edge-cloud (optional Gemini inference), edge-platform-linux, edge-platform-windows, and the umbrella edge-sdk. All ML inference runs on-device via ONNX Runtime; no raw frame or raw audio leaves the device.
How is on-device attribution different from cloud attribution?
Cloud attribution uploads either device IDs (privacy-fragile) or raw frames (privacy-untenable for venues) for centralized processing. On-device attribution produces only the derived signal (face count, attention level, dwell seconds) per aggregation window, signs it with Ed25519, and uploads the signed signal. Network costs are orders of magnitude lower (tens of bytes vs megabytes per second of video), latency is sub-100ms (no upload round-trip), and the privacy posture meets GDPR/CCPA without per-vendor DPA negotiation.
How is the proof-of-play signature verified?
Every impression payload (timestamp, screen-id, creative-id, viewability metrics) is signed with Ed25519 using the device-resident private key. Buyers (or any third party with the public key) can verify the signature offline — no API call to Trillboards needed. The public key is published at /support/developers/proof-of-play/ and the Open Verification model means anyone can audit any impression. This is the cryptographic complement to BLE foot-traffic counts: the impression really happened, on a specific screen, at a specific time.
Ready to Turn Your Screens Into Revenue?
Join thousands of businesses earning $200-500/month per screen with Trillboards' FREE digital signage platform.
Get Started Free