Making Mobile WebXR Bounding-Box Detection Practical
Async AR frames, RGB segmentation, and XR-first depth fusion on a phone
How we evolved a depth-only WebXR prototype into a robust, low-latency 3D measurement pipeline with stable 1-2 FPS auto-detection updates.
1. Background
Our goal was simple to explain and hard to execute: estimate object dimensions from a mobile AR scene directly in the browser, while keeping interaction smooth enough for real users.
The baseline implementation used WebXR Depth Sensing and local point-cloud fitting around the reticle. It worked in controlled cases, but noisy background points, unstable scale from monocular depth alone, and mobile compute limits made the first version too fragile.
2. What Made This Hard on Mobile
Mobile AR workloads are pipelines: camera frame capture, optional segmentation, depth estimation, 3D unprojection, filtering, and geometric fitting. Even when each stage is small, synchronous execution creates frame drops and latency spikes.
Memory pressure is another risk. Repeated full-frame copies between graphics and CPU memory quickly increase peak usage. On phones, that can trigger thermal throttling or process termination under load.
3. Reducing Render-Loop Pressure
We split the pipeline into two lanes. The render lane keeps XR interaction responsive: reticle, handles, visualization, and minimal depth sampling. The async lane processes queued snapshots at a throttled cadence.
Only compact data is copied out of XRFrame scope: timestamp, camera pose, sampled UV/world points, and a low-rate RGB capture. Heavy operations run in idle-time batches, not on the rendering path.
4. RGB Segmentation + Dual Depth Strategy
Object segmentation is computed from RGB frames (not from depth maps). The resulting mask filters sampled rays before geometric fitting, which removes many background outliers.
Depth is fused from two sources: (a) preferred XR depth from WebXR Depth Sensing and (b) DepthAnythingV3 relative depth. We keep XR depth dominant, then adapt blend weights using per-candidate confidence and backend quality signals.
For DepthAnythingV3 scaling, we evaluate both direct and inverse mappings and select the lower-error fit, with raw-based and floor-plane-based options. This mirrors our native experimentation flow and improves consistency across scenes.
5. Backend-Aware Inference Contracts
To align browser integration with our Android R&D stack, the web bridge accepts DepthPointCloudAR-like source semantics: DEPTH_ANYTHING_V3_NCNN and DEPTH_ANYTHING_V3_TFLITE, with delegate preferences such as gpu or nnapi.
Backend labels are normalized across providers (for example ncnn_vulkan, tflite_gpu, tflite_nnapi), so monitoring and fusion policies stay consistent even when execution paths differ.
6. Result
The updated architecture reaches a practical operating mode for mobile demos: stable auto-detection updates around 1-2 FPS, smoother UX, and fewer catastrophic outliers in object AABB estimation.
Most importantly, the system is now structured for iterative research: segmentation and depth backends can be swapped independently, while the fusion layer and UI behavior remain stable.