RnGCam: High-speed video from rolling & global shutter measurements

University of California, San Diego
ICCV 2025
* Joint first authors

Abstract

Compressive video capture encodes a short high-speed video into a single measurement using a low-speed sensor, then computationally reconstructs the original video. Prior implementations rely on expensive hardware and are restricted to imaging sparse scenes with empty backgrounds. We propose RnGCam, a system that fuses measurements from low-speed consumer-grade rolling-shutter (RS) and global-shutter (GS) sensors into video at kHz frame rates. The RS sensor is combined with a pseudorandom optic, called a diffuser, which spatially multiplexes scene information. The GS sensor is coupled with a conventional lens. The RS-diffuser provides low spatial detail and high temporal detail, complementing the GS-lens system's high spatial detail and low temporal detail. We propose a reconstruction method using implicit neural representations (INR) to fuse the measurements into a high-speed video. Our INR method separately models the static and dynamic scene components, while explicitly regularizing dynamics. In simulation, we show that our approach significantly outperforms previous RS compressive video methods, as well as state-of-the-art frame interpolators. We validate our approach in a dual-camera hardware setup, which generates 230 frames of video at 4,800 frames per second for dense scenes, using hardware that costs 10× less than previous compressive video systems.

RnGCam teaser image

Pipeline for fusing multiple global shutter measurements and an RS diffuser coded long exposure measurement.} Both sensors are triggered at the same time, and in between the start and end of the RS's coded long exposure, two more images are captured as key frames using the global shutter. The RS and diffuser encodes high speed dynamics into a single measurement, and the GS measurements act as key frames for the reconstruction. The sum of a time-varying and static neural scene representation is used to fuse together both measurements into a high-speed reconstruction with a dense background.

Forward model

forward model figure

Lensless camera forward model. (a) The intensity arriving at a lensless camera sensor is the 2D convolution of the diffuser PSF h and the scene v. Illustrating the capture process of a dynamic scene. All y-t images are slices aligned with the ball's x-coordinate. (b) The scene is a red ball moving sinusoidally in the y-direction. The RS-diffuser camera (c) measurement bR records the dynamic intensity, , distributed across the sensor due to convolution with the large diffuser PSF. This encodes rich spatio-temporal scene information into bR. The GS-lens camera (d) acquires three 2D images, trading temporal information for improved spatial detail compared to the RS-diffuser. This motivates our system design in which we fuse the two measurement types to reconstruct a high-speed video.

Space-Time Fusion Model for compressive video

inr figure

Both rolling and global shutter measurements are used to simultaneously update both static and dynamic networks by computing their loss against the estimated measurements from querying the estimated scene vθ and passing it through the optical forward model, A. FDθ takes in a grid of spatiotemporal coordinates, while FSθ only takes in a grid of spatial coordinates. The two outputs are summed together after the alpha map is applied.

Results

Results figure

Comparing our reconstruction with competing methods on a complex scene. a) A simulated scene of a smoke plume emerging from a gun barrel (credit: The Slow Mo Guys). We compare our method with video interpolators and 3DTV-based methods by calculating PSNR over the entire video. The video interpolators (red inset): b) EMA-VFI [1] and c) Super SloMo [2] fail to recover information present early in the video in the intermediate frames, as they only rely on key-frames (GS) and are thus prone to hallucination. d,e) 3DTV-based methods resolve these details due to the presence of coded RS measurements but have poor reconstruction quality. f) Our method resolves intermediate details with significantly higher fidelity. We also achieve the highest PSNR calculated over the full video.

psnr plot figure

Comparing per-frame PSNR for all methods. Video interpolators achieve high PSNR near the three input GS frames (start, mid, end) indicated by the black vertical lines, but show degradation for intermediate frames. Our method is more temporally stable, achieving the highest PSNRs on intermediate frames (excluding GS frames). We show results on the scene above with the stars corresponding to the estimated frames.

BibTeX

@article{rngcam,
  title={RnGCam: High-speed video from rolling \& global shutter measurements},
  author={Kevin Tandi and Xiang Dai and Chinmay Talegaonkar and Gal Mishne and Nick Antipa},
  journal={International Conference on Computer Vision (ICCV)},
  year={2025},
  website={https://kevintandi.github.io/rngcam}
}