Abstract
Efficiently transferring image-based object detectors to the domain of
video remains challenging under resource constraints. Previous efforts
used feature propagation to avoid recomputing unchanged features.
However, the overhead is significant when working with very slowly
changing scenes, such as in surveillance applications. In this paper, we
propose temporal early exits to reduce the computational complexity of
video object detection. Multiple temporal early exit modules with low
computational overhead are inserted at early layers of the backbone
network to identify the semantic differences between consecutive frames.
Full computation is only required if the frame is identified as having a
semantic change to previous frames; otherwise, detection results from
previous frames are reused. Experiments on ImangeNet VID and TVnet show
that the approach can accelerate video object detection by 1.7x compared
to SOTA, with a reduction of only <1% in mAP.