Multimodal RGB-Thermal Object Detection

Equal-contribution author — architecture & experiments (Computer Vision & DL course → IEEE ICUAS 2026 paper, under review)

Detecting people from a drone at medium-high altitude is hard: targets are a handful of pixels and RGB collapses at dawn, dusk and night. This project fuses visible and thermal infrared imagery to stay robust across lighting conditions — built for UAV-assisted monitoring of recreational fishing along protected coastline. The work grew from a Computer Vision & Deep Learning course project into a paper currently under review at IEEE ICUAS 2026 (equal-contribution author).

Mid-level RGB-TIR fusion on a dual-backbone architecture built on DEYOLO (Dual-Feature-Enhancement YOLO), keeping each modality’s cues distinct before fusing them.
Small-object refinements: added SPDConv (lossless, space-to-depth downsampling) and a redesigned SPANet neck that propagates the high-resolution P2 level to the heads — exactly the detail small targets need.
78% mAP50 on the curated VTUAV-det-tiny benchmark, beating RGB-only (35.5%) and thermal-only (73.8%) baselines; ablations confirm SPDConv and SPANet are complementary.
Evaluated qualitatively on a custom dual-sensor dataset captured with a DJI Matrice 30T; Class Activation Maps show tighter, better-centred activations on the enhanced model.
Training tracked for carbon footprint (CodeCarbon) on an NVIDIA RTX A6000.