Ketul Shah

Portrait


Hello!

I'm a final year PhD candidate at Johns Hopkins University in the department of Electrical and Computer Engineering advised by Prof. Rama Chellappa. My broad research interests lie at the intersection of machine learning, computer vision and computer graphics. My current work is on robust action recognition leveraging videos from multiple viewpoints and using synthetic data.

Previously, I obtained an MS in ECE from University of Maryland, College Park. In previous life, I received a Dual Degree (B.Tech + M.Tech) in Electrical Engineering from Indian Institute of Technology Madras, where I worked with Prof. Kaushik Mitra at the Computational Imaging Lab.

News

Research Work

Multimodal video retrieval using a self-improving agent.
VRAgent: Self-Refining Agent for Zero-Shot Multimodal Video Retrieval
Under submission

Agentic retrieval framwork for multimodal video retrieval by decomposing the user query into tool-instruction set and iteratively self-refining it. Introduces two multimodal video retrieval benchmarks.



Self-supervised video pre-training with motion-aware multi-view MAEs.
MV2MAE: Self-Supervised Video Pre-Training with Motion-Aware Multi-View Masked Autoencoders
Under submission

Video SSL using multi-view video data using cross-view reconstruction and motion-aware masking.



UDA by translating source images to target domain using controlled diffusion.
Diffuse2Adapt: Controlled Diffusion for Synthetic-to-Real Domain Adaptation
ICIP 2025 (Oral Presentation)

UDA leveraging controlled diffusion models to translate the source images to the target domain, while incorporating context and style of target domain.



Generalization to aerial viewpoint by 3D estimation + rendering, and adaptation
AeroGen: Ground-to-Air Generalization for Action Recognition
FG 2025

Synthesize diverse aerial and ground data using 3D human mesh extraction and rendering.
Dual Domain Adaptation loss is proposed to align synthetic-real and ground-air domains.



VUDA with Masked Pre-training & Collaborative Self-Training
Unsupervised Video Domain Adaptation with Masked Pre-Training
and Collaborative Self-Training

CVPR 2024

Video unsupervised domain adaptation (UDA) by leveraging CLIP for masked distillation and self-training on target domain data.



Improving image generation in diffusion models using kurtosis concentration property.
DiffNat: Exploiting the Kurtosis Concentration Property for Image quality improvement
TMLR 2025

Proposed a general "naturnaless" preserving loss based on projected kurtosis concentration property of natural images.



Dataset and baselines for syn-to-real action recognition.
Synthetic-to-Real Domain Adaptation for Action Recognition: A Dataset and Baseline Performances
ICRA 2023

Released RoCoG-v2 dataset for synthetic-to-real and ground-to-air action recognition, and baselines on these domain shifts.



Skeleton Self Supervised Learning by hallucinating latent positives.
HaLP: Hallucinating Latent Positives for Skeleton-based Self-Supervised Learning of Actions
CVPR 2023

New contrastive learning method by generating positives in latent space for self-supervised skeleton-based action recognition.



Skeleton Self Supervised Learning by hallucinating latent positives.
Cap2Aug: Caption guided Image to Image data Augmentation
arXiv 2023

Generate diverse augmentations using image-to-image diffusion models via captioning.



Multi view action recognition using contrastive learning.
Multi-View Action Recognition using Contrastive Learning
WACV 2023

Improved hardness-aware supervised contrastive learning objective for multi-view action recognition.



Few shot learning using hardness-aware mixup.
FeLMi : Few shot Learning with hard Mixup
NeurIPS 2022

Generate samples using manifold mixup and select hard samples based on uncertainty.



Using multi-view depth maps for modeling 3D shapes.
Improved modeling of 3D shapes with multi-view depth maps
3DV 2020 (Oral Presentation)

A novel encoder-decoder generative model for 3D shapes using multi-view depth maps; SOTA results on single view reconstruction and generation.



Photorealistic image reconstruction using event cameras.
Photorealistic Image Reconstruction from Hybrid Intensity and Event based Sensor
arXiv 2018

Novel method for generating high-frame rate video from a conventional camera and an event sensor. Warp the intensity frames by first estimating scene depth and ego-motion.