Research

I develop learning algorithms that enable AI agents to acquire robust decision-making capabilities from fixed data and rich feedback. My research spans reinforcement learning, imitation learning, foundation models, and AI agents, with a focus on building competent, trustworthy, and generalizable policies without additional environment interaction. By combining theoretical foundations with scalable algorithms, I aim to create autonomous systems that can learn safely and efficiently from demonstrations, preferences, natural language feedback, and large-scale datasets. Below are my publications, patents, and selected projects.

Reinforcement Learning Imitation Learning Behavior Foundation Models Agentic AI Large Language Models Post-training

Publications 10

DistIL: distributional DAgger with future-aware credit assignment
NEW RLxF Workshop · ICML 2026

Reinforcement Learning from Rich Feedback with Distributional DAgger

Rishabh Agrawal, Jacob Fein-Ashley, Paria Rashidinejad

Abstract

Reasoning models have advanced quickly, but the dominant RL-from-verifiable-rewards (RLVR) recipe stays narrow: sample many responses and reward each with a single correctness bit. Many settings, though, provide far richer feedback — execution traces, tool outputs, expert corrections, and self-evaluations. We use such feedback through DistIL, a distributional variant of the classic DAgger algorithm in which the learner has local access to an expert distribution over the states its current policy visits. This yields a simple forward cross-entropy objective that admits a black-box expert and whose sequence-level gradient propagates future expert–student disagreement back to earlier decisions. We show that prior self-distillation objectives based on reverse-KL or Jensen–Shannon do not guarantee monotonic policy improvement, whereas forward cross-entropy does — and additionally enjoys regret guarantees and optimizes a lower bound on teacher-weighted likelihood of success, improving Pass@N. Empirically, DistIL improves over RLVR and self-distillation baselines across scientific reasoning, coding, and hard mathematics.

RBFM: robust task inference under dynamics shift
NEW arXiv · 2026

When Dynamics Shift, Robust Task Inference Wins: Offline Imitation Learning with Behavior Foundation Models Revisited

Rishabh Agrawal, Rahul Jain, Ashutosh Nayyar

Abstract

Behavior Foundation Models (BFMs) enable scalable imitation learning by pretraining task-agnostic representations that adapt rapidly to new tasks — but existing BFMs assume fixed dynamics, limiting robustness under real-world shifts such as changes in friction, actuation, or sensor noise. We formulate BFM task inference as a robust minimax optimization over an uncertainty set of dynamics, enabling adaptation to worst-case perturbations without touching pretraining. To our knowledge this is the first BFM framework robust to dynamics shifts while using only offline data from a single nominal environment. Two variants, RBFM-Light and RBFM-Heavy, trade computation against robustness while preserving fast task inference, and both significantly outperform standard BFM and robust offline IL baselines under dynamics shifts.

BE-DROIL overview
ORAL NeurIPS 2025 E-SARS · L4DC 2026

Balance Equation-based Distributionally Robust Offline Imitation Learning

Rishabh Agrawal, Yusuf Alvi, Rahul Jain, Ashutosh Nayyar

Abstract

Standard imitation learning implicitly assumes environment dynamics remain fixed between training and deployment — an assumption that rarely holds, since modeling inaccuracies, parameter variation, and adversarial perturbations all shift transition dynamics and degrade performance. We learn robust policies from expert demonstrations collected under nominal dynamics alone, formulating the problem as a distributionally robust optimization over an uncertainty set of transition models. Crucially, we show this worst-case objective can be reformulated entirely in terms of the nominal data distribution, enabling tractable offline learning. On continuous-control benchmarks, the approach achieves superior robustness and generalization over state-of-the-art offline IL baselines, especially under perturbed or shifted environments.

Thesis overview: offline imitation learning
AAAI 2026 · Doctoral Consortium

Towards Offline Imitation Learning: Strictly Batch Settings, Generalization, and Variability in Expertise

Rishabh Agrawal

Abstract

My doctoral research develops a unified framework for offline imitation learning that tackles three challenges: sample efficiency in strictly batch settings, robustness and generalization under dynamics shifts, and learning from demonstrations of varying quality. At its core is a paradigm for strictly offline IL based on enforcing the Markov Balance Equation, realized through conditional density estimation in the CKIL and MBIL algorithms. Building on this, I developed the first distributionally robust offline IL framework under a stationarity constraint, and now extend it through Robust Behavior Foundation Models that generalize across dynamics shifts. Finally, I propose a variational approach for learning from crowdsourced demonstrations by inferring demonstrator expertise — together broadening the applicability of IL to robotics, healthcare, and autonomous systems.

AAAI 2025

Markov Balance Satisfaction Improves Performance in Strictly Batch Offline Imitation Learning

Rishabh Agrawal, Nathan Dahlin, Rahul Jain, Ashutosh Nayyar

Abstract

We address imitation where the learner relies solely on observed behavior, with no environment interaction, no supplementary datasets beyond the expert's, and no knowledge of transition dynamics — a more constrained and realistic setting than most SOTA IL methods assume. Our method uses the Markov balance equation and introduces a conditional density estimation framework, employing conditional normalizing flows for transition dynamics estimation to satisfy a balance equation for the environment. Across Classic Control and MuJoCo, it consistently outperforms many state-of-the-art IL algorithms.

L4DC 2025

Conditional Kernel Imitation Learning for Continuous State Environments

Rishabh Agrawal, Nathan Dahlin, Rahul Jain, Ashutosh Nayyar

Abstract

Classical IL methods such as behavioral cloning and inverse RL are highly sensitive to estimation errors, an acute problem in continuous state spaces, while distribution-matching approaches often require additional online interaction. We consider imitation in continuous state environments based solely on observed behavior — without transition dynamics, reward structure, or any further interaction. Our approach builds on the Markov balance equation with a conditional kernel density estimation framework, estimating dynamics via conditional kernel density estimators that satisfy the environment's probabilistic balance equations. We establish asymptotic consistency and show consistently superior empirical performance over many state-of-the-art IL algorithms.

OPT for ML · NeurIPS 2024

Policy Optimization for Strictly Batch Imitation Learning

Rishabh Agrawal, Nathan Dahlin, Rahul Jain, Ashutosh Nayyar

Abstract

We address imitation based solely on observed behavior — without transition dynamics, reward structure, or any additional environment interaction. The approach leverages conditional kernel density estimation and performs policy optimization to satisfy the Markov balance equation associated with the environment, working effectively in both discrete and continuous state settings under strictly offline optimization. We establish basic asymptotic consistency and show consistently superior empirical performance over many state-of-the-art IL algorithms.

RL scheduler
IEEE Globecom 2020

A Reinforcement Learning Framework for QoS-Driven Radio Resource Scheduler

Jitender Singh Shekhawat, Rishabh Agrawal, K. Gautam Shenoy, Rajath Shashidhara

Abstract

In cellular systems the MAC scheduler allocates radio resources to meet each flow's quality-of-service requirements while maximizing throughput and fairness — a delicate balance traditionally struck with hand-crafted, carefully tuned metrics that 5G's diverse QoS needs further complicate. We propose a reinforcement learning scheduler that learns an allocation policy to jointly optimize multiple objectives, letting operators express priorities per QoS class. A flexible neural architecture adapts to varying numbers of flows, simplifying training and making it viable for constrained systems. In simulation it outperforms conventional heuristics such as M-LWDF, EXP-RULE, and LOG-RULE and stays robust to changes in radio environment and traffic.

CoPASample
LOD 2019

CoPASample: A Heuristics Based Covariance Preserving Data Augmentation

Rishabh Agrawal, Paridhi Kothari

Abstract

Effective data augmentation generates samples that improve a model's accuracy and robustness. We propose CoPASample, an algorithm that generates samples reflecting the first- and second-order statistics of the dataset, augmenting it while preserving total covariance. To avoid the exponential cost of generating augmentation points, we formulate an optimization problem motivated by the ν-SVR approach to iteratively compute a heuristic-based optimal set of points in polynomial time. Experiments across several datasets, with comparisons to other augmentation methods, validate the approach.

General Type-2 Fuzzy Sets
FUZZ-IEEE 2018

Determining the Optimal Fuzzifier Range for Alpha-Planes of General Type-2 Fuzzy Sets

Shreyas Kulkarni, Rishabh Agrawal, Frank Chung-Hoon Rhee

Abstract

The fuzzifier parameter strongly shapes the final cluster partitions in fuzzy c-means and its interval and general type-2 variants, yet is usually chosen by experience. We adaptively compute suitable fuzzifier ranges for each α-plane of a general type-2 fuzzy set for a given dataset, deriving the footprint of uncertainty per α-plane via histogram-based membership generation and iterating to converged fuzzifier values. Experiments on several datasets validate the method's effectiveness.

Patents 1

Patent figure
US Patent · 2022

Method and System for Radio-Resource Scheduling in Telecommunication Network

Jitender Singh Shekhawat, Rishabh Agrawal, Anshuman Nigam, Konchady Gautam Shenoy, Yash Jain

Abstract

A method for radio-resource scheduling in a telecommunication network that selects an objective from a set of objectives, prioritizes flows accordingly, and feeds state parameters of active bearers into a reinforcement learning network configured with a reward aligned to the selected objective, which then returns a per-bearer radio resource allocation for the current transmission time interval.

Projects 5

Diffusion modality restoration
Multimodal · Diffusion

Restoring Multimodal Missing Modalities with Diffusion Models

Diffusion-based modality restoration conditioned on available modalities for personality-trait prediction.

Abstract

Real-world multimodal datasets are often incomplete, degrading downstream prediction. We condition a trained diffusion model on the available modalities to reconstruct the missing one, extracting emotion-rich features from text, video, and audio and exploring early, late, and transformer-based fusion to integrate reconstructed modalities for personality-trait inference. We evaluate on the ChaLearn First Impressions V2 dataset.

RLHF recommendation
RLHF · LLM

RLHF-based Optimization of Recommendations using a Large Language Model

A recommendation system that aligns an LLM's reward model with user preferences via RLHF and DPO.

Abstract

We study optimizing LLM-based recommenders with two methodologies — Reinforcement Learning from Human Feedback and Direct Preference Optimization — developing separate pipelines to fine-tune LLMs for inferred user preferences. On MovieLens, supervised fine-tuning and preference tuning both improve recommendation quality, highlighting the potential and limits of LLM-based systems for recommendation.

Offline RL · VQ-VAE

Action-Quantized Offline Reinforcement Learning

State-conditioned action quantization with VQ-VAE, jointly trained with offline RL for ~20% gains.

Abstract

Continuous action settings often force approximations that hurt offline RL, while discrete settings allow more precise constraints and regularizers. Using a VQ-VAE, we learn state-conditioned action quantization (SAQ) to avoid the exponential complexity of naive discretization, strengthening methods such as IQL and CQL. Jointly training the VQ-VAE with the offline RL objective yields further gains — roughly 20% improvements on locomotion, adroit, and kitchen tasks.

Earthquake damage prediction
Gradient Boosting · Ensemble

Earthquake Damage Prediction

Feature engineering and ensembling (LightGBM, CatBoost, XGBoost); F1 = 0.7541, 2nd of 50 teams.

Abstract

Using building location and construction data from the 2015 Gorkha earthquake in Nepal, we predict per-building damage levels with feature engineering, gradient-boosting algorithms, and ensemble models, achieving an F1 score of 0.7541 on the test set.

Data augmentation with r-cyclic matrices
Optimization · Augmentation

Application of r-Cyclic Matrices in Data Augmentation

A covariance-preserving augmentation scheme with a polynomial-time heuristic, motivated by ν-SVR.

Abstract

To counter data insufficiency while preserving covariance after augmentation, we design an algorithm that turns the otherwise exponential computation into polynomial time via a heuristic, formulating an optimization problem motivated by ν-SVR. Experiments on several UCI datasets across multiple classifiers demonstrate the effectiveness of the approach.