Rishabh Agrawal

Research Summary

I am passionate about addressing problems that have a direct and meaningful impact on our community. My work has been dedicated to understanding the nature of these challenges and exploring solutions rooted in machine learning, probability, optimization, and simulation.

Interests: Reinforcement Learning, Imitation Learning, Generative AI, Behavior Foundation Models, Large Language Models.

Publications

Markov Balance Satisfaction Improves Performance in Strictly Batch Offline Imitation Learning

Rishabh Agrawal, Nathan Dahlin, Rahul Jain, Ashutosh Nayyar.

AAAI 2025 [link]

Abstract: Imitation learning (IL) is notably effective for robotic tasks where directly programming behaviors or defining optimal control costs is challenging. In this work, we address a scenario where the imitator relies solely on observed behavior and cannot make environmental interactions during learning. It does not have additional supplementary datasets beyond the expert’s dataset nor any information about the transition dynamics. Unlike state-of-the-art (SOTA) IL methods, this approach tackles the limitations of conventional IL by operating in a more constrained and realistic setting. Our method uses the Markov balance equation and introduces a novel conditional density estimation-based imitation learning framework. It employs conditional normalizing flows for transition dynamics estimation and aims at satisfying a balance equation for the environment. Through a series of numerical experiments on Classic Control and MuJoCo environments, we demonstrate consistently superior empirical performance compared to many SOTA IL algorithms.
Conditional Kernel Imitation Learning for Continuous State Environments.

Rishabh Agrawal, Nathan Dahlin, Rahul Jain, Ashutosh Nayyar.

L4DC 2025 [link]

Abstract: Imitation Learning (IL) is an important paradigm within the broader reinforcement learning (RL) methodology. Unlike most of RL, it does not assume availability of rewardfeedback. Reward inference and shaping are known to be difficult and error-prone methods particularly when the demonstration data comes from human experts. Classical methods such as behavioral cloning and inverse reinforcement learning are highly sensitive to estimation errors, a problem that is particularly acute in continuous state space problems. Meanwhile, state-of-the-art IL algorithms convert behavioral policy learning problems into distribution-matching problems which often require additional online interaction data to be effective. In this paper, we consider the problem of imitation learning in continuous state space environments based solely on observed behavior, without access to transition dynamics information, reward structure, or, most importantly, any additional interactions with the environment. Our approach is based on the Markov balance equation and introduces a novel conditional kernel density estimation-based imitation learning framework. It involves estimating the environment’s transition dynamics using conditional kernel density estimators and seeks to satisfy the probabilistic balance equations for the environment. We establish that our estimators satisfy basic asymptotic consistency requirements. Through a series of numerical experiments on continuous state benchmark environments, we show consistently superior empirical performance over many state-of-the-art IL algorithms.
Policy Optimization for Strictly Batch Imitation Learning

Rishabh Agrawal, Nathan Dahlin, Rahul Jain, Ashutosh Nayyar.

OPT for ML, NeurIPS 2024 [link]

Abstract: Imitation Learning (IL) offers a compelling framework within the broader context of Reinforcement Learning (RL) by eliminating the need for explicit reward feedback, a common requirement in RL. In this work, we address IL based solely on observed behavior without access to transition dynamics information, reward structure, or, most importantly, any additional interactions with the environment. Our approach leverages conditional kernel density estimation and performs policy optimization to ensure the satisfaction of the Markov balance equation associated with the environment. This method performs effectively in discrete and continuous state environments, providing a novel solution to IL problems under strictly offline optimization settings. We establish that our estimators satisfy basic asymptotic consistency requirements. Through a series of numerical experiments on continuous state benchmark environments, we show consistently superior empirical performance over many state-of-the-art IL algorithms.
A Reinforcement Learning Framework for QoS-Driven Radio Resource Scheduler

Jitender Singh Shekhawat, Rishabh Agrawal, K Gautam Shenoy, Rajath Shashidhara.

Globecom 2020 [link]

Abstract: In cellular communication systems, radio resources are allocated to users by the MAC scheduler, that typically runs at the base station (BS). The task of the scheduler is to meet the quality of service (QoS) requirements of each data flow while maximizing the system throughput and achieving a desired level of fairness amongst users. Traditional schedulers use handcrafted metrics and are meticulously tuned to achieve a delicate balance between multiple, often conflicting objectives. Diverse QoS requirements of 5G networks further complicate traditional schedulers. In this paper, we propose a novel reinforcement learning based scheduler that learns an allocation policy to simultaneously optimize multiple objectives. Our approach allows network operators to customize their requirements, by assigning priority values to QoS classes. In addition, we adopt a flexible neural-network architecture that can easily adapt to varying number of flows, drastically simplifying training, thus rendering it viable for practical implementation in constrained systems. We demonstrate, via simulations, that our algorithm outperforms conventional heuristics such as M-LWDF, EXP-RULE and LOGRULE and is robust to changes in radio environment and traffic patterns.
CoPASample: A Heuristics Based Covariance Preserving Data Augmentation

Rishabh Agrawal, Paridhi Kothari.

LOD 2019 [link]

Abstract: An efficient data augmentation algorithm generates samples that improves accuracy and robustness of training models. Augmentation with informative samples imparts meaning to the augmented data set. In this paper, we propose CoPASample (Covariance Preserving Algorithm for generating Samples), a data augmentation algorithm that generates samples which reflects the first and second order statistical information of the data set, thereby augmenting the data set in a manner that preserves the total covariance of the data. To address the issue of exponential computations in the generation of points for augmentation, we formulate an optimisation problem motivated by the approach used in -SVR to iteratively compute a heuristics based optimal set of points for augmentation in polynomial time. Experimental results for several data sets and comparisons with other data augmentation algorithms validate the potential of our proposed algorithm.
Determining the Optimal Fuzzifier Range for Alpha-Planes of General Type-2 Fuzzy Sets

Shreyas Kulkarni, Rishabh Agrawal, Frank Chung-Hoon Rhee.

FUZZ-IEEE 2018 [link]

Abstract: Type-2 fuzzy sets (T2 FSs) are capable of handling uncertainty more efficiently than type-1 fuzzy sets (T1 FSs). The fuzzifier parameter plays an important role in the final cluster partitions in fuzzy c-means (FCM), interval type-2 (IT2) FCM, general type-2 (GT2) FCM, and other fuzzy clustering algorithms. In general, fuzzifiers are chosen for a given dataset based on experience. In this paper, we adaptively compute suitable values for the range of the fuzzifier parameter for each α-plane of GT2 FSs for a given data set. The footprint of uncertainty (FOU) for each α-plane is obtained from the given data set using histogram based membership generation. This is iteratively processed to give the converged values of fuzzifier parameters for each α-plane of GT2 FSs. Experimental results for several data sets are given to validate the effectiveness of our proposed method.

Patents

Method and system for radio-resource scheduling in telecommunication-network

Jitender Singh Shekhawat, Rishabh Agrawal, Anshuman Nigam, Konchady Gautam Shenoy, Yash Jain

US Patent 2022 [link]

Abstract: The present disclosure provides a method for radio-resource scheduling in a telecommunication network. The method comprises selecting at least one objective associated with a radio-resource scheduling from a plurality of objectives; prioritizing at least one flow from a plurality of flows for the selected at least one objective; identifying at least one state parameter from a plurality of state parameters associated with at least one of an active bearers from a plurality of active bearers; inputting at least one of the plurality of state parameters for the at least one of the active bearers to be scheduled during a current transmission time interval (TTI) to a reinforcement machine learning (ML) network, the reinforcement ML network being configured for a reward in accordance with the selected at least one objective; and receiving, from the reinforcement ML network, a radio resource allocation for each of the active bearers for the current TTI.

Projects

Restoring Multimodal Missing Modalities with Diffusion Models

Proposed a methodology to utilized diffusion models for missing modality generation conditioned on available modalities. Specifically, conducted studies to answer these questions: a) How well diffusion models restore missing modalities? b) Whether reconstructed data improves personality trait predictions? c) How different missing modality scenarios (e.g., missing speech or vision) affect prediction performance? [code]

Abstract: Multimodal learning has gained significant attention for its applications in critical areas such as computer vision, robotics, healthcare diagnostics, and human-computer interaction, where the ability to synthesize multiple modalities can significantly improve predictive accuracy and system robustness. However, real-world multimodal datasets are often incomplete, leading to degraded performance in predictive tasks. In this work, we propose a diffusion-based modality restoration framework that conditions a trained diffusion model on available modalities to reconstruct the missing one. We extract emotion-rich features from text, video and audio and explore various fusion strategies (early, late, and model-level transformer based fusion) to integrate reconstructed modalities for improved downstream personality trait inference. We evaluate our approach on the ChaLearn First Impressions V2 dataset.
Reinforcement learning from Human Feedback (RLHF)-based Optimization of Recommendations Using a Large Language Model (LLM)

Proposed a recommendation system that integrates RLHF with an LLM to fine-tune recommendations and align the reward model with user preferences. This approach aims to outperform both standalone LLM systems and traditional algorithms like Collaborative Filtering by ensuring better alignment with user feedback. [code]

Abstract: This study explores the optimization of recommendation systems with Large Language Models (LLMs) using two distinct methodologies: Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO). Separate pipelines were developed to fine-tune LLMs for inferred user preferences. Using MovieLens datasets, we show that supervised fine-tuning (SFT) and preference tuning improve LLM-based recommendations. These results highlight the potential and limitations of LLM-based systems for recommendation tasks.
Action-Quantized Offline Reinforcement Learning

Leveraged VQ-VAE (Vector Quantised-Variational AutoEncoder) for state-conditioned action quantization (SAQ), addressing the challenges of approximation in continuous action settings, which typically lead to performance degradation. Extended this approach to enable joint end-to-end learning of quantization and policy, resulting in approximately 20% performance improvements on locomotion, adroit, and kitchen tasks. [code]

Abstract: The field of offline reinforcement learning (RL) offers a versatile framework for transforming fixed behavior datasets into policies that have the potential to surpass the performance of the original data-collecting policy. Despite significant advancements like adding conservatism and policy constraints to address distributional shifts, continuous action settings often pose challenges that necessitate approximations. In contrast, discrete action settings offer more precise or even exact computations for offline RL constraints and regularizers, presenting fewer hurdles. Our project begins with an exploration of an adaptive method for action quantization. Utilizing a VQ-VAE, we acquire knowledge in state-conditioned action quantization to address the exponential complexity inherent in naive action space discretization. Through experimentation, we reproduce that integrating this discretization technique strengthens the effectiveness of well-known offline RL approaches such as IQL and CQL on standardized tasks. We subsequently refine this methodology through joint training of VQ-VAE and offline RL methods, resulting in further performance enhancements compared to previous methodology.
Earthquake Damage Prediction

Predicted building damage levels using data from the 2015 Gorkha earthquake in Nepal. Employed feature engineering and ensemble modeling with LightGBM, CatBoost, and XGBoost to develop a machine learning model, achieving an F1−score of 0.7541 on test data and securing 2nd place out of 50 teams. [code]

Abstract: Predicting the damage caused by an earthquake is a challenging task. Given an extensive dataset comprised of aspects of building location and construction, we predict the damage level caused to buildings. The data is from the 2015 Gorkha earthquake in Nepal. We make use of feature engineering techniques, gradient boosting algorithms and ensemble models to develop our machine learning model, which achieved an F1 score of 0.7541 on test data.
Application of r-cyclic matrices in Data Augmentation

Proposed a data augmentation algorithm to counter the problem of data insufficiency with the objective of preserving covariance post augmentation. Provided a heuristics to convert the involved exponential time complexity into polynomial time complexity. Formulated an optimization problem motivated by the approach used in ν-SVR. Experimented on various UCI data sets using multiple classification algorithms to outline the effectiveness of the proposed work.

Abstract: An efficient data augmentation algorithm generates samples that improves accuracy and robustness of training models. Augmentation with informative samples imparts meaning to the augmented data set. In this paper, we propose CoPASample (Covariance Preserving Algorithm for generating Samples), a data augmentation algorithm that generates samples which reflects the first and second order statistical information of the data set, thereby augmenting the data set in a manner that pre- serves the total covariance of the data. To address the issue of exponential computations in the generation of points for augmentation, we formu- late an optimisation problem motivated by the approach used in ν-SVR to iteratively compute a heuristics based optimal set of points for aug- mentation in polynomial time. Experimental results for several data sets and comparisons with other data augmentation algorithms validate the potential of our proposed algorithm.