← Back to CS 224R Home

CS 224R Final Projects

Spring 2026 · 245 student projects

Final project reports from CS 224R Deep Reinforcement Learning. Click a title to open the report (PDF, new tab).

Type Title Authors Mentor TA
Custom Outstanding Project A Semi-Decentralized Approach to Scalable Multiagent Control Avi Singh, Mahdi Al-Husseini Rahul Chand
Custom Outstanding Project EXPO-FT: Sample-Efficient Reinforcement Learning Finetuning for Vision-Language-Action Models Kuo-Han Hung Perry Dong
Custom Outstanding Project Hybrid Reinforcement Learning for Chip Macro Placement Moritz Schreyoegg, Jack Herrmann, Daniel Hagenlocker Anushree Aggarwal
Default Outstanding Project SFT Augmentation and Replay-Based RL for Countdown Reasoning Aizada Nurdinova, Ellie Sampson, Adhi Daiv Anikait Singh
Custom Honorable Mention Beyond Test Scores: Reward Design for Offline RL in Personal Finance Tutoring Stella Wu, Daniel Mark Argento Alex Nam
Custom Honorable Mention CARVE: Concept Avoidance via Reward-shaped Visual Erasure Febie Jane Lin, Fabio Ibanez, Chris Stanulet Rahul Chand
Custom Honorable Mention π-Drive: Reinforcement Post-Training Turns a Manipulation VLA into a Real-Time Driving Policy Felipe Laufer Barbosa, Mark Music, Alex Jihun Kim Tian Gao
Default Honorable Mention Frontier Curriculum and Adaptive Test-Time Compute for Efficient RLOO Andy Sung Jae Kim, Marco Antonio Vizcarra Tovar Anikait Singh
Custom Honorable Mention MARC: Multi-Agent Role Coordination Olufeolu Oluwapelumi Kolawole, Karn Kaura, Nihar Mudigonda Marcel Torne
Custom Honorable Mention Point and Pick: Bounding-Box Conditioned Diffusion Policies and Offline RL for Target-Specific Robot Manipulation Raul Garreta Tompson, Joshua Alexander Bowden, Swaroop Pal Marcel Torne
Default Honorable Mention Reading vs. Writing a Near-Oracle Internal Verifier: How RL Design Determines Whether a Correctness Probe Is Safe Abraham Yeung, Anagha Ramaswamy Anubha Mahajan
Custom Honorable Mention REFINE: Reinforcement-based EHR Feature Induction and Editing Ayeeshi Lakshmi Poosarla, Ryan Arya Nayebi Anushree Aggarwal
Custom Honorable Mention SciencePRM: Process-Reward RL (GRPO) for the Scientific Validity of Intermediate Reasoning Steps Zijian (Carl) Ma Joy He-Yueya
Custom A Contextual Bandit for Cheap-vs-Expensive Code Generation Mauricio Berlanga Carrillo Rahul Chand
Default A Self-Play Algorithm for Countdown Nikhil Raman, Nadim Isaac Max Du
Custom Action-Token Pruning for Sample-Efficient RL Fine-Tuning of Robot Policies Elvin Fu, Cole Donald Van Hersett Tyler Lum
Default Active RLOO: Online Filtering with Adaptive K Zhengmao Liu Max Du
Default Adaptive Competence-Boundary Curricula for Countdown with RLOO Daniel Steven Schreck, Sheng-Yong Niu, Karishma Aggarwal Max Du
Default Adaptive Curriculum Learning for Reinforcement-Trained Reasoning Agam Iheanyi-Igwe, Olatayo Anthony Sobomehin Riya Karumanchi
Default Adaptive Curriculum Learning for RL-Based Arithmetic Reasoning Yu Chi Hsu, Ryan He Shengqu Cai
Default Adaptive Curriculum Learning for RLOO on Countdown Jerry Jiayu Li Yash Kankariya
Default Adaptive Curriculum Learning with Difficulty-Conditioned Entropy Regularization for RL Fine-Tuning of Large Language Models Mahmoud Elgenedy Shengqu Cai
Default Adaptive Curriculum RL via LLM-Guided Complexity Scoring Louis Weisdorf, Laszlo Bollyky Riya Karumanchi
Default Adaptive Curriculum RLOO for Verifiable Arithmetic Reasoning in Language Models Kelechi Onuoha Abhijnya Bhat
Default Adaptive Difficulty Scheduling: A Competency-Gated Curriculum for Small-LLM Reasoning Donna Choi Ke Wang
Default Adaptive Difficulty-Aware Curriculum Learning for RLOO Darren Chan, Jayna Grace Huang, Sophie Zhang Anikait Singh
Custom Adaptive Multi-Turn Red-Teaming for Mental Health Adjacent Language Model Safety Anya Han Zhang Anushree Aggarwal
Default Adaptive Test-Time Sampling for RL-Fine-Tuned Reasoning Policies Andrea Ji Woo Nam Song Anikait Singh
Default Advantage-Driven Synthetic Curriculum for Reinforcement Learning based Fine-Tuning of Large Language Models Nate Demchak, Pravin Ravishanker, Haosheng Li Anikait Singh
Default Adversarial Co-evolution of a Difficulty Judge and an LLM Generator with an RLOO Policy for Cold-Start Countdown Reasoning Viveak Ravichandiran Riya Karumanchi
Custom Algorithm Sequencing and Curriculum Learning in Deep Reinforcement Learning Ryan D'Cunha, Ethan Hersch, Abhinav Chinta Ifdita Hasan Orney
Custom Alpha-Informed Optimal Trading Execution: Reinforcement Learning with Domain Informed Priors Ryan Nathan Padnis Ethan Hsu
Custom An Actor-Critic Neural Reachability Solver for High-Dimensional Zero-Sum Games Zeyuan Feng Rahul Chand
Custom An RL Framework for Persistent Memory Attacks on LLM Agents Mihir S Menon, Zihan Wang, Aarav Arora Joonwon Kang
Custom Architecture-Dependent Effects of Data Curation in Robot Manipulation Imitation Learning Kailana Baker-Matsuoka Jonathan Yang
Custom Augmenting Cybersecurity Capabilities with Adversarial RL and Autoregressive Action Parameterization in Small LLMs Anna Wu, Ethan Jun-shen Ho Ethan Hsu
Custom Automated Red Teaming using Reinforcement Learning Nubia Elena Correa, Abi Lopez, Annie Gisselle Villalta Joonwon Kang
Custom Bandits for Cold-Start: Exploration vs. Exploitation in Recommendation Ela Naz Sigin, Anastasiya Masalava, Eva Casto Anushree Aggarwal
Custom Begging for DEI: Increasing LLM Math Performance by Increasing Diversity Tyler Kinh Ho Joy He-Yueya
Custom BLIND: Bipedal Locomotion with Intermittent Navigation Data for Environmental Hazards Iris Zixiao Xu, Jamin Jia-Ming Xie, Eric Liang Tian Gao
Custom Body Schema Pretraining for Sample-Efficient Reinforcement Learning in Robotic Control Rishabh Malviya Tian Gao
Default Bounded Proximity Rewards: Accelerating Post-Training Reinforcement Learning for Arithmetic Reasoning without Policy Divergence Ian Yue-Ran Chen Yash Kankariya
Default Bridging the Gap: Optimal Transport-Guided Curriculum Learning Cole A Citrenbaum, Noah Cowan, Aymen Echarghaoui Yash Kankariya
Default Budget-Conditioned Tool Use for Countdown Reasoning Luca De Donno Yash Kankariya
Custom Calibrated Reinforcement Learning for LLM-Guided Perturbation Screen Design Anton Thieme Megha Srivastava
Default Can Teacher-Generated Curriculum Strategies Improve LLM Reasoning? Amrita Malhotra, Giselle Rivera, Zofia Dudek Anikait Singh
Custom City Builder: Does Induced Demand Make Transit Network Design a Reinforcement Learning Problem? Daniel Marcelo Mottesi Joy He-Yueya
Custom CKR: A Novel Baselining Improvement for T2I-Model Reinforcement Learning Using Noisy Rewards Ethan Zhang, Tatiana Zhang, Jenny Wei Tyler Lum
Custom Collapse-Resistant Constitutional AI for Small Language Models through Synthetic Revision Filtering Xue Zhang Alex Nam
Custom Compressed Memory via LoRA Fine-Tuning for Vision-Language-Action Models Nathaniel D Laurent, Matthew Michael Musson Tian Gao
Custom Controlling a Fusion Reactor with Reinforcement Learning Deniz Zeren Yilmaz, Sameer Agrawal, Siddharth M. Bhatia Marcel Torne
Custom Cooperative Fine-Tuning of Pretrained Vision-Language-Action Policies: Centralization, Communication, and Inference Recipes on TwoArmTransport Kyler Shu, Purushotham Mani, John Tucker Joonwon Kang
Default Cost-Aware Tool-Integrated Reasoning for Countdown Arithmetic Patrick Wang Abhijnya Bhat
Default Countdown as an Agentic Optimization Problem Tao Sun Yash Kankariya
Custom Critic-Targeted Exploration: Learned Per-Rollout Targeting of Entropy Bonuses in GRPO Hlumelo Notshe Joy He-Yueya
Custom Curiosity-Driven Memory-Augmented Reinforcement Learning for Adaptive Robot Tasks Jevon Mao, Yifan Geng, Alexia Huang Marcel Torne
Default Curriculum Learning for Countdown Reasoning in RL Fine-Tuning: Static Schedules Help, Adaptive Frontiers Forget Fengzhou Li Yash Kankariya
Default Curriculum Learning with Success-Rate-Driven Adaptive Gaussian Scheduler Andrew Tin-Lok Lee Anubha Mahajan
Default Curriculum Sampling for RLOO Fine-Tuning on Countdown Catherine M Zhang, Nora Menon Anubha Mahajan
Custom Curriculum Strategies for Bimanual Dexterous Piano Playing in RoboPianist Gabrielle Marie Walrath, Irene Ai Lin, Ethan Tamer Farah Megha Srivastava
Custom Decentralized Q-Learning of the Social Optimum in Strategic Experimentation Farzad Pourbabaee Joy He-Yueya
Custom Deep Reinforcement Learning for Campus-Scale Vehicle-to-Grid Fleet Scheduling Lisa Li, Yaqi Fan Joy He-Yueya
Custom Deep Reinforcement Learning for User Welfare Optimization on Recommendation Systems with Competing Content Creators Julia T Isaac, Tanush Talati Alex Nam
Custom Dense Language Rewards For Reasoning Diya Janvi Kejriwal, Andrew Su, Gurmeher Kaur Marcel Torne
Default Dense Rewards for RLVR on Countdown via ASTs Ahmed Sherif Ahmed Elbakry Mohamed Ke Wang
Custom Dense Step-Level Rewards via Process Reward Models for Mathematical Reasoning Dhruv Arcot Joy He-Yueya
Custom Deployment-Time Risk Scoring for a Safe RL Agent under Distribution Shift Luis Marc Botin-Sanz de Sautuola Anushree Aggarwal
Custom DiSCo: Distilled Steering via Consolidation for Robot Diffusion Policies William Z Liu, Sakthivel Sivaraman, Jerry Gu Max Du
Default Diversity-Aware RLOO for Pass@k Reasoning in Countdown Jane Yang, Will Nathaniel Hansen Anubha Mahajan
Default Diversity-Preserving On-Policy Distillation and Test-Time Verification for Countdown Reasoning Juntao Cheng, Qi Wu, Zhuoang Tao Riya Karumanchi
Custom Do generated futures help robot policies through representation alignment or explicit conditioning? Minyeong Kim Jonathan Yang
Custom Do Persona Vectors Track Sycophancy Under RL Fine-Tuning? Artur Barbosa Carneiro Abhijnya Bhat
Custom Does LoRA Rank Requirement Scale with Reward Density? An Empirical Study of Policy-Gradient Post-Training Mark William Gernitis Tyler Lum
Custom Effect of Stigmergic Potential Fields on Drone Swarm Coordination Kevin Michael Porter Tyler Lum
Custom Emergent Cooperation in Multi-Agent Pandemic Resource Allocation Riya Karthik Narain, Brooke Hunter Ballhaus Alex Nam
Custom Encouraging Taxonomy-Based Diversity within RL Automated Multi-Turn Red-Teaming Melvin Liam Poon Keat, Kenna Zeng Alex Nam
Default Enhancing RLOO with Dense Symbolic Rewards and Bandit-Driven Curricula Irawadee Thawornbut, Rinnara Sangpisit Abhijnya Bhat
Custom Enhancing SINGER: Onboard Visual Drone Navigation through Iterative Imitation Learning and Reinforcement Fine-Tuning Erwin Giovanbattista Marcel R. Poussi Rahul Chand
Default Evaluating RL efficiency improvement methods including Synthetic Data Augmentation and Any-Generation Reward Optimization for Mathematical Reasoning on Countdown Tasks Lin Yuan, Chloe Yuri Jeon, Kelvin Kahiu Waititu Anubha Mahajan
Custom Exploratory Coding with Poly-EPO Eli Thomas Wandless Ifdita Hasan Orney
Custom Failing to Grasp the Point: Hierarchical Reinforcement Learning for Grasping Tasks Ashwin Mahendran, Anjali Sreenivas, Arianna Liang Cao Tian Gao
Custom Fault-Adaptive Locomotion via Implicit Damage Detection in MuJoCo Ant Sidhanth Mishra Tian Gao
Default Feature-Space Curriculum Learning for Countdown Reasoning Vishaal Samir Saraiya Yash Kankariya
Custom FlowMPC: Improving Flow Matching policies with World Models Chandon Robert Hamel Joonwon Kang
Custom Forced Grounding: Diagnosing and Repairing Language Neglect in Imitation-Learned Robot Policies via RL Elana N Chen, Hayden Kwan, William Charles Rose Tyler Lum
Custom Foveated Vision via Prediction Error–Augmented Reinforcement Learning Brion Qi Ye Joonwon Kang
Custom Fragment Assembly as Goal-Reaching: An RL Approach to Targeted Molecular Design Asmani Yamin, Megan Santhumayor, Katie Liu Perry Dong
Custom Frequency Reparameterization and Anchored Residuals for Flow Matching Robot Policies Alan Zhao, Juhyun Jung Joonwon Kang
Custom From Contact to Return: Curriculum and Predictive Shaping for Humanoid Table Tennis Kyle Ian Schmoyer, Hannah Gabriella Clay, Shane Robinson Mion Jonathan Yang
Default From Structured Backtracking to Targeted Failure Correction for Robust Mathematical Reasoning Christine Li, Eric Xia Riya Karumanchi
Custom From Text to Torque: Improving RL Tracking of Text-Generated Humanoid Motions Kuzey Kantarcioglu, Benji Warburton Tyler Lum
Default Frontier-Weighted SEC: A Case Study in Curriculum Learning for RL Fine-Tuning of Language Models Garrett Alarcon Riya Karumanchi
Custom Generative and Hierarchical Imitation Learning for Marine Trajectory Control in Stochastic Ocean Currents Omar Eduardo Jimenez Lopez Jonathan Yang
Custom Goal Conditioned Behavior Cloning for Robot Social Navigation Mete Gumusayak Tyler Lum
Custom GoodLiars: A Multi-Turn Extension of Reinforcement Learning-Based Belief Disruption Emma Marie Beharry, Abel Philip John, Elizabeth Michelle Gallagher Alex Nam
Default Granularity-Aware Off-Policy Fine-Tuning of LLMs via Expert Demonstrations Kevin Song, Sukeerth Ramkumar, Yina Jian Shengqu Cai
Custom Graphs and Meta Reinforcement Learning for Portfolio Management Dhruv Manani, Churan He Joy He-Yueya
Custom GRPO Didn’t Pass the Bar (But the Harness Did): Harness-Guided Post-Training for Legal Agents Leon Reilly, Duy Nguyen Joy He-Yueya
Custom Hierarchical RL for Cost Aware Protein Engineering Campaigns in Cloud Labs Emi Maria Mathew Rahul Chand
Custom How Does Scaffolding Affect Cross-Environment Generalization in LLM RL Fine-Tuning? Alfred Sven Bertil Sjoeqivst, Hana Mengyao Liu Alex Nam
Default How Much Stale Rollout Reuse Can Verifier-Based RLOO Tolerate? A Semi-Off-Policy Replay Study on Countdown Reasoning Amulya Parthasarathy Max Du
Custom Hybrid Advantage Shaping with Goal-Aware Attention for Per-Turn Credit Assignment in LLM-Agent Reinforcement Learning Max Luis Rodriguez, Samantha Malowane Leventis, Joseph Li Perry Dong
Custom Improving Fine-Grained Manipulation by World Action Models via Online World Model-Based Feedback Planning Albert Kui Lin Tyler Lum
Custom Improving Lean Premise Retrieval via RL and Distillation Alex Lopez, Kevin Rizk, Fred Rajasekaran Tyler Lum
Default Improving Mathematical Reasoning in Small Language Models via Curriculum Learning and Iterative Execution Feedback Xinyi Ai, Jiayu Sui Yash Kankariya
Default Improving Tool Internalization for Small Models: Annealed Tool Access for RLOO on Countdown Maty Bohacek, Jason Boxi Zhang Ke Wang
Custom Intelligence per Joule as a First-Class Post-Training Objective Cynthia Wang, Ravenor Carroll Davion Rahul Chand
Custom Interoception: Teaching LLMs to Reason on a Wallclock Budget Harshvardhan Singh Joy He-Yueya
Default IPO + RLOO Alignment Report Donnie Brooks Raymond, Matthew Darshan Torre Anubha Mahajan
Custom Is Exploration Helpful? Evaluating Transfer of Open-Ended Traces to Assignment-Based Code Edits Karthik Vinay Seetharaman, Tushar Dalmia, Aaryan Shah Megha Srivastava
Default Joint Optimization of Task Difficulty and Diversity for Fine-Tuning LLMs under Sparse Rewards Jiaming Shen, Jiaxin Fang, Jenny Jin Abhijnya Bhat
Default Learnable Curricula via Self-Play for Verifiable Reasoning Tasks Kai Wen Abhijnya Bhat
Custom Learned Forgetting: Task-Conditioned Visual Memory Selection via Reinforcement Learning Han Shaun Lee Marcel Torne
Custom Learning Adaptive Tutor Policies for Conversational Language Learning via Offline Reinforcement Learning Aditya Bora Anubha Mahajan
Custom Learning from Failure: Natural Language Feedback for Reusing Failed GRPO Trajectories Chenyue Li Ethan Hsu
Custom Learning from Heterogeneous Data Sydney Yan, Tracy Y Wei, Evy Zhu Shen Tyler Lum
Custom Learning Interpretable Code Explanations of LLM Behavior Joseph Tey, Nick Jiang Abhijnya Bhat
Custom Learning Latent Action World Models for Robot Control from Unlabeled Video Seraph Kai Yang Anubha Mahajan
Custom Learning Priority Functions for Graph-Based Exploration in ARC-AGI-3 Kyle Avery Feinstein, Steve Roy Mendeleev, Ryan Bookman Ethan Hsu
Custom Learning Structured Trust Policies from Uncertainty, Advisor Signals, and Agreement Jaden Chen, Gia Grace Ancone Ifdita Hasan Orney
Custom Learning to Explore Through Information-Directed Bayesian Optimal Experimental Design Lucia Zheng Ethan Hsu
Default Learning to Teach for Test-Time Reasoning Selim Emir Can, Mete Erdogan Ke Wang
Default Learning to Teach, Teaching to Learn Isaac Wooman Park, Sam Lustgarten Anikait Singh
Default Learning to Use Tools: Reinforcement Learning for Tool-Integrated Mathematical Reasoning Zi Wang, Minghui Xu Ke Wang
Custom Learning Transfer in Multitask Agents Malti Mohan John Joonwon Kang
Custom Learning When to Use Tools: Cost-Aware RL for Agentic Reasoning Vikram Garrett Srinivasan Anushree Aggarwal
Default Learning with a Curriculum: Enhancing LLM Math Reasoning via Hint-Based RL Fine-Tuning Luan Lam Anubha Mahajan
Custom Librarian Models: Anticipatory Filesystem Construction via Reinforcement Learning Teresa Zhang Ifdita Hasan Orney
Default Making RLOO Learn From Better Signals Curriculum Scheduling and Verification Commit Contrast for Countdown Reasoning Zayn Malhotra, Ziyi Ding Anikait Singh
Default MaxRL with Imperfect Reward Signals on Countdown Petru Cristian Budianu, Nicolas Bejar Arambula Abhijnya Bhat
Custom Meaningless Trivia, Meaningful Compression: Near-Free Token Efficiency in RL for Reasoning Chung Fat Wong, Anna Grebenchtchikova Joonwon Kang
Default Memory as an Action Space: Adaptive Retrieval in Small Language Models for Medical Reasoning Renn Su, Summer Olivia Royal Riya Karumanchi
Custom Memory Pretraining for Vision-Language-Action Model Pengyu Mo, Haowen Wang, Zhen Jia Marcel Torne
Custom Memory-Augmented VLM Planners for Long-Horizon VLA Control via RL Krish Sharma, Lucas Burgett Marcel Torne
Custom MiniHedgemony: Asymmetric Reward Structures in Self-Play Wargames Alex Lin Wang, Dario Gaitzi Soatto Tian Gao
Custom MIRAGE: Model Imagined Reachability for Augmented Graph Expansion Abhinav Sattiraju, Samrat Sahoo Ifdita Hasan Orney
Custom Model-Based Reinforcement Learning for Particle Accelerator Control Ryan Wu Anushree Aggarwal
Custom Multi-Move Refinement Reinforcement Learning with D4 Spatial Equivariance for 3D Chip Placement Yize Liu Perry Dong
Default Multi-Stage Curriculum GRPO for Countdown Timothy Yu, Yash Ranjith Ke Wang
Custom Multi-Timescale Language Memory for a Frozen VLA Controller Po-Yun Cheng, Wayne Chu Marcel Torne
Default Multimodal LLM Self-Play Deepika Dandeboyina Shengqu Cai
Custom Multimodal RLPD for Industrial Robotic Cable Insertion Jehan Shah Rahul Chand
Custom Off Policy or On Policy? Multi-Agent Reinforcement Learning for Drone Swarm Coordination Jett Crist Carruth, Alex Tadken Shaffer Anushree Aggarwal
Default Off-Policy Sampling for RLOO: When Does Reusing Rollouts Help? Kennaissa Kebeto Nabi, Henok Mikael Tewolde Ke Wang
Custom Offline Model-Based Reinforcement Learning for Energy-Efficient GPU Data-Center Cooling Naomie Sandra Chien Perry Dong
Custom Offline RL for Adaptive Vocabulary Selection in Conversational Language Tutoring Onyinyechi Nichole Okoye Ifdita Hasan Orney
Custom Parallel Deep-RL Agents for Roblox Obstacle-Course Navigation: From Single-Course Memorization to Generalizing Across Procedurally-Composed Courses Alex Li, Cheney Sang, Aidan Whitedeer Joonwon Kang
Custom Personalizing Slide Layouts: A Case Study in RL Reward and Context Bottlenecks Elijah Song, Ryan Minh-Tri Le Anushree Aggarwal
Custom PFP: A Perception-Factored Policy for Robust and Efficient SO-101 Manipulation Shobhit Agarwal, Amirreza Zeinali Rahul Chand
Custom Physics-Blind Reward Hacking: Exposing and Mitigating Safety Failures in LLM-Generated Reward Functions for Robotic Manipulation Zichen Yuan, Sophia Huang Tian Gao
Custom PipelineRL: Limits of Asynchronous Reinforcement Learning for Long-Horizon Trajectories Henry Bosch, Shurui Liu Jonathan Yang
Custom PlayGrader: Coaching the Coaches with Deep RL JP Paul McAnally Jonathan Yang
Custom Pluralistic Alignment via Self-Distillation from Synthetic User Feedback Minsik Oh Rahul Chand
Custom Pose Under Pressure: Robustness of Pose-Derived Dense Rewards in Demonstration-Guided Reinforcement Learning Joseph Dehoney Ethan Hsu
Custom Preclinical HIV Drug Candidate Discovery with Reinforcement Learning Arda Dastan, Kevin Chen, Elijah Alexander Schacter Jonathan Yang
Default Preference Optimization and Curriculum RLOO for Countdown Reasoning Akshar Sarvesh Yash Kankariya
Custom Prisoner’s Lemma: Exploitability-Aware Reinforcement Learning for Online Strategic Adaptation Maanit Goel Marcel Torne
Default Process Rewards and Tool Use: Two Extensions to RLOO Fine-Tuning for Math Reasoning Michael Yang, Du Li Max Du
Default Process-Level Alignment and Value-Guided Stepwise Planning in Countdown Math Reasoning Dongyu Jia Ke Wang
Default Programmatic Reasoning for Countdown: Learning to Generate Executable Python-Style Verifications Henry Jingsong Zhou, Oleh Ivankiv Anubha Mahajan
Default Progress-Aware Prompt Sampling for Verifier-Based RL Fine-Tuning Jason Yan Yash Kankariya
Default Progressive Rationality: Enhancing LLM Mathematical Reasoning via Numerical-Target-Curriculum-SFT in the Countdown Task Meng-Chin Wang Anubha Mahajan
Default Prompt Distribution Design for RLOO Countdown Reasoning Yikai Cao, Zhibo Dai Anubha Mahajan
Custom RadOncReason: Reinforcement Learning with Verifiable Guideline Rewards for Clinical Reasoning in Radiation Oncology Hailemariam Teshome Jonathan Yang
Custom RECAP-Ψ: Advantage-Conditioned Fine-Tuning for Open-Source Humanoid VLAs Aaditya Shah, Karthik Pythireddi, Jonathan Manfu Lu Joonwon Kang
Custom Reducing Citation Hallucinations in Large Language Models Rushank Goyal Ifdita Hasan Orney
Custom Reducing Scalar Rewards to Binary Success: General Off-Policy Learning with Success Functions Armaan Alan Abraham Jonathan Yang
Custom Refine and Compose: Mahalanobis Action Barriers over Demo-free Contrastive RL Primitives Dylan Zhou, Anuva Banwasi, Jiaye Zou Tyler Lum
Custom Reinforcement Learning for Clinical Site-of-Care Triage in a Sepsis Simulator Yun Dong, Saimai Lau, Liane Ozoemelam Anushree Aggarwal
Custom Reinforcement Learning for Compiled-CNOT-Efficient VQE Circuits John William Carlson, Josh Joseph Joonwon Kang
Custom Reinforcement Learning for Dynamic Beam Steering in Plasma Metamaterials Susan Zhang Megha Srivastava
Custom Reinforcement Learning for Figgie: Learning Negotiation as a First-Class Skill Daniel Li Yang Ethan Hsu
Custom Reinforcement Learning for Fog of War Chess with Action Space Pruning Sandeep Sethuraman, Kuba Hashemian, Leon Junliang Liu Anushree Aggarwal
Custom Reinforcement Learning for Geothermal Drilling Optimization Devan Shaan Agrawal Perry Dong
Custom Reinforcement Learning for Mental Health Interventions Using Unlabeled Smartphone Data Elisabeth A Holm, Juan Pablo Gonzalez Pacheco, Alfred Yu Alex Nam
Custom Reinforcement Learning for Noise-Aware Quantum Circuit Compilation Vinav Shah, Abhishek A Shah Rahul Chand
Default Reinforcement Learning for Self-Guided Context Compression in Mathematical Reasoning Jerry Wang, Ryan Wang Ke Wang
Custom Reinforcement Learning for Terminal-Area Air Traffic Control Jerry Yin Anushree Aggarwal
Default Replay-Augmented RLOO: Restoring Within-Group Reward Variance in Sparse-Reward Policy Gradients Hao Xu Shengqu Cai
Custom Replay-Aware Curriculum Learning for RoboPianist Shekhar Sharma Megha Srivastava
Custom Reproducible Top-K PPO for S&P 500 Portfolio Selection: Risk-Adjusted Gains, Seed Variance, and the Limits of Return-Maximizing RL Andres Felipe Restrepo Rahul Chand
Custom Residual Reinforcement Learning for Robotic Manipulation of Wire-like Objects Bautista Guerra, Andrew Yuxuan Liang, Alexander Tarvo Joonwon Kang
Custom Reversing Emergent Misalignment Using Simple Self-Distillation Justin Yizhou Huang, Adam Joseph Banks Rahul Chand
Custom Reward Density and Process-Reward Hacking in a Code-Repair MDP: A Controlled Study of PPO Meta-Control over a Frozen Code LLM Jack Frederick Lofwall, Luca Thomas Wheeler Yash Kankariya
Custom Reward Design for Reinforcement-Learning Fine-Tuning of Navigation Policies Yunshan Wang Ethan Hsu
Custom Reward-Model-Calibrated Reinforcement Learning from Verifiable Rewards for Machine Learning Engineering Siddharth Sachdeva Joy He-Yueya
Custom RL & Domain Randomization for Volt-VAR Control in Electricity Distribution Grids Anish Chaudhuri, Aniket Mahajan Alex Nam
Default RL Fine-Tuning for Countdown Reasoning with Test-Time Verification and Curriculum Learning Howard Xiao, Weiwei Wu Abhijnya Bhat
Custom RL for Adaptive Tutoring: When Should a Tutor Intervene? Hoang D Nguyen, Peter Martin Alisky, Zhenghui Chen Anushree Aggarwal
Custom RL-Powered Hint Generation for Adaptive Math Tutoring: A Simulated Student Evaluation of RLOO and DPO Policies Prabu Ganesh Ravindren Rahul Chand
Default Sample Wide, Pick Smart: For a Fixed 0.5B Countdown Reasoner, Test-Time Selection Beats More Training Manat Kaur, Felipe Leite Teixeira Riya Karumanchi
Custom Sample-Efficient Atari RL with Self-Supervised Pretrained Visual Encoders Mark Mutugi Athiri Joonwon Kang
Default Sample-Efficient RLOO via Off-Policy Rollout Reuse Zhaoyang Li Abhijnya Bhat
Custom Scale-Based Curriculum Pretraining for Robotic Piano Performance Anna Luna Fisher Lopez, Justin Choo, Eric Martz Megha Srivastava
Default Scaling Test-Time Computation for Countdown Reasoning through Verifier-Guided Resampling Jiaxuan Sun, Angel Zhang Ke Wang
Default Scaling Test-Time Compute via Generative Verification in Constrained Parameter Regimes Mohammad Rehan Ghori Anubha Mahajan
Custom SEACTS: Sequential Evidence Acquisition for Cancer Target Selection Nathan Zhou, Ria Garg Megha Srivastava
Custom Seeking Disagreement: Online Credit Assignment with Delayed and Pseudo-Aggregated Rewards Haozhan Gao Perry Dong
Default Selective Entropy Shaping For RLOO: When Importance Weight Gating Hurts Exploration Syed Ashal Ali Shengqu Cai
Custom Self-improving Vision-Language Models: Reinforcement Learning over Visual Abstractions Khai Loong Aw, Zhang Bai-han Tyler Lum
Custom Self-Supervised Data Quality Scoring for Offline RL in Driving Roman Gasiorowski Ke Wang
Custom Sequential Injection Control for Optimal Stimulated Geologic Hydrogen Production through Deep Reinforcement Learning Spencer Zhang Riya Karumanchi
Custom Sequential Outfit Curation with Multi-Dimensional Aesthetic Rewards Nicole Cortes, Esidore Fajardo Eneinyang, Chloe Di Murdoch Anushree Aggarwal
Default SFT-Estimated Curriculum Learning for Rule-Based RLOO Fine-Tuning Vanessa Felix Anikait Singh
Custom SHIELD: Failure-Aware Policy Shielding for Frozen Vision-Language-Action Policies Tianhui Huang, Jacob Lokheen Lee, Daniel Contreras-Esquivel Ke Wang
Custom SimToolReal-RGB: Visuomotor Diffusion Policies for Dexterous Manipulation Cayden Gu, Karen Vo, Christine Zhang Tyler Lum
Custom SketchRL: Finetuning Generative Sketch Models with Visual Rewards Mallika Parulekar, Tia S Geri, Hannah Rachel Levin Ifdita Hasan Orney
Default Small Generative Verifiers for Inference-Time Scaling Across RL Training Frameworks Joshua Delgadillo, Jui Khankari Abhijnya Bhat
Custom SODA: Supervised Option Discovery for Dynamic Action Chunking Umar Padela, Neetish Sharma Tyler Lum
Custom SpecGen: RL-Driven Compiler Verification Monami Dutta Gupta Jonathan Yang
Default Static vs. Adaptive Curriculum Learning for RLOO Fine-Tuning of Language Models Norah Asemota Ke Wang
Custom Stay In Your Lane! Hierarchical RL for a Modified Asteroids Game Samantha Estrada Rahul Chand
Default Strategy-Diverse Synthetic Warm-Starts for RL Fine-Tuning on Countdown Shreyas C S, Anushka Rawat Max Du
Custom Strong Sub-Agents in a Monitored “Private” Channel Under a Weak RL Supervisor Andrew Samuel Park Perry Dong
Custom Structural Diversity Rewards for Verifiable Graph-Structured Reasoning Rutanshu Jhaveri Ifdita Hasan Orney
Structured Action-Effect Observables for Residual RL under Hidden Actuator Drift Sarvesh R. Babu Abhijnya Bhat
Default Structured Q&A Reasoning for Language Models Anuj Jamwal, Srinidhi Bhat Ke Wang
Default Structured Search Traces for Process-Aware Countdown Training Parth Sheth, Yucheng Huang Max Du
Custom SUBTITLE-DPO: Verifiable-Reward Preference Optimization to Suppress Spurious Burned-in Subtitles in Audio-Visual Video Diffusion Yubo Ruan Perry Dong
Custom Survival Instinct: One-Staged RL for Quadruped Parkour Self-Play Edward Neo Lee, Shatong Zhu, Haoyue Xiao Tian Gao
Custom System-Aware Reward Shaping for the Pythia RL Prefetcher Esmee Cowing, Milly Wong, Tesvara Suliani Jiang Perry Dong
Default Systems-Aware Off-Policy RLOO: Amortizing Sampling Cost via K-Reuse Abhishek Bharani Abhijnya Bhat
Default Targeted Counterfactual Branching for Tool-Invocation Decisions in RL Fine-Tuning Andres Ernesto Garcia, Prakash Koukuntla, Joshua Hsieh Riya Karumanchi
Custom Teacher-Contrastive On-Policy Distillation Juntong Shi Perry Dong
Default Teaching Small Models to Teach Rosemary Mingrui Jiang Ke Wang
Default Test-Time Best-of-K Selection with Generative Verification for Countdown Reasoning Jingshu Liu Ke Wang
Default Test-Time Inference Scaling for RL Fine-Tuned Language Models Ava Kouhana, Julian Rodriguez Cardenas, Leah Balakrishnan Anubha Mahajan
Default Test-Time Selection and Curriculum Learning for RL Fine-Tuned Language Models on Countdown Reasoning Rydham Goyal, Rakshit Kaushik Anikait Singh
Custom The Affect of Opponent Pool Size on the Policy Stability of Compute Constrained Self-play Tasks Jonathan Andrew Lutch Tian Gao
Custom The Long Game: A Long-Horizon RL Study of Fairness in Financial Lending Brydie Sigg, Naomi Y Boneh, Christelle Chantal Millos-Lopez Riya Karumanchi
Default Tool-Integrated Reasoning for Countdown Mahmood Ishaq Alhusseini, Frank D'Agostino, Sebastian Beckett Fisher Anikait Singh
Default Tool-Integrated RLOO and Pass@K: Testing the Invisible Leash Isaiah Flores, Mia Xiao, Katherine Wang Xu Anubha Mahajan
Custom Topology-Aware Reinforcement Learning for Text-to-CAD Code Generation Gaurav Tyagi Ke Wang
Default Towards Advantage Shaping for Multi-tool Reasoning: A Preliminary Empirical Study Ricardo Alberto Carrillo Romero Anubha Mahajan
Custom Training Robot Policies with a Foundation Model Teacher Steven Feng, April Yang, Daniel Zou Tyler Lum
Default Using, Learning, and Removing Tools: A Study of Tool-Integrated Reasoning for Countdown Diego Sierra, Brianna Xie Max Du
Default Verifier-Based Reinforcement Learning for Countdown with Curriculum Training Yuyan Wu Riya Karumanchi
Default Verifier-Guided Recombination Search for Token-Efficient Test-Time Compute in Countdown Shyam Sai Bethina, Sahil Koita, Ananya Ganapathi Anubha Mahajan
Default When Can a Model Write Its Own Curriculum? A Diagnostic Study of Joint Conjecturer/Prover RLOO for Countdown Finn Staeblein, Nicholas Simon Allen Riya Karumanchi
Default When Does Curriculum Learning Help? A Knowledge-Aware Learning-Progress Curriculum for RLVR Louis de Germay de Cirfontaine, Arthur Gontier Riya Karumanchi
Custom When Does Execution Feedback Transfer? Minimal-Sufficient Feedback for Internalized RLVR Yucheng Yao Alex Nam
Custom When Does Privileged Self-Distillation Help GUI Grounding? A Teacher-Gap Analysis of Visual SDPO Ethan Charles Morgan Ethan Hsu
Custom When Does Specialist RL Help a Small-Model Multi-Agent Pipeline? A Cross-Benchmark Study on Long-Document QA Mert Karabiyik Alex Nam
Default Where RLOO Stops on Countdown: A Capability Ceiling at the Additive ↔ Multiplicative Boundary Aarohi Gupta Abhijnya Bhat
Default Zero Sum Game Framework for Group Advantage Simplification Joshua A Slagle Yash Kankariya
No projects match your search.


    © Chelsea Finn 2026