Tag: policy

Decisioning at the Edge: Policy Matching at Scale

Decisioning at the Edge: Policy Matching at Scale Policy-to-Agency Optimization with PuLP The post Decisioning at the Edge: Policy Matching at Scale appeared first on Towards Data Science. Erika Gomes-Gonçalves Go to original source

February 25, 2026
Conformal Prediction Beyond the Horizon: Distribution-Free Inference for Policy Evaluation

Conformal Prediction Beyond the Horizon: Distribution-Free Inference for Policy Evaluation arXiv:2510.26026v1 Announce Type: new Abstract: Reliable uncertainty quantification is crucial for reinforcement learning (RL) in high-stakes settings. We propose a unified conformal prediction framework for infinite-horizon policy evaluation that constructs distribution-free prediction intervals {for returns} in both on-policy and off-policy settings. Our method integrates distributional…

October 31, 2025
Convergence of off-policy TD(0) with linear function approximation for reversible Markov chains

Convergence of off-policy TD(0) with linear function approximation for reversible Markov chains arXiv:2510.25514v1 Announce Type: new Abstract: We study the convergence of off-policy TD(0) with linear function approximation when used to approximate the expected discounted reward in a Markov chain. It is well known that the combination of off-policy learning and function approximation can lead…

October 30, 2025
Beating the Winner’s Curse via Inference-Aware Policy Optimization

Beating the Winner’s Curse via Inference-Aware Policy Optimization arXiv:2510.18161v1 Announce Type: new Abstract: There has been a surge of recent interest in automatically learning policies to target treatment decisions based on rich individual covariates. A common approach is to train a machine learning model to predict counterfactual outcomes, and then select the policy that optimizes…

October 22, 2025
Stochastic Path Planning in Correlated Obstacle Fields

Stochastic Path Planning in Correlated Obstacle Fields arXiv:2509.19559v1 Announce Type: new Abstract: We introduce the Stochastic Correlated Obstacle Scene (SCOS) problem, a navigation setting with spatially correlated obstacles of uncertain blockage status, realistically constrained sensors that provide noisy readings and costly disambiguation. Modeling the spatial correlation with Gaussian Random Field (GRF), we develop Bayesian belief…

September 25, 2025
PAC Off-Policy Prediction of Contextual Bandits

PAC Off-Policy Prediction of Contextual Bandits arXiv:2507.16236v1 Announce Type: new Abstract: This paper investigates off-policy evaluation in contextual bandits, aiming to quantify the performance of a target policy using data collected under a different and potentially unknown behavior policy. Recently, methods based on conformal prediction have been developed to construct reliable prediction intervals that guarantee…

July 23, 2025
Step-DAD: Semi-Amortized Policy-Based Bayesian Experimental Design

Step-DAD: Semi-Amortized Policy-Based Bayesian Experimental Design arXiv:2507.14057v1 Announce Type: new Abstract: We develop a semi-amortized, policy-based, approach to Bayesian experimental design (BED) called Stepwise Deep Adaptive Design (Step-DAD). Like existing, fully amortized, policy-based BED approaches, Step-DAD trains a design policy upfront before the experiment. However, rather than keeping this policy fixed, Step-DAD periodically updates it…

July 21, 2025
Best-of-N through the Smoothing Lens: KL Divergence and Regret Analysis

Best-of-N through the Smoothing Lens: KL Divergence and Regret Analysis arXiv:2507.05913v1 Announce Type: new Abstract: A simple yet effective method for inference-time alignment of generative models is Best-of-$N$ (BoN), where $N$ outcomes are sampled from a reference policy, evaluated using a proxy reward model, and the highest-scoring one is selected. While prior work argues that…

July 9, 2025
POLAR: A Pessimistic Model-based Policy Learning Algorithm for Dynamic Treatment Regimes

POLAR: A Pessimistic Model-based Policy Learning Algorithm for Dynamic Treatment Regimes arXiv:2506.20406v1 Announce Type: new Abstract: Dynamic treatment regimes (DTRs) provide a principled framework for optimizing sequential decision-making in domains where decisions must adapt over time in response to individual trajectories, such as healthcare, education, and digital interventions. However, existing statistical methods often rely on…

June 26, 2025
DOLCE: Decomposing Off-Policy Evaluation/Learning into Lagged and Current Effects

DOLCE: Decomposing Off-Policy Evaluation/Learning into Lagged and Current Effects arXiv:2505.00961v1 Announce Type: new Abstract: Off-policy evaluation (OPE) and off-policy learning (OPL) for contextual bandit policies leverage historical data to evaluate and optimize a target policy. Most existing OPE/OPL methods–based on importance weighting or imputation–assume common support between the target and logging policies. When this assumption…

May 5, 2025
Reinforcement Learning with Continuous Actions Under Unmeasured Confounding

Reinforcement Learning with Continuous Actions Under Unmeasured Confounding arXiv:2505.00304v1 Announce Type: new Abstract: This paper addresses the challenge of offline policy learning in reinforcement learning with continuous action spaces when unmeasured confounders are present. While most existing research focuses on policy evaluation within partially observable Markov decision processes (POMDPs) and assumes discrete action spaces, we…

May 2, 2025
SNPL: Simultaneous Policy Learning and Evaluation for Safe Multi-Objective Policy Improvement

SNPL: Simultaneous Policy Learning and Evaluation for Safe Multi-Objective Policy Improvement arXiv:2503.12760v1 Announce Type: new Abstract: To design effective digital interventions, experimenters face the challenge of learning decision policies that balance multiple objectives using offline data. Often, they aim to develop policies that maximize goal outcomes, while ensuring there are no undesirable changes in guardrail…

March 18, 2025
Off-Policy Evaluation for Recommendations with Missing-Not-At-Random Rewards

Off-Policy Evaluation for Recommendations with Missing-Not-At-Random Rewards arXiv:2502.08993v1 Announce Type: new Abstract: Unbiased recommender learning (URL) and off-policy evaluation/learning (OPE/L) techniques are effective in addressing the data bias caused by display position and logging policies, thereby consistently improving the performance of recommendations. However, when both bias exits in the logged data, these estimators may suffer…

February 14, 2025
Navigating Soft Actor-Critic Reinforcement Learning

Navigating Soft Actor-Critic Reinforcement Learning Understanding the theory and implementation of SAC RL in the context of Bioengineering Image generated by the author using ChatGPT-4o Introduction The research domain of Reinforcement Learning (RL) has evolved greatly over the past years. The use of deep reinforcement learning methods such as Proximal Policy Optimisation (PPO) (Schulman, 2017)…

December 19, 2024