No Regret Reinforcement Learning Algorithms for Online Scheduling with Multi-Stage Tasks

Yongxin Xu; Hengquan Guo; Ziyu Shao; Xin Liu

doi:10.24963/ijcai.2025/753

No Regret Reinforcement Learning Algorithms for Online Scheduling with Multi-Stage Tasks

Yongxin Xu, Hengquan Guo, Ziyu Shao, Xin Liu

Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence

Main Track. Pages 6767-6775. https://doi.org/10.24963/ijcai.2025/753

PDF BibTeX

We study online task scheduling problems where tasks arrive sequentially and are processed by the platform or server. The service processes for tasks are multi-stage and are modeled as episodic Markov Decision Processes (MDPs). While processing a task, the system acquires rewards by consuming resources. The goal of the platform is to maximize the reward-to-cost ratio over a sequence of K tasks. Online scheduling with multi-stage tasks faces two major challenges: intra-dependence among the different stages within a task and inter-dependence among different tasks. These challenges are further exacerbated by the unknown rewards, costs, and task arrival distribution. To address these challenges, we propose the Robbins-Monro-based Value Iteration for Ratio Maximization (RM^2VI) algorithm. Specifically,RM^2VI addresses ``intra-dependence'' through optimistic value iteration and handles ``inter-dependence'' using the Robbins-Monro method. The algorithm has a greedy structure and achieves a sub-linear regret of O(K^(3/4)), establishing the no-regret property (per-task). We test RM^2VI in two synthetic experiments of sale promotion in E-commerce and machine learning job training in cloud computing. The results show RM^2VI achieves the best reward-to-cost ratio compared with the baselines.

Keywords:

Machine Learning: ML: Online learning

Machine Learning: ML: Cost-sensitive learning

Planning and Scheduling: PS: Learning in planning and scheduling

Planning and Scheduling: PS: Markov decisions processes