No Regret Reinforcement Learning Algorithms for Online Scheduling with Multi-Stage Tasks

No Regret Reinforcement Learning Algorithms for Online Scheduling with Multi-Stage Tasks

Yongxin Xu, Hengquan Guo, Ziyu Shao, Xin Liu

Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence

We study online task scheduling problems where tasks arrive sequentially and are processed by the platform or server. The service processes for tasks are multi-stage and are modeled as episodic Markov Decision Processes (MDPs). While processing a task, the system acquires rewards by consuming resources. The goal of the platform is to maximize the reward-to-cost ratio over a sequence of K tasks. Online scheduling with multi-stage tasks faces two major challenges: intra-dependence among the different stages within a task and inter-dependence among different tasks. These challenges are further exacerbated by the unknown rewards, costs, and task arrival distribution. To address these challenges, we propose the Robbins-Monro-based Value Iteration for Ratio Maximization (RM^2VI) algorithm. Specifically,RM^2VI addresses ``intra-dependence'' through optimistic value iteration and handles ``inter-dependence'' using the Robbins-Monro method. The algorithm has a greedy structure and achieves a sub-linear regret of O(K^(3/4)), establishing the no-regret property (per-task). We test RM^2VI in two synthetic experiments of sale promotion in E-commerce and machine learning job training in cloud computing. The results show RM^2VI achieves the best reward-to-cost ratio compared with the baselines.
Keywords:
Machine Learning: ML: Online learning
Machine Learning: ML: Cost-sensitive learning
Planning and Scheduling: PS: Learning in planning and scheduling
Planning and Scheduling: PS: Markov decisions processes