Discriminatory Order Assignment and Payment-Setting on Food-Delivery Platforms: A Multi-Action and Multi-Agent Reinforcement Learning Framework

January, 2026

Abstract

This paper studies the discriminatory order assignment and payment-setting strategies for on-demand food-delivery platforms. We consider an on-demand food-delivery platform that coordinates customers, couriers, and restaurants to maximize the profit. It determines how to bundle orders, assign orders to couriers, and set payments to couriers in real-time. These decisions are made in a personalized manner, depending on the historical data collected from each of the couriers, such as the order acceptance and rejection rates under distinct scenarios of order assignment and payment values. A Markov Decision Process is formulated for the courier, capturing the decisions of the platform (including differentiated order assignment/bundling strategies and the discriminatory payment-settings decisions) while considering its dependence on the personalized work-related data of each individual courier. To derive the optimal policies, we propose a novel multi-action and multi-agent deep reinforcement learning framework, where a double Deep Q-Network is employed to develop discrete order assignment strategies, and double Proximal Policy Optimization is utilized to determine continuous payment decisions. Within this learning framework, we introduce a novel neural network architecture that leverages the Query-Key attention mechanism to transform multiplicative time complexities into additive computation complexity for order assignment, and we adopt a variable-length Bi-LSTM module that compresses variable-length order sequence into a fixed-dimensional feature space to enhance scalability. The proposed neural network and algorithmic framework was validated in a case study using real-world food-delivery data from Hong Kong. By comparing the proposed method with a vanilla MLP-based neural network architecture, we find that the proposed neural network architecture significantly enhances platform performance that it increases the number of orders served by 5.25%, reduces platform expenses by 10%, and improves the overall reward of the platform by over 50%. Additionally, our results reveal that couriers with higher order rejection rates receive more orders during peak hours but earn lower wages. This counterintuitive finding is attributed to a strategic approach by the platform to differentiate order allocation that instead of simply allocating fewer orders to couriers with higher rejection rates, the platform preferentially assigns longer-distance trips to couriers with a higher likelihood of order acceptance. These findings expose the implicit biases in the discriminatory algorithms used by the profit-maximizing platform and highlight potential areas for governmental regulatory intervention. The code for our model is publicly available at https://github.com/RS2002/Discriminatory-Food-Delivery .

Type

Journal article

Publication

In Transportation Research Part E – Logistics and Transportation Review