Tsinghua's Landmark Survey: How Deep Reinforcement Learning

Tsinghua University’s Landmark Survey: How Deep Reinforcement Learning Is Reshaping Intelligent Dispatch in Delivery, Ride-Hailing, and Warehousing

Every day, Meituan’s platform processes over 30 million food delivery orders, DiDi dispatches millions of rides, and JD Logistics handles tens of millions of parcels. Behind these seemingly simple “order-deliver” transactions lies one of the most complex optimization problems in computer science: How do you make optimal dispatch and routing decisions for thousands of service workers (riders, drivers, AGV robots) in a constantly changing supply-demand environment?

A research team from Tsinghua University’s Department of Electronic Engineering (BNRist Lab)—Zefang Zong, Jingwei Wang, and Professor Yong Li—in collaboration with Tao Feng (UIUC) and Tong Xia (Cambridge), has published a comprehensive 41-page survey in ACM Computing Surveys, one of the most prestigious review journals in computer science: “Deep Reinforcement Learning for Demand Driven Services in Logistics and Transportation Systems: A Survey.” This paper systematically catalogs the entire frontier of deep reinforcement learning (DRL) applications in “Demand-Driven Services” (DDS) across logistics and transportation, covering everything from food delivery to ride-hailing, express shipping to warehouse AGV operations.

The Unified Framework: “DDS Service Loop” — Uncovering the Structural Commonality from Delivery to Warehousing

The paper’s most important theoretical contribution is the “Demand-Driven Service Loop” (DDS Loop) concept. The authors discovered that regardless of whether the application is food delivery, ride-hailing, express shipping, or warehousing, the underlying structure can be abstracted into a cyclic interaction among three roles: Service Provider → Service Target → Service Worker.

This abstraction carries profound methodological significance. In traditional research, food delivery dispatch, ride-hailing scheduling, and warehouse robot path planning were treated as separate problems tackled by different research communities. The DDS Loop framework reveals their structural commonality:

Food Delivery: Restaurant (provider) → Consumer (target) → Courier (worker)
Ride-hailing: Passenger origin (provider) → Destination (target) → Driver (worker)
Express Pickup: Consignor (provider) → Depot (target) → Courier (worker)
Express Delivery: Depot (provider) → Consignee (target) → Courier (worker)
Warehousing: Shelf/Entry/Station (provider) → Shelf/Entry/Station (target) → AGV (worker)

This means DRL algorithms validated in one scenario can theoretically transfer to another. For logistics companies operating multiple business lines (like Meituan running delivery, flash sales, and ride-hailing simultaneously), the underlying algorithmic framework can be reused with scenario-specific fine-tuning.

Two Decision Stages: Dispatching and Routing — DRL’s Core Battlegrounds

The paper categorizes all DDS decision problems into two stages, which are DRL’s primary arenas:

Stage 1: Dispatching — matching demands with service workers. This is the “who does what” problem. In food delivery, it means deciding which rider picks up which order; in ride-hailing, which driver picks up which passenger. The core challenge: both supply and demand change in real-time, and current dispatch decisions affect future resource distribution. Traditional greedy algorithms (always pick the current best) ignore these long-term effects.

DRL’s advantage in dispatching lies in learning “delayed gratification” — through sequential decision learning, it can understand that “sending a rider slightly further now may position them advantageously, enabling more deliveries over the next 30 minutes.” The paper details applications of DQN, Actor-Critic, and PPO algorithms in dispatching. Notably, Meituan’s own research team contributed key works, including using Capsule Networks to capture spatial-temporal distributions of riders and orders.

Stage 2: Routing — determining specific travel routes for service workers. This is the “how to get there” problem. Mathematically, it reduces to various forms of the Vehicle Routing Problem (VRP) — a classic NP-hard problem. Traditional methods (exact solvers, heuristics) work well at small scales but become computationally prohibitive in large-scale real-time scenarios.

DRL’s breakthrough in routing is replacing hand-crafted heuristic rules with neural networks. Through Attention mechanisms (particularly Transformer architectures), DRL models can directly “see” high-quality routes from spatial node distributions without enumerating all possible combinations. The paper notes that Google DeepMind’s Pointer Network and subsequent Attention Model (AM) are milestones in this direction, now approaching or exceeding traditional heuristic performance on 100-node TSP/VRP problems while being hundreds of times faster at inference.

Five Technical Challenges: The Gap Between Lab and Real World

Chapter 7 candidly identifies five technical challenges DRL faces in DDS applications — invaluable reference for logistics companies evaluating AI investments:

1. Coupled Spatial-Temporal Representations. Demand and supply in logistics change simultaneously across space and time, deeply intertwined. Current solutions (capsule networks, multi-head attention) are still exploratory. Designing neural architectures that effectively capture these coupled relationships remains an open core problem.

2. System Safety. DRL models may produce uncontrollable behaviors during inference, and constraint violations in logistics carry high costs (late deliveries, illegal routes). Existing constraint-handling methods (reward penalties, Lagrangian relaxation) show limited effectiveness. In production, DRL systems must be paired with human oversight and fallback rules — they cannot run fully autonomously.

3. Large-Scale Deployment. The largest gap from paper to product. Academic VRP experiments typically cap at 100 nodes, but real city-level delivery networks may have tens of thousands. The authors acknowledge current solutions are “far from enough to solve the large-scale problem.”

4. Dynamic Real-time Scheduling. In the real world, new orders arrive continuously, existing ones get cancelled or modified, and traffic conditions change constantly. Standard dynamic TSP complexity reaches O(n³). Only a handful of DRL solutions currently handle truly dynamic routing scenarios.

5. AGV and UAV Challenges. When service workers shift from humans to machines, additional constraints emerge: charging requirements, micro-obstacle avoidance, and flight regulations. These are difficult to fully simulate during training, creating a significant sim-to-real performance gap.

Five Open Problems: Golden Tracks for Future Research

Chapter 8 identifies five research directions with enormous commercial potential:

1. Advanced DRL Methods. Offline RL is considered among the most promising directions — learning from historical data without real-world interaction, solving the high-risk, high-cost problem of online learning. For logistics companies with accumulated dispatch data (Meituan, DiDi, JD), offline RL may be the most practical AI upgrade path.

2. Joint Optimization of Both Stages. Most current systems solve dispatching and routing independently, but they’re deeply interconnected in reality. Joint optimization faces state-space explosion challenges but promises significant overall performance improvements.

3. Fairness Consideration. Perhaps the most socially significant direction. Current DDS systems almost universally maximize platform profit, potentially creating extreme income disparities among workers. The paper calls for incorporating fairness into reward function design — an issue that became a social flashpoint in China’s “algorithms trapping delivery riders” debate.

4. Partial Compliance. Current algorithms assume workers execute 100% of platform instructions, but riders frequently reject certain orders (e.g., long-distance deliveries on rainy days). Modeling this “human uncertainty” and designing incentive mechanisms is a cross-disciplinary challenge bridging behavioral economics and reinforcement learning.

5. Large-Scale Online Scheduling Systems. The paper identifies this as the “ultimate benchmark” — building systems that handle real-world DDS tasks at scale. This requires solving all challenges simultaneously: spatial-temporal coupling, dynamics, fleet heterogeneity, scalability, and practical constraints.

Strategic Implications for Supply Chain Practitioners

First, DRL is becoming the “new infrastructure” of logistics dispatch. From Meituan’s order allocation to Amazon’s warehouse robot scheduling, DRL has evolved from academic concept to industrial practice. For enterprises with daily order volumes exceeding 10,000, DRL-driven efficiency gains (typically 5-15%) can more than cover technology investment costs.

Second, data is DRL’s fuel — and logistics companies naturally own it. Every dispatch record, GPS trajectory, and order flow that logistics companies accumulate daily is ideal training data. The rise of offline RL further lowers the barrier — you don’t need complex simulation environments, just learn directly from historical data. Start systematically storing and labeling dispatch data now.

Third, pursue “AI-assisted + human fallback,” not end-to-end automation. DRL cannot yet achieve 100% reliability in logistics. The most pragmatic model: DRL generates dispatch recommendations, human dispatchers review critical decisions, hard-constraint fallback rules remain in place. Gradually expand automation as models mature in specific scenarios.

Fourth, fairness isn’t just ethics — it’s business necessity. Over-optimizing platform profit while ignoring worker income distribution ultimately leads to worker attrition, public backlash, and regulatory intervention. Incorporating fairness as a constraint in dispatch algorithms is essential investment for long-term sustainable operations.

Source: Zong, Z., Wang, J., Feng, T., Xia, T., & Li, Y. (2024). “Deep Reinforcement Learning for Demand Driven Services in Logistics and Transportation Systems: A Survey.” ACM Computing Surveys. arXiv:2108.04462v3 | Tsinghua University, Department of Electronic Engineering, BNRist Lab