Transportation Science Paper: How Reinforcement Learning + Hyper-Heuristics Cut Meituan’s Meal Delivery Costs by 12%
Meal delivery looks simple — accept order, pick up, deliver — but at Meituan’s scale of 30+ million daily orders, it’s one of the most complex real-time combinatorial optimization problems on Earth. Up to 260 new orders per minute, tens of thousands of couriers moving through cities, each dispatch decision altering the entire system’s future state. Traditional greedy dispatching (always assign the nearest courier) seems reasonable but ignores a critical fact: today’s “optimal” can cause tomorrow’s disaster.
A research team from Amazon, Universidad Católica del Norte, Universidad Adolfo Ibáñez, and the University of Sydney — Ramón Auad, Felipe Lagos, and Tomás Lagos — submitted a paper to Transportation Science (one of the top journals in transportation research) proposing a hybrid framework combining reinforcement learning with hyper-heuristic optimization, validated using Meituan’s real operational data. Result: 12% cost reduction through “strategic order postponement,” with the largest improvements during peak hours with courier shortages.
Core Problem: Why “Nearest Courier Takes Nearest Order” Is a Terrible Strategy
Meal delivery platforms face two core decisions: Dispatching — which courier takes which order; and Routing — in what sequence does the courier pick up and deliver. These problems are deeply coupled, dynamically changing, and fraught with uncertainty — mathematically NP-hard.
Traditional methods treat each time window independently, minimizing current-period costs. The problem: this “myopic” strategy ignores the long-term effects of sequential decisions. Sending a courier to a distant delivery removes them from covering urgent new orders in their zone over the next 5 minutes. Waiting might bring a closer courier into range, or allow bundling of directionally similar orders.
The paper formalizes this as a Sequential Decision Process, explicitly modeling dynamic system state evolution. Each dispatch decision has both immediate cost and downstream effects on courier distribution and order wait times. This modeling makes “don’t dispatch now — wait for a better match” a legitimate, evaluable strategy.
Technical Approach: n-step SARSA + Multi-Armed Bandit Hyper-Heuristic
The framework consists of two layers, elegantly solving the “action space explosion” problem that plagues RL in combinatorial optimization:
Upper layer: n-step SARSA reinforcement learning. Unlike the better-known Q-learning, SARSA learns the value function under the current policy rather than the optimal policy — more suitable for meal delivery, which demands conservative, stable policies. The n-step extension enables the algorithm to see multi-step future rewards. The researchers use linear value function approximation for scalability — neural networks might be more precise, but at 260 orders per minute, inference speed must be prioritized.
Lower layer: Multi-Armed Bandit (MAB) hyper-heuristic. The paper’s most original design. At each decision point, the system faces not a simple “A or B” choice but must find good solutions among tens of thousands of possible courier-order matching combinations. The authors designed 7 specialized low-level heuristics (nearest-match, load-balancing, delay-tolerant, etc.), then used a MAB algorithm to dynamically select the most appropriate heuristic for the current system state. This “choosing which heuristic to use” strategy — called hyper-heuristic — avoids the computational disaster of searching directly in the enormous action space.
Simulation Environment: Rebuilding the Delivery World with Meituan’s Real Data
Another major contribution is the high-fidelity simulation environment built from Meituan’s actual operational data, capturing multiple critical real-world features:
- Order dynamics: Arrival patterns follow real temporal patterns — lunch peak vs. dinner peak, weekday vs. weekend distributions
- Courier behavior: Couriers aren’t robots. They reject certain orders (especially long-distance, bad weather), have zone preferences, and vary in online/offline timing. ML models predict order acceptance probability
- Stochastic service times: Restaurant preparation time and delivery time (traffic, building floors) are random variables, modeled with gradient boosting trees
- Time window constraints: Each order has a promised delivery time; violations mean compensation and rating penalties
Notably, the researchers honestly acknowledged a limitation: due to computational constraints, experiments ran on scaled-down instances rather than Meituan’s full-scale operations. Extending the framework to 260-orders-per-minute full scale remains a future research direction — academic honesty rare and valuable in industry-partnered papers.
Key Findings: More Than an Algorithmic Victory
1. “Strategic order postponement” delivers 12% cost reduction. The most counterintuitive finding: not always dispatching immediately is more efficient than immediate dispatch. The algorithm learned to deliberately wait in certain situations — for new couriers entering a zone, for directionally similar new orders enabling bundled delivery, for naturally easing pressure in overloaded areas.
2. Greatest improvements during peak + courier shortage. When couriers are abundant, any algorithm performs well — surplus supply means ample choices. The real differentiation occurs in extreme scenarios: lunch peak 11:30-13:00, dinner peak 17:30-20:00, bad weather causing courier dropoff. In these scenarios, myopic strategies create “cascading failures” (send courier far → zone under-served → more timeouts → forced expediting → costs spike) that RL effectively prevents.
3. Adding 10% more couriers beats algorithmic improvements. Perhaps the paper’s most practically valuable finding. A 10% increase in courier availability yields greater cost reduction than upgrading from baseline algorithms to the RL framework. For delivery platforms, fleet supply management (recruitment, incentives, retention) may deliver higher ROI than dispatch algorithm optimization. Algorithms and fleet capacity are complementary, not substitutes — the optimal strategy invests in both.
Strategic Implications for the Logistics Industry
1. “Delayed decisions” are an undervalued optimization lever. Under instant-delivery pressure, operations teams default to “dispatch as fast as possible.” This paper proves that under the right conditions (bundling opportunities, new resources arriving soon), disciplined waiting outperforms hasty action. This principle extends beyond food delivery to parcel sorting, ride-hailing dispatch, and warehouse task assignment.
2. Extreme-scenario performance is the real competitive differentiator. All competitors provide adequate service during normal periods. What determines user retention and brand reputation is service quality during peaks, bad weather, and incidents. Concentrating algorithmic resources on extreme scenarios may deliver more business value than pursuing across-the-board average improvement.
3. Capacity is the primary productive force. Algorithms cannot conjure couriers from thin air. No matter how sophisticated the algorithm, severe courier shortage leaves limited optimization room. For instant-delivery platforms, courier recruitment and retention strategies should share equal priority with technology investment. The smartest approach: use algorithmic optimization to improve courier experience (better routes, less dead mileage), which in turn improves retention.
4. Simulation is the bridge to production. The high-fidelity simulation environment is itself a major asset. Validating RL algorithms in simulation before deployment avoids the risk of “experimenting” on real orders. For any company considering AI in logistics operations, the first investment should be building the most realistic simulation environment possible, not deploying models directly.
Source: Auad, R., Lagos, F., & Lagos, T. “Data-Driven Optimization for Meal Delivery: A Reinforcement Learning Approach for Order-Courier Assignment and Routing at Meituan.” Submitted to Transportation Science. | Amazon / Universidad Católica del Norte / Universidad Adolfo Ibáñez / University of Sydney | First INFORMS TSL Data-Driven Research Challenge








