Abstract
LIBERO has emerged as a widely adopted benchmark for evaluating Vision-Language-Action (VLA) models; however, its current training and evaluation settings are problematic, often leading to inflated performance estimates and preventing fair model comparison. To address these issues, we introduce LIBERO-PRO, an extended LIBERO benchmark that systematically evaluates model performance under reasonable perturbations across four dimensions: manipulated objects, initial states, task instructions, and environments. Experimental results reveal that, although existing models achieve over 90% accuracy under the standard LIBERO evaluation, their performance collapses to 0.0% under our generalized setting. Crucially, this discrepancy exposes the models’ reliance on rote memorization of action sequences and environment layouts from the training set, rather than genuine task understanding or environmental perception. For instance, models persist in executing grasping actions when the target object is replaced with irrelevant items, and their outputs remain unchanged even when given corrupted instructions or even messy tokens. These findings expose the severe flaws in current evaluation practices, and we call on the community to abandon misleading methodologies in favor of robust assessments of model generalization and comprehension.
The VLA’s High Scores Reflect Rote Memorization, Not The Acquisition Of Effective Policy
Finding 1: Does the Model Generalize to New Objects? VLA completely don't know to grasp what is object
The salad dressing on the workspace was replaced with alphabet soup. The VLA still grasped it as if it were salad dressing, demonstrating that it follows a fixed action trajectory rather than understanding the object.
Finding 2: Does the Model Generalize to Varied Instructions? VLA completely don't know what instruction
Finding 3: How Sensitive is the Model to Object Placement? VLA can hardly handle changes in the position of objects
Finding 4: Is the Model Truly Understanding Tasks?
Quantitative Experimental Analysis
Task-wise LIBERO and LIBERO-PRO Performance on libero-goal Benchmark
Accuracy on test (Ori: original, Obj: object, Spa: spatial, Sem: semantic, Task: task-level, Env: environment).
Action notation: Open(x,y) = open target y of container x;
Put(obj,loc) = place object obj onto/into location loc;
Push(obj,loc) = push object obj toward location loc;
TurnOn(obj) = activate object obj.
| Task (Symbolic Form) | OpenVLA | Pi0 | Pi0.5 | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Ori | Obj | Pos | Sem | Task | Env | Ori | Obj | Pos | Sem | Task | Env | Ori | Obj | Pos | Sem | Task | Env | |
| Open(cabinet, drawer_mid) | 0.98 | 0.96 | 0.00 | 1.00 | 0.00 | 0.96 | 0.92 | 0.94 | 0.00 | 0.94 | 0.00 | 0.00 | 0.96 | 0.96 | 0.00 | 0.94 | 0.04 | 0.92 |
| Put(bowl, drawer_top) | 0.88 | 0.88 | 0.00 | 0.96 | 0.00 | 0.82 | 0.76 | 0.86 | 0.00 | 0.94 | 0.00 | 0.10 | 0.98 | 0.98 | 0.94 | 1.00 | 0.02 | 1.00 |
| Push(plate, stove_front) | 1.00 | 1.00 | 0.00 | 1.00 | 0.00 | 1.00 | 0.94 | 0.93 | 0.00 | 0.94 | 0.00 | 0.54 | 0.96 | 0.98 | 0.00 | 0.98 | 0.00 | 0.28 |
| Put(bowl, plate) | 1.00 | 1.00 | 0.00 | 1.00 | 0.00 | 1.00 | 1.00 | 0.99 | 0.00 | 0.96 | 0.00 | 0.96 | 0.90 | 0.90 | 0.00 | 0.92 | 0.02 | 0.04 |
| Put(bowl, stove) | 0.94 | 1.00 | 0.00 | 1.00 | 0.00 | 1.00 | 0.98 | 0.97 | 0.00 | 0.98 | 0.00 | 0.98 | 0.98 | 0.96 | 0.00 | 0.98 | 0.04 | 0.78 |
| Put(bowl, cabinet_top) | 1.00 | 0.98 | 0.00 | 1.00 | 0.00 | 1.00 | 0.98 | 0.94 | 0.00 | 0.92 | 0.00 | 0.94 | 0.96 | 0.96 | 0.00 | 0.96 | 0.02 | 0.00 |
| Put(cream_cheese, bowl) | 1.00 | 1.00 | 0.00 | 1.00 | 0.00 | 1.00 | 0.96 | 0.96 | 0.00 | 0.96 | 0.00 | 0.46 | 0.98 | 1.00 | 0.98 | 0.98 | 0.02 | 0.94 |
| Put(wine_bottle, rack) | 1.00 | 1.00 | 0.00 | 1.00 | 0.00 | 1.00 | 0.68 | 0.84 | 0.00 | 0.64 | 0.00 | 0.00 | 1.00 | 0.98 | 0.88 | 0.96 | 0.02 | 0.12 |
| Put(wine_bottle, cabinet_top) | 0.96 | 0.83 | 0.00 | 0.88 | 0.00 | 0.88 | 0.94 | 1.00 | 0.00 | 1.00 | 0.00 | 0.00 | 1.00 | 1.00 | 0.98 | 0.98 | 0.02 | 0.48 |
| TurnOn(stove) | 1.00 | 1.00 | 0.00 | 1.00 | 0.00 | 1.00 | 1.00 | 1.00 | 0.00 | 1.00 | 0.00 | 0.00 | 1.00 | 1.00 | 0.00 | 0.96 | 0.00 | 0.00 |
| Average | 0.98 | 0.96 | 0.00 | 0.98 | 0.00 | 0.98 | 0.92 | 0.94 | 0.00 | 0.93 | 0.00 | 0.39 | 0.97 | 0.97 | 0.38 | 0.97 | 0.00 | 0.46 |
Task-wise LIBERO and LIBERO-PRO Performance on libero-spatial Benchmark
Action notation: Pick(src) = pick up object bowl_black from location src and place bowl_black onto/into target dst.
| Task (Symbolic Form) | OpenVLA | Pi0 | Pi0.5 | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Ori | Obj | Pos | Sem | Task | Env | Ori | Obj | Pos | Sem | Task | Env | Ori | Obj | Pos | Sem | Task | Env | |
| Pick(between(plate, ramekin), plate) | 1.00 | 0.95 | 0.00 | 0.88 | 0.00 | 0.02 | 1.00 | 0.97 | 0.00 | 0.94 | 0.00 | 1.00 | 1.00 | 0.96 | 0.02 | 0.98 | 0.00 | 0.66 |
| Pick(table_center, plate) | 1.00 | 0.98 | 0.00 | 1.00 | 0.00 | 1.00 | 1.00 | 1.00 | 0.00 | 0.98 | 0.00 | 0.98 | 1.00 | 1.00 | 0.00 | 0.98 | 0.02 | 0.36 |
| Pick(drawer_top(cabinet_wood), plate) | 0.96 | 0.96 | 0.00 | 0.94 | 0.00 | 0.94 | 0.98 | 0.99 | 0.00 | 0.94 | 0.00 | 0.54 | 1.00 | 1.00 | 0.00 | 1.00 | 0.00 | 0.64 |
| Pick(next_to(cookie_box), plate) | 1.00 | 1.00 | 0.00 | 1.00 | 0.00 | 1.00 | 1.00 | 0.94 | 0.00 | 1.00 | 0.00 | 1.00 | 1.00 | 0.98 | 0.00 | 1.00 | 0.02 | 0.86 |
| Pick(next_to(plate), plate) | 1.00 | 0.95 | 0.00 | 0.94 | 0.00 | 0.92 | 0.96 | 1.00 | 0.00 | 0.94 | 0.00 | 0.84 | 0.90 | 0.92 | 0.00 | 0.90 | 0.00 | 0.22 |
| Pick(next_to(ramekin), plate) | 1.00 | 0.99 | 0.00 | 1.00 | 0.00 | 0.98 | 1.00 | 0.80 | 0.00 | 1.00 | 0.00 | 0.48 | 1.00 | 0.96 | 0.12 | 1.00 | 0.02 | 0.00 |
| Pick(on(cookie_box), plate) | 1.00 | 0.99 | 0.00 | 0.98 | 0.00 | 1.00 | 0.98 | 0.92 | 0.00 | 0.98 | 0.00 | 0.96 | 1.00 | 1.00 | 0.00 | 0.98 | 0.00 | 1.00 |
| Pick(on(ramekin), plate) | 0.92 | 0.95 | 0.00 | 0.96 | 0.00 | 1.00 | 0.90 | 1.00 | 0.00 | 0.96 | 0.00 | 0.10 | 1.00 | 1.00 | 0.98 | 1.00 | 0.02 | 0.00 |
| Pick(on(stove), plate) | 1.00 | 0.99 | 0.00 | 1.00 | 0.00 | 1.00 | 0.96 | 0.99 | 0.00 | 0.96 | 0.00 | 0.04 | 0.96 | 0.96 | 0.02 | 0.94 | 0.00 | 0.82 |
| Pick(on(cabinet_wood), plate) | 0.94 | 0.99 | 0.00 | 1.00 | 0.00 | 1.00 | 0.94 | 0.86 | 0.00 | 1.00 | 0.00 | 0.80 | 0.90 | 0.88 | 0.90 | 0.90 | 0.00 | 0.02 |
| Average | 0.98 | 0.97 | 0.00 | 0.97 | 0.00 | 0.89 | 0.97 | 0.95 | 0.00 | 0.97 | 0.00 | 0.60 | 0.98 | 0.97 | 0.20 | 0.97 | 0.01 | 0.46 |
Task-wise LIBERO and LIBERO-PRO Performance on libero-10 Benchmark
Action notation: TurnOn(obj) = activate object obj; Put(obj, loc) = place object obj onto/into location loc; Place(obj, loc) = place picked object obj into/onto target loc; (obj).close = close container obj.
| Task (Symbolic Form) | OpenVLA | Pi0 | Pi0.5 | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Ori | Obj | Pos | Sem | Task | Env | Ori | Obj | Pos | Sem | Task | Env | Ori | Obj | Pos | Sem | Task | Env | |
| TurnOn(stove) ∧ Put(moka_pot, stove) | 1.00 | 0.95 | 0.00 | 1.00 | 0.00 | 0.00 | 0.74 | 0.67 | 0.00 | 0.74 | 0.00 | 0.00 | 0.92 | 0.90 | 0.18 | 0.90 | 0.00 | 0.94 |
| Put(bowl_black, drawer_bottom(cabinet)).close | 0.78 | 0.98 | 0.00 | 1.00 | 0.00 | 0.98 | 0.94 | 0.93 | 0.00 | 0.94 | 0.00 | 0.78 | 0.98 | 0.96 | 0.58 | 0.98 | 0.02 | 0.86 |
| Put(mug_yellow_white, microwave).close | 1.00 | 0.98 | 0.00 | 0.94 | 0.00 | 0.96 | 0.88 | 0.74 | 0.00 | 0.88 | 0.00 | 0.00 | 0.98 | 0.98 | 0.00 | 1.00 | 0.00 | 0.46 |
| Put({moka_pot_1, moka_pot_2}, stove) | 0.92 | 0.64 | 0.00 | 0.92 | 0.00 | 0.78 | 0.22 | 0.28 | 0.00 | 0.22 | 0.00 | 0.00 | 0.98 | 0.98 | 0.00 | 0.98 | 0.02 | 0.26 |
| Put({alphabet_soup, cream_cheese}, basket) | 0.96 | 0.98 | 0.00 | 1.00 | 0.00 | 1.00 | 1.00 | 0.96 | 0.00 | 1.00 | 0.00 | 0.16 | 1.00 | 0.98 | 0.00 | 0.98 | 0.00 | 0.52 |
| Put({alphabet_soup, tomato_sauce}, basket) | 0.98 | 0.95 | 0.00 | 0.98 | 0.00 | 0.96 | 0.88 | 0.79 | 0.00 | 0.88 | 0.00 | 0.06 | 0.94 | 0.94 | 0.02 | 0.94 | 0.02 | 0.70 |
| Put({cream_cheese, butter}, basket) | 0.96 | 0.96 | 0.00 | 0.98 | 0.00 | 0.98 | 0.98 | 0.98 | 0.00 | 0.98 | 0.00 | 0.96 | 0.96 | 0.94 | 0.00 | 0.98 | 0.00 | 0.00 |
| Put((mug_white, plate_left) | (mug_yellow_white, plate_right)) | 0.84 | 0.85 | 0.00 | 0.92 | 0.00 | 0.98 | 0.86 | 0.72 | 0.00 | 0.86 | 0.00 | 0.00 | 1.00 | 0.98 | 0.00 | 1.00 | 0.02 | 0.22 |
| Put((mug_white, plate) | (pudding_choco, right_of(plate))) | 0.94 | 0.77 | 0.00 | 0.90 | 0.00 | 0.86 | 0.76 | 0.82 | 0.00 | 0.74 | 0.00 | 0.00 | 0.62 | 0.58 | 0.00 | 0.60 | 0.00 | 0.04 |
| Place(book, compartment_back(caddy)) | 1.00 | 1.00 | 0.00 | 0.98 | 0.00 | 1.00 | 0.94 | 0.98 | 0.00 | 0.94 | 0.00 | 0.76 | 0.94 | 0.94 | 0.00 | 0.94 | 0.00 | 0.56 |
| Average | 0.93 | 0.81 | 0.00 | 0.96 | 0.00 | 0.85 | 0.82 | 0.79 | 0.00 | 0.82 | 0.00 | 0.27 | 0.93 | 0.92 | 0.08 | 0.93 | 0.01 | 0.46 |
Task-wise LIBERO and LIBERO-PRO Performance on libero-object Benchmark
Action notation: Place(obj, loc) = pick up object obj and place obj into/onto target loc.
| Task (Symbolic Form) | OpenVLA | Pi0 | Pi0.5 | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Ori | Obj | Pos | Sem | Task | Env | Ori | Obj | Pos | Sem | Task | Env | Ori | Obj | Pos | Sem | Task | Env | |
| Place(alphabet_soup, basket) | 1.00 | 0.97 | 0.00 | 0.96 | 0.00 | 0.00 | 0.98 | 0.98 | 0.00 | 0.88 | 0.00 | 0.06 | 0.96 | 0.96 | 0.00 | 0.94 | 0.00 | 0.68 |
| Place(bbq_sauce, basket) | 0.98 | 0.90 | 0.00 | 0.94 | 0.00 | 0.00 | 0.98 | 1.00 | 0.00 | 1.00 | 0.00 | 0.46 | 1.00 | 0.98 | 1.00 | 1.00 | 0.02 | 0.96 |
| Place(butter, basket) | 1.00 | 1.00 | 0.00 | 1.00 | 0.00 | 0.00 | 1.00 | 0.98 | 0.00 | 1.00 | 0.00 | 0.00 | 0.96 | 0.98 | 0.54 | 0.98 | 0.00 | 0.00 |
| Place(chocolate_pudding, basket) | 0.98 | 1.00 | 0.00 | 0.98 | 0.00 | 0.00 | 0.98 | 0.96 | 0.00 | 1.00 | 0.00 | 0.18 | 1.00 | 0.96 | 0.00 | 0.94 | 0.02 | 0.82 |
| Place(cream_cheese, basket) | 1.00 | 1.00 | 0.00 | 1.00 | 0.00 | 0.00 | 1.00 | 1.00 | 0.10 | 1.00 | 0.00 | 0.76 | 0.98 | 1.00 | 0.00 | 0.94 | 0.00 | 0.88 |
| Place(ketchup, basket) | 1.00 | 1.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.96 | 0.84 | 0.00 | 0.92 | 0.00 | 0.24 | 0.98 | 0.96 | 0.20 | 0.96 | 0.02 | 0.92 |
| Place(milk, basket) | 1.00 | 0.98 | 0.00 | 0.96 | 0.00 | 0.00 | 1.00 | 0.84 | 0.00 | 1.00 | 0.00 | 0.16 | 1.00 | 1.00 | 0.00 | 0.98 | 0.00 | 0.80 |
| Place(orange_juice, basket) | 1.00 | 1.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.92 | 1.00 | 0.00 | 1.00 | 0.00 | 0.62 | 0.98 | 1.00 | 0.00 | 0.96 | 0.02 | 0.28 |
| Place(salad_dressing, basket) | 0.94 | 1.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.98 | 0.98 | 0.10 | 1.00 | 0.00 | 0.00 | 1.00 | 0.98 | 0.00 | 0.98 | 0.00 | 1.00 |
| Place(tomato_sauce, basket) | 0.96 | 1.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.98 | 0.82 | 0.00 | 0.92 | 0.00 | 0.48 | 0.98 | 0.98 | 0.00 | 0.96 | 0.00 | 0.92 |
| Average | 0.99 | 0.98 | 0.00 | 0.98 | 0.00 | 0.00 | 0.98 | 0.94 | 0.00 | 0.90 | 0.00 | 0.29 | 0.98 | 0.98 | 0.17 | 0.96 | 0.01 | 0.73 |
Case Study
OpenVLA in LIBERO-GOAL
Pi0.5 in LIBERO-SPATIAL
Pi0 in LIBERO-10
OpenVLA in LIBERO-OBJECT
BibTeX
@article{Liberopro2025,
title={LIBERO-PRO: Towards Robust and Fair Evaluation of Vision-Language-Action Models Beyond Memorization},
author={Xueyang Zhou and Yangming Xu and Guiyao Tie and Yongchao Chen and Guowen Zhang and Duanfeng Chu and Pan Zhou and Lichao Sun},
journal={[arXiv preprint arXiv:2510.03827]},
year={2025},
url={https://arxiv.org/abs/2510.03827}
}