PAPER_TITLE

FIRST_AUTHOR_LAST, FIRST_AUTHOR_FIRST; SECOND_AUTHOR_LAST, SECOND_AUTHOR_FIRST

LIBERO-PRO: Towards Robust and Fair Evaluation of Vision-Language-Action Models Beyond Memorization

Xueyang Zhou¹, Yangming Xu¹, Guiyao Tie¹, Yongchao Chen^2,3, Guowen Zhang¹, Duanfeng Chu⁴, Pan Zhou¹, Lichao Sun⁵

¹Huazhong University of Science and Technology, ²Harvard University, ³Massachusetts Institute of Technology, ⁴Wuhan University of Technology, ⁵Lehigh University

Paper Code arXiv

LIBERO-PRO Teaser — Figure 1. Overview of **LIBERO-PRO**: An evaluation extension for embodied VLA benchmarks.

LIBERO-PRO introduces more challenging and fair evaluation environments to drive VLA models toward true generalization and comprehension.

Abstract

LIBERO has emerged as a widely adopted benchmark for evaluating Vision-Language-Action (VLA) models; however, its current training and evaluation settings are problematic, often leading to inflated performance estimates and preventing fair model comparison. To address these issues, we introduce LIBERO-PRO, an extended LIBERO benchmark that systematically evaluates model performance under reasonable perturbations across four dimensions: manipulated objects, initial states, task instructions, and environments. Experimental results reveal that, although existing models achieve over 90% accuracy under the standard LIBERO evaluation, their performance collapses to 0.0% under our generalized setting. Crucially, this discrepancy exposes the models’ reliance on rote memorization of action sequences and environment layouts from the training set, rather than genuine task understanding or environmental perception. For instance, models persist in executing grasping actions when the target object is replaced with irrelevant items, and their outputs remain unchanged even when given corrupted instructions or even messy tokens. These findings expose the severe flaws in current evaluation practices, and we call on the community to abandon misleading methodologies in favor of robust assessments of model generalization and comprehension.

The VLA’s High Scores Reflect Rote Memorization, Not The Acquisition Of Effective Policy

Finding 1: Does the Model Generalize to New Objects? VLA completely don't know to grasp what is object

LIBERO Original Task Original Task: Pick the salad dressing and place it in the basket. ✔

Same Object Change the color of the salad dressing to red. ✔

Other Object Replace the salad dressing with alphabet soup. ✖

The salad dressing on the workspace was replaced with alphabet soup. The VLA still grasped it as if it were salad dressing, demonstrating that it follows a fixed action trajectory rather than understanding the object.

Finding 2: Does the Model Generalize to Varied Instructions? VLA completely don't know what instruction

LIBERO Original Task Original Task: Pick the salad dressing and place it in the basket. ✔

Semantic Perturbation Grab salad dressing and set it in basket. ✔

Task Change Pick the tomato sauce and place it in the basket. ✖

Meaningless instruction xxx. ✖

Finding 3: How Sensitive is the Model to Object Placement? VLA can hardly handle changes in the position of objects

LIBERO Original Task Pick the alphabet soup and place it in the basket.✔

Position Perturbation TaskX-axis +0.1✔

Position Perturbation TaskX-axis +0.2✖

Position Perturbation TaskX-axis +0.3✖

Position Perturbation TaskX-axis +0.4✖

Position Perturbation TaskX-axis +0.5✖

LIBERO Original TaskPick the alphabet soup and place it in the basket.✔

Position Perturbation TaskY-axis +0.1✖

Position Perturbation TaskY-axis +0.2✖

Position Perturbation TaskY-axis +0.3✖

Position Perturbation TaskY-axis +0.4✖

Position Perturbation TaskY-axis +0.5✖

Finding 4: Is the Model Truly Understanding Tasks?

Task 1Put the bowl on top of the cabinet.✔

Task 2Turn on the stove.✔

Combined TaskPut the bowl on cabinet then turn on stove.✖

Quantitative Experimental Analysis

Task-wise LIBERO and LIBERO-PRO Performance on libero-goal Benchmark

Accuracy on test (Ori: original, Obj: object, Spa: spatial, Sem: semantic, Task: task-level, Env: environment).
Action notation: Open(x,y) = open target y of container x; Put(obj,loc) = place object obj onto/into location loc; Push(obj,loc) = push object obj toward location loc; TurnOn(obj) = activate object obj.

Task (Symbolic Form)	OpenVLA						Pi0						Pi0.5
Task (Symbolic Form)	Ori	Obj	Pos	Sem	Task	Env	Ori	Obj	Pos	Sem	Task	Env	Ori	Obj	Pos	Sem	Task	Env
Open(cabinet, drawer_mid)	0.98	0.96	0.00	1.00	0.00	0.96	0.92	0.94	0.00	0.94	0.00	0.00	0.96	0.96	0.00	0.94	0.04	0.92
Put(bowl, drawer_top)	0.88	0.88	0.00	0.96	0.00	0.82	0.76	0.86	0.00	0.94	0.00	0.10	0.98	0.98	0.94	1.00	0.02	1.00
Push(plate, stove_front)	1.00	1.00	0.00	1.00	0.00	1.00	0.94	0.93	0.00	0.94	0.00	0.54	0.96	0.98	0.00	0.98	0.00	0.28
Put(bowl, plate)	1.00	1.00	0.00	1.00	0.00	1.00	1.00	0.99	0.00	0.96	0.00	0.96	0.90	0.90	0.00	0.92	0.02	0.04
Put(bowl, stove)	0.94	1.00	0.00	1.00	0.00	1.00	0.98	0.97	0.00	0.98	0.00	0.98	0.98	0.96	0.00	0.98	0.04	0.78
Put(bowl, cabinet_top)	1.00	0.98	0.00	1.00	0.00	1.00	0.98	0.94	0.00	0.92	0.00	0.94	0.96	0.96	0.00	0.96	0.02	0.00
Put(cream_cheese, bowl)	1.00	1.00	0.00	1.00	0.00	1.00	0.96	0.96	0.00	0.96	0.00	0.46	0.98	1.00	0.98	0.98	0.02	0.94
Put(wine_bottle, rack)	1.00	1.00	0.00	1.00	0.00	1.00	0.68	0.84	0.00	0.64	0.00	0.00	1.00	0.98	0.88	0.96	0.02	0.12
Put(wine_bottle, cabinet_top)	0.96	0.83	0.00	0.88	0.00	0.88	0.94	1.00	0.00	1.00	0.00	0.00	1.00	1.00	0.98	0.98	0.02	0.48
TurnOn(stove)	1.00	1.00	0.00	1.00	0.00	1.00	1.00	1.00	0.00	1.00	0.00	0.00	1.00	1.00	0.00	0.96	0.00	0.00
Average	0.98	0.96	0.00	0.98	0.00	0.98	0.92	0.94	0.00	0.93	0.00	0.39	0.97	0.97	0.38	0.97	0.00	0.46

Task-wise LIBERO and LIBERO-PRO Performance on libero-spatial Benchmark

Action notation: Pick(src) = pick up object bowl_black from location src and place bowl_black onto/into target dst.

Task (Symbolic Form)	OpenVLA						Pi0						Pi0.5
Task (Symbolic Form)	Ori	Obj	Pos	Sem	Task	Env	Ori	Obj	Pos	Sem	Task	Env	Ori	Obj	Pos	Sem	Task	Env
Pick(between(plate, ramekin), plate)	1.00	0.95	0.00	0.88	0.00	0.02	1.00	0.97	0.00	0.94	0.00	1.00	1.00	0.96	0.02	0.98	0.00	0.66
Pick(table_center, plate)	1.00	0.98	0.00	1.00	0.00	1.00	1.00	1.00	0.00	0.98	0.00	0.98	1.00	1.00	0.00	0.98	0.02	0.36
Pick(drawer_top(cabinet_wood), plate)	0.96	0.96	0.00	0.94	0.00	0.94	0.98	0.99	0.00	0.94	0.00	0.54	1.00	1.00	0.00	1.00	0.00	0.64
Pick(next_to(cookie_box), plate)	1.00	1.00	0.00	1.00	0.00	1.00	1.00	0.94	0.00	1.00	0.00	1.00	1.00	0.98	0.00	1.00	0.02	0.86
Pick(next_to(plate), plate)	1.00	0.95	0.00	0.94	0.00	0.92	0.96	1.00	0.00	0.94	0.00	0.84	0.90	0.92	0.00	0.90	0.00	0.22
Pick(next_to(ramekin), plate)	1.00	0.99	0.00	1.00	0.00	0.98	1.00	0.80	0.00	1.00	0.00	0.48	1.00	0.96	0.12	1.00	0.02	0.00
Pick(on(cookie_box), plate)	1.00	0.99	0.00	0.98	0.00	1.00	0.98	0.92	0.00	0.98	0.00	0.96	1.00	1.00	0.00	0.98	0.00	1.00
Pick(on(ramekin), plate)	0.92	0.95	0.00	0.96	0.00	1.00	0.90	1.00	0.00	0.96	0.00	0.10	1.00	1.00	0.98	1.00	0.02	0.00
Pick(on(stove), plate)	1.00	0.99	0.00	1.00	0.00	1.00	0.96	0.99	0.00	0.96	0.00	0.04	0.96	0.96	0.02	0.94	0.00	0.82
Pick(on(cabinet_wood), plate)	0.94	0.99	0.00	1.00	0.00	1.00	0.94	0.86	0.00	1.00	0.00	0.80	0.90	0.88	0.90	0.90	0.00	0.02
Average	0.98	0.97	0.00	0.97	0.00	0.89	0.97	0.95	0.00	0.97	0.00	0.60	0.98	0.97	0.20	0.97	0.01	0.46

Task-wise LIBERO and LIBERO-PRO Performance on libero-10 Benchmark

Action notation: TurnOn(obj) = activate object obj; Put(obj, loc) = place object obj onto/into location loc; Place(obj, loc) = place picked object obj into/onto target loc; (obj).close = close container obj.

Task (Symbolic Form)	OpenVLA						Pi0						Pi0.5
Task (Symbolic Form)	Ori	Obj	Pos	Sem	Task	Env	Ori	Obj	Pos	Sem	Task	Env	Ori	Obj	Pos	Sem	Task	Env
TurnOn(stove) ∧ Put(moka_pot, stove)	1.00	0.95	0.00	1.00	0.00	0.00	0.74	0.67	0.00	0.74	0.00	0.00	0.92	0.90	0.18	0.90	0.00	0.94
Put(bowl_black, drawer_bottom(cabinet)).close	0.78	0.98	0.00	1.00	0.00	0.98	0.94	0.93	0.00	0.94	0.00	0.78	0.98	0.96	0.58	0.98	0.02	0.86
Put(mug_yellow_white, microwave).close	1.00	0.98	0.00	0.94	0.00	0.96	0.88	0.74	0.00	0.88	0.00	0.00	0.98	0.98	0.00	1.00	0.00	0.46
Put({moka_pot_1, moka_pot_2}, stove)	0.92	0.64	0.00	0.92	0.00	0.78	0.22	0.28	0.00	0.22	0.00	0.00	0.98	0.98	0.00	0.98	0.02	0.26
Put({alphabet_soup, cream_cheese}, basket)	0.96	0.98	0.00	1.00	0.00	1.00	1.00	0.96	0.00	1.00	0.00	0.16	1.00	0.98	0.00	0.98	0.00	0.52
Put({alphabet_soup, tomato_sauce}, basket)	0.98	0.95	0.00	0.98	0.00	0.96	0.88	0.79	0.00	0.88	0.00	0.06	0.94	0.94	0.02	0.94	0.02	0.70
Put({cream_cheese, butter}, basket)	0.96	0.96	0.00	0.98	0.00	0.98	0.98	0.98	0.00	0.98	0.00	0.96	0.96	0.94	0.00	0.98	0.00	0.00
Put((mug_white, plate_left) \| (mug_yellow_white, plate_right))	0.84	0.85	0.00	0.92	0.00	0.98	0.86	0.72	0.00	0.86	0.00	0.00	1.00	0.98	0.00	1.00	0.02	0.22
Put((mug_white, plate) \| (pudding_choco, right_of(plate)))	0.94	0.77	0.00	0.90	0.00	0.86	0.76	0.82	0.00	0.74	0.00	0.00	0.62	0.58	0.00	0.60	0.00	0.04
Place(book, compartment_back(caddy))	1.00	1.00	0.00	0.98	0.00	1.00	0.94	0.98	0.00	0.94	0.00	0.76	0.94	0.94	0.00	0.94	0.00	0.56
Average	0.93	0.81	0.00	0.96	0.00	0.85	0.82	0.79	0.00	0.82	0.00	0.27	0.93	0.92	0.08	0.93	0.01	0.46

Task-wise LIBERO and LIBERO-PRO Performance on libero-object Benchmark

Action notation: Place(obj, loc) = pick up object obj and place obj into/onto target loc.

Task (Symbolic Form)	OpenVLA						Pi0						Pi0.5
Task (Symbolic Form)	Ori	Obj	Pos	Sem	Task	Env	Ori	Obj	Pos	Sem	Task	Env	Ori	Obj	Pos	Sem	Task	Env
Place(alphabet_soup, basket)	1.00	0.97	0.00	0.96	0.00	0.00	0.98	0.98	0.00	0.88	0.00	0.06	0.96	0.96	0.00	0.94	0.00	0.68
Place(bbq_sauce, basket)	0.98	0.90	0.00	0.94	0.00	0.00	0.98	1.00	0.00	1.00	0.00	0.46	1.00	0.98	1.00	1.00	0.02	0.96
Place(butter, basket)	1.00	1.00	0.00	1.00	0.00	0.00	1.00	0.98	0.00	1.00	0.00	0.00	0.96	0.98	0.54	0.98	0.00	0.00
Place(chocolate_pudding, basket)	0.98	1.00	0.00	0.98	0.00	0.00	0.98	0.96	0.00	1.00	0.00	0.18	1.00	0.96	0.00	0.94	0.02	0.82
Place(cream_cheese, basket)	1.00	1.00	0.00	1.00	0.00	0.00	1.00	1.00	0.10	1.00	0.00	0.76	0.98	1.00	0.00	0.94	0.00	0.88
Place(ketchup, basket)	1.00	1.00	0.00	1.00	0.00	0.00	0.96	0.84	0.00	0.92	0.00	0.24	0.98	0.96	0.20	0.96	0.02	0.92
Place(milk, basket)	1.00	0.98	0.00	0.96	0.00	0.00	1.00	0.84	0.00	1.00	0.00	0.16	1.00	1.00	0.00	0.98	0.00	0.80
Place(orange_juice, basket)	1.00	1.00	0.00	1.00	0.00	0.00	0.92	1.00	0.00	1.00	0.00	0.62	0.98	1.00	0.00	0.96	0.02	0.28
Place(salad_dressing, basket)	0.94	1.00	0.00	1.00	0.00	0.00	0.98	0.98	0.10	1.00	0.00	0.00	1.00	0.98	0.00	0.98	0.00	1.00
Place(tomato_sauce, basket)	0.96	1.00	0.00	1.00	0.00	0.00	0.98	0.82	0.00	0.92	0.00	0.48	0.98	0.98	0.00	0.96	0.00	0.92
Average	0.99	0.98	0.00	0.98	0.00	0.00	0.98	0.94	0.00	0.90	0.00	0.29	0.98	0.98	0.17	0.96	0.01	0.73

Case Study

OpenVLA in LIBERO-GOAL

LIBERO Original Task Original Task: Open the middle layer of the drawer. ✔

LIBERO-PRO Task Object Perturbation: Change the cabinet to yellow. ✔

LIBERO-PRO Task Initial Position Perturbation: Change the initial position of the cabinet. ✖

LIBERO-PRO Task Semantic Perturbation: Open the bottom layer of the drawer. ✔

LIBERO-PRO Task Task Perturbation: Open the top layer of the drawer. ✖

Pi0.5 in LIBERO-SPATIAL

LIBERO Original Task Original Task: Pick the akita black bowl on the ramekin and place it on the plate. ✔

LIBERO-PRO Task Object Perturbation: Change the plate to yellow. ✔

LIBERO-PRO Task Initial Position Perturbation: Change the initial position of the plate. ✔

LIBERO-PRO Task Semantic Perturbation: Grab the akita black bowl upon the ramekin and place it on the plate. ✔

LIBERO-PRO Task Task Perturbation: Pick the akita black bowl on the cookie box and place it on the plate. ✖

Pi0 in LIBERO-10

LIBERO Original Task Original Task: Put the yellow and white mug in the microwave and close it. ✔

LIBERO-PRO Task Object Perturbation: Change the other cup to yellow and white as well. ✔

LIBERO-PRO Task Initial Position Perturbation: Swap the positions of the two cups. ✖

LIBERO-PRO Task Semantic Perturbation: Grab the yellow and white mug, put it in the microwave and close it. ✔

LIBERO-PRO Task Task Perturbation: put the white mug in the microwave and close it. ✖

OpenVLA in LIBERO-OBJECT

LIBERO Original Task Original Task: Pick the alphabet soup and place it in the basket. ✔

LIBERO-PRO Task Object Perturbation: Change the color of alphabet soup. ✔

LIBERO-PRO Task Initial Position Perturbation: Change the initial position of alphabet soup. ✖

LIBERO-PRO Task Semantic Perturbation: Take the alphabet soup and put it in the basket. ✔

LIBERO-PRO Task Task Perturbation: Pick the cream cheese and place it in the basket. ✖

BibTeX

@article{Liberopro2025,
  title={LIBERO-PRO: Towards Robust and Fair Evaluation of Vision-Language-Action Models Beyond Memorization},
  author={Xueyang Zhou and Yangming Xu and Guiyao Tie and Yongchao Chen and Guowen Zhang and Duanfeng Chu and Pan Zhou and Lichao Sun},
  journal={[arXiv preprint arXiv:2510.03827]},
  year={2025},
  url={https://arxiv.org/abs/2510.03827}
}

More Works from Our Lab

Paper Title 1

Paper Title 2

Paper Title 3