GameVerse : Can Vision-Language Models Learn from Video-based Reflection?

Kuan Zhang*, Dongchen Liu*, Qiyue Zhao, Jinkun Hou, Xinran Zhang, Qinlei Xie, Miao Liu, Yiming Li
*Equal Contribution    Corresponding Author
College of AI, Tsinghua University, China

Abstract

Human gameplay is a visually grounded interaction loop in which players act, reflect on failures, and watch tutorials to refine strategies. Can Vision-Language Models (VLMs) also learn from video-based reflection? We present GameVerse, a comprehensive video game benchmark that enables a reflective visual interaction loop. Moving beyond traditional fire-and-forget evaluations, it uses a novel reflect-and-retry paradigm to assess how VLMs internalize visual experience and improve policies. To facilitate systematic and scalable evaluation, we also introduce a cognitive hierarchical taxonomy spanning 15 globally popular games, dual action space for both semantic and GUI control, and milestone evaluation using advanced VLMs to quantify progress. Our experiments show that VLMs succeed in simple tasks, but struggle to generalize in complex games. Meanwhile, they benefit from video-based reflection in varied settings, and perform best by combining failure trajectories and expert tutorials—a training-free analogue to reinforcement learning (RL) plus supervised fine-tuning (SFT).

GameVerse Framework

Figure 1. Overview of GameVerse, which effectively probes the capability boundaries of VLMs in video game worlds. GameVerse supports dual action space, enables human-like reflection by integrating failure and tutorial videos, and delivers process score.

Method

Game Taxonomy

We construct a cognitive hierarchical taxonomy based on three cognitive axes: Image Structure (Grid/2D/3D), Temporal Dynamics (Real-time/Non-Real-time) and Causal Linearity (Linear/Non-linear), resulting in five distinct categories. GameVerse selects 15 globally popular video games distributed across these categories, with easy, medium, and hard difficulty tiers per category.

Turn-based discrete state transitions with full observability.
Turn-based progression with a fixed narrative path.
Turn-based mechanics with open-ended goals.
Continuous time constraints with singular objectives.
Continuous dynamics with complex, branching objectives.

Dual Action Space

To assess both high-level reasoning and low-level control, we define two action modes: Semantic Actions: High-level semantic actions (e.g., "Position(1,3)"), testing the agent's perception and reasoning. GUI Actions: Low-level GUI operations (e.g., KeyPress(A)), testing end-to-end precise visual control.

Video-Based Reflection and Learning

The paradigm consists of four steps: (1) Trial and Failure. The agent first attempts the game task. If it fails, the system records the sequence of visual observations leading to the negative outcome. (2) Expert Demonstration Retrieval. The system retrieves an expert walkthrough from online gameplay videos. (3) Visual Reflection. The VLM acts as a reflector. It takes both its failure trajectory and the expert demonstration as visual input. The model is prompted to contrast the two, analyzing the divergence in strategy and execution, and generating condensed empirical reflections. (4) Policy Update. These reflections are injected into the agent's system prompt, enabling it to re-attempt the task with the new context.

Scalable Evaluation Protocol

We introduce a milestone scoring pipeline that leverages advanced VLMs to quantify agent progress purely from pixels. Phase 1: Offline Milestone Detection. We employ an advanced VLM to watch expert walkthrough videos and extract structured milestone references. Phase 2: Online Scoring. Following gameplay, an advanced VLM analyzes the playthrough video, comparing the agent's trajectory against the reference milestones to calculate a process-oriented score. This method scales easily to closed-source commercial games where internal state information is inaccessible.

GameVerse Architecture

9 milestones are extracted by Gemini-3-pro in th walkthrough video of Ace Attorney.

Evaluation

We demonstrate VLM's performance in GameVerse across 15 globally popular video games. Each video shows gameplay trajectories captured in real-time. Videos are arranged in 5 rows (Cognitive Tiers) and 3 columns (Difficulty Tiers). The video has been accelerated.

Models vs. Human Comparison

Comparing 2 VLM models and 1 human player across 5 games: Baba Is You, Angry Birds, Scene Investigators, Plants vs. Zombies, Red Dead Redemption 2.

Qwen3-VL-32B-instruct Gemini-2.5-Pro Human

Video-based Reflection Comparison

Comparison of gameplay before and after Video-based Reflection (VR) implementation. The video-based reflection paradigm enables agents to refine gameplay by observing failures and expert tutorials. The improvement demonstrates how VLMs can learn from visual experience and improve policies through the reflect-and-retry paradigm. Representative comparison games: Tic-Tac-Toe, Angry Birds, Plants vs. Zombies.

Tic-Tac-Toe
Before VR
After VR

Case Study: Defensive Line Blocking (Qwen3-VL-32B)

Strategy Evolution: Qwen3-VL-32B transitioned from a purely offensive playstyle to a threat-aware defensive strategy. When facing a column-1 threat (X at A1 + C1), it shifted from placing an aggressive C3 to strategically blocking with B1, successfully forcing a draw against a Minimax AI opponent.

Key Insight (Prior Knowledge): "Aim for a draw by mirroring their moves or blocking their potential lines. Avoid leaving open lines without countermeasures."

Analysis: By internalizing the concepts of "blocking potential lines" and "countering open threats," Qwen3-VL-32B successfully identified the opponent's vertical winning condition. This shift prioritized structural defense over expansion, effectively neutralizing a high-risk threat into a stable draw.

Angry Birds
Before VR
After VR

Case Study: Strategic Trajectory Planning (GPT-4o-mini)

Strategy Evolution: GPT-4o-mini transitioned from inefficient direct fire to a sophisticated parabolic trajectory strategy. Instead of repeatedly striking the frontal obstacle, it now aims for the wooden support structure behind the mound, leveraging gravity to trigger a structural collapse.

Key Insight (Prior Knowledge): "Use a slightly upward angle to allow the bird to arc over obstacles and strike the structure at its weakest point."

Analysis: By internalizing the concept of "arcing over obstacles," GPT-4o-mini successfully bypassed terrain constraints and identified the structural vulnerability, significantly improving overall task efficiency.

Plants vs. Zombies
Before VR
After VR

Case Study: Plant Placement Correction (GPT-4o-mini)

Strategy Evolution: GPT-4o-mini transitioned from repeatedly dragging plants to invalid positions to correctly identifying plantable tiles. Before reflection, the model kept attempting to place a second Peashooter on non-plantable spots, wasting all available steps without ever starting the actual battle. After reflection, it successfully placed the plant on a valid tile, allowing the game to proceed normally.

Key Insight (Prior Knowledge): "Space plants strategically: Place Peashooters with at least one tile gap between them to allow room for additional defensive plants like Wall-nuts or more Peashooters later."

Analysis: By internalizing the concept of "strategic tile selection," GPT-4o-mini shifted its attention from blindly dragging plants to consciously identifying valid, unoccupied tiles on the lawn, resolving the repeated placement failure and enabling successful game progression.

Leaderboard and Results

We evaluate the performance of 7 VLMs (Qwen3-VL-8B/32B, GPT-4o/4o-mini, Gemini-2.5-Pro/Flash, Seed-1.8) across 15 GameVerse games, comparing them against Random, Human Rookie, and Human Expert baselines.

Table 1. Performance of 7 VLMs on 15 GameVerse games in GUI mode, with and without Video-based Reflection (VR). Green/red cells indicate performance gain/drop after applying VR (darker = larger change). Bold = top across models and human baselines.
Model VR Markov Grid Non-RT Linear Non-RT Non-Linear Real-time Linear Real-time Non-Linear Avg. Rank
TicTacToeBaba2048 MazeAngryBirdSlay AttorneyCivilizationScene SnakePvZHorizon MetroGenshinRDR2
Three Baselines
Random26.310.81.221.30.00.00.00.00.07.00.00.00.00.00.016.5
Human Expert98.910077.110085.399.410010094.210010098.11001001001.6
Human RookieNo85.183.415.999.346.434.277.454.170.493.289.373.991.263.358.44.1
Yes99.410047.310074.281.310096.490.110097.985.398.488.283.12.2
Seven Vision-Language Models
Qwen3-VL-8BNo53.360.04.480.319.611.77.10.06.32.233.43.27.814.37.212.4
Yes51.870.45.189.431.27.426.10.06.30.051.33.28.314.37.211.6
Qwen3-VL-32BNo70.673.36.410041.68.217.38.38.342.641.23.25.414.37.29.5
Yes63.180.06.110059.411.337.28.38.333.448.33.211.214.310.07.8
GPT-4o-miniNo34.460.01.377.116.72.00.00.03.20.027.63.24.214.310.014.1
Yes43.268.31.491.313.32.00.00.03.20.022.43.27.114.310.013.6
GPT-4oNo58.660.93.880.214.32.018.20.00.00.032.22.411.214.310.013.3
Yes64.260.03.485.115.42.033.30.00.00.036.43.214.314.310.012.3
Seed-1.8No92.380.09.187.829.439.633.322.23.27.226.15.41.314.310.09.3
Yes10080.018.210057.228.459.319.58.310.422.12.41.414.37.28.4
Gemini-2.5-FlashNo88.280.07.81009.42.033.30.03.220.451.15.414.614.310.09.3
Yes95.180.011.410010.22.029.30.08.310.455.35.419.114.313.48.1
Gemini-2.5-ProNo90.080.013.210032.74.437.20.00.024.233.43.211.314.310.09.2
Yes10080.026.410042.27.948.30.00.068.430.35.418.214.310.07.5

Key Findings

Performance Hierarchy: Gemini-2.5-Pro achieves the highest average ranking (7.5/9.2) among all VLMs, followed by Gemini-2.5-Flash (8.1/9.3), Gemini-2.5-Pro (8.4/9.3), and Qwen3-VL-32B (7.8/9.5).

The Rich-Get-Richer Effect: Reflection generally yields improvements, but its efficacy is non-uniform. Gains scale positively with model capability (Gemini-2.5-Pro improves 6.47% vs. GPT-4o-mini with 1.60%) and negatively with cognitive complexity (dropping from 4.4% in Non-Real-time to 1.7% in Real-time).

The Generalization Gap: Human players demonstrate remarkable generalization across all game types, while VLM agents exhibit severe degradation as complexity increases. In Easy games like Tic-Tac-Toe, Gemini-2.5-Pro achieves perfect scores (100), matching human expert baselines (98.9). In Hard games like Scene or RDR 2, model performance collapses to 0, falling short of human rookie (58.4~70.4).

The Knowing-Doing Gap: Model averages 50.5 in semantic mode, consistently outperforming the GUI mode average of 33.5. While current VLMs possess strong reasoning capabilities for high-level planning, they still struggle with the visual grounding required to translate these plans into precise pixel coordinates.

Video-Based Reflection Benefits: VLMs benefit from video-based reflection in varied settings, and perform best by combining failure trajectories and expert tutorials—a training-free analogue to reinforcement learning (RL) plus supervised fine-tuning (SFT).

BibTeX

@article{gameverse2026,
      title={GameVerse: Can Vision-Language Models Learn from Video-based Reflection?}, 
      author={Zhang, Kuan and Liu, Dongchen and Zhao, Qiyue and Hou, Jinkun and Zhang, Xinran and Xie, Qinlei and Liu, Miao and Li, Yiming},
      journal={arXiv},
      year={2026},
      url={https://arxiv.org/abs/2603.06656}
}

Background picture in hero section generated by Nano Banana