DynaSolidGeo: A Dynamic Benchmark for Genuine Spatial Mathematical Reasoning of VLMs in Solid Geometry

Changti Wu*,1,2, Shijie Lian*,3,2, Zihao Liu4,2, Lei Zhang✉️,1, Laurence Tianruo Yang5,3, Kai Chen✉️,6,2
1East China Normal University, 2Zhongguancun Academy,
3Huazhong University of Science and Technology,
4Peking University 5Zhengzhou University,
6Zhongguancun Institute of Artificial Intelligence

*These authors contributed equally

✉️Corresponding author

Abstract

Solid geometry problem solving demands spatial mathematical reasoning that integrates spatial intelligence and symbolic reasoning. However, most existing multimodal mathematical reasoning benchmarks focus primarily on 2D plane geometry, rely on static datasets prone to data contamination and memorization, and evaluate models solely by final answers, overlooking the reasoning process. To address these limitations, we introduce DynaSolidGeo, the first dynamic benchmark for evaluating genuine spatial reasoning in Vision-Language Models (VLMs). Constructed through a semi-automatic annotation pipeline, DynaSolidGeo contains 503 expert-curated seed questions that can, in principle, dynamically generate an unbounded number of diverse multimodal text-visual instances. Beyond answer accuracy, we incorporate process evaluation based on expert-annotated reasoning chains to measure logical validity and causal coherence. Experiments across representative open-source and closed-source VLMs reveal large performance gaps, severe degradation in dynamic settings, and poor performance on tasks requiring high-level spatial intelligence, such as mental rotation and visualization.

First research result visualization

Overview of the data annotation pipeline and the application of seed questions. Annotation: 1) Expert-Guided Parametrization and Visualization: Each source question is first parameterized into a JSON annotation and paired with a MATLAB visualization program. 2) Automated Python Program Synthesis: The pipeline then synthesizes parameterized Python programs that generate textual descriptions and MATLAB invoke commands. 3) Automated MATLAB Program Synthesis: Correspondingly, the pipeline then synthesizes the param- eterized version of MATLAB programs for figure and video rendering. 4) Expert Verification: Final human checks ensure the correctness and usability of seed questions. Application: By inputting a random seed, each seed question is instantiated into a question instance.

First research result visualization

Comparison of model performance on the Answer Accuracy (AA), Process Score (PS), and Process-Qualified Accuracy (PA) metrics. For the GPT-5 family, LLaVA-OneVision-1.5 family, and GLM-4.5V, the PS and PA metrics are not reported, as these models either do not disclose their reasoning traces by API or inherently do not produce explicit reasoning processes.

Paper

BibTeX

@misc{wu2025dynasolidgeodynamicbenchmarkgenuine,
        title={DynaSolidGeo: A Dynamic Benchmark for Genuine Spatial Mathematical Reasoning of VLMs in Solid Geometry}, 
        author={Changti Wu and Shijie Lian and Zihao Liu and Lei Zhang and Laurence Tianruo Yang and Kai Chen},
        year={2025},
        eprint={2510.22340},
        archivePrefix={arXiv},
        primaryClass={cs.AI},
        url={https://arxiv.org/abs/2510.22340}, 
  }