Abstract
Solid geometry problem solving demands spatial mathematical reasoning that integrates spatial intelligence and symbolic reasoning. However, most existing multimodal mathematical reasoning benchmarks focus primarily on 2D plane geometry, rely on static datasets prone to data contamination and memorization, and evaluate models solely by final answers, overlooking the reasoning process. To address these limitations, we introduce DynaSolidGeo, the first dynamic benchmark for evaluating genuine spatial reasoning in Vision-Language Models (VLMs). Constructed through a semi-automatic annotation pipeline, DynaSolidGeo contains 503 expert-curated seed questions that can, in principle, dynamically generate an unbounded number of diverse multimodal text-visual instances. Beyond answer accuracy, we incorporate process evaluation based on expert-annotated reasoning chains to measure logical validity and causal coherence. Experiments across representative open-source and closed-source VLMs reveal large performance gaps, severe degradation in dynamic settings, and poor performance on tasks requiring high-level spatial intelligence, such as mental rotation and visualization.
Overview of the data annotation pipeline and the application of seed questions. Annotation: 1) Expert-Guided Parametrization and Visualization: Each source question is first parameterized into a JSON annotation and paired with a MATLAB visualization program. 2) Automated Python Program Synthesis: The pipeline then synthesizes parameterized Python programs that generate textual descriptions and MATLAB invoke commands. 3) Automated MATLAB Program Synthesis: Correspondingly, the pipeline then synthesizes the param- eterized version of MATLAB programs for figure and video rendering. 4) Expert Verification: Final human checks ensure the correctness and usability of seed questions. Application: By inputting a random seed, each seed question is instantiated into a question instance.
Comparison of model performance on the Answer Accuracy (AA), Process Score (PS), and Process-Qualified Accuracy (PA) metrics. For the GPT-5 family, LLaVA-OneVision-1.5 family, and GLM-4.5V, the PS and PA metrics are not reported, as these models either do not disclose their reasoning traces by API or inherently do not produce explicit reasoning processes.
Comparison of model performance on Answer Accuracy (AA) between DynaSolidGeo and source questions.
Comparison of average output tokens for correct, incorrect, and overall responses with corresponding performance.
Paper
BibTeX
@misc{wu2025dynasolidgeodynamicbenchmarkgenuine,
title={DynaSolidGeo: A Dynamic Benchmark for Genuine Spatial Mathematical Reasoning of VLMs in Solid Geometry},
author={Changti Wu and Shijie Lian and Zihao Liu and Lei Zhang and Laurence Tianruo Yang and Kai Chen},
year={2025},
eprint={2510.22340},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2510.22340},
}