CONSTRUCTURE: Benchmarking CONcept STRUCTUre REasoning for Multimodal Large Language Models
Name of the Dataset: CONSTRUCTURE: Benchmarking CONcept STRUCTUre REasoning for Multimodal Large Language Models
Dataset Introduction: To evaluate the hierarchical concept understanding and reasoning abilities of Multimodal Large Language Models (MLLMs), we introduce CONSTRUCTURE, a novel concept-level benchmark. It aims to assess MLLMs across four key aspects: understanding atomic concepts, performing upward abstraction reasoning, achieving downward concretization reasoning, and conducting multi-hop reasoning. Initial findings show that even state-of-the-art models struggle with these tasks (e.g., GPT-4o scores 62.1%). The CONSTRUCTURE benchmark contains 10,329 samples, split into training (7,234), validation (1,031), and test (2,064) sets in a 7:1:2 ratio.
Download Link: https://aclanthology.org/2024.findings-emnlp.285/
Relevant Paper: https://aclanthology.org/2024.findings-emnlp.285.pdf