A survey of Document understanding and question answering

Documents (paper media, images, or electronic files containing textual and graphical information) are ubiquitous in daily office work, online dissemination, and governmental/enterprise workflows; automatically parsing, retrieving, and supporting decisions based on their content constitutes the core demand of Document Intelligence. In real-world settings, a large portion of documents are first converted into images via scanning or photographing; consequently, document processing typically starts from visual inputs: on the one hand, the system must accurately localize layout elements such as text blocks, tables, figures, and headings, and perform text detection and recognition; on the other hand, it must conduct higher-level semantic understanding and reasoning on top of structured inputs to answer queries, extract key information, and generate verifiable results. With the advances of deep learning and large language models, the research focus has gradually expanded from ``character-level recognition'' to ``document-level understanding and question answering'' across regions, pages, and modalities, and industry has correspondingly formed evaluation demands that emphasize end-to-end capability and robustness. Following this evolution, this paper organizes the survey in the order of ``perception first, then reasoning, and finally end-to-end unified modeling'', and aligns evaluation suites with methodological lineages. Part I (Document Layout Analysis + OCR) focuses on layout parsing starting from page geometry, including detection/segmentation-based layout element recognition, layout-aware models that incorporate layout information into pretrained representation learning, and robustness and cross-domain generalization under real-world document distributions; it further reviews key technical points of OCR from traditional pipelines to deep learning and lightweight deployment, emphasizing error propagation and engineering constraints when OCR serves as the ``entry point'' of inputs to downstream understanding tasks. Part II (Document Understanding and Question Answering) systematically summarizes three mainstream scenarios built on structured inputs: structure-aware representations for tables, modular/executable reasoning and verifiable question answering; retrieval-augmented generation (RAG) and reasoning-driven retrieval for long texts; and multimodal pretraining and instruction alignment for visually rich documents, the OCR-free evolution, and multi-page long-context modeling. Finally, we summarize the datasets and benchmarks associated with these two stages, covering task settings from single-page to multi-page documents and from closed sets to real-world enterprise documents, providing reproducible comparative baselines and a systematic research roadmap for future studies.

Keywords

document understanding, document question answering, vision-language models

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Upload OA version

Are you the author of this publication? Upload your Open Access version to Zenodo!

It’s fast and easy, just two clicks!

uploadUpload now