Tables to LaTeX: structure and content extraction from scientific tables

descriptionPublicationkeyboard_double_arrow_right Article , Preprint 27 Oct 2022Embargo end date: 01 Jan 2022 English Publisher:Springer Science and Business Media LLCJournal:International Journal on Document Analysis and Recognition (IJDAR), volume 26, pages 121-130 (issn: 1433-2833, eissn: 1433-2825,

Copyright policy )

Authors: Pratik Kayal; Mrinal Anand; Harsh Desai; Mayank Singh 0001;

doi: 10.1007/s10032-022-00420-9 , 10.48550/arxiv.2210.17246

arXiv: 2210.17246

Tables to LaTeX: structure and content extraction from scientific tables

- Summary
- Subjects
- Related research
  (1)
- Metrics

Abstract

Scientific documents contain tables that list important information in a concise fashion. Structure and content extraction from tables embedded within PDF research documents is a very challenging task due to the existence of visual features like spanning cells and content features like mathematical symbols and equations. Most existing table structure identification methods tend to ignore these academic writing features. In this paper, we adapt the transformer-based language modeling paradigm for scientific table structure and content extraction. Specifically, the proposed model converts a tabular image to its corresponding LaTeX source code. Overall, we outperform the current state-of-the-art baselines and achieve an exact match accuracy of 70.35 and 49.69% on table structure and content extraction, respectively. Further analysis demonstrates that the proposed models efficiently identify the number of rows and columns, the alphanumeric characters, the LaTeX tokens, and symbols.

10 pages, published in IJDAR'22. arXiv admin note: text overlap with arXiv:2105.14426

Related Organizations

Indian Institute of Technology Gandhinagar
India

Keywords

FOS: Computer and information sciences, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Information Retrieval (cs.IR), Computer Science - Information Retrieval

1 Research products, page 1 of 1

wand software on GitHub
IsRelatedTo

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	7
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 10%