descriptionPublicationkeyboard_double_arrow_right Part of book or chapter of book , Article , Preprint 01 Jan 2021Embargo end date: 01 Jan 2021 English Publisher:Springer International Publishing

Authors: Desai, Harsh; Kayal, Pratik; Singh, Mayank;

doi: 10.1007/978-3-030-86331-9_36 , 10.48550/arxiv.2105.06400

arXiv: 2105.06400

TabLeX: A Benchmark Dataset for Structure and Content Information Extraction from Scientific Tables

- Summary
- Subjects
- Related research
  (5)
- Metrics

Abstract

Information Extraction (IE) from the tables present in scientific articles is challenging due to complicated tabular representations and complex embedded text. This paper presents TabLeX, a large-scale benchmark dataset comprising table images generated from scientific articles. TabLeX consists of two subsets, one for table structure extraction and the other for table content extraction. Each table image is accompanied by its corresponding LATEX source code. To facilitate the development of robust table IE tools, TabLeX contains images in different aspect ratios and in a variety of fonts. Our analysis sheds light on the shortcomings of current state-of-the-art table extraction models and shows that they fail on even simple table images. Towards the end, we experiment with a transformer-based existing baseline to report performance scores. In contrast to the static benchmarks, we plan to augment this dataset with more complex and diverse tables at regular intervals.

Related Organizations

Indian Institute of Technology Gandhinagar
India
Indian Institute of Technology Dharwad
India
Indian Institutes of Technology
India

Keywords

FOS: Computer and information sciences, Computer Science - Machine Learning, Information Retrieval (cs.IR), Computer Science - Information Retrieval, Machine Learning (cs.LG)

5 Research products, page 1 of 1

pdfplumber software on GitHub
IsRelatedTo
tabula-py software on GitHub
IsRelatedTo
wand software on GitHub
IsRelatedTo
CAMELoT software on GitHub
IsRelatedTo
jiwer software on GitHub
IsRelatedTo

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	12
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 10%