<script type="text/javascript">
<!--
document.write('<div id="oa_widget"></div>');
document.write('<script type="text/javascript" src="https://www.openaire.eu/index.php?option=com_openaire&view=widget&format=raw&projectId=undefined&type=result"></script>');
-->
</script>

COPY SCRIPT

For further information contact us at helpdesk@openaire.eu

The IBEM Dataset: a large printed scientific image dataset for indexing and searching mathematical expressions

Name: The IBEM Dataset: a large printed scientific image dataset for indexing and searching mathematical expressions
Keywords: Mathematical Expression Recognition, Mathematical Expression Retrieval, Mathematical Expression Detection, Mathematical Expression Dataset

Research datakeyboard_double_arrow_right Dataset 23 May 2023 English Publisher:Zenodo

Authors: Anitei, Dan; Sánchez, Joan Andreu; Benedí, José Miguel;

doi: 10.5281/zenodo.11635134 , 10.5281/zenodo.7963703 , 10.5281/zenodo.7963702

The IBEM Dataset: a large printed scientific image dataset for indexing and searching mathematical expressions

- Summary
- Subjects
- Metrics

Abstract

The IBEM dataset consists of 600 documents with a total number of 8272 pages, containing 29603 isolated and 137089 embedded Mathematical Expressions (MEs). The objective of the IBEM dataset is to facilitate the indexing and searching of MEs in massive collections of STEM documents. The dataset was built by parsing the LaTeX source files of documents from the KDD Cup Collection. Several experiments can be carried out with the IBEM dataset ground-truth (GT): ME detection and extraction, ME recognition, etc. The dataset consists of the following files: “IBEM.json”: file containing the IBEM GT information. The data is firstly organized by pages, then by the type of expression (“embedded” or “displayed”), and lastly by the GT of each individual ME. For each ME we provide: xy page-level coordinates, reported as relative (%) to the width/height of the page image. “split” attribute indicating the number of fragments in which the ME has been split. MEs can be split over various lines, columns or pages. The LaTeX transcript of split MEs have been exactly replicated (entire LaTeX definition) for each fragment. “latex” original transcript as extracted from the LaTeX source files of the documents. This definition can contain user-defined macros. In order to be able to compile these expressions, each page includes the preamble of the source files containing the defined macros and the packages used by the authors of the documents. “latex_expand” transcript reconstructed from the output stream of the LuaLaTeX engine in which user-defined macros have been expanded. The transcript has the same visual representation as the original transcript, with the addition that the LaTeX definitions are tokenized, the order of sub/super script elements have been fixed, and matrices have been transformed to arrays. “latex_norm” transcript resulting from applying an extra normalization process to the “latex_expand” expression. This normalization process includes removing font information such as slant, style, and weight. “partitions/*.lst”: files containing list of pages forming the partition sets. “pages/*.jpg”: individual pages extracted from the documents. The dataset is partitioned into various sets as provided for the ICDAR 2021 Competition on Mathematical Formula Detection. The ground-truth related to this competition, which is included in this dataset version, can also be found here. More information about the competition can be found in the following paper: D. Anitei, J.A. Sánchez, J.M. Fuentes, R. Paredes, and J.M. Benedí. ICDAR 2021 Competition on Mathematical Formula Detection. In ICDAR, pages 783–795, 2021. For ME recognition tasks, we recommend rendering the “latex_expand” version of the formulae in order to create standalone expressions that have the same visual representation as MEs found in the original documents (see attached python script “extract_GT.py”). Extracting MEs from the documents based on coordinates is more complex, as special care is needed to concatenate the fragments of split expressions. Baseline results for ME recognition tasks will soon be made available.

This work has been partially supported by MCIN/AEI/10.13039/501100011033 under the grant PID2020-116813RB-I00; the Generalitat Valenciana under the FPI grant CIACIF/2021/313; and by the support of the Valencian Graduate School and Research Network of Artificial Intelligence.

Keywords

Mathematical Expression Recognition, Mathematical Expression Retrieval, Mathematical Expression Detection, Mathematical Expression Dataset

Impact byBIP!

	citations This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Usage byUsageCounts

visibility	views	25
download	downloads	10

25
views
10
downloads
Powered by

Found an issue? Give us feedback

visibility

download

Average