PIN: A Knowledge-Intensive Dataset for Paired and Interleaved Multimodal Documents

Name: PIN: A Knowledge-Intensive Dataset for Paired and Interleaved Multimodal Documents
Keywords: FOS: Computer and information sciences, Artificial Intelligence (cs.AI), Multimedia, Artificial Intelligence, Computer Vision and Pattern Recognition (cs.CV), Computer Vision and Pattern Recognition, Computation and Language, Computation and Language (cs.CL), Multimedia (cs.MM)

Wang, Junjie; Zhang, Yuxiang; Liu, Minghao; Zhang, Yin; Ji, Yatai; Xuan, Weihao; Lin, Nie; Zhu, Kang; Lin, Zhiqiang; Ren, Yiming; Jiang, Chunyang; Yu, Yiyao; Wang, Zekun; Wang, Tiezhen; Huang, Wenhao; Fu, Jie; Lin, Qunshu; Yang, Yujiu; Zhang, Ge; Yuan, Ruibin; Chen, Bei; Chen, Wenhu

Found an issue? Give us feedback

arXiv.org e-Print Ar...arrow_drop_down

arXiv.org e-Print Archive

Preprint . 2024

Data sources: arXiv.org e-Print Archive

https://dx.doi.org/10.48550/ar...

Article . 2024

License: arXiv Non-Exclusive Distribution

Data sources: Datacite

DBLP

Article

Data sources: DBLP

PIN: A Knowledge-Intensive Dataset for Paired and Interleaved Multimodal Documents

descriptionPublicationkeyboard_double_arrow_right Article , Preprint 01 Jan 2024Embargo end date: 01 Jan 2024Publisher:arXivJournal:CoRR, volume abs/2406.13923

Authors: Wang, Junjie; Zhang, Yuxiang; Liu, Minghao; Zhang, Yin; Ji, Yatai; Xuan, Weihao; Lin, Nie; +15 Authors

doi: 10.48550/arxiv.2406.13923

arXiv: 2406.13923

PIN: A Knowledge-Intensive Dataset for Paired and Interleaved Multimodal Documents

- Summary
- Subjects
- Related research
  (4)
- Metrics

Abstract

Recent advancements in large multimodal models (LMMs) have leveraged extensive multimodal datasets to enhance capabilities in complex knowledge-driven tasks. However, persistent challenges in perceptual and reasoning errors limit their efficacy, particularly in interpreting intricate visual data and deducing multimodal relationships. To address these issues, we introduce PIN (Paired and INterleaved multimodal documents), a novel data format designed to foster a deeper integration of visual and textual knowledge. The PIN format uniquely combines semantically rich Markdown files, which preserve fine-grained textual structures, with holistic overall images that capture the complete document layout. Following this format, we construct and release two large-scale, open-source datasets: PIN-200M (~200 million documents) and PIN-14M (~14 million), compiled from diverse web and scientific sources in both English and Chinese. To maximize usability, we provide detailed statistical analyses and equip the datasets with quality signals, enabling researchers to easily filter and select data for specific tasks. Our work provides the community with a versatile data format and substantial resources, offering a foundation for new research in pre-training strategies and the development of more powerful knowledge-intensive LMMs.

Technical report v1.0

Related Organizations

Tsinghua University
China (People's Republic of)
Independent Researcher
United Kingdom
Tsinghua University
Tsinghua University
Tsinghua University

View all View all

Keywords

FOS: Computer and information sciences, Artificial Intelligence (cs.AI), Multimedia, Artificial Intelligence, Computer Vision and Pattern Recognition (cs.CV), Computer Vision and Pattern Recognition, Computation and Language, Computation and Language (cs.CL), Multimedia (cs.MM)

4 Research products, page 1 of 1

pdf2image software on GitHub
IsRelatedTo
s2orc-doc2json software on GitHub
IsRelatedTo
engrafo software on GitHub
IsRelatedTo
coyo-dataset software on GitHub
IsRelatedTo

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Green

Related to Research communities

Knowmad Institut

PIN: A Knowledge-Intensive Dataset for Paired and Interleaved Multimodal Documents

PIN: A Knowledge-Intensive Dataset for Paired and Interleaved Multimodal Documents

4 Research products, page 1 of 1

pdf2image software on GitHub

s2orc-doc2json software on GitHub

engrafo software on GitHub

coyo-dataset software on GitHub