MiniCPM-V: A GPT-4V Level MLLM on Your Phone

Name: MiniCPM-V: A GPT-4V Level MLLM on Your Phone
Keywords: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition

Yuan Yao 0013; Tianyu Yu 0002; Ao Zhang; Chongyi Wang; Junbo Cui; Hongji Zhu; Tianchi Cai; Haoyu Li; Weilin Zhao; Zhihui He; Qianyu Chen; Huarong Zhou; Zhensheng Zou; Haoye Zhang; Shengding Hu; Zhi Zheng; Jie Zhou 0016; Jie Cai; Xu Han 0007; Guoyang Zeng; Dahai Li; Zhiyuan Liu 0001; Maosong Sun 0001

Found an issue? Give us feedback

arXiv.org e-Print Ar...arrow_drop_down

arXiv.org e-Print Archive

Preprint . 2024

Data sources: arXiv.org e-Print Archive

https://dx.doi.org/10.48550/ar...

Article . 2024

License: CC BY

Data sources: Datacite

DBLP

Article

Data sources: DBLP

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

descriptionPublicationkeyboard_double_arrow_right Article , Preprint 01 Jan 2024Embargo end date: 01 Jan 2024Publisher:arXivJournal:CoRR, volume abs/2408.01800

Authors: Yuan Yao 0013; Tianyu Yu 0002; Ao Zhang; Chongyi Wang; Junbo Cui; Hongji Zhu; Tianchi Cai; +16 Authors

doi: 10.48550/arxiv.2408.01800

arXiv: 2408.01800

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

- Summary
- Subjects
- Related research
  (5)
- Metrics

Abstract

The recent surge of Multimodal Large Language Models (MLLMs) has fundamentally reshaped the landscape of AI research and industry, shedding light on a promising path toward the next AI milestone. However, significant challenges remain preventing MLLMs from being practical in real-world applications. The most notable challenge comes from the huge cost of running an MLLM with a massive number of parameters and extensive computation. As a result, most MLLMs need to be deployed on high-performing cloud servers, which greatly limits their application scopes such as mobile, offline, energy-sensitive, and privacy-protective scenarios. In this work, we present MiniCPM-V, a series of efficient MLLMs deployable on end-side devices. By integrating the latest MLLM techniques in architecture, pretraining and alignment, the latest MiniCPM-Llama3-V 2.5 has several notable features: (1) Strong performance, outperforming GPT-4V-1106, Gemini Pro and Claude 3 on OpenCompass, a comprehensive evaluation over 11 popular benchmarks, (2) strong OCR capability and 1.8M pixel high-resolution image perception at any aspect ratio, (3) trustworthy behavior with low hallucination rates, (4) multilingual support for 30+ languages, and (5) efficient deployment on mobile phones. More importantly, MiniCPM-V can be viewed as a representative example of a promising trend: The model sizes for achieving usable (e.g., GPT-4V) level performance are rapidly decreasing, along with the fast growth of end-side computation capacity. This jointly shows that GPT-4V level MLLMs deployed on end devices are becoming increasingly possible, unlocking a wider spectrum of real-world AI applications in the near future.

preprint

Keywords

FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition

5 Research products, page 1 of 1

opencompass software on GitHub
IsRelatedTo
llama.cpp software on GitHub
IsRelatedTo
MiniCPM-o software on GitHub
IsRelatedTo
coyo-dataset software on GitHub
IsRelatedTo
ggml software on GitHub
IsRelatedTo

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Green

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

5 Research products, page 1 of 1

opencompass software on GitHub

llama.cpp software on GitHub

MiniCPM-o software on GitHub

coyo-dataset software on GitHub

ggml software on GitHub