Pchatbot: A Large-Scale Dataset for Personalized Chatbot

descriptionPublicationkeyboard_double_arrow_right Article , Preprint 11 Jul 2021Embargo end date: 01 Jan 2020Publisher:ACMJournal:Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval

Authors: Qian, Hongjin; Li, Xiaohe; Zhong, Hanxun; Guo, Yu; Ma, Yueyuan; Zhu, Yutao; Liu, Zhanliang; +2 Authors

doi: 10.1145/3404835.3463239 , 10.48550/arxiv.2009.13284

arXiv: 2009.13284

Pchatbot: A Large-Scale Dataset for Personalized Chatbot

- Summary
- Subjects
- Related research
  (3)
- Metrics

Abstract

Natural language dialogue systems raise great attention recently. As many dialogue models are data-driven, high-quality datasets are essential to these systems. In this paper, we introduce Pchatbot, a large-scale dialogue dataset that contains two subsets collected from Weibo and Judicial forums respectively. To adapt the raw dataset to dialogue systems, we elaborately normalize the raw dataset via processes such as anonymization, deduplication, segmentation, and filtering. The scale of Pchatbot is significantly larger than existing Chinese datasets, which might benefit the data-driven models. Besides, current dialogue datasets for personalized chatbot usually contain several persona sentences or attributes. Different from existing datasets, Pchatbot provides anonymized user IDs and timestamps for both posts and responses. This enables the development of personalized dialogue models that directly learn implicit user personality from the user's dialogue history. Our preliminary experimental study benchmarks several state-of-the-art dialogue models to provide a comparison for future work. The dataset can be publicly accessed at Github.

Camera-ready version, SIGIR 2021 (Resource Track), the dataset and codes are available at https://github.com/qhjqhj00/Pchatbot

Related Organizations

Renmin University of China
China (People's Republic of)
University of Montreal
Canada

Keywords

FOS: Computer and information sciences, Computer Science - Computation and Language, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Computation and Language (cs.CL)

3 Research products, page 1 of 1

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	21
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Top 10%
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 10%