Exploring the Impact of Negative Sampling on Patent Citation Recommendation

pcr_patents.csv is the dataset which is generated by collecting samples randomly from Google Patents by exploiting a Python library. The dataset comprises around 250,000 US patents and their titles, abstracts, and citations. Each patent has roughly on average 27 citations. The zip file contains 3 different datasets for training and testing patent citation recommendation systems. These datasets were generated by utilizing the main dataset. They consist of around 1 million instances which are positive as well as negative samples. pcr_cpc_negative_sample_data.csv consists of negative samples that were generated based on CPC subclass codes. pcr_random_negative_sample_data.csv consists of negative samples that were generated randomly. pcr_sem_sim_negative_sample_data_2.csv consists of negative samples that were generated based on nearest neighbor relation.

Related Organizations

Leibniz Association
Germany
FIZ Karlsruhe – Leibniz Institute for Information Infrastructure
Germany

Keywords

patent citation, citation recommendation

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average