<script type="text/javascript">
<!--
document.write('<div id="oa_widget"></div>');
document.write('<script type="text/javascript" src="https://www.openaire.eu/index.php?option=com_openaire&view=widget&format=raw&projectId=undefined&type=result"></script>');
-->
</script>
Our datasets, both released and evaluation set, are derived from the YFCC100M Dataset. Each dataset comprises vectors encoded from images using the CLIP model, which are then reduced to 100 dimensions using Principal Component Analysis (PCA). Additionally, categorical and timestamp attributes are selected from the metadata of the images. The categorical attribute is discretized into integers starting from 0, and the timestamp attribute is normalized into floats between 0 and 1. For each query, a query type is randomly selected from four possible types, denoted by the numbers 0 to 3. Then, we randomly choose two data points from dataset D, utilizing their categorical attribute (C) timestamp attribute (T), and vectors, to determine the values of the query. Specifically: Randomly sample two data points from D. Use the categorical value of the first data point as v for the equality predicate over the categorical attribute C. Use the timestamp attribute values of the two sampled data points for the range predicate. Designate l as the smaller timestamp value and r as the larger. The range predicate is thus defined as l≤T≤r. Use the vector of the first data point as the query vector. If the query type does not involve v, l, or r, their values are set to -1. We assure that at least 100 data points in D meet the query limit. Dataset Structure Dataset D is in a binary format, beginning with a 4-byte integer num_vectors (uint32_t) indicating the number of vectors. This is followed by data for each vector, stored consecutively, with each vector occupying 102 (2 + vector_num_dimension) x sizeof(float32) bytes, summing up to num_vectors x 102 (2 + vector_num_dimension) x sizeof(float32) bytes in total. Specifically, for the 102 dimensions of each vector: the first dimension denotes the discretized categorical attribute C and the second dimension denotes the normalized timestamp attribute T. The rest 100 dimensions are the vector. Query Set Structure Query set Q is in a binary format, beginning with a 4-byte integer num_queries (uint32_t) indicating the number of queries. This is followed by data for each query, stored consecutively, with each query occupying 104 (4 + vector_num_dimension) x sizeof(float32) bytes, summing up to num_queries x 104 (4 + vector_num_dimension) x sizeof(float32) bytes in total. The 104-dimensional representation for a query is organized as follows: The first dimension denotes query_type (takes values from 0, 1, 2, 3). The second dimension denotes the specific query value v for the categorical attribute (if not queried, takes -1). The third dimension denotes the specific query value l for the timestamp attribute (if not queried, takes -1). The fourth dimension denotes the specific query value r for the timestamp attribute (if not queried, takes -1). The rest 100 dimensions are the query vector. There are four types of queries, i.e., the query_type takes values from 0, 1, 2 and 3. The 4 types of queries correspond to: If query_type=0: Vector-only query, i.e., the conventional approximate nearest neighbor (ANN) search query. If query_type=1: Vector query with categorical attribute constraint, i.e., ANN search for data points satisfying C=v. If query_type=2: Vector query with timestamp attribute constraint, i.e., ANN search for data points satisfying l≤T≤r. If query_type=3: Vector query with both categorical and timestamp attribute constraints, i.e. ANN search for data points satisfying C=v and l≤T≤r. The predicate for the categorical attribute is an equality predicate, i.e., C=v. And the predicate for the timestamp attribute is a range predicate, i.e., l≤T≤r. Originally provided on https://dbgroup.cs.tsinghua.edu.cn/sigmod2024/task.shtml?content=datasets .
citations This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |