
DSEBench DSEBench is a test collection designed to support the evaluation of Dataset Search with Examples (DSE), a task that generalizes two established paradigms: keyword-based dataset search and similarity-based dataset discovery. Given a textual query q and a set of target datasets Dt known to be relevant, the goal of DSE is to retrieve a ranked list Dc of candidate datasets that are both relevant to q and similar to the datasets in Dt. As an extension, Explainable DSE further requires identifying, for each result dataset d∈Dc, a subset of metadata or content fields that explain its relevance to q and similarity to Dt. This repository contains the datasets, queries, and relevance judgments. For source code, baseline implementations, and experimental setups, please visit our GitHub Repository. For further details, please refer to the accompanying paper. Datasets We reused the 46,615 datasets collected from NTCIR. The "datasets.json" file provides the id, title, description, tags, author, and summary of each dataset in JSON format. { "id": "0000de36-24e5-42c1-959d-2772a3c747e7", "title": "Montezuma National Wildlife Refuge: January - April, 1943", "description": "This narrative report for Montezuma National Wildlife Refuge outlines Refuge accomplishments from January through April of 1943. ...", "tags": ["annual-narrative", "behavior", "populations"], "author": "Fish and Wildlife Service", "summary": "Almost continuous rains during April brought flood conditions to the Clyde River as well as to the refuge storage pool. Cayuga Lake is at its highest level in about ton years. ..." } Below is an example of how to load and use the datasets.json file: import json # Load the dataset file with open('datasets.json', 'r') as f: datasets_data = json.load(f) # Iterate through each judgment for dataset in datasets_data: dataset_id = dataset['id'] # Get the dataset ID title = dataset['title'] # Get the title # Other code to process the judgment data... Queries The "queries.tsv" file provides 3,979 keyword queries. Each row represents a query with two "\t"-separated columns: query_id and query_text. The queries can be divided into two categories: generated queries, created from the metadata of datasets, and NTCIR queries, imported from the English part of the NTCIR dataset. Queries with IDs starting with "GEN_" are generated queries, while those starting with "NTCIR" are NTCIR queries. Below is an example of how to load and use the queries.tsv file: # Load the queries file with open('queries.tsv', 'r') as f: # Iterate through each line for line in f: query_id, query_text = line.split('\t') # Get the query ID and the query text # Other code to process the data... Test and Training Cases In DSEBench, each input consists of a case, which includes a query and a set of target datasets that are known to be relevant to the query. The "cases.tsv" file provides 141 test cases and 5,699 training cases. Each row represents a case with three "\t"-separated columns: case_id, query_id, and target_dataset_id. Test cases are identified by a case_id composed of pure numbers. These test cases are adapted from highly relevant query-dataset pairs from the NTCIR dataset. The remaining cases are training cases. Among these, those with a case_id starting with l1_ are adapted from partially relevant query-dataset pairs from NTCIR, while those starting with gen_ are synthetic training cases, where the queries are generated queries. Below is an example of how to load and use the cases.tsv file: # Load the cases file with open('cases.tsv', 'r') as f: # Iterate through each line for line in f: case_id, query_id, target_dataset_id = line.split('\t') # Get the case ID, the query ID, and the target dataset ID # Other code to process the data... Relevance Judgments The "human_annotated_judgments.json" file contains 7,415 human-annotated judgments, and the "llm_annotated_judgments.json" file contains 122,585 judgments generated by a large language model (LLM). Each JSON object has eight keys: query_id, target_dataset_id, candidate_dataset_id, case_id (the ID of the input), query_rel (relevance of the candidate dataset to the query, 0: irrelevant; 1: partially relevant; 2: highly relevant), field_query_rel, target_sim (similarity of the candidate dataset to the target datasets, 0: dissimilar; 1: partially similar; 2: highly similar), and field_target_sim. The field_query_rel and field_target_sim are both lists of length 5 consisting of 0 and 1, corresponding to the fields [title, description, tags, author, summary]. { "query_id": "NTCIR_200000", "target_dataset_id": "002ece58-9603-43f1-8e2e-54e3d9649e84", "candidate_dataset_id": "99e3b6a2-d097-463f-b6e1-3caceff300c9", "case_id": "1", "query_rel": 1, "field_query_rel": [1, 1, 1, 0, 0], "target_sim": 2, "field_target_sim": [1, 1, 1, 1, 1] } Below is an example of how to load and use the human_annotated_judgments.json file: import json # Load the judgments file with open('Data/human_annotated_judgments.json', 'r') as f: judgments_data = json.load(f) # Iterate through each judgment for judgment in judgments_data: case_id = judgment['case_id'] # Get the case ID candidate_dataset_id = judgment['candidate_dataset_id'] # Get the candidate dataset ID query_rel = judgment['query_rel'] # Get the query relevance score field_query_rel = judgment['field_query_rel'] # Get the field-level query relevance scores (title, description, tags, author, summary) # Other code to process the judgment data... Splits for Training, Validation, and Test Sets To ensure that evaluation results are comparable, we provide predefined train-validation-test splits. The "Splits/5-Fold_split" folder contains five sub-folders, each providing three qrel files for training, validation, and test sets. The "Splits/Annotators_split" folder contains three qrel files for the training, validation, and test sets as well. These files are used in the same way as the relevance judgments files. Evaluation Scripts We provide Python scripts to facilitate standard evaluation. 1. DSE Evaluation (Retrieval/Reranking) Use evaluate_dse.py to calculate metrics (MAP, NDCG, Recall). Input Format (JSON): { "case_id_1": {"dataset_id_A": 0.95, "dataset_id_B": 0.82}, "case_id_2": {"...": ...} } Usage: python evaluate_dse.py --qrels Data/human_annotated_judgments.json --run path/to/your_results.json Note: This script requires the pytrec_eval library. 2. Explainable DSE Evaluation Use evaluate_explanation.py to calculate F1-scores for field-level explanations. Input Format (JSON): The query and dataset lists correspond to the binary relevance of ['title', 'description', 'tags', 'author', 'summary']. { "case_id": { "dataset_id": { "query": [1, 1, 1, 1, 0], "dataset": [1, 1, 1, 1, 0] } } } Usage: python evaluate_explanation.py --qrels Data/human_annotated_judgments.json --run path/to/your_explanations.json Codes and Baselines To access the source code for retrieval, reranking, and explanation models, as well as the implementation details and full baseline results, please refer to our GitHub Repository.
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
