We provide large-scale multi-domain benchmark datasets for Personalized Search.
The datasets can be found here.
Models' source code can be found here.
Pre-computed baseline runs are available on ranxhub.
Please cite the following paper if you use the data or code in this repo.
@inproceedings{bassani2022multi,
title={A Multi-Domain Benchmark for Personalized Search Evaluation},
author={Bassani, Elias and Kasela, Pranav and Raganato, Alessandro and Pasi, Gabriella},
booktitle={Proceedings of the 31st ACM International Conference on Information \& Knowledge Management},
pages={3822--3827},
year={2022}
}
- train:
- queries.jsonl
- query_ids.txt
- val:
- bm25_run.json
- qrels.json
- queries.jsonl
- query_ids.txt
- test:
- bm25_run.json
- qrels.json
- queries.jsonl
- query_ids.txt
- collection.jsonl
- fos_hierarachies.jsonl
- in_refs.jsonl
- out_refs.jsonl
- has_authors.jsonl
- authors.jsonl
- affiliations.jsonl
- conference_instances.jsonl
- conference_series.jsonl
- journals.jsonl
- bm25_config.json
Each JSON line is as follows:
{
"id": ...
"text": ...
"rel_doc_ids": ... # IDs of the relevant documents
"user_id": ... # Same as `author_id` in other files
"user_doc_ids": ... # IDs of the associated user documents
"bm25_doc_ids": ... # IDs of the documents retrieved by BM25
"bm25_doc_scores": ... # Scores assigned by BM25 to the retrieved documents
"timestamp": ...
}
Each JSON line is as follows:
{
"id": ...
"title": ...
"text": ...
"keywords": ...
"fields_of_study": ...
"publication_date": ...
"timestamp": ...
"conference_instance_id": ...
"conference_series_id": ...
"journal_id": ...
"issue_id": ...
"volume": ...
"publisher": ...
"doi": ...
}
Each JSON line is as follows:
{
"id": ...
"name": ...
"affiliation_id": ...
"docs": [{"doc_id": "...", "timestamp": ...}, ...]
}
Each JSON line is as follows:
{
"doc_id": ...
"timestamp": ...
"author_ids": ["123678452", ...]
}
Each JSON line is as follows:
{
"doc_id": ...
"in_refs": [{"doc_id": "...", "timestamp": ...}, ...]
}
Each JSON line is as follows:
{
"doc_id": ...
"timestamp": ...
"out_refs": ["2048600620", ...]
}
Each JSON line is as follows:
{
"id": ...
"name": ... # Name of the institution
}
Each JSON line is as follows:
{
"id": ...
"name": ...
"conference_series_id": ...
}
Each JSON line is as follows:
{
"id": ...
"name": ...
}
Each JSON line is as follows:
{
"id": ...
"name": ...
}
Fields of studies associated with the documents have a hierarchical tree structure.
Each JSON line is as follows:
{
"id": ...
"hierarchy": ...
}