A Multi-domain Benchmark for Personalized Search Evaluation

We provide large-scale multi-domain benchmark datasets for Personalized Search.

The datasets can be found here.
Models' source code can be found here.
Pre-computed baseline runs are available on ranxhub.

Citation

Please cite the following paper if you use the data or code in this repo.

@inproceedings{bassani2022multi,
  title={A Multi-Domain Benchmark for Personalized Search Evaluation},
  author={Bassani, Elias and Kasela, Pranav and Raganato, Alessandro and Pasi, Gabriella},
  booktitle={Proceedings of the 31st ACM International Conference on Information \& Knowledge Management},
  pages={3822--3827},
  year={2022}
}

Folder structure of each dataset

- train:
  - queries.jsonl
  - query_ids.txt
- val:
  - bm25_run.json
  - qrels.json
  - queries.jsonl
  - query_ids.txt
- test:
  - bm25_run.json
  - qrels.json
  - queries.jsonl
  - query_ids.txt
- collection.jsonl
- fos_hierarachies.jsonl
- in_refs.jsonl
- out_refs.jsonl
- has_authors.jsonl
- authors.jsonl
- affiliations.jsonl
- conference_instances.jsonl
- conference_series.jsonl
- journals.jsonl
- bm25_config.json

File descriptions

`queries.jsonl`

Each JSON line is as follows:

{
  "id": ...
  "text": ...
  "rel_doc_ids": ...      # IDs of the relevant documents
  "user_id": ...          # Same as `author_id` in other files
  "user_doc_ids": ...     # IDs of the associated user documents
  "bm25_doc_ids": ...     # IDs of the documents retrieved by BM25
  "bm25_doc_scores": ...  # Scores assigned by BM25 to the retrieved documents
  "timestamp": ...
}

`collection.jsonl`

Each JSON line is as follows:

{
  "id": ...
  "title": ...
  "text": ...
  "keywords": ...
  "fields_of_study": ...
  "publication_date": ...
  "timestamp": ...
  "conference_instance_id": ...
  "conference_series_id": ...
  "journal_id": ...
  "issue_id": ...
  "volume": ...
  "publisher": ...
  "doi": ...
}

`authors.jsonl`

Each JSON line is as follows:

{
  "id": ...
  "name": ...
  "affiliation_id": ...
  "docs": [{"doc_id": "...", "timestamp": ...}, ...]
}

`has_authors.jsonl`

Each JSON line is as follows:

{
  "doc_id": ...
  "timestamp": ...
  "author_ids": ["123678452", ...]
}

`in_refs.jsonl` (incoming reference)

Each JSON line is as follows:

{
  "doc_id": ...
  "in_refs": [{"doc_id": "...", "timestamp": ...}, ...]
}

`out_refs.jsonl` (outgoing reference)

Each JSON line is as follows:

{
  "doc_id": ...
  "timestamp": ...
  "out_refs": ["2048600620", ...]
}

`affiliations.jsonl`

Each JSON line is as follows:

{
  "id": ...
  "name": ...   # Name of the institution
}

`conference_instances.jsonl`

Each JSON line is as follows:

{
  "id": ...
  "name": ...
  "conference_series_id": ...
}

`conference_series.jsonl`

Each JSON line is as follows:

{
  "id": ...
  "name": ...
}

`journals.jsonl`

Each JSON line is as follows:

{
  "id": ...
  "name": ...
}

`fields_of_study_hierarchies.jsonl`

Fields of studies associated with the documents have a hierarchical tree structure.
Each JSON line is as follows:

{
  "id": ...
  "hierarchy": ...
}

AmenRa/a-multi-domain-benchmark-for-personalized-search-evaluation

A Multi-domain Benchmark for Personalized Search Evaluation

Citation

Folder structure of each dataset

File descriptions

queries.jsonl

collection.jsonl

authors.jsonl

has_authors.jsonl

in_refs.jsonl (incoming reference)

out_refs.jsonl (outgoing reference)

affiliations.jsonl

conference_instances.jsonl

conference_series.jsonl

journals.jsonl

fields_of_study_hierarchies.jsonl

`queries.jsonl`

`collection.jsonl`

`authors.jsonl`

`has_authors.jsonl`

`in_refs.jsonl` (incoming reference)

`out_refs.jsonl` (outgoing reference)

`affiliations.jsonl`

`conference_instances.jsonl`

`conference_series.jsonl`

`journals.jsonl`

`fields_of_study_hierarchies.jsonl`