TOMT

Corpus construction

Website topics: I removed questions that point to websites that dont exist anymore or websites that are not in the clueweb22

Requirements

Please install Docker, git, python3, and tira.

Please use official documentation/tutorials to install docker, git and python3 on your machine.
Please run pip3 install tira to install the TIRA client library on your machine.

Build the Docker images

Please run:

make build-docker

Methods

Baseline: ChatNoir using the original title

Please run:

make run-title-baseline

Oracle Baseline: ChatNoir using reciprocal rank fusion over

Please run:

make run-oracle-baseline

The output (e.g., head -3 output-oracle-baseline/run.txt) should look like:

20 0 clueweb22-en0024-09-06042 1 0.04762704813108039 chatnoir-oracle-baseline
20 0 clueweb22-en0041-37-00460 2 0.046871392288155164 chatnoir-oracle-baseline
20 0 clueweb22-en0023-36-06642 3 0.04569460390355913 chatnoir-oracle-baseline

Idea 1:

Let ChatGPT/alpaca generate keywords for the website I am looking for.

Prompt: "I want to build a website for X. Please write keywords that I should include so that the webpage can be easily found."

https://chat.web.webis.de/chat-ui

Example: I want to build a website selling t-shirts bags posters with text of whole book. Please write keywords that I should include so that the webpage can be easily found.

Idea 2:

The Tip-of-the-Tongue track has now a training and validation dataset.

Use HDCT/DeepCT to remove unimportant terms from the documents and queries.

Extract reddit questions with links to web pages
download linked pages with https://github.com/hartator/wayback-machine-downloader
Train a BERT Model (we already have the corresponding training scripts) to remove terms that do not occur in the query /
- Dedicated models for (1) the query, and (2) the document
- Multiple dedicated models for different fields: title, url, full text of the document

python subprocess.checkoutput mit try except drum: docker run -v ${PWD}/websites:/websites hartator/wayback-machine-downloader --exact-url --from 20110225005609 --to 20120225005609 --maximum-snapshot 2 http://www.louissachar.com/Wayside.htm

Evaluation results for the Title Baseline

tira-run \
    --input-directory ${PWD}/output-title-baseline \
    --image tomt-ir-dataset \
    --command 'ir_measures tomt $inputDataset/run.txt nDCG@10 MRR P@3 Recall@3'

This should output the following:

nDCG@10	0.0000
RR	0.0000
P@3	0.0000
R@3	0.0000

Evaluation results for the Oracle Baseline

tira-run \
    --input-directory ${PWD}/output-oracle-baseline \
    --image tomt-ir-dataset \
    --command 'ir_measures tomt $inputDataset/run.txt nDCG@10 MRR P@3 Recall@3'

This should output the following:

nDCG@10	0.8474
RR	1.0000
P@3	0.8889
R@3	0.3452

webis-de/TREC24

TOMT

Corpus construction

Requirements

Build the Docker images

Methods

Baseline: ChatNoir using the original title

Oracle Baseline: ChatNoir using reciprocal rank fusion over

Idea 1:

Idea 2:

Evaluation results for the Title Baseline

Evaluation results for the Oracle Baseline

Additional Resources