R. Cantini, F. Marozzo, G. Bruno, P. Trunfio, "Learning sentence-to-hashtags semantic mapping for hashtag recommendation on microblogs". ACM Transactions on Knowledge Discovery from Data, vol. 16, n. 2, pp. 1-26, 2022.
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.
- Python 3.7
- Install requirements
pip install requirements.txt
python -m spacy download en_core_web_lg
- Run the HASHET model
python run.py
The dataset available in the input/
folder is a sample of 100 tweets which has the sole purpose of showing
the functioning of the methodology. Each tweet is a json formatted string.
The real datasets on which HASHET has been validated are in the used_dataset
folder.
In accordance with Twitter API Terms, only Tweet IDs are provided as part of this datasets.
To recollect tweets based on the list of Tweet IDs contained in these datasets you will need to use tweet
'rehydration' programs.
The resulting json line for each tweet after rehydration must have this format:
{
"id":"id",
"text":"tweet text",
"date":"date",
"user":{
"id":"user_id",
"name":"",
"screenName":"",
"location":"",
"lang":"en",
"description":""
},
"location":{
"latitude":0.0,
"longitude":0.0
},
"isRetweet":false,
"retweets":0,
"favoutites":0,
"inReplyToStatusId":-1,
"inReplyToUserId":-1,
"hashtags":[
"hashtag"
],
"lang":"lang",
"place":{
}
}
constants.py
contains all the parameters used in the methodology. Changing them will influence the obtained results.
It is recommended to change W2V_MINCOUNT
and MINCOUNT
values for larger datasets.