This rep is the pytorch implementation for our CIKM 2023 papar "Neural Personalized Topic Modeling for Mining User Preferences on Social Media".
Python enviroment:
Anaconda3-2022.10-Linux-x86_64
torch-1.13.1+cu117-cp39-cp39-linux_x86_64
pytorch-lightning==1.9.3
transformers==4.36.2
palmettopy==3.3
prefetch-generator==1.0.3
wordcloud==1.8.2.2
tqdm
argparse
matplotlib
OS and driver enviroment:
Ubuntu 20.04
NVIDIA-Linux-x86_64-515.105.01
cuda_11.7.0_515.43.04_linux
Hardware enviroment:
Chasis: Inspur NF5468M6
CPU: Intel(R) Xeon(R) Platinum 8362 CPU x2
RAM: DDR4 RECC 3200 512GB
GPU: NVIDIA A100 80G
Plz prepare following data and models in right directory:
- "embedding/word2vec_glove.6B.100d.txt.bin" dictionary object of GloVe word embedding file. Using provided
glove_vec_2_dict.py
inembedding
directory to covert original glove word embedding into this format. - "data/corpus/twitter-2016" for preprocessed twitter political archive data and vocabulary. plz unzip the twitter2016.zip file to this directory.
- "data/corpus/authorblog" for preprocessed authorblog data and vocabulary.
- "data/huggingface_model/bert-base-uncased/" for BERT-BASE-UNCASED model files to be load.
- "data/huggingface_model/sentence-transformer/all-mpnet-base-v2/" for Sentence-Transformer model files to be load.
To load different pretrain language models, we defined a directory and corresponding hidden dimension dictionary variable in run.py file(Line-79). Plz download ur selected pretrain language model files and place them at corresponding directory:
D_PLM={"bert-base": ("data/huggingface_model/bert-base-uncased", 768),
"sbert-all-mpnet": ("data/huggingface_model/sentence-transformers/all-mpnet-base-v2/", 768)}
plz place the corresponding pretrained language model files to their directory before running.
We provide zipped twitter-political-archive file in "twitter2016.zip" in "data/corpus/twitter2016/". plz unzip this file to this directory before running. The raw data of Twitter Political Archive is available at https://github.com/bpb27/political_twitter_archivee. The first running would rebuild the corpus object and serialize into "corpus_obj.bin" to accelerate the next running on the same data. When the "disablerapidload" argument is not set, it will rebuild corpus from select dataset in "data-path" argument. The processed version of Authorblog data can be found at "https://www.dropbox.com/scl/fi/k9718nkqqi6goxoqz3r2k/authorblog.zip?rlkey=zococh5utggiaz7mk41j12t1f&st=uc7fg8oe&dl=0". u can acquire raw Authorblog data at "https://www.kaggle.com/datasets/rtatman/blog-authorship-corpus". Set the argument data-path="corpus/authorblog"
to select authorblog as training data. u can use Python pickle to load and check the format for given binary file.
We leverage Palmetto(https://github.com/dice-group/Palmetto) into this code for automatic topic coherence calculating. Plz prepare Palmetto server endpoint and corresponding index files. Then modify the following line-149 in "run.py" to specify ur own Palmetto endpoint URL:
tc = Palmetto("http://127.0.0.1:7777/service/", timeout=60)
Key argument for running:
python run.py --n-hidden <the number of hidden unit in inference network's mlp> \
--user-embed-size <the dimension of user embedding> \
--dropout <dropout rate> \
--lr <learning rate> \
--topics <the number of topics> \
--batch-size <batchsize> \
--topicembedsize <the dimension of topic embedding> \
--plm-select <specify which pretrain language model to use> \
--topk <top-k word to be extracted when computing topic coherence> \
--max-epoch <maximum epoch in training> \
--subepoch <the number of alternative epoch> \
--patience <patience para in earlystop> \
--seed <random seed> \
--measure <topic coherence metric used in palmettopy> \
--ulr <learning rate for user network> \
--savckpt <flag for enabling saving checkpoint> \
--disablerapidload <flag for disable corpus processing and use previous serialized corpus object> \
--disabledisplay <flag for disable pytorch_lightning echo>
You can use following command to practice a running on twitter-politics-archive:
python run.py --topics=50 --batchsize=128 --dropout=0.1 --lr=1e-4 --data-path=="twitter-2016"
or authorblog data:
python run.py --topics=100 --batchsize=256 --dropout=0.1 --lr=1e-4 --data-path=="authorblog"
When the training is finished, it will generate a text file containing top-k words in each topics at "topic_file/" directory.
If u find this code useful, plz kindly cite our paper:
@inproceedings{DBLP:conf/cikm/LiuLTZLWZ23,
author = {Luyang Liu and
Qunyang Lin and
Haonan Tong and
Hongyin Zhu and
Ke Liu and
Min Wang and
Chuang Zhang},
editor = {Ingo Frommholz and
Frank Hopfgartner and
Mark Lee and
Michael Oakes and
Mounia Lalmas and
Min Zhang and
Rodrygo L. T. Santos},
title = {Neural Personalized Topic Modeling for Mining User Preferences on
Social Media},
booktitle = {Proceedings of the 32nd {ACM} International Conference on Information
and Knowledge Management, {CIKM} 2023, Birmingham, United Kingdom,
October 21-25, 2023},
pages = {1545--1555},
publisher = {{ACM}},
year = {2023},
url = {https://doi.org/10.1145/3583780.3614987},
doi = {10.1145/3583780.3614987},
timestamp = {Fri, 27 Oct 2023 20:40:46 +0200},
biburl = {https://dblp.org/rec/conf/cikm/LiuLTZLWZ23.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
Plz be advised that different enviroment may lead to undesired results or potential issues. Thus, this code comes WITHOUT SUPPORT.