Neural Personalized Topic Model

This rep is the pytorch implementation for our CIKM 2023 papar "Neural Personalized Topic Modeling for Mining User Preferences on Social Media".


Python enviroment:











OS and driver enviroment:

Ubuntu 20.04



Hardware enviroment:

Chasis: Inspur NF5468M6
CPU:    Intel(R) Xeon(R) Platinum 8362 CPU x2
RAM:    DDR4 RECC 3200 512GB
GPU:    NVIDIA A100 80G


Data and model preparation:

Plz prepare following data and models in right directory:

  • "embedding/word2vec_glove.6B.100d.txt.bin" dictionary object of GloVe word embedding file. Using provided glove_vec_2_dict.py in embedding directory to covert original glove word embedding into this format.
  • "data/corpus/twitter-2016" for preprocessed twitter political archive data and vocabulary. plz unzip the twitter2016.zip file to this directory.
  • "data/corpus/authorblog" for preprocessed authorblog data and vocabulary.
  • "data/huggingface_model/bert-base-uncased/" for BERT-BASE-UNCASED model files to be load.
  • "data/huggingface_model/sentence-transformer/all-mpnet-base-v2/" for Sentence-Transformer model files to be load.

To load different pretrain language models, we defined a directory and corresponding hidden dimension dictionary variable in run.py file(Line-79). Plz download ur selected pretrain language model files and place them at corresponding directory:

D_PLM={"bert-base": ("data/huggingface_model/bert-base-uncased", 768),
       "sbert-all-mpnet": ("data/huggingface_model/sentence-transformers/all-mpnet-base-v2/", 768)}

plz place the corresponding pretrained language model files to their directory before running.

We provide zipped twitter-political-archive file in "twitter2016.zip" in "data/corpus/twitter2016/". plz unzip this file to this directory before running. The raw data of Twitter Political Archive is available at https://github.com/bpb27/political_twitter_archivee. The first running would rebuild the corpus object and serialize into "corpus_obj.bin" to accelerate the next running on the same data. When the "disablerapidload" argument is not set, it will rebuild corpus from select dataset in "data-path" argument. The processed version of Authorblog data can be found at "https://www.dropbox.com/scl/fi/k9718nkqqi6goxoqz3r2k/authorblog.zip?rlkey=zococh5utggiaz7mk41j12t1f&st=uc7fg8oe&dl=0". u can acquire raw Authorblog data at "https://www.kaggle.com/datasets/rtatman/blog-authorship-corpus". Set the argument data-path="corpus/authorblog" to select authorblog as training data. u can use Python pickle to load and check the format for given binary file.

Evaluation setting

We leverage Palmetto(https://github.com/dice-group/Palmetto) into this code for automatic topic coherence calculating. Plz prepare Palmetto server endpoint and corresponding index files. Then modify the following line-149 in "run.py" to specify ur own Palmetto endpoint URL:

tc = Palmetto("", timeout=60)


Key argument for running:

python run.py --n-hidden <the number of hidden unit in inference network's mlp> \
              --user-embed-size <the dimension of user embedding> \
              --dropout <dropout rate> \
              --lr <learning rate> \
              --topics <the number of topics> \
              --batch-size <batchsize> \
              --topicembedsize <the dimension of topic embedding> \
              --plm-select <specify which pretrain language model to use> \
              --topk <top-k word to be extracted when computing topic coherence> \
              --max-epoch <maximum epoch in training> \
              --subepoch <the number of alternative epoch> \
              --patience <patience para in earlystop> \
              --seed <random seed> \
              --measure <topic coherence metric used in palmettopy> \
              --ulr <learning rate for user network> \
              --savckpt <flag for enabling saving checkpoint> \
              --disablerapidload <flag for disable corpus processing and use previous serialized corpus object> \
              --disabledisplay <flag for disable pytorch_lightning echo>

You can use following command to practice a running on twitter-politics-archive:

python run.py --topics=50 --batchsize=128 --dropout=0.1 --lr=1e-4 --data-path=="twitter-2016" 

or authorblog data:

python run.py --topics=100 --batchsize=256 --dropout=0.1 --lr=1e-4 --data-path=="authorblog" 


When the training is finished, it will generate a text file containing top-k words in each topics at "topic_file/" directory.


If u find this code useful, plz kindly cite our paper:

Plz be advised that different enviroment may lead to undesired results or potential issues. Thus, this code comes WITHOUT SUPPORT.