Socialized word embeddings (SWE) has been proposed to deal with two phenomena of language use:
- everyone has his/her own personal characteristics of language use,
- and socially connected users are likely to use language in similar ways.
We observe that the spread of language use is transitive, namely, one user can affect his/her friends, and the friends can also affect their friends. However, SWE modeled transitivity implicitly. The social regularization in SWE only applies to one-hop friends, and thus users outside the one-hop social circle will not be affected directly.
In this work, we adopt random walk methods to generate paths on the social graph to model the transitivity explicitly. Each user on a path will be affected by his/her adjacent user(s) on the path. Moreover, according to the update mechanism of SWE, fewer friends a user has, fewer update opportunities he/she can get. Hence, we propose a biased random walk method to provide these users with more update opportunities.
Experiments show that our random walk based social regularizations perform better on sentiment classification task.
You need to download the dataset:
-
Download Yelp dataset
-
Convert following datasets from json format to csv format by using
json_to_csv_converter.py
inpreprocess
and get two.json
files:yelp_academic_dataset_review.json
yelp_academic_dataset_user.json
And put them in the
data
directory. -
Download LIBLINEAR or Multi-core LIBLINEAR and put the
liblinear
under the root directory of the repo. And you can install LIBLINEAR according to its Installation
cd preprocess
Modify run.py
by specifying --input
(Path to yelp dataset).
python run.py
cd train
You may modify the following arguments in run.py
:
--yelp_round
The round number of yelp data, e.g. {9, 10}--para_lambda
The trade off parameter between log-likelihood and regularization term--para_r
The constraint of the L2-norm--para_path
The number of random walk paths for every review--para_path_length
The length of random walk paths for every review--path_p
The return parameter for the second-order random walk--path_q
The in-out parameter for the second-order random walk--para_alpha
The restart parameter for the bias random walk--para_bias_weight
The bias parameter for the bias random walk
Because there are a number of ways to train models, you also need to specify model_types
in training.py
.
Then begin to train models:
python run.py
cd sentiment
You may modify the following arguments in run.py
, arguments are the same as above.
It will run two .py
files, where sentiment.py
is about sentiment classification on all users while head_tail.py
is about sentiment classification on head users and tail users. You can choose one or both.
Because there are a number of ways to train models, you also need to specify model_types
in sentiment.py
and head_tail.py
.
Then begin the classification:
python run.py
We thank Tao Lei as our code is developed based on his code.
cd attention
You can simply re-implement our results of different settings by modifying the run.sh
:
- add user and word embeddings by specifying
--user_embs
and--embedding
. - add train/dev/test files by specifying
--train
,--dev
, and--test
respectively. - choose the type of layers by specifying
--layer
, cnn or lstm. - three settings for our experiments could be achieved by specifying
--user_atten
and--user_atten_base
:- setting
--user_atten 0
for Without attention. - setting
--user_atten 1 --user_atten_base 1
for Trained attention. - setting
--user_atten 1 --user_atten_base 0
for Fixed user vector as attention.
- setting
Then begin to train attention models:
bash run.sh
You can download pretrained embeddings and models in release.
- Put
embs/*
in theembs/
directory. - Put
sentiment_models/*
in thesentiment/models/
directory. - Put
attention_models/*
in theattention/models/
directory.
.
├── attention
│ ├── models
│ ├── nn
│ │ ├── __init__.py
│ │ ├── advanced.py
│ │ ├── basic.py
│ │ ├── evaluation.py
│ │ ├── initialization.py
│ │ └── optimization.py
│ ├── utils
│ │ └── __init__.py
│ ├── dc.py
│ └── run.sh
├── data
├── embs
│ ├── users_sample.txt
│ └── words_sample.txt
├── liblinear
├── preprocess
│ ├── english.pickle
│ ├── english_stop.txt
│ ├── json_to_csv_converter.py
│ ├── preprocess.py
│ └── run.py
├── sentiment
│ ├── format_data
│ ├── models
│ ├── results
│ ├── get_SVM_format_swe.c
│ ├── get_SVM_format_w2v.c
│ ├── head_tail.py
│ ├── run.py
│ └── sentiment.py
├── train
│ ├── run.py
│ ├── swe.c
│ ├── swe_with_2nd_randomwalk.c
│ ├── swe_with_bias_randomwalk.c
│ ├── swe_with_deepwalk.c
│ ├── swe_with_node2vec.c
│ ├── swe_with_randomwalk.c
│ ├── training.py
│ └── w2v.c
├── LICENSE
└── README.md
- *nix operating systems or WSL
- Python 2.7
- gcc
- Theano (>= 0.7 but < 1.0)
- NumPy
- gensim
- PrettyTable
- Pandas
- ujson
- NLTK
If you use this code, then please cite our IJCAI 2018 paper:
@inproceedings{ijcai2018-634,
title = {Biased Random Walk based Social Regularization for Word Embeddings},
author = {Ziqian Zeng and Xin Liu and Yangqiu Song},
booktitle = {Proceedings of the Twenty-Seventh International Joint Conference on
Artificial Intelligence, {IJCAI-18}},
publisher = {International Joint Conferences on Artificial Intelligence Organization},
pages = {4560--4566},
year = {2018},
month = {7},
doi = {10.24963/ijcai.2018/634},
url = {https://doi.org/10.24963/ijcai.2018/634},
}
Copyright (c) 2018 HKUST-KnowComp. All rights reserved.
Licensed under the MIT License.