This repository contains the code for our paper on expert recommendation systems based on syntax and semantic analysis. To utilize this code for your dataset, please follow the instructions outlined below.
The code is organized sequentially, with folders numbered from 1 to 4. Each folder represents a step in the process, starting with data preparation and ending with a multi-layer perceptron (MLP) model. If file order matters within a folder, it follows a similar numerical pattern. Detailed instructions for each section are provided below.
We utilized the Stack Exchange dataset, which can be downloaded from archive.org. Extract each dataset to the corresponding folder in the ./data
directory. For example, the biology dataset should be extracted to ./data/biology
.
In this repository, we changed the 'task' term to 'question' because we used the Stack Exchange dataset, where questions are referred to as tasks. So, in some cases, based on the need we used 'question' and 'task' instead of each other.
To convert XML files to CSV, execute all Python scripts in the 1_preparation
folder. This includes the Comments
, Posts
, Tags
, User
, and Worker
directories. Paths are relative, and all files will be stored in the data
directory. Modify the directory names in the Python scripts for different datasets as needed.
This folder is dedicated to computing syntax and semantic similarities between texts.
In the W2V
folder, we first compute permutations of tags to create all possible sequences of question tags (1_separate_tags
). Then, we train and create word2vec vectors (2_create_word2vec
). The W2V model is used for computing semantic similarity.
In the 2_tree
folder, we sort question tags based on their number and compute syntax similarity using a tree structure. Tags are considered children, siblings, or parents based on their relationships to the input question, as detailed in our paper.
First, we select one tag from the input question and compute the five most similar tags using the W2V model. Questions with at least one of these tags are selected. This process is recursively repeated for all tags of the new question until all tags are considered or the number of candidate questions is sufficiently reduced. The Apriori algorithm is employed to select new candidates at each step.
In the 4_bert_score
folder, we use the BERT Score to calculate the similarity between question bodies. Due to the time-consuming nature of this process, we generate a shortlist of candidates by averaging the tree
and graph
candidates. The BERT Score is then calculated for candidates with higher scores.
We compute the average similarity scores of all candidates from the previous three phases (tree, graph, and BERT). This is done using two methods: intersections and ordering. In the first method, we find the intersection of the three candidate lists and select questions recommended by all similarity methods first, and then, other questions are selected based on their scores. In the second method, all questions are ordered and selected based on their similarity scores.
Here, we compute the difficulty of tasks based on the reputation of users who have completed similar tasks.
In this directory, we split the dataset into training, testing, and validation sets. We also create profiles for both requesters and workers, as well as situation vectors, which are used as inputs for training, testing, and validating the model.
In the 1_split_data
folder, the Python code divides the dataset to test, train, and validate data 80%, 10%, and 10%, respectively.
In the folder 2_req_profile
there are two subfolders, by which the requesters' profiles including fairness, expertise, and reputation are generated.
In the fairness/abandoned_tasks_rate
directory, we first calculate the average number of upvotes for accepted answers to a requester's questions. Then, we determine the number of questions that have answers with more upvotes than the average but were not chosen as the accepted answer. This could be a possible indication of unfair behavior.
In the fairness/deviation_from_community
directory, we define the answers that were marked as accepted but had lower upvotes compared to other answers in a question.
In the exp_rep
directory, the reputation and expertise of requesters are computed. Expertise means how many accomplished tasks in a specific domain each requester has.
In this directory, the workers' profiles including expertise in domain and reputation are generated.
Here, the input data of the ML model for training, validating, and testing purposes are generated. Each row includes the worker, requester, and task information.
In this directory, we use the Multi-Layer Perceptron (MLP) machine learning model. After the ML model produces the candidates, we use the Bert Score model implemented in b_s.py
to rank the candidates. First, you need to update the dataset name in the b_s.py
file, then run the 1_ML_model.py
.
For any questions, please feel free to open an issue or email javad.b.razavi@gmail.com.