If you would like to test out our model, feel free to check out the "Running our model" section. If you're more interested in how we gathered our features and tuned our model checked out the "Preparing data & model tuning"
In order to test our model, follow the following steps:
- Create a conda virtual environment from our "our_env.yml" file. This will install all the necessary packages for the full project. Run
conda env create -f our_env.yml
. - Download the prepared training and test dataframes, as well as the submission template from the zip folder we provided in our project submission. WeLink: https://we.tl/t-aAiXrtjWks
- Choose which sklearn regressor you'd like to use. We've already imported LogisticRegression, Lasso, Ridge, SVR, MLP, RandomForestRegressor, but feel free to import one of your own!
For our Kaggle results we used MLPRegressor and we use its best parameters which were calculated through our Bayesian hyperparameter tuning script. The parameters were:
{'activation': 'logistic', 'alpha': 0.001, 'early_stopping': True, 'hidden_layer_sizes': 500, 'learning_rate': 'invscaling', 'learning_rate_init': 0.0001, 'verbose': True}
- Run the command
python3 pipeline.py
.
-
Prepping data
- function
get_auth_pid_tbl
from helpers/df_helpers.py: converted author_papers.txt into a human-friendly dataframe (doesn't take too long) - function
get_abstract_tbl
from helpers/df_helpers.py: converted abstracts.txt into a human-friendly dataframe (doesn't take too long) - function
abs_to_str
from helpers/inv_ind_to_txt.py: converted inverted indices to strings (this can take several hours...)
- function
-
Prepping features
All features were pre-computed and saved locally to avoid constant re-computing. For the features that required splitting up of the data for faster computation, the not-so-pretty scripts of piecing them back together can be found in the exploration folder.
- Graph Features (all scripts can be found at features/graph_features.py)
- Degree : it represents the number ofconnections of each nodes.
- Core-number : Returns the core numberfor each vertex of rank k which is the maximalnumber of subgraph that contains nodes ofdegree k or more.
- Eigen centrality : this measure emphasizes connection to important nodes (which themselves have many connections) over connection to isolated (or less connected) nodes
- Degree centrality : the degree centrality represents the number of connections of each node with respect to the total number of nodes.
- Clustering : represents the fraction of possible triangles through that node that exist with respect to the degree of the node.
- Triangle : The Triangle score counts the number of triangles for each node in thegraph. A triangle is a set of three nodes whereeach node has a relationship to the other two
- function
node2vec_emb
: node embedding of the graph to a 5 dimensional space using the Node2Vec framework. - function
prone_emb
: node embedding of the graph to a 32 dimensional space using the ProNe framework. - function
graph_clustering
: Use DBSCAN, a non linear clustering method used to find 2 cluster in the Node2Vec embedding
- Text Features (all scripts can be found at features/text_features.py)
- function
number_papers
: calculates the number of major papers an author has published (does not take too long) - function
total_sum_abs_words
: calculates the total sum of words in an author's major papers' abstracts (takes a while... couple of hours) - function
get_scibert_vectors
: concatenate the string versions of every abstract an author has, and feed that into the pre-trained scibert model to get vectors of length 798 for every author (can take a very very long time... recommend to split up authors in several groups). Downloaded SciBERT from: https://github.com/allenai/scibert, although you don't really need to if you use HuggingFace Pytorch models.
- function
- Graph Features (all scripts can be found at features/graph_features.py)
-
Hyperparameter Tuning:
In order to find out what hyperparameters work best for our problem and a given regression model, we ran the
bayesian_hopt.py
script found in the regressors directory. Choose the regressor you're interested in from a list of LogisticRegression, Lasso, Ridge, RandomForestRegressor, KNeighborsRegressor, SVR and MLPRegressor. To run the script you can use the example template given at the bottom of the script. Make sure to have a params and params/trials folder so that you can save your results! -
K-Fold Cross-Validation:
In order to prevent overfitting and make sure our model was generalizable, we use 10-fold cross validation which can be found in regressors/cross_validation.py