BERT-based NN TopCoder Pricing Model
This is the repo for Benjamin's research thesis
The previous repo containing other models are too messy to manage - That's what happened when you have no idea about a data analysis research
So instead of making the original repo messier, I created a new repo to singly focus on builing the pricing model using BERT and Neural Network. There are several parts to builde the pricing model:
- Fine-tune bert with my requirement text
- Customize network and append meta data features in the BERT last layer states
- Train the model for a classification task
I will have some code/data copy-paste from the previous repo to save my time.
Insturctions of code struture
This setion is for Mingyang to locate the code that needs his help to review
Thanks for your help in advance Mingyang :D
-
get_data.py
: The data of TopCoder challenges is fetched and stored in a MySQL database on the server of Stevens, this script is for fetching the original data. -
tc_data.py
: This file define a classTopCoder
that read the data intopandas.DataFrame
with some simple preprocessing, including manuanly filtering out the data used for training of the neural network. It will return the text and metadata for 4,800-ish challenge data. -
model_tcpm_distilbert.py
: [HELP NEEDED HERE!] This file implement the neural network model I want to train in two ways:- Using functional api of
Tensorflow
. The model returned should be trainable as standardtf.keras.Model
usingmodel.fit
method. - Implement
TCPMDistilBertClassification
, a subclass ofTFDistilBertPreTrained
, which essentially is a subclass oftf.keras.Model
. And it can be trained usingTFTrainer
from thetransformers
.
- Using functional api of
-
fine_tune_bert.py
: [HELP NEEDED HERE AS WELL] This file implement the functionbuild_dataset
that convert the data frompandas.DataFrame
totf.data.Dataset
. And afine_with_tftrainer
which insantiates theTCPMDistilBertClassification
and trains the model. -
preprocessing_util.py
: Some text preproccessing utility funtion.