LSS-Similarity

Dataset Preparation

Prepare SQL/NL Data

Use the Web-Crawler-for-NL2SQL to extract the web-crawled SQL and natural language data. The data will be saved to the PostgreSQL dataset

Prepare SQL/NL Alignments

We use english word aligner to align words in SQL and NL and extract features from the alignments.

To align words in SQL and NL, first change directory to monolingual-word-aligner

cd monolingual-word-aligner

Then, run the following codes.

python run_all_mining.py
python3 InputWords.py
python run_all_align.py

Following codes align words in SQL and NL, and save the alignment information into the database alignments.dls_cu.

Prepare Answers

In order to train the XGBoost Model, manually marked answers are needed. The answers should be stored in PostgreSQL database as follows.

Schema: "answers"
Table: "sql_0_0_nl_0"
Headers: ["sql_id", "nl_id"]

Each row in answers.sql_0_0_nl_0 table stores answer as pair of ids ("sql_id", "nl_id").

Word Embedding

You can download the word embedding by following codes:

wget http://clic.cimec.unitn.it/composes/materials/EN-wform.w.5.cbow.neg10.400.subsmpl.txt.gz
gunzip EN-wform.w.5.cbow.neg10.400.subsmpl.txt.gz

Move EN-wform.w.5.cbow.neg10.400.subsmpl.txt to data directory.

Extract Features

Dataset Configuration

Before running the feature extraction code, you should config the dataset connection.

  Check out the ```config/config.py``` file:
  ```
  _user = 'USER'                # TODO: Fill in the user name of your PostgreSQL DB account
  _password = 'PASSWORD'        # TODO: Fill in the password of your PostgreSQL DB account
  _dbname = 'DATABASE_NAME'     # TODO: Fill in the database name where datasets are saved in your PostgreSQL DB 
  _host_product = 'localhost'
  _port='5432'
  conn_string = "dbname={dbname} user={user} host={host} password={password} port={port}"\
              .format(dbname=_dbname, user=_user, host=_host_product, password=_password, port=_port)
  ```

Feature Extraction

You can extract features from SQL, NL pair by running the following code.

python feature_extract.py

Features are stored in file data/features.txt in form (<sql_id> <nl_id> <feat0> <feat1> <feat2>)

Run the following code to re-format the file for XGBoost Model.

python modify_feature.py

All features and answers for each feature are stored in file data/feature_answer_all.txt in form (<sql_id> <nl_id> <feat0> <feat1> <feat2> <answer>) where <answer> is 1 or 0

Run XGBoost/Neural Network Model

Configuration

To train/test the XGBoost/Neural Network Model, firstly, configuration file should be set.

For XGBoost model, set up the config/xgb_params.json file:

{
 "test_ratio": 0.2,                    # Ratio of Test data in whole dataset
 "model_type": "xgb",                  # Type of model, (xgb/neural)
 "pretrained_model": "",               # To use the pretrained model, insert path to the pretrained model
 "save_model": false,                  # Save the final model for later use. (at data/saved_model)
 "model":{                             # Model configs. Search xgboost.XGBoostClassifier for more information
     "objective": "binary:logistic",   
     "eval_metric" : "auc",            
     "learning_rate" : 0.1,
     "max_depth": 1,
     "min_child_weight": 1,
     "subsample": 1,
     "colsample_bytree": 1,
     "gamma" : 0,
     "reg_alpha": 0,
     "reg_lambda": 1
     }
}

For Neural network model, set up the config/neural_params.json file:

{
 "test_ratio": 0.2,                    # Ratio of Test data in whole dataset
 "model_type": "neural",               # Type of model, (xgb/neural)
 "pretrained_model": "",               # To use the pretrained model, insert path to the pretrained model
 "save_model": false,                  # Save the final model for later use. (at data/saved_model)
 "model":{                             # Model configs.
     "objective": "regression",        # Type of neural network model, (regression/classification)
     "num_epoch": 500,                 # Number of epoch to train
     "lr": 1e-3,                       # Learning rate
     "hidden_dim": 10,                 # Dimension of hidden vector
     "validation": true                # Validate the model and find best model
     }
}

Train

Set up all the parameters in config/xgb_params.json file or config/neural_params.json file.

You can train the model and test the code by running the code below:

python nl_sql_dist.py --parameters xgb_params.json

With provided sample features, you would get results as follows:

Accuracy: 92.78%
True Positive: 42
False Positive: 4
False Negative: 3
True Negative: 48
-----------------------------
Precision: 0.91
Recall: 0.93
=============================

The ROC curve would be shown like this:

To train the neural network model, use the following code:

python nl_sql_dist.py --parameters neural_params.json

With proided sample features, you would get results as follows:

Accuracy: 94.85%
True Positive: 43
False Positive: 3
False Negative: 2
True Negative: 49
-----------------------------
Precision: 0.93
Recall: 0.96
=============================

The ROC curve would be shown like this:

Test

Firstly, you should update the config/xgb_params.json file or config/neural_params.json to direct pretrained model file.

You can load the existing model by running the following code:

python nl_sql_dist.py --parameters <PARAMETER_FILE_PATH>

We provide the pre-trained model for test, named data/saved_model/XGB_TEST1.dat.

If you use the pre-trained model, the program will use all the data in data/feature_answer_all.txt as test data.

You can get the test result with SQL/NL sentences by running the code below:

python extract_result.py

postech-db-lab-starlab/LSS-Similarity