LING131-Final-Project

This is the repository for the LING131 final project.

"presentation_slides.pptx" is the slide we used for our course presentation.

"LING131_Project_Report.pdf" is our final report of this course project.

This repository contains three directories:

To Run

It is a directory in which we place the Jupyter Notebook file "LING131 Final Project(Team rm -rf).ipynb" which contains all the code we are using in this project.

Due to the fact that all the actual datasets are exclusive in the Kaggle server and can only access through the competition kernel, after consulting and agreed by Professor Verhagen, we can just give you all the code we are using and the screenshot showing that our code successfully passed the Kaggle server(Took roughly 6 hours to run the code on Kaggle server).

However, we strongly recommend whoever read this report to join the competition as well and test run our code in your private kernel.

"coderun1.png" and "coderun2.png" show that our code passed and successfully run through the Kaggle server.

"marketdata_sample.csv" and "news_sample.csv" are provided by Kaggle which gives an overview of how does those data frame looks like to help people getting started to the competition.

"actual_submission" is the real submission file that our code generated from the Kaggle server and scoring top 10% of the competition. You can get an overview of how does the submission file looks like from this file.

what didn't work

It is the directory in which we place all our attempts on feature engineering, selections and boosting algorithms. Also, you can only run them on the Kaggle competition kernel.

"with_only_market_data.ipynb" is the version we tried without using any news data but only market data. Surprisingly this version's result is slightly worse than our actual submitted version, means the news data indeed helped with the prediction.

"xgboost_version.ipynb" is the version we used xgboost instead of LightGBM, the reason why we discarded it is explained in the report.