Stock Price Prediction

File Structure

stock_price_prediction:.
│   .gitignore      <- File instructing git to ignore specified files
│   README.md       <- Description about the project, developers, file structure, workflow, etc.
├───data
│   ├───external    <- Third Party Data
│   ├───interim     <- Transformed intermediate data, not ready for modelling
│   ├───processed   <- Prepared data, ready for modelling
│   └───raw         <- Immutable original data
├───models          <- Serialized models
├───notebooks       <- Jupyter notebooks for exploration, communication and prototyping
└───src             <- Folder containing project source code
    ├───data        <- Folder containing scripts to download generate data
    ├───features    <- Folder containing scripts to transform data for modelling
    └───model       <- Folder containing scripts to train and predict
  • data
    • external : This is data extracted from third party sources (Immutable data). If no third party data is extracted then this folder is obsolete.
    • interim : In the event external data being available, this data would be the data that we would load for feature engineering by using a script in the src/data directory. This dataset is generated by performing various joins and/or merges to combine the external and raw data.
    • processed : This is the data that has been transformed using various machine learning techniques. The features folder that we will get to in the src folder performs various transformations on the data to allow it to be ready for modelling. It serves as a good idea to persist the processed data in order to shorten the training time of our model.
    • raw: Having a local subset copy of data ensures that you have a static dataset to perform task on. Additionally, this overcomes any workflow breakdowns due to network latency issues. This data should be considered immutable. If there is no external data then this is the data to be downloaded by the script in src\data.
  • models : We use a script in src\models for training of our Machine Learning model. We may need to restore or reuse the model with other models to build an ensemble or to compare and we may decide upon a model that we want to deploy. In order to do this we save the trained model to a file (usually a pickle format) and that file would be saved in this directory.
  • notebooks : Jupyter notebooks are excellent for prototyping, exploring and communicating findings, however they aren’t very good for long-term growth and can be less effective for reproducibility. Notebooks can be further divided into sub-folders such as Notebooks\explorations and Notebooks\PoC . Using a good naming conventions helps to distinguish what populates each notebook — a useful template is <step>-<user>-<description>.ipynb (i.e. 01-kpy-eda.ipynb) where the step serves as an ordering mechanism, the creator’s first name initial, and first 2 letters of surname and description of what the notebook contains.
  • src :
    • data : In this directory we have the scripts that ingest the data from wherever it is being generated and transform that data so that it is in a state that further feature engineering can take place.
    • features : In this directory, we have a script that manipulates the data and puts it in a format that can be consumed by our machine learning model.
    • models : Contains scripts that are used to build and train our model.

Reference : Structuring Machine Learning projects

Developers