/Crude_Oil_Price_Projection

Group Project: we intend to develop a robust machine learning model that will deliver a one-year projection for the price of West Texas Intermediate (WTI) Crude Oil.

Primary LanguageJupyter Notebook

Crude Oil Price Projection

GWU Data Analytics Boot Camp

plot

Our Team:

Siyada Harker,

Stephan Weir,

Tsaschiker Ulziibadrakh

Roles:

  • Data Source Research, Data Cleaning, and Database Storage

    • Siyada (Cleaning & Storage)
    • Stephan (Data Source Research)
  • Machine Learning

  • Dashboard

    • Tesa (Tableau)
    • Siyada (HTML/CSS, GitHub Pages)
  • GitHub Repository & Slides

    • Tesa
    • Siyada
    • Stephan

Software:

  • Data Cleaning

    • Jupyter Notebook 6.4.6
      • Python (ETL process)
        • Pandas library
    • Quick DBD (Create ERDs)
    • PostgreSQL (Join/Merge datasets)
  • Database Storage

  • Machine Learning

    • Jupyter Notebook or Google Colaboratory (Google Colab Notebook)
      • Python
        • Prophet
        • ARIMA
        • LSTM (Keras/TensorFlow)
        • Random Forest Regressor
  • Dashboard

    • Tableau Public 2021.3.3
    • Visual Studio Code 1.63.2
      • HTML/CSS
    • GitHub Pages
  • Slides

    • Google Slides

Project Proposal

For this project, we intend to develop a robust machine learning model that will deliver a one-year projection for the price of West Texas Intermediate (WTI) Crude Oil, one of the most well-known and widely produced blends of crude in the United States. To do so, we will be examining historical and current data from the Energy Information Agency (EIA), a federal agency that tracks production, sales, and spot & futures prices of WTI Crude. The price projection will be a function of oil production, refinery utilization and capacity, sales, and a variety of other indicators provided by the Energy Information Agency (EIA). We will test four different machine learning methods in order to determine the best fit for this task, shown below: -Random Forest Regression -ARIMA -Prophet LSTM RNN (Recurrent Neural Network)

Business Case

Many industries, including shipping companies, oil producers, commodities traders, and banks rely heavily upon projections of oil prices to support their decision-making in today’s economy. However, with oil price volatility reaching all-time highs in the past ten years, and with the waning influence of OPEC on the world’s oil production, the task of generating consistent, accurate projections has become increasingly difficult. This has created a considerable opportunity for data scientists and machine learning engineers to apply their skillsets, as well as a demand for complex decision-support systems that firms can employ to navigate the unprecedented dynamics in today’s oil market.

Questions the team hopes to answer

  • Where will the data be sourced from?

  • How will the data be data be transformed and stored?

  • Which machine learning models and libraries should be used?

  • How will the results be visualized?

  • Which model produces the most accurate projections for WTI?

Resources:

  • Gathering data from U.S. Energy Information Administration (EIA)
    • Crude Oil Production dataset

      The volume of crude oil produced from oil reservoirs during given periods of time. The amount of such production for a given period is measured as volumes delivered from lease storage tanks (i.e., the point of custody transfer) to pipelines, trucks, or other media for transport to refineries or terminals with adjustments for (1) net
      differences between opening and closing lease inventories, and (2) basic sediment and water (BS&W).

    • Product Supplied dataset

      Approximately represents consumption of petroleum products because it measures the disappearance of these products from primary sources, i.e., refineries, natural gas-processing plants, blending plants, pipelines, and bulk terminals. In general, Product Supplied of each product in any given period is computed as follows: Product supplied = field production + refinery production + imports + unaccounted for crude oil (+ net receipts when calculated on a PAD District basis) - stock change - crude oil losses - refinery inputs - exports

    • Refinery Utilization and Capacity dataset

      Ratio of the total amount of crude oil, unfinished oils, and natural gas plant liquids run through crude oil distillation units to the operable capacity of these units.

    • NYMEX Futures Prices dataset

      New York Mercantile Exchange, The price quoted for delivering a specified quantity of a commodity at a specified time and place in the future.

    • WTI Crude SPOT Prices Historical dataset

      West Texas Intermediate, The price for a one-time open market transaction for immediate delivery of a specific quantity of product at a specific location where the commodity is purchased "on the spot" at current market rates.

The ETL Process (Extract, Transform, Load)

1. Extract :

  • There are 5 datasets stored as CSV files
    • Crude Oil Production
      • Area: U.S.
      • Period-Unit: Monthly-Thousand Barrels
      • Date: Jan-1920 to Oct-2021
    • Product Supplied
      • Area: U.S.
      • Period-Unit: Monthly-Thousand Barrels
      • Date: Jan-1936 to Oct-2021
    • Refinery Utilization and Capacity
      • Area: U.S.
      • Period: Monthly
      • Date: Jan-1985 to Oct-2021
    • NYMEX Futures Prices
      • Area: U.S.
      • Period: Daily
      • Date: Mar 30, 1983 to Jan 11, 2022
    • WTI Crude SPOT Prices Historical
      • Area: U.S.
      • Period: Daily
      • Date: 1/10/2022 to 1/2/1986

2. Transform :

  • Create a DataFrame with the columns that we want to keep or Drop unnecessary columns
  • Rename the columns by assigning a list of new column names
  • Drop the null values using dropna() method
  • Convert Data types on "Date" column using to_datetime method
  • Sort ascending order on "Date" column in the WTI Crude SPOT Prices Historical DataFrame using sort_values method
  • Filter on "Date" column of all DataFrames between (1986-01-01 & 2021-10-31)
  • Calculate the average for each month and year in the NYMEX Futures Prices and the WTI Crude SPOT Prices Historical DataFrames
  • Round to specific decimals places
  • Convert the "Date" column to datetime (in abbreviated Month-Year or mmm-yyyy format) using strftime("%b-%Y") method
  • Join/Merge datasets using PostgreSQL plot

3. Load :

  • Connect Pandas and SQL: using Pandas to_sql method paired with sqlAlchemy plot

SQL Database

Database Schema ERD

plot

Data Dictionary

plot

Definitions and Explanatory Notes

plot

Database Storage

  • In PostgreSQL storage_sql

  • In Google Cloud Platform (GCP)

GCP

Machine Learning

*** Random Forest Regressor ***

  • Rank the importance of features

rdf_rank

  • Accuracy

random_forest_regressor_accuracy

  • Random Forest Visualization

random_forest_regressor


*** ARIMA ***

  • AutoRegressive Integrated Moving Average
    • (AR) Regresses target variable on its own prior values.
      • lag parameter = p
    • (I) Uses differences between observations and their prior values to attempt to make data stationary
      • differencing parameter = d
    • (MA) Accounts for a moving average of lagged observations
      • Moving average window size = q
    • Advantages:
      • Easy to use when analyzing and creating projections for time-series data
    • Drawbacks:
      • Potentially ineffective in analyzing data with a non-stationary mean (unit root)
      • Extremely sensitive to large shocks

ARIMA Model Structure

ARIMA Model Output

ArimaOutput

metrics


*** Long Short-Term Memory (LSTM) ***

  • Recurrent Neural Network (RNN)
    • Addresses the issue of ‘vanishing gradients’ in neural networks with its gate structure
      • Multi-layer networks using certain activation functions experience gradient decay in the loss function, making them harder to train
  • Drawbacks:
    • Longer model runtime
    • Prone to overfitting, especially as depth is added to model

LSTM Cell Structure

  • Forget Gate: Decides which prior inputs are pertinent to the model, and which should be ignored.
  • Input Gate: Determines which relevant information can be added from the current step
  • Cell State: The path along which information is transported between gates.
  • Output Gate: Determines the value of the next hidden state, based on information from input gate and the hidden state.
  • Activation Functions: Regulate data in the network, aid in determining if data should be updated or forgotten. lstm structure

Multi-Layer LSTM Model

ModelStructure1 ModelStructure2

LSTM Output

Results


*** Prophet ***

  • Developed by Facebook/Meta
    • Additive Regression Model
      • Includes seasonality component using a Fourier Series: expansion of a periodic function
        • Allows the model to more easily account for recurring trends in time series datasets
  • Drawbacks:
    • Relatively ‘simplistic’ architecture, features limited when compared to other multivariate models model build Results_logtransformed Results_originalValues accuracy measure

Dashboard

Our Tableau Dashboard has four pages where each dispays detailed visualization of WTI Crude Oil Field Production, Product Supplied of Crude Oil, Petrolium Products, and Future Price analysis.

Here is the sneak peak of the Dashboard:

home

price copy

Link to Tableau Dashboards

GitHub Pages

GitHubPages_Crude_oil

Link to GitHub Pages

Google Slides

Slides

Link to Google Slides

Results

Results

Summary

  • In conclusion, our results indicate that Prophet yields the highest accuracy score for forecasting WTI Crude Oil futures, based on the MAPE scores.
    • While further improvements can certainly be made to all of these models, Prophet offers an intuitive, user-friendly forecasting platform that boasts impressive accuracy, given its simplicity.
  • Random Forest Regressors, particularly when tuned to account for only the most important variables, also serve as a powerful forecasting tool when applied to WTI pricing.
  • ARIMA struggles to efficiently cope with the volatility of recent Crude pricing data, as seen in the forecast graph. While the ARIMA calculates a relatively acceptable MAPE, the trend line does not cope well during periods of high price volatility.
  • LSTM models have high potential for accurate modeling of financial data, but it appears that the hyperparameters of the model would require extensive tuning in order to be improved.
    • Might also benefit from a larger dataset

# Recommendation for future analysis

  • Use of weekly or daily data
    • Possible increase in model accuracy, at the cost of:
      • Increased runtime
      • Increased potential for model overfitting
  • Inclusion of pricing data for historically-correlated asset classes (USD, Gold, etc)
    • Caution: in recent years, some of these correlations have shifted
  • Inclusion of global oil demand figures, not just product from domestic suppliers

# Anything the team would have done differently

For data:

  • There are several influences on oil prices
    • We could have explored additional data types, such as
      • Weather patterns
      • Natural disasters
      • Political Instability
      • Military Conflicts
      • Correlated financial instruments

For Machine Learning:

  • Further exploration of different model types
    • VARMA (Vector AutoRegressive Moving Average with differencing)
      • Multivariate iteration of ARIMA
    • LSTM encoder-decoder Model (interprets data sequence-by-sequence)
  • Extrapolation of projections beyond the constraints of the time series
    • Considerable scarcity of publicly-available examples/recommendations of how this is accomplished, especially in the case of multivariate, multi-step models
  • Deeper understanding of hyperparameter tuning for more complex models
    • Would assist in the improvement of model accuracy

For Dashboard:

  • Explore more datasets from various sources and include more categorical data for calculation and visualization.