/regression-oil-production-prediction

Production prediction is one of the core problems in a company. The provided dataset is a set of nearby wells located in the United States and their 12 months cumulative production. The company data scientist needs to build a model from scratch to predict production.

Primary LanguageJupyter Notebook

logo

Oil Production Prediction

This is a fictional project for studying purposes. The business context and the insights are not real.

1. Description of the Business Problem

Production prediction is one of the core problems in a company. The provided dataset is a set of nearby wells located in the United States and their 12 months cumulative production. The company needs a production prediction model to serve as one of the tools to support the company decisions. So, the company data scientist needs to build a model from scratch to predict production and show the manager that the model can perform well on unseen data.

The tools that were created:

Machine Learning Regression Model: Using the dataset provided by the company. A machine learning regression model was created to be used for future predictions.

The notebook used to create the model is available here.

Streamlit App for Production Prediction: The model is available on the Streamlit Cloud and can be used through the Streamlit App created. The App is available here.

2. Dataset Attributes

AttributeDescription
treatment companyThe treatment company who provides treatment service.
azimuthWell drilling direction.
md (ft)
tvd (ft)True vertical depth.
date on productionFirst production date.
operatorThe well operator who performs drilling service.
footage lateral lengthHorizontal well section.
well spacingDistance to the closest nearby well.
porpoise deviationHow much max (in ft.) a well deviated from its horizontal.
porpoise countHow many times the deviations (porpoises) occurred.
shale footageHow much shale (in ft) encountered in a horizontal well.
acoustic impedanceThe impedance of a reservoir rock (ft/s * g/cc).
log permeabilityThe property of rocks that is an indication of the ability for fluids (gas or liquid) to flow through rocks.
porosityThe percentage of void space in a rock.
poisson ratioMeasures the ratio of lateral strain to axial strain at linearly elastic region.
water saturationThe ratio of water volume to pore volume.
tocTotal Organic Carbon, indicates the organic richness (hydrocarbon generative potential) of a reservoir rock.
vclThe amount of clay minerals in a reservoir rock.
p-velocityThe velocity of P-waves (compressional waves) through a reservoir rock (ft/s).
s-velocityThe velocity of S-waves (shear waves) through a reservoir rock (ft/s).
youngs modulusThe ratio of the applied stress to the fractional extension (or shortening) of the reservoir rock parallel to the tension (or compression) (giga pascals).
isipWhen the pumps are quickly stopped, and the fluids stop moving, these friction pressures disappear and the resulting pressure is called the instantaneous shut-in pressure, ISIP.
breakdown pressureThe pressure at which a hydraulic fracture is created/initiated/induced.
pump rateThe volume of liquid that travels through the pump in a given time.
total number of stagesTotal stages used to fracture the horizontal section of the well.
proppant volumeThe amount of proppant in pounds used in the completion of a well (lbs).
proppant fluid ratioThe ratio of proppant volume/fluid volume (lbs/gallon).
productionThe 12 months cumulative gas production (mmcf).

3. Solution Strategy

  1. Understand the Business problem.
  2. Clean the dataset removing outliers, NA values and unnecessary features.
  3. Explore the data to create hypothesis, think about a few insights and validate them.
  4. Prepare the data to be used by the modeling algorithms encoding variables, splitting train and test dataset and other necessary operations.
  5. Create the models using machine learning algorithms.
  6. Evaluate the created models to find the one that best fits to the problem.
  7. Tune the model to achieve a better performance.
  8. Deploy the model in production so that it is available to other people.
  9. Find possible improvements to be explored in the future.

4. The Insights

I1: Wells with a greater number of stages produce more,

True: This relationship doesn't apply for all values of total number of stages, but it tends to be true.

I2: Wells that started producing longer ago produce less.

True: Productions from newer wells are better.

I3: Wells that are farther from the others produce more.

False: The production doesn't increase according to the distance from other wells.

I4: Wells in which more proppant were used produce more.

True: More proppant indicates a greater production.

I5: Wells in which the rocks have higher values of porosity produce more.

False: More porosity does not mean more production.

5. Machine Learning Modeling

The final result of this project is a regression model. Therefore, some machine learning models were created. So, 7 models were created, Linear Regression, Lasso, SVM, Random Forest, XGBoost, LightGBM and CatBoost.

Boruta (feature selection algorithm) was used to select features for the model and 11 features were selected to the final model. The models were evaluated considering three metrics, Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE) and Root Mean Squared Error (RMSE). The initial models performances are in the table below.

Model NameMAEMAPERMSE
CatBoost502.930.2817781.34
LightGBM522.030.2936806.55
XGBoost535.100.3094813.48
Random Forest564.380.3281852.23
SVM648.010.4468931.77
Linear Regression679.331012.51
Lasso1018.080.42591396.98

6. Final Model

To decide which would be the final model, a cross-validation was carried out to evaluate the performance of the algorithms in a more robust way. These metrics are represented in the table below.

Model NameMAEMAPERMSE
Linear Regression687.8 +/- 49.400.49 +/- 0.04974.12 +/- 90.88
Lasso1023.65 +/- 61.450.89 +/- 0.061348.19 +/- 96.97
SVM651.62 +/- 28.270.51 +/- 0.06897.34 +/- 60.87
Random Forest521.82 +/- 26.990.36 +/- 0.02768.7 +/- 74.63
XGBoost526.78 +/- 14.360.35 +/- 0.02773.11 +/- 52.73
LightGBM525.71 +/- 31.970.34 +/- 0.02767.4 +/- 58.25
CatBoost490.18 +/- 16.50.32 +/- 0.02724.79 +/- 54.17

As the table presents, the Catboost model was the best one and was chosen to be deployed. After choosing which would be the final model, a random search hyperparameter optimization algorithm was used to improve the performance of the model. The final model evaluation metrics are in the table below.

Model NameMAEMAPERMSE
CatBoost Tuned485.66 +/- 23.010.32 +/- 0.02714.4 +/- 64.6

7. Conclusion

Although the dataset has many features, it is small and has a significant amount of missing values. The model presented a larger error than expected, this problem could be circumvented with a larger amount of data. Using the app, other people can easily make predictions just setting the values and pressing the prediction button.

8. Future Work

  • Find a better way to replace missing values.
  • Find the best way of dealing with the outliers.
  • Search for models that could perform better with this small dataset.
  • Try some dimensionality reduction algorithm to improve the model prediction capabilities.
  • Improve the Streamlit app adding more functions.