Estimating total organic carbon of potential source rocks in the Espírito Santo Basin, SE Brazil, using XGBoost

Bione et al. (2024) scientific paper repository.

Reference: BIONE F.R.A. et al. Estimating total organic carbon of potential source rocks in the Espírito Santo Basin, SE Brazil, using XGBoost. Marine and Petroleum Geology, v. 162, 106765, 2024. https://doi.org/10.1016/j.marpetgeo.2024.106765

⚙️Configuring the environment

Creating the venv and installing the dependencies

In terminal, run:

python -m venv venv →
cd venv\Scripts →
activate.bat →
cd ..\.. →
pip install -r requirements.txt

📖 Create the train-test dataframe

To create the train-test dataframe from the originally-compiled full dataframe:

Open the Data_preparation notebook, and run all four steps to generate the train-test dataframe

📈 Reproducing and visualizing the results

To reproduce the results using the already tuned model parameters and visualize the results:

Go to the Reproduce_and_generate_figs notebook, and follow the instructions provided in the notebook.

▶️ Tune your own models

If you want to run your own models using this approach, bear in mind you must provide a compatible dataframe. Thus, it is very likely that some code adaptations will be required, such as renaming features/targets, data imputation parameters or any other feature engineering technique you wish to include.

Installing pySpark (Windows)

Optional. Do this in case you want to run parameter tuning for your own models.

Download JDK from this link, and install it;
Download Spark from this link, then extract the tar file to a directory (e.g., C:\spark);
Download hadoop from this link, then add the winutils file to a directory (e.g., C:\hadoop\bin);
Configure the Environment Variables by adding the following:
- JAVA_HOME - (e.g., C:\java\jdk)
- HADOOP_HOME - (e.g., C:\hadoop)
- SPARK_HOME - (e.g., C:\spark\spark-3.3.2-bin-hadoop2)
- PYSPARK_HOME - (e.g., ..\venv\lib\site-packages\pyspark)
Finally, add the following to Path:
- %JAVA_HOME%\bin
- %HADOOP_HOME%\bin
- %SPARK_HOME%\bin

After installing pySpark, you can run the model_run.py script, passing your own dataframe.