/GDELT-ML-project

How low code machine learning library can speed up machine learning project

Primary LanguageJupyter Notebook

GDelt Project

App Screenshot

How low code can speed up your Machine Learning project

We use GDELT dataset with Pycaret to demonstrate how low code machine learning library to quickly prototype machine learning models.

Introduction

The GDELT Project is an initiative to construct a catalog of human societal-scale behavior and beliefs across all countries of the world, connecting every person, organization, location, count, theme, news source, and event across the planet into a single massive network that captures what's happening around the world, what its context is and who's involved, and how the world is feeling about it, every single day.

Data

The data is available in the following link: https://www.gdeltproject.org/data.html#rawdatafiles. And the data dictionary is available in the following link: http://data.gdeltproject.org/documentation/GDELT-Event_Codebook-V2.0.pdf.

Exploratory Data Analysis

Automated EDA

from typing import List  
import pandas as pd

from utils import transform_data, int_to_datetime, get_null_values

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

Load and Transform data

data = pd.read_csv("230722.csv")

column_names = [
    'GlobalEventID',
            'Day',
            'MonthYear',
            'Year',
            'FractionDate',
            'Actor1Code',
            'Actor1Name',
            'Actor1CountryCode',
            'Actor1KnownGroupCode',
            'Actor1EthnicCode',
            'Actor1Religion1Code',
            'Actor1Religion2Code',
            'Actor1Type1Code',
            'Actor1Type2Code',
            'Actor1Type3Code',
            'Actor2Code',
            'Actor2Name',
            'Actor2CountryCode',
            'Actor2KnownGroupCode',
            'Actor2EthnicCode',
            'Actor2Religion1Code',
            'Actor2Religion2Code',
            'Actor2Type1Code',
            'Actor2Type2Code',
            'Actor2Type3Code',
            'IsRootEvent',
            'EventCode',
            'EventBaseCode',
            'EventRootCode',
            'QuadClass',
            'GoldsteinScale',
            'NumMentions',
            'NumSources',
            'NumArticles',
            'AvgTone',
            'Actor1Geo_Type',
            'Actor1Geo_FullName',
            'Actor1Geo_CountryCode',
            'Actor1Geo_ADM1Code',
            'Actor1Geo_Lat',
            'Actor1Geo_Long',
            'Actor1Geo_FeatureID',
            'Actor2Geo_Type',
            'Actor2Geo_FullName',
            'Actor2Geo_CountryCode',
            'Actor2Geo_ADM1Code',
            'Actor2Geo_Lat',
            'Actor2Geo_Long',
            'Actor2Geo_FeatureID',
            'DateAdded',
            'SourceURL'
        ]


data = transform_data(data, column_names)

data = int_to_datetime(data, "day")
data = int_to_datetime(data, "dateadded")

# Let use small segment of the dataset
data_sm = data[data['year']==2022]

data_lm = data[data['year']!=2022]

# columns = data_sm.columns.tolist()
# columns.remove('quadclass')
# columns.append('quadclass')

# data_sm = data_sm[columns]

data_lm.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
globaleventid day monthyear year fractiondate actor1code actor1name actor1countrycode actor1knowngroupcode actor1ethniccode actor1religion1code actor1religion2code actor1type1code actor1type2code actor1type3code actor2code actor2name actor2countrycode actor2knowngroupcode actor2ethniccode actor2religion1code actor2religion2code actor2type1code actor2type2code actor2type3code isrootevent eventcode eventbasecode eventrootcode quadclass goldsteinscale nummentions numsources numarticles avgtone actor1geo_type actor1geo_fullname actor1geo_countrycode actor1geo_adm1code actor1geo_lat actor1geo_long actor1geo_featureid actor2geo_type actor2geo_fullname actor2geo_countrycode actor2geo_adm1code actor2geo_lat actor2geo_long actor2geo_featureid dateadded sourceurl
13 1116435795 2023-06-22 202306 2023 2023.4712 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN BUS COMPANY NaN NaN NaN NaN NaN BUS NaN NaN 0 40 40 4 Verbal Cooperation 1.0 5 1 5 1.118963 0 NaN NaN NaN NaN NaN NaN 3 Fort Smith, Arkansas, United States US USAR 35.3859 -94.3985 76952 2023-07-22 https://www.kuaf.com/show/ozarks-at-large/2023...
14 1116435796 2023-06-22 202306 2023 2023.4712 AFR AFRICA AFR NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1 43 43 4 Verbal Cooperation 2.8 4 1 4 0.790514 1 Russia RS RS 60.0000 100.0000 RS 1 Russia RS RS 60.0000 100.0000 RS 2023-07-22 https://www.beijingbulletin.com/news/273906652...
15 1116435797 2023-06-22 202306 2023 2023.4712 AFR AFRICA AFR NaN NaN NaN NaN NaN NaN NaN RUS RUSSIAN RUS NaN NaN NaN NaN NaN NaN NaN 1 43 43 4 Verbal Cooperation 2.8 10 3 10 -0.491013 4 Pretoria, Gauteng, South Africa SF SF06 -25.7069 28.2294 -1273769 4 Pretoria, Gauteng, South Africa SF SF06 -25.7069 28.2294 -1273769 2023-07-22 https://www.beijingbulletin.com/news/273906652...
16 1116435798 2023-06-22 202306 2023 2023.4712 AFR AFRICA AFR NaN NaN NaN NaN NaN NaN NaN RUS RUSSIAN RUS NaN NaN NaN NaN NaN NaN NaN 0 43 43 4 Verbal Cooperation 2.8 2 1 2 0.790514 1 Russia RS RS 60.0000 100.0000 RS 1 Russia RS RS 60.0000 100.0000 RS 2023-07-22 https://www.beijingbulletin.com/news/273906652...
17 1116435799 2023-06-22 202306 2023 2023.4712 AFR AFRICA AFR NaN NaN NaN NaN NaN NaN NaN RUSGOV RUSSIAN RUS NaN NaN NaN NaN GOV NaN NaN 1 43 43 4 Verbal Cooperation 2.8 18 3 18 -0.491013 4 Pretoria, Gauteng, South Africa SF SF06 -25.7069 28.2294 -1273769 4 Pretoria, Gauteng, South Africa SF SF06 -25.7069 28.2294 -1273769 2023-07-22 https://www.beijingbulletin.com/news/273906652...

Identify Total number of NULL of each feature

get_null_values(data_lm)
actor2type3code          87517
actor1type3code          87517
actor2religion2code      87351
actor1religion2code      87289
actor2ethniccode         87168
actor1ethniccode         87011
actor2knowngroupcode     86724
actor1knowngroupcode     86627
actor2religion1code      86621
actor1religion1code      86523
actor2type2code          86025
actor1type2code          85240
actor2type1code          58516
actor1type1code          50638
actor2countrycode        48626
actor1countrycode        37956
actor2code               25351
actor2name               25351
actor1geo_fullname       11051
actor1geo_long           11051
actor1geo_lat            11051
actor1geo_featureid      11044
actor1geo_adm1code       11044
actor1geo_countrycode    11044
actor1name                8630
actor1code                8630
actor2geo_lat             2723
actor2geo_fullname        2723
actor2geo_long            2723
actor2geo_adm1code        2718
actor2geo_countrycode     2718
actor2geo_featureid       2718
actor2geo_type               0
dateadded                    0
globaleventid                0
isrootevent                  0
actor1geo_type               0
avgtone                      0
numarticles                  0
numsources                   0
nummentions                  0
goldsteinscale               0
quadclass                    0
eventrootcode                0
eventbasecode                0
eventcode                    0
day                          0
fractiondate                 0
year                         0
monthyear                    0
sourceurl                    0
dtype: int64
new_data = data_lm[
[      'fractiondate',
       'year',
       'monthyear',
       'actor1code', 
        'actor1name', 
        'actor1geo_type',
        'actor1geo_long',
        'actor1geo_lat',
        'actor2code', 
        'actor2name', 
        'actor2geo_type',
        'actor2geo_long',
        'actor2geo_lat',
        'isrootevent',
        'eventcode',
        'eventrootcode',
        'goldsteinscale',
        'nummentions',
        'numsources',
        'avgtone',
        'quadclass'
  ]
]
new_data.head() 
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
fractiondate year monthyear actor1code actor1name actor1geo_type actor1geo_long actor1geo_lat actor2code actor2name actor2geo_type actor2geo_long actor2geo_lat isrootevent eventcode eventrootcode goldsteinscale nummentions numsources avgtone quadclass
13 2023.4712 2023 202306 NaN NaN 0 NaN NaN BUS COMPANY 3 -94.3985 35.3859 0 40 4 1.0 5 1 1.118963 Verbal Cooperation
14 2023.4712 2023 202306 AFR AFRICA 1 100.0000 60.0000 NaN NaN 1 100.0000 60.0000 1 43 4 2.8 4 1 0.790514 Verbal Cooperation
15 2023.4712 2023 202306 AFR AFRICA 4 28.2294 -25.7069 RUS RUSSIAN 4 28.2294 -25.7069 1 43 4 2.8 10 3 -0.491013 Verbal Cooperation
16 2023.4712 2023 202306 AFR AFRICA 1 100.0000 60.0000 RUS RUSSIAN 1 100.0000 60.0000 0 43 4 2.8 2 1 0.790514 Verbal Cooperation
17 2023.4712 2023 202306 AFR AFRICA 4 28.2294 -25.7069 RUSGOV RUSSIAN 4 28.2294 -25.7069 1 43 4 2.8 18 3 -0.491013 Verbal Cooperation

Initialise Startup

from pycaret.classification import ClassificationExperiment

exp = ClassificationExperiment()

exp.setup(data=new_data,
          target='quadclass', 
           session_id=123
         )
<style type="text/css"> #T_20d8e_row12_col1 { background-color: lightgreen; } </style>
  Description Value
0 Session id 123
1 Target quadclass
2 Target type Multiclass
3 Target mapping Material Conflict: 0, Material Cooperation: 1, Verbal Conflict: 2, Verbal Cooperation: 3
4 Original data shape (87552, 21)
5 Transformed data shape (87552, 31)
6 Transformed train set shape (61286, 31)
7 Transformed test set shape (26266, 31)
8 Ordinal features 1
9 Numeric features 13
10 Categorical features 7
11 Rows with missing values 40.5%
12 Preprocess True
13 Imputation type simple
14 Numeric imputation mean
15 Categorical imputation mode
16 Maximum one-hot encoding 25
17 Encoding method None
18 Fold Generator StratifiedKFold
19 Fold Number 10
20 CPU Jobs -1
21 Use GPU False
22 Log Experiment False
23 Experiment Name clf-default-name
24 USI b912
<pycaret.classification.oop.ClassificationExperiment at 0x7fffbbcbba60>

Automated Exploratory Data Analysis

exp.eda()
Imported v0.1.58. After importing, execute '%matplotlib inline' to display charts in Jupyter.
    AV = AutoViz_Class()
    dfte = AV.AutoViz(filename, sep=',', depVar='', dfte=None, header=0, verbose=1, lowess=False,
               chart_format='svg',max_rows_analyzed=150000,max_cols_analyzed=30, save_plot_dir=None)
Update: verbose=0 displays charts in your local Jupyter notebook.
        verbose=1 additionally provides EDA data cleaning suggestions. It also displays charts.
        verbose=2 does not display charts but saves them in AutoViz_Plots folder in local machine.
        chart_format='bokeh' displays charts in your local Jupyter notebook.
        chart_format='server' displays charts in your browser: one tab for each chart type
        chart_format='html' silently saves interactive HTML files in your local machine
Shape of your Data Set loaded: (87552, 31)
#######################################################################################
######################## C L A S S I F Y I N G  V A R I A B L E S  ####################
#######################################################################################
Classifying variables in data set...
Data cleaning improvement suggestions. Complete them before proceeding to ML modeling.
<style type="text/css"> #T_3faa3_row0_col0, #T_3faa3_row0_col4 { background-color: #67000d; color: #f1f1f1; font-family: Segoe UI; } #T_3faa3_row0_col1, #T_3faa3_row0_col6, #T_3faa3_row1_col1, #T_3faa3_row1_col6, #T_3faa3_row2_col1, #T_3faa3_row2_col6, #T_3faa3_row3_col1, #T_3faa3_row3_col6, #T_3faa3_row4_col1, #T_3faa3_row4_col6, #T_3faa3_row5_col1, #T_3faa3_row5_col6, #T_3faa3_row6_col1, #T_3faa3_row6_col6, #T_3faa3_row7_col1, #T_3faa3_row7_col6, #T_3faa3_row8_col1, #T_3faa3_row8_col6, #T_3faa3_row9_col1, #T_3faa3_row9_col6, #T_3faa3_row10_col1, #T_3faa3_row10_col6, #T_3faa3_row11_col1, #T_3faa3_row11_col6, #T_3faa3_row12_col1, #T_3faa3_row12_col6, #T_3faa3_row13_col1, #T_3faa3_row13_col6, #T_3faa3_row14_col1, #T_3faa3_row14_col6, #T_3faa3_row15_col1, #T_3faa3_row15_col6, #T_3faa3_row16_col1, #T_3faa3_row16_col6, #T_3faa3_row17_col1, #T_3faa3_row17_col6, #T_3faa3_row18_col1, #T_3faa3_row18_col6, #T_3faa3_row19_col1, #T_3faa3_row19_col6, #T_3faa3_row20_col1, #T_3faa3_row20_col6, #T_3faa3_row21_col1, #T_3faa3_row21_col6, #T_3faa3_row22_col1, #T_3faa3_row22_col6, #T_3faa3_row23_col1, #T_3faa3_row23_col6, #T_3faa3_row24_col1, #T_3faa3_row24_col6, #T_3faa3_row25_col1, #T_3faa3_row25_col6, #T_3faa3_row26_col1, #T_3faa3_row26_col6, #T_3faa3_row27_col1, #T_3faa3_row27_col6, #T_3faa3_row28_col1, #T_3faa3_row28_col6, #T_3faa3_row29_col1, #T_3faa3_row29_col6 { font-family: Segoe UI; } #T_3faa3_row0_col2, #T_3faa3_row0_col3, #T_3faa3_row0_col5, #T_3faa3_row1_col2, #T_3faa3_row1_col3, #T_3faa3_row1_col5, #T_3faa3_row2_col2, #T_3faa3_row2_col3, #T_3faa3_row2_col5, #T_3faa3_row3_col2, #T_3faa3_row3_col3, #T_3faa3_row3_col5, #T_3faa3_row4_col2, #T_3faa3_row4_col3, #T_3faa3_row4_col5, #T_3faa3_row5_col2, #T_3faa3_row5_col3, #T_3faa3_row5_col5, #T_3faa3_row6_col2, #T_3faa3_row6_col3, #T_3faa3_row6_col5, #T_3faa3_row7_col2, #T_3faa3_row7_col3, #T_3faa3_row7_col5, #T_3faa3_row8_col2, #T_3faa3_row8_col3, #T_3faa3_row8_col5, #T_3faa3_row9_col2, #T_3faa3_row9_col3, #T_3faa3_row9_col5, #T_3faa3_row10_col2, #T_3faa3_row10_col3, #T_3faa3_row10_col5, #T_3faa3_row11_col2, #T_3faa3_row11_col3, #T_3faa3_row11_col5, #T_3faa3_row12_col0, #T_3faa3_row12_col2, #T_3faa3_row12_col3, #T_3faa3_row12_col4, #T_3faa3_row12_col5, #T_3faa3_row13_col0, #T_3faa3_row13_col2, #T_3faa3_row13_col3, #T_3faa3_row13_col4, #T_3faa3_row13_col5, #T_3faa3_row14_col0, #T_3faa3_row14_col2, #T_3faa3_row14_col3, #T_3faa3_row14_col4, #T_3faa3_row14_col5, #T_3faa3_row15_col0, #T_3faa3_row15_col2, #T_3faa3_row15_col3, #T_3faa3_row15_col4, #T_3faa3_row15_col5, #T_3faa3_row16_col0, #T_3faa3_row16_col2, #T_3faa3_row16_col3, #T_3faa3_row16_col4, #T_3faa3_row16_col5, #T_3faa3_row17_col0, #T_3faa3_row17_col2, #T_3faa3_row17_col3, #T_3faa3_row17_col4, #T_3faa3_row17_col5, #T_3faa3_row18_col0, #T_3faa3_row18_col2, #T_3faa3_row18_col3, #T_3faa3_row18_col4, #T_3faa3_row18_col5, #T_3faa3_row19_col0, #T_3faa3_row19_col2, #T_3faa3_row19_col3, #T_3faa3_row19_col4, #T_3faa3_row19_col5, #T_3faa3_row20_col0, #T_3faa3_row20_col2, #T_3faa3_row20_col3, #T_3faa3_row20_col4, #T_3faa3_row20_col5, #T_3faa3_row21_col0, #T_3faa3_row21_col2, #T_3faa3_row21_col3, #T_3faa3_row21_col4, #T_3faa3_row21_col5, #T_3faa3_row22_col0, #T_3faa3_row22_col2, #T_3faa3_row22_col3, #T_3faa3_row22_col4, #T_3faa3_row22_col5, #T_3faa3_row23_col0, #T_3faa3_row23_col2, #T_3faa3_row23_col3, #T_3faa3_row23_col4, #T_3faa3_row23_col5, #T_3faa3_row24_col0, #T_3faa3_row24_col2, #T_3faa3_row24_col3, #T_3faa3_row24_col4, #T_3faa3_row24_col5, #T_3faa3_row25_col0, #T_3faa3_row25_col2, #T_3faa3_row25_col3, #T_3faa3_row25_col4, #T_3faa3_row25_col5, #T_3faa3_row26_col0, #T_3faa3_row26_col2, #T_3faa3_row26_col3, #T_3faa3_row26_col4, #T_3faa3_row26_col5, #T_3faa3_row27_col0, #T_3faa3_row27_col2, #T_3faa3_row27_col3, #T_3faa3_row27_col4, #T_3faa3_row27_col5, #T_3faa3_row28_col0, #T_3faa3_row28_col2, #T_3faa3_row28_col3, #T_3faa3_row28_col4, #T_3faa3_row28_col5, #T_3faa3_row29_col0, #T_3faa3_row29_col2, #T_3faa3_row29_col3, #T_3faa3_row29_col4, #T_3faa3_row29_col5 { background-color: #fff5f0; color: #000000; font-family: Segoe UI; } #T_3faa3_row1_col0, #T_3faa3_row1_col4 { background-color: #fcb79c; color: #000000; font-family: Segoe UI; } #T_3faa3_row2_col0, #T_3faa3_row2_col4 { background-color: #fcb99f; color: #000000; font-family: Segoe UI; } #T_3faa3_row3_col0, #T_3faa3_row3_col4 { background-color: #fcbba1; color: #000000; font-family: Segoe UI; } #T_3faa3_row4_col0, #T_3faa3_row4_col4 { background-color: #fcbda4; color: #000000; font-family: Segoe UI; } #T_3faa3_row5_col0, #T_3faa3_row5_col4 { background-color: #feeae1; color: #000000; font-family: Segoe UI; } #T_3faa3_row6_col0, #T_3faa3_row6_col4 { background-color: #ffece3; color: #000000; font-family: Segoe UI; } #T_3faa3_row7_col0, #T_3faa3_row7_col4 { background-color: #ffeee7; color: #000000; font-family: Segoe UI; } #T_3faa3_row8_col0, #T_3faa3_row8_col4 { background-color: #ffefe8; color: #000000; font-family: Segoe UI; } #T_3faa3_row9_col0, #T_3faa3_row9_col4 { background-color: #fff1ea; color: #000000; font-family: Segoe UI; } #T_3faa3_row10_col0, #T_3faa3_row10_col4, #T_3faa3_row11_col0, #T_3faa3_row11_col4 { background-color: #fff4ee; color: #000000; font-family: Segoe UI; } </style>
  Nuniques dtype Nulls Nullpercent NuniquePercent Value counts Min Data cleaning improvement suggestions
avgtone 19394 float64 0 0.000000 22.151407 0
actor2geo_long 5121 float64 0 0.000000 5.849095 0
actor1geo_long 4968 float64 0 0.000000 5.674342 0
actor2geo_lat 4904 float64 0 0.000000 5.601243 0 skewed: cap or drop outliers
actor1geo_lat 4759 float64 0 0.000000 5.435627 0 skewed: cap or drop outliers
actor1name 1242 float64 0 0.000000 1.418586 0 skewed: cap or drop outliers
actor2name 1077 float64 0 0.000000 1.230126 0 skewed: cap or drop outliers
actor1code 804 float64 0 0.000000 0.918311 0 skewed: cap or drop outliers
actor2code 715 float64 0 0.000000 0.816658 0 skewed: cap or drop outliers
nummentions 507 float64 0 0.000000 0.579084 0 highly skewed: drop outliers or do box-cox transform
eventcode 194 float64 0 0.000000 0.221583 0 highly skewed: drop outliers or do box-cox transform
numsources 172 float64 0 0.000000 0.196455 0 highly skewed: drop outliers or do box-cox transform
goldsteinscale 42 float64 0 0.000000 0.047971 0
eventrootcode 20 float64 0 0.000000 0.022844 0
fractiondate 5 float64 0 0.000000 0.005711 0 highly skewed: drop outliers or do box-cox transform
monthyear 3 float64 0 0.000000 0.003427 0 highly skewed: drop outliers or do box-cox transform
year 2 float64 0 0.000000 0.002284 0 highly skewed: drop outliers or do box-cox transform
actor2geo_type_4.0 2 float64 0 0.000000 0.002284 0
actor2geo_type_3.0 2 float64 0 0.000000 0.002284 0 skewed: cap or drop outliers
actor2geo_type_2.0 2 float64 0 0.000000 0.002284 0 skewed: cap or drop outliers
actor2geo_type_0.0 2 float64 0 0.000000 0.002284 0 highly skewed: drop outliers or do box-cox transform
actor2geo_type_5.0 2 float64 0 0.000000 0.002284 0 highly skewed: drop outliers or do box-cox transform
actor1geo_type_5.0 2 float64 0 0.000000 0.002284 0 highly skewed: drop outliers or do box-cox transform
actor1geo_type_2.0 2 float64 0 0.000000 0.002284 0 skewed: cap or drop outliers
isrootevent 2 float64 0 0.000000 0.002284 0
actor1geo_type_4.0 2 float64 0 0.000000 0.002284 0
actor1geo_type_0.0 2 float64 0 0.000000 0.002284 0 skewed: cap or drop outliers
actor1geo_type_3.0 2 float64 0 0.000000 0.002284 0 skewed: cap or drop outliers
actor1geo_type_1.0 2 float64 0 0.000000 0.002284 0 skewed: cap or drop outliers
actor2geo_type_1.0 2 float64 0 0.000000 0.002284 0 skewed: cap or drop outliers
    30 Predictors classified...
        No variables removed since no ID or low-information variables found in data set

################ Multi_Classification problem #####################
Number of variables = 30 exceeds limit, finding top 30 variables through XGBoost
    No categorical feature reduction done. All 14 Categorical vars selected 
    Removing correlated variables from 16 numerics using SULO method

After removing highly correlated variables, following 11 numeric vars selected: ['actor1code', 'actor1name', 'actor2code', 'actor2name', 'eventcode', 'avgtone', 'eventrootcode', 'actor2geo_long', 'actor2geo_lat', 'fractiondate', 'numsources']
    Adding 14 categorical variables to reduced numeric variables  of 11
############## F E A T U R E   S E L E C T I O N  ####################
Current number of predictors = 25 
    Finding Important Features using Boosted Trees algorithm...
        using 25 variables...
        using 20 variables...
        using 15 variables...
        using 10 variables...
        using 5 variables...
Found 22 important features
#######################################################################################
######################## C L A S S I F Y I N G  V A R I A B L E S  ####################
#######################################################################################
Classifying variables in data set...
Data cleaning improvement suggestions. Complete them before proceeding to ML modeling.
<style type="text/css"> #T_3c68c_row0_col0, #T_3c68c_row0_col4 { background-color: #67000d; color: #f1f1f1; font-family: Segoe UI; } #T_3c68c_row0_col1, #T_3c68c_row0_col6, #T_3c68c_row1_col1, #T_3c68c_row1_col6, #T_3c68c_row2_col1, #T_3c68c_row2_col6, #T_3c68c_row3_col1, #T_3c68c_row3_col6, #T_3c68c_row4_col1, #T_3c68c_row4_col6, #T_3c68c_row5_col1, #T_3c68c_row5_col6, #T_3c68c_row6_col1, #T_3c68c_row6_col6, #T_3c68c_row7_col1, #T_3c68c_row7_col6, #T_3c68c_row8_col1, #T_3c68c_row8_col6, #T_3c68c_row9_col1, #T_3c68c_row9_col6, #T_3c68c_row10_col1, #T_3c68c_row10_col6, #T_3c68c_row11_col1, #T_3c68c_row11_col6, #T_3c68c_row12_col1, #T_3c68c_row12_col6, #T_3c68c_row13_col1, #T_3c68c_row13_col6, #T_3c68c_row14_col1, #T_3c68c_row14_col6, #T_3c68c_row15_col1, #T_3c68c_row15_col6, #T_3c68c_row16_col1, #T_3c68c_row16_col6, #T_3c68c_row17_col1, #T_3c68c_row17_col6, #T_3c68c_row18_col1, #T_3c68c_row18_col6, #T_3c68c_row19_col1, #T_3c68c_row19_col6, #T_3c68c_row20_col1, #T_3c68c_row20_col6, #T_3c68c_row21_col1, #T_3c68c_row21_col6 { font-family: Segoe UI; } #T_3c68c_row0_col2, #T_3c68c_row0_col3, #T_3c68c_row0_col5, #T_3c68c_row1_col2, #T_3c68c_row1_col3, #T_3c68c_row1_col5, #T_3c68c_row2_col2, #T_3c68c_row2_col3, #T_3c68c_row2_col5, #T_3c68c_row3_col2, #T_3c68c_row3_col3, #T_3c68c_row3_col5, #T_3c68c_row4_col2, #T_3c68c_row4_col3, #T_3c68c_row4_col5, #T_3c68c_row5_col2, #T_3c68c_row5_col3, #T_3c68c_row5_col5, #T_3c68c_row6_col2, #T_3c68c_row6_col3, #T_3c68c_row6_col5, #T_3c68c_row7_col2, #T_3c68c_row7_col3, #T_3c68c_row7_col5, #T_3c68c_row8_col0, #T_3c68c_row8_col2, #T_3c68c_row8_col3, #T_3c68c_row8_col4, #T_3c68c_row8_col5, #T_3c68c_row9_col0, #T_3c68c_row9_col2, #T_3c68c_row9_col3, #T_3c68c_row9_col4, #T_3c68c_row9_col5, #T_3c68c_row10_col0, #T_3c68c_row10_col2, #T_3c68c_row10_col3, #T_3c68c_row10_col4, #T_3c68c_row10_col5, #T_3c68c_row11_col0, #T_3c68c_row11_col2, #T_3c68c_row11_col3, #T_3c68c_row11_col4, #T_3c68c_row11_col5, #T_3c68c_row12_col0, #T_3c68c_row12_col2, #T_3c68c_row12_col3, #T_3c68c_row12_col4, #T_3c68c_row12_col5, #T_3c68c_row13_col0, #T_3c68c_row13_col2, #T_3c68c_row13_col3, #T_3c68c_row13_col4, #T_3c68c_row13_col5, #T_3c68c_row14_col0, #T_3c68c_row14_col2, #T_3c68c_row14_col3, #T_3c68c_row14_col4, #T_3c68c_row14_col5, #T_3c68c_row15_col0, #T_3c68c_row15_col2, #T_3c68c_row15_col3, #T_3c68c_row15_col4, #T_3c68c_row15_col5, #T_3c68c_row16_col0, #T_3c68c_row16_col2, #T_3c68c_row16_col3, #T_3c68c_row16_col4, #T_3c68c_row16_col5, #T_3c68c_row17_col0, #T_3c68c_row17_col2, #T_3c68c_row17_col3, #T_3c68c_row17_col4, #T_3c68c_row17_col5, #T_3c68c_row18_col0, #T_3c68c_row18_col2, #T_3c68c_row18_col3, #T_3c68c_row18_col4, #T_3c68c_row18_col5, #T_3c68c_row19_col0, #T_3c68c_row19_col2, #T_3c68c_row19_col3, #T_3c68c_row19_col4, #T_3c68c_row19_col5, #T_3c68c_row20_col0, #T_3c68c_row20_col2, #T_3c68c_row20_col3, #T_3c68c_row20_col4, #T_3c68c_row20_col5, #T_3c68c_row21_col0, #T_3c68c_row21_col2, #T_3c68c_row21_col3, #T_3c68c_row21_col4, #T_3c68c_row21_col5 { background-color: #fff5f0; color: #000000; font-family: Segoe UI; } #T_3c68c_row1_col0, #T_3c68c_row1_col4 { background-color: #fcb79c; color: #000000; font-family: Segoe UI; } #T_3c68c_row2_col0, #T_3c68c_row2_col4 { background-color: #fcbba1; color: #000000; font-family: Segoe UI; } #T_3c68c_row3_col0, #T_3c68c_row3_col4 { background-color: #feeae1; color: #000000; font-family: Segoe UI; } #T_3c68c_row4_col0, #T_3c68c_row4_col4 { background-color: #ffece3; color: #000000; font-family: Segoe UI; } #T_3c68c_row5_col0, #T_3c68c_row5_col4 { background-color: #ffeee7; color: #000000; font-family: Segoe UI; } #T_3c68c_row6_col0, #T_3c68c_row6_col4 { background-color: #ffefe8; color: #000000; font-family: Segoe UI; } #T_3c68c_row7_col0, #T_3c68c_row7_col4 { background-color: #fff4ee; color: #000000; font-family: Segoe UI; } </style>
  Nuniques dtype Nulls Nullpercent NuniquePercent Value counts Min Data cleaning improvement suggestions
avgtone 19394 float64 0 0.000000 22.151407 0
actor2geo_long 5121 float64 0 0.000000 5.849095 0
actor2geo_lat 4904 float64 0 0.000000 5.601243 0 skewed: cap or drop outliers
actor1name 1242 float64 0 0.000000 1.418586 0 skewed: cap or drop outliers
actor2name 1077 float64 0 0.000000 1.230126 0 skewed: cap or drop outliers
actor1code 804 float64 0 0.000000 0.918311 0 skewed: cap or drop outliers
actor2code 715 float64 0 0.000000 0.816658 0 skewed: cap or drop outliers
eventcode 194 float64 0 0.000000 0.221583 0 highly skewed: drop outliers or do box-cox transform
eventrootcode 20 float64 0 0.000000 0.022844 0
actor2geo_type_3.0 2 float64 0 0.000000 0.002284 0 skewed: cap or drop outliers
actor1geo_type_2.0 2 float64 0 0.000000 0.002284 0 skewed: cap or drop outliers
actor2geo_type_2.0 2 float64 0 0.000000 0.002284 0 skewed: cap or drop outliers
actor2geo_type_1.0 2 float64 0 0.000000 0.002284 0 skewed: cap or drop outliers
actor1geo_type_4.0 2 float64 0 0.000000 0.002284 0
actor2geo_type_4.0 2 float64 0 0.000000 0.002284 0
actor2geo_type_0.0 2 float64 0 0.000000 0.002284 0 highly skewed: drop outliers or do box-cox transform
actor1geo_type_3.0 2 float64 0 0.000000 0.002284 0 skewed: cap or drop outliers
isrootevent 2 float64 0 0.000000 0.002284 0
actor2geo_type_5.0 2 float64 0 0.000000 0.002284 0 highly skewed: drop outliers or do box-cox transform
actor1geo_type_1.0 2 float64 0 0.000000 0.002284 0 skewed: cap or drop outliers
actor1geo_type_0.0 2 float64 0 0.000000 0.002284 0 skewed: cap or drop outliers
actor1geo_type_5.0 2 float64 0 0.000000 0.002284 0 highly skewed: drop outliers or do box-cox transform

Machine Learning Applications

Multi-Classification

Use classification models to predict the type of event that is going to happen.

1. Load data and transform

data = pd.read_csv("230722.csv")
column_names = [
  'GlobalEventID',
  'Day',
  'MonthYear',
  'Year',
  'FractionDate',
  'Actor1Code',
  'Actor1Name',
  'Actor1CountryCode',
  'Actor1KnownGroupCode',
  'Actor1EthnicCode',
  'Actor1Religion1Code',
  'Actor1Religion2Code',
  'Actor1Type1Code',
  'Actor1Type2Code',
  'Actor1Type3Code',
  'Actor2Code',
  'Actor2Name',
  'Actor2CountryCode',
  'Actor2KnownGroupCode',
  'Actor2EthnicCode',
  'Actor2Religion1Code',
  'Actor2Religion2Code',
  'Actor2Type1Code',
  'Actor2Type2Code',
  'Actor2Type3Code',
  'IsRootEvent',
  'EventCode',
  'EventBaseCode',
  'EventRootCode',
  'QuadClass',
  'GoldsteinScale',
  'NumMentions',
  'NumSources',
  'NumArticles',
  'AvgTone',
  'Actor1Geo_Type',
  'Actor1Geo_FullName',
  'Actor1Geo_CountryCode',
  'Actor1Geo_ADM1Code',
  'Actor1Geo_Lat',
  'Actor1Geo_Long',
  'Actor1Geo_FeatureID',
  'Actor2Geo_Type',
  'Actor2Geo_FullName',
  'Actor2Geo_CountryCode',
  'Actor2Geo_ADM1Code',
  'Actor2Geo_Lat',
  'Actor2Geo_Long',
  'Actor2Geo_FeatureID',
  'DateAdded',
  'SourceURL'
      ]

data = transform_data(data, column_names)

data = int_to_datetime(data, "day")
data = int_to_datetime(data, "dateadded")

# Let use small segment of the dataset
data_sm = data[data['year']==2022]

data_lm = data[data['year']!=2022]

# columns = data_sm.columns.tolist()
# columns.remove('quadclass')
# columns.append('quadclass')

# data_sm = data_sm[columns]

data_lm.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
globaleventid day monthyear year fractiondate actor1code actor1name actor1countrycode actor1knowngroupcode actor1ethniccode actor1religion1code actor1religion2code actor1type1code actor1type2code actor1type3code actor2code actor2name actor2countrycode actor2knowngroupcode actor2ethniccode actor2religion1code actor2religion2code actor2type1code actor2type2code actor2type3code isrootevent eventcode eventbasecode eventrootcode quadclass goldsteinscale nummentions numsources numarticles avgtone actor1geo_type actor1geo_fullname actor1geo_countrycode actor1geo_adm1code actor1geo_lat actor1geo_long actor1geo_featureid actor2geo_type actor2geo_fullname actor2geo_countrycode actor2geo_adm1code actor2geo_lat actor2geo_long actor2geo_featureid dateadded sourceurl
13 1116435795 2023-06-22 202306 2023 2023.4712 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN BUS COMPANY NaN NaN NaN NaN NaN BUS NaN NaN 0 40 40 4 Verbal Cooperation 1.0 5 1 5 1.118963 0 NaN NaN NaN NaN NaN NaN 3 Fort Smith, Arkansas, United States US USAR 35.3859 -94.3985 76952 2023-07-22 https://www.kuaf.com/show/ozarks-at-large/2023...
14 1116435796 2023-06-22 202306 2023 2023.4712 AFR AFRICA AFR NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1 43 43 4 Verbal Cooperation 2.8 4 1 4 0.790514 1 Russia RS RS 60.0000 100.0000 RS 1 Russia RS RS 60.0000 100.0000 RS 2023-07-22 https://www.beijingbulletin.com/news/273906652...
15 1116435797 2023-06-22 202306 2023 2023.4712 AFR AFRICA AFR NaN NaN NaN NaN NaN NaN NaN RUS RUSSIAN RUS NaN NaN NaN NaN NaN NaN NaN 1 43 43 4 Verbal Cooperation 2.8 10 3 10 -0.491013 4 Pretoria, Gauteng, South Africa SF SF06 -25.7069 28.2294 -1273769 4 Pretoria, Gauteng, South Africa SF SF06 -25.7069 28.2294 -1273769 2023-07-22 https://www.beijingbulletin.com/news/273906652...
16 1116435798 2023-06-22 202306 2023 2023.4712 AFR AFRICA AFR NaN NaN NaN NaN NaN NaN NaN RUS RUSSIAN RUS NaN NaN NaN NaN NaN NaN NaN 0 43 43 4 Verbal Cooperation 2.8 2 1 2 0.790514 1 Russia RS RS 60.0000 100.0000 RS 1 Russia RS RS 60.0000 100.0000 RS 2023-07-22 https://www.beijingbulletin.com/news/273906652...
17 1116435799 2023-06-22 202306 2023 2023.4712 AFR AFRICA AFR NaN NaN NaN NaN NaN NaN NaN RUSGOV RUSSIAN RUS NaN NaN NaN NaN GOV NaN NaN 1 43 43 4 Verbal Cooperation 2.8 18 3 18 -0.491013 4 Pretoria, Gauteng, South Africa SF SF06 -25.7069 28.2294 -1273769 4 Pretoria, Gauteng, South Africa SF SF06 -25.7069 28.2294 -1273769 2023-07-22 https://www.beijingbulletin.com/news/273906652...

2. Identify Total number of NULL of each feature

get_null_values(data_lm)
actor2type3code          87517
actor1type3code          87517
actor2religion2code      87351
actor1religion2code      87289
actor2ethniccode         87168
actor1ethniccode         87011
actor2knowngroupcode     86724
actor1knowngroupcode     86627
actor2religion1code      86621
actor1religion1code      86523
actor2type2code          86025
actor1type2code          85240
actor2type1code          58516
actor1type1code          50638
actor2countrycode        48626
actor1countrycode        37956
actor2code               25351
actor2name               25351
actor1geo_fullname       11051
actor1geo_long           11051
actor1geo_lat            11051
actor1geo_featureid      11044
actor1geo_adm1code       11044
actor1geo_countrycode    11044
actor1name                8630
actor1code                8630
actor2geo_lat             2723
actor2geo_fullname        2723
actor2geo_long            2723
actor2geo_adm1code        2718
actor2geo_countrycode     2718
actor2geo_featureid       2718
actor2geo_type               0
dateadded                    0
globaleventid                0
isrootevent                  0
actor1geo_type               0
avgtone                      0
numarticles                  0
numsources                   0
nummentions                  0
goldsteinscale               0
quadclass                    0
eventrootcode                0
eventbasecode                0
eventcode                    0
day                          0
fractiondate                 0
year                         0
monthyear                    0
sourceurl                    0
dtype: int64

3. Selection of features that seem importanace

quadclass is the target and rest of the attributes are the feature. And I am trying to predict type of events.

new_data = data_lm[
[      'fractiondate',
       'year',
       'monthyear',
       'actor1code', 
        'actor1name', 
        'actor1geo_type',
        'actor1geo_long',
        'actor1geo_lat',
        'actor2code', 
        'actor2name', 
        'actor2geo_type',
        'actor2geo_long',
        'actor2geo_lat',
        'isrootevent',
        'eventcode',
        'eventrootcode',
        'goldsteinscale',
        'nummentions',
        'numsources',
        'avgtone',
        'quadclass'
  ]
]


new_data.head() 
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
fractiondate year monthyear actor1code actor1name actor1geo_type actor1geo_long actor1geo_lat actor2code actor2name actor2geo_type actor2geo_long actor2geo_lat isrootevent eventcode eventrootcode goldsteinscale nummentions numsources avgtone quadclass
13 2023.4712 2023 202306 NaN NaN 0 NaN NaN BUS COMPANY 3 -94.3985 35.3859 0 40 4 1.0 5 1 1.118963 Verbal Cooperation
14 2023.4712 2023 202306 AFR AFRICA 1 100.0000 60.0000 NaN NaN 1 100.0000 60.0000 1 43 4 2.8 4 1 0.790514 Verbal Cooperation
15 2023.4712 2023 202306 AFR AFRICA 4 28.2294 -25.7069 RUS RUSSIAN 4 28.2294 -25.7069 1 43 4 2.8 10 3 -0.491013 Verbal Cooperation
16 2023.4712 2023 202306 AFR AFRICA 1 100.0000 60.0000 RUS RUSSIAN 1 100.0000 60.0000 0 43 4 2.8 2 1 0.790514 Verbal Cooperation
17 2023.4712 2023 202306 AFR AFRICA 4 28.2294 -25.7069 RUSGOV RUSSIAN 4 28.2294 -25.7069 1 43 4 2.8 18 3 -0.491013 Verbal Cooperation

4. Set Classifcation Experiment (MultiClass)

This function initializes the training environment and creates the transformation pipeline.

from pycaret.classification import ClassificationExperiment

cls_exp = ClassificationExperiment()

cls_exp.setup(data=new_data,
      target='quadclass', 
      session_id=123,
      log_experiment='mlflow', experiment_name='gdelt_experiment')
<style type="text/css"> #T_cb6cb_row12_col1 { background-color: lightgreen; } </style>
  Description Value
0 Session id 123
1 Target quadclass
2 Target type Multiclass
3 Target mapping Material Conflict: 0, Material Cooperation: 1, Verbal Conflict: 2, Verbal Cooperation: 3
4 Original data shape (87552, 21)
5 Transformed data shape (87552, 31)
6 Transformed train set shape (61286, 31)
7 Transformed test set shape (26266, 31)
8 Ordinal features 1
9 Numeric features 13
10 Categorical features 7
11 Rows with missing values 40.5%
12 Preprocess True
13 Imputation type simple
14 Numeric imputation mean
15 Categorical imputation mode
16 Maximum one-hot encoding 25
17 Encoding method None
18 Fold Generator StratifiedKFold
19 Fold Number 10
20 CPU Jobs -1
21 Use GPU False
22 Log Experiment MlflowLogger
23 Experiment Name gdelt_experiment
24 USI 631c

5. Compare Models

This function trains and evaluates the performance of all the estimators available in the model library using cross-validation.

best = cls_exp.compare_models()
<style type="text/css"> #T_625eb th { text-align: left; } #T_625eb_row0_col0, #T_625eb_row1_col0, #T_625eb_row2_col0, #T_625eb_row3_col0, #T_625eb_row4_col0, #T_625eb_row5_col0, #T_625eb_row6_col0, #T_625eb_row7_col0, #T_625eb_row7_col1, #T_625eb_row7_col2, #T_625eb_row7_col3, #T_625eb_row7_col4, #T_625eb_row7_col5, #T_625eb_row7_col6, #T_625eb_row7_col7, #T_625eb_row8_col0, #T_625eb_row8_col1, #T_625eb_row8_col2, #T_625eb_row8_col3, #T_625eb_row8_col4, #T_625eb_row8_col5, #T_625eb_row8_col6, #T_625eb_row8_col7, #T_625eb_row9_col0, #T_625eb_row9_col1, #T_625eb_row9_col2, #T_625eb_row9_col3, #T_625eb_row9_col4, #T_625eb_row9_col5, #T_625eb_row9_col6, #T_625eb_row9_col7, #T_625eb_row10_col0, #T_625eb_row10_col1, #T_625eb_row10_col2, #T_625eb_row10_col3, #T_625eb_row10_col4, #T_625eb_row10_col5, #T_625eb_row10_col6, #T_625eb_row10_col7, #T_625eb_row11_col0, #T_625eb_row11_col1, #T_625eb_row11_col2, #T_625eb_row11_col3, #T_625eb_row11_col4, #T_625eb_row11_col5, #T_625eb_row11_col6, #T_625eb_row11_col7, #T_625eb_row12_col0, #T_625eb_row12_col1, #T_625eb_row12_col2, #T_625eb_row12_col3, #T_625eb_row12_col4, #T_625eb_row12_col5, #T_625eb_row12_col6, #T_625eb_row12_col7, #T_625eb_row13_col0, #T_625eb_row13_col1, #T_625eb_row13_col2, #T_625eb_row13_col3, #T_625eb_row13_col4, #T_625eb_row13_col5, #T_625eb_row13_col6, #T_625eb_row13_col7, #T_625eb_row14_col0, #T_625eb_row14_col1, #T_625eb_row14_col2, #T_625eb_row14_col3, #T_625eb_row14_col4, #T_625eb_row14_col5, #T_625eb_row14_col6, #T_625eb_row14_col7, #T_625eb_row15_col0, #T_625eb_row15_col1, #T_625eb_row15_col2, #T_625eb_row15_col3, #T_625eb_row15_col4, #T_625eb_row15_col5, #T_625eb_row15_col6, #T_625eb_row15_col7 { text-align: left; } #T_625eb_row0_col1, #T_625eb_row0_col2, #T_625eb_row0_col3, #T_625eb_row0_col4, #T_625eb_row0_col5, #T_625eb_row0_col6, #T_625eb_row0_col7, #T_625eb_row1_col1, #T_625eb_row1_col2, #T_625eb_row1_col3, #T_625eb_row1_col4, #T_625eb_row1_col5, #T_625eb_row1_col6, #T_625eb_row1_col7, #T_625eb_row2_col1, #T_625eb_row2_col2, #T_625eb_row2_col3, #T_625eb_row2_col4, #T_625eb_row2_col5, #T_625eb_row2_col6, #T_625eb_row2_col7, #T_625eb_row3_col1, #T_625eb_row3_col2, #T_625eb_row3_col3, #T_625eb_row3_col4, #T_625eb_row3_col5, #T_625eb_row3_col6, #T_625eb_row3_col7, #T_625eb_row4_col1, #T_625eb_row4_col2, #T_625eb_row4_col3, #T_625eb_row4_col4, #T_625eb_row4_col5, #T_625eb_row4_col6, #T_625eb_row4_col7, #T_625eb_row5_col1, #T_625eb_row5_col2, #T_625eb_row5_col3, #T_625eb_row5_col4, #T_625eb_row5_col5, #T_625eb_row5_col6, #T_625eb_row5_col7, #T_625eb_row6_col1, #T_625eb_row6_col2, #T_625eb_row6_col3, #T_625eb_row6_col4, #T_625eb_row6_col5, #T_625eb_row6_col6, #T_625eb_row6_col7 { text-align: left; background-color: yellow; } #T_625eb_row0_col8, #T_625eb_row1_col8, #T_625eb_row2_col8, #T_625eb_row3_col8, #T_625eb_row4_col8, #T_625eb_row5_col8, #T_625eb_row6_col8, #T_625eb_row7_col8, #T_625eb_row8_col8, #T_625eb_row9_col8, #T_625eb_row10_col8, #T_625eb_row11_col8, #T_625eb_row12_col8, #T_625eb_row14_col8, #T_625eb_row15_col8 { text-align: left; background-color: lightgrey; } #T_625eb_row13_col8 { text-align: left; background-color: yellow; background-color: lightgrey; } </style>
  Model Accuracy AUC Recall Prec. F1 Kappa MCC TT (Sec)
dt Decision Tree Classifier 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 0.2530
rf Random Forest Classifier 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 0.3150
gbc Gradient Boosting Classifier 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 0.3320
et Extra Trees Classifier 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 0.3340
xgboost Extreme Gradient Boosting 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 0.2430
lightgbm Light Gradient Boosting Machine 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 0.3330
catboost CatBoost Classifier 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 0.2730
knn K Neighbors Classifier 0.9886 0.9985 0.9886 0.9885 0.9885 0.9803 0.9803 1.5500
lda Linear Discriminant Analysis 0.9741 0.9979 0.9741 0.9766 0.9734 0.9554 0.9562 0.2390
nb Naive Bayes 0.9392 0.9888 0.9392 0.9397 0.9339 0.8936 0.8960 0.2320
lr Logistic Regression 0.9098 0.9792 0.9098 0.8989 0.9021 0.8414 0.8440 0.4480
ada Ada Boost Classifier 0.8881 0.9827 0.8881 0.8268 0.8485 0.8071 0.8252 0.2730
ridge Ridge Classifier 0.8188 0.0000 0.8188 0.8539 0.7633 0.6703 0.6973 0.2050
svm SVM - Linear Kernel 0.7028 0.0000 0.7028 0.6337 0.6357 0.4433 0.4887 0.1870
qda Quadratic Discriminant Analysis 0.6721 0.9795 0.6721 0.8973 0.6889 0.5181 0.5880 0.2360
dummy Dummy Classifier 0.6036 0.5000 0.6036 0.3644 0.4544 0.0000 0.0000 0.2030

6. Analyze Model

cls_exp.evaluate_model(best)

png

cls_exp.plot_model(best, plot = 'auc')

png

cls_exp.plot_model(best, plot = 'confusion_matrix')

png

cls_exp.plot_model(best, plot = 'confusion_matrix')

png

Click here for more.

Regression

Use regression models to predict the number of mentions.

import pandas as pd 
from utils import transform_data, int_to_datetime, get_null_values

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
data = pd.read_csv("230722.csv")

column_names = [
    'GlobalEventID',
    'Day',
    'MonthYear',
    'Year',
    'FractionDate',
    'Actor1Code',
    'Actor1Name',
    'Actor1CountryCode',
    'Actor1KnownGroupCode',
    'Actor1EthnicCode',
    'Actor1Religion1Code',
    'Actor1Religion2Code',
    'Actor1Type1Code',
    'Actor1Type2Code',
    'Actor1Type3Code',
    'Actor2Code',
    'Actor2Name',
    'Actor2CountryCode',
    'Actor2KnownGroupCode',
    'Actor2EthnicCode',
    'Actor2Religion1Code',
    'Actor2Religion2Code',
    'Actor2Type1Code',
    'Actor2Type2Code',
    'Actor2Type3Code',
    'IsRootEvent',
    'EventCode',
    'EventBaseCode',
    'EventRootCode',
    'QuadClass',
    'GoldsteinScale',
    'NumMentions',
    'NumSources',
    'NumArticles',
    'AvgTone',
    'Actor1Geo_Type',
    'Actor1Geo_FullName',
    'Actor1Geo_CountryCode',
    'Actor1Geo_ADM1Code',
    'Actor1Geo_Lat',
    'Actor1Geo_Long',
    'Actor1Geo_FeatureID',
    'Actor2Geo_Type',
    'Actor2Geo_FullName',
    'Actor2Geo_CountryCode',
    'Actor2Geo_ADM1Code',
    'Actor2Geo_Lat',
    'Actor2Geo_Long',
    'Actor2Geo_FeatureID',
    'DateAdded',
    'SourceURL'
        ]

data = transform_data(data, column_names)

data = int_to_datetime(data, "day")
data = int_to_datetime(data, "dateadded")

# Let use small segment of the dataset
data_sm = data[data['year']==2022]

data_lm = data[data['year']!=2022]

# columns = data_sm.columns.tolist()
# columns.remove('quadclass')
# columns.append('quadclass')

# data_sm = data_sm[columns]

data_lm.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
globaleventid day monthyear year fractiondate actor1code actor1name actor1countrycode actor1knowngroupcode actor1ethniccode actor1religion1code actor1religion2code actor1type1code actor1type2code actor1type3code actor2code actor2name actor2countrycode actor2knowngroupcode actor2ethniccode actor2religion1code actor2religion2code actor2type1code actor2type2code actor2type3code isrootevent eventcode eventbasecode eventrootcode quadclass goldsteinscale nummentions numsources numarticles avgtone actor1geo_type actor1geo_fullname actor1geo_countrycode actor1geo_adm1code actor1geo_lat actor1geo_long actor1geo_featureid actor2geo_type actor2geo_fullname actor2geo_countrycode actor2geo_adm1code actor2geo_lat actor2geo_long actor2geo_featureid dateadded sourceurl
13 1116435795 2023-06-22 202306 2023 2023.4712 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN BUS COMPANY NaN NaN NaN NaN NaN BUS NaN NaN 0 40 40 4 Verbal Cooperation 1.0 5 1 5 1.118963 0 NaN NaN NaN NaN NaN NaN 3 Fort Smith, Arkansas, United States US USAR 35.3859 -94.3985 76952 2023-07-22 https://www.kuaf.com/show/ozarks-at-large/2023...
14 1116435796 2023-06-22 202306 2023 2023.4712 AFR AFRICA AFR NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1 43 43 4 Verbal Cooperation 2.8 4 1 4 0.790514 1 Russia RS RS 60.0000 100.0000 RS 1 Russia RS RS 60.0000 100.0000 RS 2023-07-22 https://www.beijingbulletin.com/news/273906652...
15 1116435797 2023-06-22 202306 2023 2023.4712 AFR AFRICA AFR NaN NaN NaN NaN NaN NaN NaN RUS RUSSIAN RUS NaN NaN NaN NaN NaN NaN NaN 1 43 43 4 Verbal Cooperation 2.8 10 3 10 -0.491013 4 Pretoria, Gauteng, South Africa SF SF06 -25.7069 28.2294 -1273769 4 Pretoria, Gauteng, South Africa SF SF06 -25.7069 28.2294 -1273769 2023-07-22 https://www.beijingbulletin.com/news/273906652...
16 1116435798 2023-06-22 202306 2023 2023.4712 AFR AFRICA AFR NaN NaN NaN NaN NaN NaN NaN RUS RUSSIAN RUS NaN NaN NaN NaN NaN NaN NaN 0 43 43 4 Verbal Cooperation 2.8 2 1 2 0.790514 1 Russia RS RS 60.0000 100.0000 RS 1 Russia RS RS 60.0000 100.0000 RS 2023-07-22 https://www.beijingbulletin.com/news/273906652...
17 1116435799 2023-06-22 202306 2023 2023.4712 AFR AFRICA AFR NaN NaN NaN NaN NaN NaN NaN RUSGOV RUSSIAN RUS NaN NaN NaN NaN GOV NaN NaN 1 43 43 4 Verbal Cooperation 2.8 18 3 18 -0.491013 4 Pretoria, Gauteng, South Africa SF SF06 -25.7069 28.2294 -1273769 4 Pretoria, Gauteng, South Africa SF SF06 -25.7069 28.2294 -1273769 2023-07-22 https://www.beijingbulletin.com/news/273906652...
new_data = data_lm[
[      'fractiondate',
       'year',
       'monthyear',
       'actor1code', 
        'actor1name', 
        'actor1geo_type',
        'actor1geo_long',
        'actor1geo_lat',
        'actor2code', 
        'actor2name', 
        'actor2geo_type',
        'actor2geo_long',
        'actor2geo_lat',
        'isrootevent',
        'eventcode',
        'eventrootcode',
        'goldsteinscale',
        'nummentions',
        'numsources',
        'avgtone',
        'quadclass'
  ]
]


new_data.head() 
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
fractiondate year monthyear actor1code actor1name actor1geo_type actor1geo_long actor1geo_lat actor2code actor2name actor2geo_type actor2geo_long actor2geo_lat isrootevent eventcode eventrootcode goldsteinscale nummentions numsources avgtone quadclass
13 2023.4712 2023 202306 NaN NaN 0 NaN NaN BUS COMPANY 3 -94.3985 35.3859 0 40 4 1.0 5 1 1.118963 Verbal Cooperation
14 2023.4712 2023 202306 AFR AFRICA 1 100.0000 60.0000 NaN NaN 1 100.0000 60.0000 1 43 4 2.8 4 1 0.790514 Verbal Cooperation
15 2023.4712 2023 202306 AFR AFRICA 4 28.2294 -25.7069 RUS RUSSIAN 4 28.2294 -25.7069 1 43 4 2.8 10 3 -0.491013 Verbal Cooperation
16 2023.4712 2023 202306 AFR AFRICA 1 100.0000 60.0000 RUS RUSSIAN 1 100.0000 60.0000 0 43 4 2.8 2 1 0.790514 Verbal Cooperation
17 2023.4712 2023 202306 AFR AFRICA 4 28.2294 -25.7069 RUSGOV RUSSIAN 4 28.2294 -25.7069 1 43 4 2.8 18 3 -0.491013 Verbal Cooperation
## 1. Import RegressionExperiment and init the class
from pycaret.regression import RegressionExperiment
reg_exp = RegressionExperiment()
reg_exp.setup(new_data, target= 'nummentions', session_id = 223)
<style type="text/css"> #T_dbec4_row11_col1 { background-color: lightgreen; } </style>
  Description Value
0 Session id 223
1 Target nummentions
2 Target type Regression
3 Original data shape (87552, 21)
4 Transformed data shape (87552, 34)
5 Transformed train set shape (61286, 34)
6 Transformed test set shape (26266, 34)
7 Ordinal features 1
8 Numeric features 12
9 Categorical features 8
10 Rows with missing values 40.5%
11 Preprocess True
12 Imputation type simple
13 Numeric imputation mean
14 Categorical imputation mode
15 Maximum one-hot encoding 25
16 Encoding method None
17 Fold Generator KFold
18 Fold Number 10
19 CPU Jobs -1
20 Use GPU False
21 Log Experiment False
22 Experiment Name reg-default-name
23 USI c2ca
### 2. Compare baseline models 
best = reg_exp.compare_models()
<style type="text/css"> #T_e0820 th { text-align: left; } #T_e0820_row0_col0, #T_e0820_row0_col5, #T_e0820_row0_col6, #T_e0820_row1_col0, #T_e0820_row1_col1, #T_e0820_row1_col2, #T_e0820_row1_col3, #T_e0820_row1_col4, #T_e0820_row1_col5, #T_e0820_row1_col6, #T_e0820_row2_col0, #T_e0820_row2_col1, #T_e0820_row2_col2, #T_e0820_row2_col3, #T_e0820_row2_col4, #T_e0820_row2_col5, #T_e0820_row3_col0, #T_e0820_row3_col1, #T_e0820_row3_col2, #T_e0820_row3_col3, #T_e0820_row3_col4, #T_e0820_row3_col5, #T_e0820_row3_col6, #T_e0820_row4_col0, #T_e0820_row4_col1, #T_e0820_row4_col2, #T_e0820_row4_col3, #T_e0820_row4_col4, #T_e0820_row4_col5, #T_e0820_row4_col6, #T_e0820_row5_col0, #T_e0820_row5_col1, #T_e0820_row5_col2, #T_e0820_row5_col3, #T_e0820_row5_col4, #T_e0820_row5_col5, #T_e0820_row5_col6, #T_e0820_row6_col0, #T_e0820_row6_col1, #T_e0820_row6_col2, #T_e0820_row6_col3, #T_e0820_row6_col4, #T_e0820_row6_col5, #T_e0820_row6_col6, #T_e0820_row7_col0, #T_e0820_row7_col1, #T_e0820_row7_col2, #T_e0820_row7_col3, #T_e0820_row7_col4, #T_e0820_row7_col6, #T_e0820_row8_col0, #T_e0820_row8_col1, #T_e0820_row8_col2, #T_e0820_row8_col3, #T_e0820_row8_col4, #T_e0820_row8_col5, #T_e0820_row8_col6, #T_e0820_row9_col0, #T_e0820_row9_col1, #T_e0820_row9_col2, #T_e0820_row9_col3, #T_e0820_row9_col4, #T_e0820_row9_col5, #T_e0820_row9_col6, #T_e0820_row10_col0, #T_e0820_row10_col1, #T_e0820_row10_col2, #T_e0820_row10_col3, #T_e0820_row10_col4, #T_e0820_row10_col5, #T_e0820_row10_col6, #T_e0820_row11_col0, #T_e0820_row11_col1, #T_e0820_row11_col2, #T_e0820_row11_col3, #T_e0820_row11_col4, #T_e0820_row11_col5, #T_e0820_row11_col6, #T_e0820_row12_col0, #T_e0820_row12_col1, #T_e0820_row12_col2, #T_e0820_row12_col3, #T_e0820_row12_col4, #T_e0820_row12_col5, #T_e0820_row12_col6, #T_e0820_row13_col0, #T_e0820_row13_col1, #T_e0820_row13_col2, #T_e0820_row13_col3, #T_e0820_row13_col4, #T_e0820_row13_col5, #T_e0820_row13_col6, #T_e0820_row14_col0, #T_e0820_row14_col1, #T_e0820_row14_col2, #T_e0820_row14_col3, #T_e0820_row14_col4, #T_e0820_row14_col5, #T_e0820_row14_col6, #T_e0820_row15_col0, #T_e0820_row15_col1, #T_e0820_row15_col2, #T_e0820_row15_col3, #T_e0820_row15_col4, #T_e0820_row15_col5, #T_e0820_row15_col6, #T_e0820_row16_col0, #T_e0820_row16_col1, #T_e0820_row16_col2, #T_e0820_row16_col3, #T_e0820_row16_col4, #T_e0820_row16_col5, #T_e0820_row16_col6, #T_e0820_row17_col0, #T_e0820_row17_col1, #T_e0820_row17_col2, #T_e0820_row17_col3, #T_e0820_row17_col4, #T_e0820_row17_col5, #T_e0820_row17_col6, #T_e0820_row18_col0, #T_e0820_row18_col1, #T_e0820_row18_col2, #T_e0820_row18_col3, #T_e0820_row18_col4, #T_e0820_row18_col5, #T_e0820_row18_col6, #T_e0820_row19_col0, #T_e0820_row19_col1, #T_e0820_row19_col2, #T_e0820_row19_col3, #T_e0820_row19_col4, #T_e0820_row19_col5, #T_e0820_row19_col6 { text-align: left; } #T_e0820_row0_col1, #T_e0820_row0_col2, #T_e0820_row0_col3, #T_e0820_row0_col4, #T_e0820_row2_col6, #T_e0820_row7_col5 { text-align: left; background-color: yellow; } #T_e0820_row0_col7, #T_e0820_row1_col7, #T_e0820_row2_col7, #T_e0820_row3_col7, #T_e0820_row4_col7, #T_e0820_row5_col7, #T_e0820_row6_col7, #T_e0820_row7_col7, #T_e0820_row8_col7, #T_e0820_row9_col7, #T_e0820_row10_col7, #T_e0820_row11_col7, #T_e0820_row12_col7, #T_e0820_row13_col7, #T_e0820_row14_col7, #T_e0820_row15_col7, #T_e0820_row17_col7, #T_e0820_row18_col7, #T_e0820_row19_col7 { text-align: left; background-color: lightgrey; } #T_e0820_row16_col7 { text-align: left; background-color: yellow; background-color: lightgrey; } </style>
  Model MAE MSE RMSE R2 RMSLE MAPE TT (Sec)
et Extra Trees Regressor 7.1072 971.7544 30.7220 0.6642 0.7049 1.3150 7.8420
lr Linear Regression 7.4204 1125.7411 33.0324 0.6264 0.7079 1.0732 0.7970
ridge Ridge Regression 7.4197 1125.7176 33.0320 0.6264 0.7076 1.0730 0.2770
br Bayesian Ridge 7.4039 1125.7355 33.0322 0.6264 0.7066 1.0764 0.3320
en Elastic Net 7.3649 1133.0863 33.1390 0.6242 0.6787 1.1509 0.3000
lasso Lasso Regression 7.3717 1133.9014 33.1559 0.6235 0.6786 1.1485 0.2990
llar Lasso Least Angle Regression 7.3717 1133.9015 33.1559 0.6235 0.6786 1.1485 0.2800
omp Orthogonal Matching Pursuit 7.3207 1133.4790 33.1510 0.6235 0.6704 1.1371 0.2590
gbr Gradient Boosting Regressor 7.4531 1182.7314 33.7733 0.6089 0.7222 1.3741 3.9150
knn K Neighbors Regressor 8.2201 1251.2758 34.8129 0.5846 0.7963 1.3226 0.9470
lightgbm Light Gradient Boosting Machine 8.4165 1328.7091 35.7868 0.5611 0.8174 1.7037 0.5730
catboost CatBoost Regressor 9.2361 1324.4911 35.8178 0.5594 0.8975 2.0437 3.6830
rf Random Forest Regressor 11.6646 1506.5636 38.2956 0.4919 1.1009 3.0682 10.9790
xgboost Extreme Gradient Boosting 10.2014 1576.1854 39.2998 0.4577 0.9663 2.3305 2.8680
dt Decision Tree Regressor 12.7570 2345.6060 47.8260 0.1788 1.1485 3.2058 0.4750
huber Huber Regressor 9.5112 2676.3266 50.8367 0.1279 0.8288 1.1368 0.4730
dummy Dummy Regressor 13.2721 3071.2452 54.4972 -0.0001 1.1804 3.1352 0.2260
par Passive Aggressive Regressor 10.4196 3129.2235 55.0295 -0.0206 1.0203 1.1399 0.3020
ada AdaBoost Regressor 84.4408 10654.0191 100.9776 -2.9177 2.6662 27.1457 1.6290
lar Least Angle Regression 34.3947 109968.5421 134.1696 -43.2102 0.9558 11.8419 0.2800

Analyse Model

The plot_model function is used to analyze the performance of a trained model on the test set. It may require re-training the model in certain cases.

#### plot residuals 
reg_exp.plot_model(best, plot='residuals')

png

#### Plot error 
reg_exp.plot_model(best, plot = 'error')

png

#### Plot feature importance
reg_exp.plot_model(best, plot='feature')

png

# check docstring to see available plots 
# help(plot_model)

An alternate to plot_model function is evaluate_model

reg_exp.evaluate_model(best)

png

Prediction

The predict_model function returns prediction_label as new column to the input dataframe. When data is None (default), it uses the test set (created during the setup function) for scoring.

#### predict on the test set 
holdout_pred = reg_exp.predict_model(best)
<style type="text/css"> </style>
  Model MAE MSE RMSE R2 RMSLE MAPE
0 Extra Trees Regressor 5.2705 1002.9253 31.6690 0.6425 0.4999 0.6888
holdout_pred.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
fractiondate year monthyear actor1code actor1name actor1geo_type actor1geo_long actor1geo_lat actor2code actor2name ... actor2geo_lat isrootevent eventcode eventrootcode goldsteinscale numsources avgtone quadclass nummentions prediction_label
56143 2023.553345 2023 202307 GBR UNITED KINGDOM 4 -1.783330 51.700001 GOV PRINCE ... 51.700001 1 36 3 4.0 1 1.671733 Verbal Cooperation 2 2.58
62156 2023.553345 2023 202307 USA UNITED STATES 3 -118.327003 34.098301 NaN NaN ... -21.100000 1 10 1 0.0 1 -0.250627 Verbal Cooperation 1 2.63
81585 2023.553345 2023 202307 USA TEXAS 2 -97.647499 31.106001 NaN NaN ... 25.683300 0 84 8 7.0 1 2.936857 Material Cooperation 13 5.28
61106 2023.553345 2023 202307 MED PUBLISHER 0 NaN NaN NaN NaN ... NaN 1 190 19 -10.0 9 0.366028 Material Conflict 65 80.50
8558 2023.553345 2023 202307 USA PENNSYLVANIA 2 -77.264000 40.577301 EDU SCHOOL ... 40.577301 0 125 12 -5.0 1 -1.845820 Verbal Conflict 10 9.07

5 rows × 22 columns

Click here for more.

Clustering

Use clustering models to cluster the events.

import pandas as pd 
from utils import transform_data, int_to_datetime, get_null_values

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
data = pd.read_csv("230722.csv")

column_names = [
    'GlobalEventID',
    'Day',
    'MonthYear',
    'Year',
    'FractionDate',
    'Actor1Code',
    'Actor1Name',
    'Actor1CountryCode',
    'Actor1KnownGroupCode',
    'Actor1EthnicCode',
    'Actor1Religion1Code',
    'Actor1Religion2Code',
    'Actor1Type1Code',
    'Actor1Type2Code',
    'Actor1Type3Code',
    'Actor2Code',
    'Actor2Name',
    'Actor2CountryCode',
    'Actor2KnownGroupCode',
    'Actor2EthnicCode',
    'Actor2Religion1Code',
    'Actor2Religion2Code',
    'Actor2Type1Code',
    'Actor2Type2Code',
    'Actor2Type3Code',
    'IsRootEvent',
    'EventCode',
    'EventBaseCode',
    'EventRootCode',
    'QuadClass',
    'GoldsteinScale',
    'NumMentions',
    'NumSources',
    'NumArticles',
    'AvgTone',
    'Actor1Geo_Type',
    'Actor1Geo_FullName',
    'Actor1Geo_CountryCode',
    'Actor1Geo_ADM1Code',
    'Actor1Geo_Lat',
    'Actor1Geo_Long',
    'Actor1Geo_FeatureID',
    'Actor2Geo_Type',
    'Actor2Geo_FullName',
    'Actor2Geo_CountryCode',
    'Actor2Geo_ADM1Code',
    'Actor2Geo_Lat',
    'Actor2Geo_Long',
    'Actor2Geo_FeatureID',
    'DateAdded',
    'SourceURL'
        ]

data = transform_data(data, column_names)

data = int_to_datetime(data, "day")
data = int_to_datetime(data, "dateadded")

# Let use small segment of the dataset
data_sm = data[data['year']==2022]

data_lm = data[data['year']!=2022]

# columns = data_sm.columns.tolist()
# columns.remove('quadclass')
# columns.append('quadclass')

# data_sm = data_sm[columns]

data_lm.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
globaleventid day monthyear year fractiondate actor1code actor1name actor1countrycode actor1knowngroupcode actor1ethniccode actor1religion1code actor1religion2code actor1type1code actor1type2code actor1type3code actor2code actor2name actor2countrycode actor2knowngroupcode actor2ethniccode actor2religion1code actor2religion2code actor2type1code actor2type2code actor2type3code isrootevent eventcode eventbasecode eventrootcode quadclass goldsteinscale nummentions numsources numarticles avgtone actor1geo_type actor1geo_fullname actor1geo_countrycode actor1geo_adm1code actor1geo_lat actor1geo_long actor1geo_featureid actor2geo_type actor2geo_fullname actor2geo_countrycode actor2geo_adm1code actor2geo_lat actor2geo_long actor2geo_featureid dateadded sourceurl
13 1116435795 2023-06-22 202306 2023 2023.4712 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN BUS COMPANY NaN NaN NaN NaN NaN BUS NaN NaN 0 40 40 4 Verbal Cooperation 1.0 5 1 5 1.118963 0 NaN NaN NaN NaN NaN NaN 3 Fort Smith, Arkansas, United States US USAR 35.3859 -94.3985 76952 2023-07-22 https://www.kuaf.com/show/ozarks-at-large/2023...
14 1116435796 2023-06-22 202306 2023 2023.4712 AFR AFRICA AFR NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1 43 43 4 Verbal Cooperation 2.8 4 1 4 0.790514 1 Russia RS RS 60.0000 100.0000 RS 1 Russia RS RS 60.0000 100.0000 RS 2023-07-22 https://www.beijingbulletin.com/news/273906652...
15 1116435797 2023-06-22 202306 2023 2023.4712 AFR AFRICA AFR NaN NaN NaN NaN NaN NaN NaN RUS RUSSIAN RUS NaN NaN NaN NaN NaN NaN NaN 1 43 43 4 Verbal Cooperation 2.8 10 3 10 -0.491013 4 Pretoria, Gauteng, South Africa SF SF06 -25.7069 28.2294 -1273769 4 Pretoria, Gauteng, South Africa SF SF06 -25.7069 28.2294 -1273769 2023-07-22 https://www.beijingbulletin.com/news/273906652...
16 1116435798 2023-06-22 202306 2023 2023.4712 AFR AFRICA AFR NaN NaN NaN NaN NaN NaN NaN RUS RUSSIAN RUS NaN NaN NaN NaN NaN NaN NaN 0 43 43 4 Verbal Cooperation 2.8 2 1 2 0.790514 1 Russia RS RS 60.0000 100.0000 RS 1 Russia RS RS 60.0000 100.0000 RS 2023-07-22 https://www.beijingbulletin.com/news/273906652...
17 1116435799 2023-06-22 202306 2023 2023.4712 AFR AFRICA AFR NaN NaN NaN NaN NaN NaN NaN RUSGOV RUSSIAN RUS NaN NaN NaN NaN GOV NaN NaN 1 43 43 4 Verbal Cooperation 2.8 18 3 18 -0.491013 4 Pretoria, Gauteng, South Africa SF SF06 -25.7069 28.2294 -1273769 4 Pretoria, Gauteng, South Africa SF SF06 -25.7069 28.2294 -1273769 2023-07-22 https://www.beijingbulletin.com/news/273906652...
new_data = data_sm[
[      'fractiondate',
       'year',
       'monthyear',
       'actor1code', 
        'actor1name', 
        'actor1geo_type',
        'actor1geo_long',
        'actor1geo_lat',
        'actor2code', 
        'actor2name', 
        'actor2geo_type',
        'actor2geo_long',
        'actor2geo_lat',
        'isrootevent',
        'eventcode',
        'eventrootcode',
        'goldsteinscale',
        'nummentions',
        'numsources',
        'avgtone',
        'quadclass'
  ]
]


new_data.head() 
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
fractiondate year monthyear actor1code actor1name actor1geo_type actor1geo_long actor1geo_lat actor2code actor2name actor2geo_type actor2geo_long actor2geo_lat isrootevent eventcode eventrootcode goldsteinscale nummentions numsources avgtone quadclass
0 2022.5534 2022 202207 CAN CANADA 4 -137.8170 66.8500 CHRCTH CATHOLIC 4 -137.8170 66.8500 1 20 2 3.0 13 2 -3.829401 Verbal Cooperation
1 2022.5534 2022 202207 CAN CANADA 4 -137.8170 66.8500 CHRCTH CATHOLIC 4 -137.8170 66.8500 1 90 9 -2.0 13 2 -3.829401 Material Cooperation
2 2022.5534 2022 202207 CVL COMMUNITY 4 -137.8170 66.8500 CHRCTH CATHOLIC 4 -137.8170 66.8500 1 20 2 3.0 7 2 -3.829401 Verbal Cooperation
3 2022.5534 2022 202207 CVL COMMUNITY 4 -137.8170 66.8500 CHRCTH CATHOLIC 4 -137.8170 66.8500 1 90 9 -2.0 7 2 -3.829401 Material Cooperation
4 2022.5534 2022 202207 EDU STUDENT 3 -77.6497 39.0834 NaN NaN 3 -77.6497 39.0834 0 110 11 -2.0 10 1 0.393701 Verbal Conflict

import ClusteringExperiment and init the class

from pycaret.clustering import ClusteringExperiment
clu_exp = ClusteringExperiment()
Init setup
clu_exp.setup(data=new_data, session_id=234)
<style type="text/css"> #T_4ce77_row7_col1 { background-color: lightgreen; } </style>
  Description Value
0 Session id 234
1 Original data shape (832, 21)
2 Transformed data shape (832, 662)
3 Ordinal features 1
4 Numeric features 13
5 Categorical features 8
6 Rows with missing values 39.5%
7 Preprocess True
8 Imputation type simple
9 Numeric imputation mean
10 Categorical imputation mode
11 Maximum one-hot encoding -1
12 Encoding method None
13 CPU Jobs -1
14 Use GPU False
15 Log Experiment False
16 Experiment Name cluster-default-name
17 USI 0098

Create Model

This function trains and evaluates the performance of a given model. Metrics evaluated can be accessed using the get_metrics function. Custom metrics can be added or removed using the add_metric and remove_metric function.

clu_exp.models()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Name Reference
ID
kmeans K-Means Clustering sklearn.cluster._kmeans.KMeans
ap Affinity Propagation sklearn.cluster._affinity_propagation.Affinity...
meanshift Mean Shift Clustering sklearn.cluster._mean_shift.MeanShift
sc Spectral Clustering sklearn.cluster._spectral.SpectralClustering
hclust Agglomerative Clustering sklearn.cluster._agglomerative.AgglomerativeCl...
dbscan Density-Based Spatial Clustering sklearn.cluster._dbscan.DBSCAN
optics OPTICS Clustering sklearn.cluster._optics.OPTICS
birch Birch Clustering sklearn.cluster._birch.Birch
kmodes K-Modes Clustering kmodes.kmodes.KModes

Train kmeans model

kmeans = clu_exp.create_model('kmeans')
<style type="text/css"> </style>
  Silhouette Calinski-Harabasz Davies-Bouldin Homogeneity Rand Index Completeness
0 0.4875 2091.9609 0.5016 0 0 0
# train meanshift model
meanshift = clu_exp.create_model('meanshift')
<style type="text/css"> </style>
  Silhouette Calinski-Harabasz Davies-Bouldin Homogeneity Rand Index Completeness
0 0.2849 649.3671 0.4694 0 0 0

Assign Model

This function assigns cluster labels to the training data, given a trained model.

kmeans_cluster = clu_exp.assign_model(kmeans)
kmeans_cluster
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
fractiondate year monthyear actor1code actor1name actor1geo_type actor1geo_long actor1geo_lat actor2code actor2name actor2geo_type actor2geo_long actor2geo_lat isrootevent eventcode eventrootcode goldsteinscale nummentions numsources avgtone quadclass Cluster
0 2022.553345 2022 202207 CAN CANADA 4 -137.817001 66.849998 CHRCTH CATHOLIC 4 -137.817001 66.849998 1 20 2 3.0 13 2 -3.829401 Verbal Cooperation Cluster 2
1 2022.553345 2022 202207 CAN CANADA 4 -137.817001 66.849998 CHRCTH CATHOLIC 4 -137.817001 66.849998 1 90 9 -2.0 13 2 -3.829401 Material Cooperation Cluster 2
2 2022.553345 2022 202207 CVL COMMUNITY 4 -137.817001 66.849998 CHRCTH CATHOLIC 4 -137.817001 66.849998 1 20 2 3.0 7 2 -3.829401 Verbal Cooperation Cluster 2
3 2022.553345 2022 202207 CVL COMMUNITY 4 -137.817001 66.849998 CHRCTH CATHOLIC 4 -137.817001 66.849998 1 90 9 -2.0 7 2 -3.829401 Material Cooperation Cluster 2
4 2022.553345 2022 202207 EDU STUDENT 3 -77.649696 39.083401 NaN NaN 3 -77.649696 39.083401 0 110 11 -2.0 10 1 0.393701 Verbal Conflict Cluster 2
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
87979 2022.553345 2022 202207 USAGOV HOUSTON 3 -95.363297 29.763300 USA TEXAS 3 -95.363297 29.763300 1 112 11 -2.0 3 1 -10.795455 Verbal Conflict Cluster 2
87980 2022.553345 2022 202207 USAGOV UNITED STATES 3 -95.363297 29.763300 USA UNITED STATES 2 -97.647499 31.106001 1 112 11 -2.0 1 1 -10.795455 Verbal Conflict Cluster 2
87981 2022.553345 2022 202207 USAGOV HOUSTON 3 -95.363297 29.763300 USA TEXAS 2 -97.647499 31.106001 1 112 11 -2.0 6 1 -10.795455 Verbal Conflict Cluster 2
87982 2022.553345 2022 202207 USAGOV UNITED STATES 3 -95.363297 29.763300 USA HOUSTON 2 -97.647499 31.106001 1 180 18 -9.0 6 1 -10.795455 Material Conflict Cluster 2
87983 2022.553345 2022 202207 USAGOV HOUSTON 3 -95.363297 29.763300 USA TEXAS 2 -97.647499 31.106001 1 180 18 -9.0 4 1 -10.795455 Material Conflict Cluster 2

832 rows × 22 columns

Click here for more.

Demo

To deploy this project run

    docker run -p 8888:8888 pycaret/full start.sh jupyter lab 

Hi, I'm Ade! 👋

🔗 Links

portfolio linkedin twitter

Acknowledgements