/ETS-unit-matching

Fuzzy matching Czech ETS units

Primary LanguagePython

ETS unit matching

CZP Charles University, February-March 2022

Goal: fuzzy match ETS unit names across two spreadsheets.

This folder contains two spreadsheets with ETS (European Union emission trading system) units (facility and company) in the Czech Republic. ETS_id_list_2015.xlsx is the original file which contains 2015 data (including ETS codes), ETS_art10a_national_alocation_2021-2026_Czechia.xlsx is the new file which contains 2021-2026 data. They both use slightly different names for individual units. My goal is to match the specific facilities across the two files in order to get the ETS codes from the 2015 file for each unit in the 2021-2026 file. Both files were converted into a new/original ETS units list .csv.

Data Scope Excel file CSV file
Original 2015 ETS_id_list_2015 original ets unit
New 2021-2026 ETS_art10a_national_alocation_2021-2026_Czechia new ets unit

There are two versions of the code. fuzzy matching - original.py matches the facility and company names independently. This results in a high error rate. The new version of the code, fuzzy matching - current.py combines the facility and company names before matching, resulting in more correct matches. However, the matched data still needs extensive editing and QA. The matched ets units.csv is the output of the fuzzy matching, correction, and QA process. The efficacy of the original and new scripts is listed in the table below:

Result Old method New method
Match 183 (79%) 214-216 (93%)
Partial match 37 (16%) n/a
No match 11 (5%) 17-15 (7%)

The fuzzy matching - check matched pairs.py was used for quality control with the fuzzy matching - original.py script and is not longer useful.