ETS unit matching

CZP Charles University, February-March 2022

Goal: fuzzy match ETS unit names across two spreadsheets.

This folder contains two spreadsheets with ETS (European Union emission trading system) units (facility and company) in the Czech Republic. ETS_id_list_2015.xlsx is the original file which contains 2015 data (including ETS codes), ETS_art10a_national_alocation_2021-2026_Czechia.xlsx is the new file which contains 2021-2026 data. They both use slightly different names for individual units. My goal is to match the specific facilities across the two files in order to get the ETS codes from the 2015 file for each unit in the 2021-2026 file. Both files were converted into a new/original ETS units list .csv.

Data	Scope	Excel file	CSV file
Original	2015	`ETS_id_list_2015`	`original ets unit`
New	2021-2026	`ETS_art10a_national_alocation_2021-2026_Czechia`	`new ets unit`

There are two versions of the code. fuzzy matching - original.py matches the facility and company names independently. This results in a high error rate. The new version of the code, fuzzy matching - current.py combines the facility and company names before matching, resulting in more correct matches. However, the matched data still needs extensive editing and QA. The matched ets units.csv is the output of the fuzzy matching, correction, and QA process. The efficacy of the original and new scripts is listed in the table below:

Result	Old method	New method
Match	183 (79%)	214-216 (93%)
Partial match	37 (16%)	n/a
No match	11 (5%)	17-15 (7%)

The fuzzy matching - check matched pairs.py was used for quality control with the fuzzy matching - original.py script and is not longer useful.

merlijnkersten/ETS-unit-matching

ETS unit matching