Come up with correct normalizing code for fields
Opened this issue · 4 comments
We need to come up with the right rules for normalizing the fields in the master table. For instance, deciding to have all ship names in capital letters, or getting rid of bad characters from callsigns.
The file where we will keep track of these normalizing rules is here: https://github.com/GlobalFishingWatch/identity-matching/blob/master/table_building/fieldnormalizing.md
I'm assigning this to @Bjorn-skytruth, but I am also equally assigned to it.
Here are examples of callsigns from type5/24 messages and registries and how they should be normalized.
callsign | source | normalized callsign |
---|---|---|
HO_4007 | type 5 | HO4007 |
BL-2099 | type 5 | BL2099 |
6 N K J | type 5 | 6NKJ |
A | type 5 | |
BBIM2 | IOTC | BBIM2 |
412677390 | IOTC | |
YEA. 4420 | IOTC | YEA4420 |
Unknown | IOTC | |
BJ4089Â | WCPFC | BJ4089 |
cb3290 | SPRFMO | CB3290 |
(blank) | ICCAT | |
326DAEJIN | CLAV | |
306GANGWON | CLAV |
Normalization Rules for callsigns
- Remove all characters that are not numbers or standard letters
- Capitalize letters
- Remove all spaces
- Remove any callsign that does not have from 3-7 characters
- Remove the callsigns 'UNKNOWN', 'Unknown', and '(blank)'
Normalizing names will be considerably more complex. Examples of name and callsigns can be found here. Issues to resolve, roughly in order of importance:
- Capitalize name
- Remove the text 'No.' (Nº sometimes found in registries)
- Move the number from before to after name
- Remove characters -#'
- Replace roman numerals with numbers
- Remove F/V and F/B from the name
- Fix foreign accented letters that come up scrambled in the registries, there is a pattern but it may be complex to work out. Turkish names are almost indecipherable.
- Replace number words with numbers
@Bjorn-skytruth this is very helpful -- I think we can do most of it in BigQuery, but it will take some work to get the regex expressions correct.