GlobalFishingWatch/identity-matching

Come up with correct normalizing code for fields

Opened this issue · 4 comments

We need to come up with the right rules for normalizing the fields in the master table. For instance, deciding to have all ship names in capital letters, or getting rid of bad characters from callsigns.

The file where we will keep track of these normalizing rules is here: https://github.com/GlobalFishingWatch/identity-matching/blob/master/table_building/fieldnormalizing.md

I'm assigning this to @Bjorn-skytruth, but I am also equally assigned to it.

Here are examples of callsigns from type5/24 messages and registries and how they should be normalized.

callsign source normalized callsign
HO_4007 type 5 HO4007
BL-2099 type 5 BL2099
6 N K J type 5 6NKJ
A type 5
BBIM2 IOTC BBIM2
412677390 IOTC
YEA. 4420 IOTC YEA4420
Unknown IOTC
BJ4089  WCPFC BJ4089
cb3290 SPRFMO CB3290
(blank) ICCAT
326DAEJIN CLAV
306GANGWON CLAV

Normalization Rules for callsigns

  1. Remove all characters that are not numbers or standard letters
  2. Capitalize letters
  3. Remove all spaces
  4. Remove any callsign that does not have from 3-7 characters
  5. Remove the callsigns 'UNKNOWN', 'Unknown', and '(blank)'

Normalizing names will be considerably more complex. Examples of name and callsigns can be found here. Issues to resolve, roughly in order of importance:

  1. Capitalize name
  2. Remove the text 'No.' (Nº sometimes found in registries)
  3. Move the number from before to after name
  4. Remove characters -#'
  5. Replace roman numerals with numbers
  6. Remove F/V and F/B from the name
  7. Fix foreign accented letters that come up scrambled in the registries, there is a pattern but it may be complex to work out. Turkish names are almost indecipherable.
  8. Replace number words with numbers

@Bjorn-skytruth this is very helpful -- I think we can do most of it in BigQuery, but it will take some work to get the regex expressions correct.