inkjet/pypostalcode

Update Database

Closed this issue · 8 comments

Is there an easy way to update this database to use the 2016 data?
https://www12.statcan.gc.ca/census-recensement/2011/geo/bound-limit/bound-limit-2016-eng.cfm

I also would like to request this update (or contribute, whichever makes the most sense) for the 2016 Census postal codes. I'm finding postal codes (FSA: "V3Z", for example) existing and I can search for them, but the database has no knowledge of these codes.

+1 on wanting to contribute to the database. I have a lot of verified zip codes that don't appear in the DB.

It might be possible to convert the 2016 Canadian Census shapefiles available at the link above to an SQLite-compatible *.db file using shapefile-to-sqlite. If this is done then how do we know if the *.db file uses the correct schema for this application, or do we need to somehow infer this from the existing postalcodes.db?

I looked into this and I still don't know where this data actually comes from.

In this project, postalcodes.db contains data in this format:

c.execute("CREATE TABLE PostalCodes(fsa VARCHAR(3), city TEXT, province TEXT, longitude DOUBLE, latitude DOUBLE, timezone INT, dst INT);")

So for each 3 character FSA code A1A, it has a human-readable name ("city"), a province, a longitude and latitude coordinate and a timezone/dst status.

Going to the 2016 Census data page (linked above) and downloading the "English" language, "ArcGIS ® (.shp)" format, "Cartographic Boundary File" for "Forward Sortation Areas ©" you get a zip file with some stuff, most important being a .shp file. Opening the .shp file in QGIS, you can see that the data is a map of Canada split into shapes (the FSA codes)

Screenshot from 2021-02-25 16-51-35

and right clicking on the layer and clicking "Open Attribute Table", you see there's also an associated table that looks like this:

Screenshot from 2021-02-25 16-36-30

So it's just associating the first 3 digits of postal codes (FSA code) with a 2D shape and a province name. The latter is useless because the first letter of a postal code already tells you the province. Furthermore, the data you get is a shapefile but this project has a single longitude/latitude point for each FSA code, so if you were to update it with the 2016 data, you would somehow have to replicate the conversion of 2D shapes into a single coordinate, which should be straight forward for most FSAs but there's also huge regions in the north which would be weird to choose a single point for.

Also, the data contains a PDF file with information about the data, and the following text:

Below is a list of the twenty-one forward sortation areas which are not included in the boundary file because they
were not the dominant FSA in a dissemination area

E2R G1A H4T H4Y H4Z H5A H5B K1A L5S L5T L9E M5K M5L M5W M5X M7A M7Y S4M T6N V7X V7Y

I went looking for where the more descriptive subdivision names like "Calgary (Discovery Ridge / Signal Hill / Aspen Woods / Patterson / Cougar Ridge)" come from, and the closest thing I found is the "Federal electoral districts (2013 Representation Order)" file, which has shapes with the following attributes:

Screenshot from 2021-02-25 17-18-51

But those aren't electoral districts (well Calgary-Signal Hill is, but Discovery Ridge isn't). The 2011 "Forward Sortation Areas" data also doesn't have these names, just names of provinces. I thought that this data might come from Wikipedia:

https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_T

The article for T postal codes, there's even a table with the same data (city, subdivision, long/lat) (possibly added from this package?) but the other letters don't have this table and the other tables are not quite the same: for example, on Wikipedia TOA is given as just "Eastern Alberta" but in postalcodes.db it's given as "Eastern Alberta (St. Paul)", presumably St. Paul is the most populous town in that region, I don't know where that's coming from. I also don't know where the Wikipedia article is populated from. It might still make sense to scrape those 18 Wikipedia articles for comparing with the existing data.

I converted postalcodes.db to a csv and saved the 2016 FSA .shp file to a CSV, and compared them using comm and xsv

comm -3 <(xsv select fsa 2011_fsa.csv | sort) <(xsv select CFSAUID 2016_fsa.csv | sort)

and I got this, the 2016 data has these FSAs that postalcodes.db doesn't have:

B6L E3G G3B G3C G3N G6Y J1C J1R J2M J5C J7W L7K L9J P7L R1C R4G S7W T1Z T3S T4M T8T V3Z V9Z

And also, postalcodes.db contains these FSAs that are missing from the newer 2016 data:

E2R G0B G1A H0H H4T H4Y H4Z H5A H5B K1A L5P L5S L5T M5K M5L M5W M5X M7A M7Y N8V T6N V7X V7Y

if you remove the 21 FSA codes that are mentioned in the PDF (see above), you get

E2R G1A H4T H4Y H4Z H5A H5B K1A L5S L5T L9E M5K M5L M5W M5X M7A M7Y S4M T6N V7X V7Y
E2R G1A H4T H4Y H4Z H5A H5B K1A L5S L5T     M5K M5L M5W M5X M7A M7Y     T6N V7X V7Y
G0B H0H L5P N8V 

According to Wikipedia, G0B and H0H are not used (HOH is for Santa) but L5P and N8V are, so I don't know what to make of that. L9E and S4M (mentioned in the PDF) are missing from both data sets still.

I found the data! it's https://download.geonames.org/export/zip/

except it doesn't have timezones

Hey, you guys found it! I'm proud of everyone, and apologize for letting this languish for so long. I made @verhovsky a maintainer, so he can roll in some of the great updates he's found.

We just released 0.3.6 with 25 new FSA codes. A CSV file containing the data in a human-readable form is now (once again) part of the git repository, so please submit edits to ca_postalcodes.csv if you see something wrong.

0.4.1 added time zones to the FSA codes added earlier and updates the time zones of some other FSA codes. The data should be up to date now, so I am closing this issue.

If you notice a mistake in the data, don't hesitate to create a pull requests that edits ca_postalcodes.csv.