Index page has repeating entries
byco opened this issue · 1 comments
Hey! This is an interesting and useful project. I noticed there's some potential bugs with the index page. There are repeating entries for some place entities. Puerto Rico and British Columbia both show up twice, for example:
I'm guessing these entries show up twice because either it's case sensitive (e.g. BRITISH COLUMBIA vs. British Columbia) or abbreviations are used in some but not in others (e.g. Puerto Rico vs. PR).
There's also some odd entries that are obviously not U.S. states, e.g. Foreign, Beijing, England, Ontario, etc. I would suggest ordering the U.S. states first, then list the odd entries, since I think most users would look for U.S. states first.
Thanks for filing this and completely agree this page is long overdue for some love!
The page was initially created solely for SEO purposes and gets virtually no traffic thanks to the search feature. Plus, the location of the foundation's HQ isn't all that relevant to fundraisers...it's the location of a foundation's grantees that is most relevant. Hence, this page is fairly low priority.
That said, your ideas of including US states first and everything else second make sense. Anything to reduce the size of the page would be useful, both for SEO purposes and for the odd visitor that stumbles upon it. PRs are always welcome!
The source of the duplicates is likely one of three sources:
-
Upstream: The data is taken directly from the public IRS dataset. The IRS does not force any schema, so cities and states are often entered incorrectly, especially when it comes to foreign addresses.
-
Conversion: I do some very minor cleaning of the data (one of our stated goals is to present the tax filings as is). The city/state conversion parts can be found here.
-
Loop: We then loop through the entire dataset to group all the filings by state, referencing a US state list that includes US territories. Relevant code is here.