Not every starter file has a valid link to a csv file.
Closed this issue · 1 comments
Continuation of rnckp/starter-code_opendataswiss#2
At one point the starter script creates a line of code like this:
df = get_dataset('https://www.stadt-zuerich.ch/geodaten/download/Baumkataster?format=10008')
but the provided link does not contain a csv file, hence later this code fails in the generated starter script.
I think this could be solved by changeing the following function in the generator script:
def has_csv_distribution(dists):
"""Iterate over package resources and keep only CSV entries in list"""
# <<<<<<<
# Don't look, whether the dataset claims to be a CSV file, but check,
# whether the download url (which will be inserted in get_dataset() later)
# ends with csv:
#csv_dists = [x for x in dists if x.get("format", "") == "CSV"]
csv_dists = [x for x in dists if x.get('download_url', '').lower()[-3:] == 'csv']
# >>>>>>>
if csv_dists != []:
return csv_dists
else:
return np.nan
But this produces ~2000 instead of 2700 starter files, so a thorough investigation is necessary.
I can't provide you with a pull request yet, because I had to make some changes locally to get the script to work. One of them included the addition of encoding='utf-8'
in every file open()
command, otherwise it crashed on my machine.
Thanks for the suggestion.
From my anecdotal usage of the starter code notebooks with a couple of hundred datasets I assume that the amount of ressources without direct link to a CSV ressource is very low. In addition, even if I want to work with one of these datasets (like the Baumkataster one that you found) a prepared notebook still would be useful, even if the initial download fails. I would simply follow the link, get the data, adjust the read statement and start working.
Therefore I won't follow you suggestion here to change the code. We'd simply loose way too many datasets in the collection just to avoid a minor and easily fixable error with few datasets.