rnckp/starter-code-opendataswiss-gh

Not every starter file has a valid link to a csv file.

Closed this issue · 1 comments

Continuation of rnckp/starter-code_opendataswiss#2

At one point the starter script creates a line of code like this:
df = get_dataset('https://www.stadt-zuerich.ch/geodaten/download/Baumkataster?format=10008')
but the provided link does not contain a csv file, hence later this code fails in the generated starter script.

I think this could be solved by changeing the following function in the generator script:

def has_csv_distribution(dists):
    """Iterate over package resources and keep only CSV entries in list"""

    # <<<<<<<
    # Don't look, whether the dataset claims to be a CSV file, but check,
    # whether the download url (which will be inserted in get_dataset() later)
    # ends with csv:

    #csv_dists = [x for x in dists if x.get("format", "") == "CSV"]
    csv_dists = [x for x in dists if x.get('download_url', '').lower()[-3:] == 'csv'] 
    # >>>>>>>

    if csv_dists != []:
        return csv_dists
    else:
        return np.nan

But this produces ~2000 instead of 2700 starter files, so a thorough investigation is necessary.

I can't provide you with a pull request yet, because I had to make some changes locally to get the script to work. One of them included the addition of encoding='utf-8' in every file open() command, otherwise it crashed on my machine.

rnckp commented

Thanks for the suggestion.

From my anecdotal usage of the starter code notebooks with a couple of hundred datasets I assume that the amount of ressources without direct link to a CSV ressource is very low. In addition, even if I want to work with one of these datasets (like the Baumkataster one that you found) a prepared notebook still would be useful, even if the initial download fails. I would simply follow the link, get the data, adjust the read statement and start working.

Therefore I won't follow you suggestion here to change the code. We'd simply loose way too many datasets in the collection just to avoid a minor and easily fixable error with few datasets.