fatiando/pooch

sha256 inconsistency between local hash and github actions

trhallam opened this issue · 6 comments

The sha256 hash on github actions did not match my local hash:

Local: Ubuntu 20.04 on WSL, hash created using openssl
Github actions: Python 3.9, ubuntu-latest

Data is being sourced from Github raw. The Github actions runner appears to create a different sha256 hash to my local implementation. Switching to md5 fixed this.

import os
import pooch
from . import __version__

GOODBOY = pooch.create(
    path=os.curdir,
    base_url="https://raw.githubusercontent.com/trhallam/digirock/main/tests/test_data/",
    version=__version__,
    # If this is a development version, get the data from the master branch
    version_dev="main",
    # The registry specifies the files that can be fetched from the local storage
    registry={
        "COMPLEX_PVT.inc": "3018c7ec33dded551e0bcd44103a1abd27ff4895268c712197616e396532da25",
        "PVT_BO.inc": "053669c122948b690b03bcd2e5d11bdbc377bf84cddcd0d614ee19ec22ca36b6",
        "PVT_RS.inc": "ff869731b2ece69fa0686b6a0204f113a0106e359413ddf1547841cbdf3d219d",
    },
)

Data is located here: https://github.com/trhallam/digirock/tree/main/tests/test_data

Creates error:
SHA256 hash of downloaded file (COMPLEX_PVT.inc) does not match the known hash: expected 3018c7ec33dded551e0bcd44103a1abd27ff4895268c712197616e396532da25 but got 2bb908ad754ac1939a6d3cc34e3997c0c120424f27a028dcdbce288e338fe00e. Deleted download for safety. The downloaded file may have been corrupted or the known hash may be outdated.

@trhallam do you have a link to the Actions job that fails? That could help us figure out what's going on.

I ran openssl here and get the same SHA256 as you. Strange that MD5 works for this when SHA doesn't.

One quick comment on the code above: You probably to use {version} in the url instead of main so that releases get data from the tag instead of the main branch. Otherwise changes to the data on main can break previous releases.

Hi @leouieda, I think this was the workflow -> https://github.com/trhallam/digirock/runs/5357449619?check_suite_focus=true

It is all green because the error is handled inside a notebook example during the build.

I don't have a release yet, hence things are not tied to a version but thanks for the tip, I'll try to update that when I create the first release.

Thanks, I'll have a look at the log to see if I can spot anything.

I don't have a release yet, hence things are not tied to a version but thanks for the tip, I'll try to update that when I create the first release.

You can do that now since pre-release version numbers from setuptools-scm will default to "main" since they're development releases.

@trhallam I had a look at the log and tried to see if I could somehow reproduce this behaviour but so far I got nothing. I thought it could be GitHub caching an old version of the data for some reason but then the MD5 would fail as well.

Have you tried using the https://github.com/trhallam/digirock/raw/main/ URL instead?

Looks like switching to this URL: https://github.com/trhallam/digirock/raw/{version}/ and adjusting my versioning scheme worked.

Perhaps the .raw URL does some small changes to the byte code.

Thanks for your help.

Glad it worked! I have noticed that it does some caching at times when serving content. So changing things in the repo doesn't automatically update the raw URL.