timrdf/csv2rdf4lod-automation

Rewrite punzip.sh

Opened this issue · 0 comments

punzip.sh was written as a proof of concept and should be replaced by a python implementation that uses a proper zip API to

  • determine the files within the zip,
  • perform their extraction, and
  • encode the provenance of their extraction.

The task includes:

  • Fork this repo to your github account and clone your fork to your machine.
  • Review the current behavior of punzip.sh, running it on a few zip files to see what it does.
  • [Re]write any documentation on punzip.sh that is currently inadequate. Include how it relates to pcurl.sh during retrieval and the invocation of the converter during conversion. List desired features that it currently does not provide. Use a public wiki on github.
  • Query the existing PML produced by deployed punzip.sh (LOGD, healthdata, orgpedia) to find test cases and to verify that the next implementation works on those as well.
  • Write a python version of the current punzip.sh, including proper --help usage and input error checking.
    • include flags --as-pml and --as-prov-o to output PML 2 and PROV-O, respectively. (This will require designing the PROV-O model that is output - and should be inspired by the current PML 2 encoding).
  • Commit to your clone and push to your github repo at regular milestones in development (e.g. "stubbed in, responding to --help, added PML 2, added PROV-O, fixed bug X, added comments, etc.)
    • Find an appropriate python zip library and use it as part of the implementation. Document the dependency.
  • Verify that your implementation is useful by having a peer use it without your guidance (other than pointers to its documentation).
  • Submit a pull request back to this repository when punzip.py is complete.
  • Ask questions as needed. If something is holding you back, say so.