Redeploy loci-cache to include gnaf1605 and updated geofabric rr>cc relations
Closed this issue · 5 comments
The team has been developing an scripted way to harvest gnaf1605 and geofabric datasets as well as deploying the data to a new cache. However, the main task is to redeploy loci-cach to include gnaf1605 and the update geofabric relationships.
Some pre-requisite steps before redeploying cache:
- Review this PR: CSIRO-enviro-informatics/geofabric-dataset#22
- Review changes in this branch: https://github.com/CSIRO-enviro-informatics/loci-cache-scripts/tree/shane/feature/gnaf_harvest (which include gnaf1605 and geofabric harvest)
Run cache load scripts in loci-cache-scripts
repo.
Shane hit an issue in the loci-cache-scripts
pip install during running a script in docker from that repo (not the geofabric-dataset docker process) - see comments below. Suggest adding some pre-built lxml packages in the dockerfile during docker build time in https://github.com/CSIRO-enviro-informatics/loci-cache-scripts/blob/shane/feature/gnaf_harvest/docker/harvest/geofabric/Dockerfile
Shane was working on this but is hitting some blockers.
Excerpt from the discussion:
"After producing the dataset, in needed to be uploaded to s3://loci-assets/auto-generated/linksets/gnaf_1605.trig.gz . The name matters for the cache builder.
Then the geofabric will have to be harvested, and it needs to go to `s3://loci-assets/auto-generated/linksets/geofabric.trig.gz again, name matters."
"After this, login to the EC2 machine named loci-cache-latest
...
Then from the directory ~loci-cache-scripts/docker/cache/
run ./startup.sh
Which should build the cache."
"Branch that was being developed: shane/feature/split_cache_build_and_run
... Command to restart now is ./startup.sh --rebuild so please use that instead ./startup.sh will simple start the graphDB otherwise."
"Looks like I was able to get the Geofabric Harvest almost working, it is hanging on the wheel build of lxml for now. No idea why. The code for harvesting is in the loci-cache-scripts report currently in the shane/feature/gnaf_harvest branch. I didn't get the gnaf working due to inaccessible database, but I threw in the geofabric stuff and it works. So will be a pull request for that branch ready when I get back. Needless to say, if you want to run the geofabric harvest you will need to switch to the appropriate branch. maybe you can work our the wheel issue, other than that I had it working.
In order to get the automated harvester for geofab to work I had to make minor changes to the geofabric-dataset repo. I have made a pull request to get them into master so someone will have to review it and merge it for the harvester to run (unless you specify the branch like I did, it will want to run from master). If you do run it, building the LXML library take an eternity. Need to fix that (10+ mins)
I made similar changes to gnaf-dataset but given things aren't working yet, no pull-request there."
I think i have a working Geofabric Harvest via these 2 PRs
CSIRO-enviro-informatics/loci-cache-scripts#30
CSIRO-enviro-informatics/geofabric-dataset#23
It even output a file to https://s3-ap-southeast-2.amazonaws.com/loci-assets/auto-generated/datasets/geofabric2.trig.gz
The issue is the geofabricld.net geoserver keeps falling over after harvest, so we need to do a power cycle on that machine each time we want to reharvest.
Those PRs are good to review I think.
We don't have a GNAF harvest script working yet... that's the next thing to work on for this issue.
The latest cache was deployed. gnaf1605 and updated geofabric datasets included. tests are passing on the majority of the items being tested. there was a discrepancy of 1 feature in asgs which we'll investigate.
closing this issue as rr>cc relations updated as expected now and gnaf1605 included.