Convert DP+ to a CKAN extension
jqnatividad opened this issue ยท 21 comments
DP+ was forked from DP which uses ckanserviceprovider.
ckanserviceprovider was meant to be a general library for CKAN for making web services, and was originally created in 2013.
Right now, DP and DP+ are really the only users of the library, and in the 10 years since, a lot has changed.
DP+ 0.x will continue to use ckanserviceprovider so older CKAN installations using DataPusher can still use it.
However, especially now that CKAN v2.9 uses Python 3.8+ which DP+ requires to call qsv using subprocess options only available in Python 3.7+, v1.x of DP+ will work as a CKAN extension.
This has the following benefits:
- easier to maintain as it just works as a regular extension
- no need to maintain a separate web service running in its own virtualenv
- we can drop the ckanserviceprovider dependency
- we can use the ckan_default database to store DP+ job history instead of creating our own DP+ database
- we can add additional UI elements beyond the existing Datastore tab. Here are some examples of what a DP+ extension can do:
- Have the ability to set resource-specific DP+ job settings on the UI
- Have the ability to see how the resource is stored in the Datastore (with datastore_info API)
- Have a more advanced Data Dictionary interface that goes beyond setting label/description and overriding types
This is already WIP in the https://github.com/dathere/datapusher-plus/tree/dev-v1.0 branch and we hope to the make the first release summer 2023.
DP+ 1.x should also be more interactive - i.e. users have the ability to configure the DP+ job on a per resource basis.
99% of the time, the default configuration should work fine. But when the user wants to tweak DP+ configuration for that particular job (e.g ask for PII screening using a particular PII screening configuration, add summary stats, check for dupes, etc.), there should be some DP+ knobs exposed in the Datastore tab to change DP+ settings just for that resource.
Hi @jqnatividad Any idea, when it will be ready for release ??
What is the current state of this migration? I am considering setting up datapusher-plus, but would like to avoid to set up an external service if future updates require it to be an ckanext instead. Could I already use the dev-1.0 branch for this, and if so, how?
@jhbruhn dev-1.0 can be used to set the DP+ as extension and you are more than welcome to give it a try
To use it, just checkout to that branch and follow same steps as any other extension for installation
Thanks. I installed the extension in my ckan 2.10 test instance and included it as the "datapusher" plugin (according to this
Line 85 in 1cfba1c
CKAN_DATAPUSHER_URL
as ckan would not start if not defined to the value I was using with the original datapusher, but stopped that service.
But when I upload a file and try to push it to the datastore (did happen automatically as well), I get this error in the Web Interface: Could not connect to DataPusher.
without a more specific error in the ckan logs.
You should start a job worker
ckan jobs worker
I did (now), but the result is still the same. Could the original, ckan internal datapusher be interfering with datapusher-plus here? Should I open another issue?
If I provide my original datapusher configuration and service next to the activated dataplusher configuration, it uses the classic datapusher to push data to my resources. While a task in the workerqueue for datapusher-plus seems to be created, it is not used, and nothing is happening on my ckan jobs worker
worker.
@jhbruhn in the plugins it should be added as ckan.plugins = stats ... datapusher_plus ...
That leads to this exception at startup:
ckan | Traceback (most recent call last):
ckan | File "/srv/app/wsgi.py", line 20, in <module>
ckan | application = make_app(config)
ckan | File "/srv/app/src/ckan/ckan/config/middleware/__init__.py", line 27, in make_app
ckan | load_environment(conf)
ckan | File "/srv/app/src/ckan/ckan/config/environment.py", line 69, in load_environment
ckan | p.load_all()
ckan | File "/srv/app/src/ckan/ckan/plugins/core.py", line 222, in load_all
ckan | load(*plugins)
ckan | File "/srv/app/src/ckan/ckan/plugins/core.py", line 238, in load
ckan | service = _get_service(plugin)
ckan | File "/srv/app/src/ckan/ckan/plugins/core.py", line 346, in _get_service
ckan | raise PluginNotFoundException(plugin_name)
ckan | ckan.plugins.core.PluginNotFoundException: datapusher_plus
Perhaps due to this commit? 1cfba1c
The plugin is definitely installed, as some of it is run when including the datapusher
plugin.
@jhbruhn my bad, forgot that i had that changed
Are you using dockerized environment for your setup ?
Yes, this is the ckan Dockerfile I am using, based off of the official image:
FROM ckan/ckan-base:2.10
### DCAT ###
ARG CKANEXT_DCAT_VERSION="1.5.1"
RUN pip3 install -e git+https://github.com/ckan/ckanext-dcat.git@v${CKANEXT_DCAT_VERSION}#egg=ckanext-dcat && \
pip3 install -r https://raw.githubusercontent.com/ckan/ckanext-dcat/v${CKANEXT_DCAT_VERSION}/requirements.txt
### LDAP with ignore referrals patch ###
ARG CKANEXT_LDAP_VERSION="add_ignore_referrals"
RUN apk add --no-cache \
openldap-dev
RUN pip3 install -e git+https://github.com/jhbruhn/ckanext-ldap.git@${CKANEXT_LDAP_VERSION}#egg=ckanext-ldap
### PDFView
ARG CKANEXT_PDFVIEW_VERSION="0.0.8"
RUN pip3 install -e git+https://github.com/ckan/ckanext-pdfview.git@${CKANEXT_PDFVIEW_VERSION}#egg=ckanext-pdfview
### Geoview
ARG CKANEXT_GEOVIEW_VERSION="0.1.0"
RUN pip3 install ckanext-geoview==${CKANEXT_GEOVIEW_VERSION}
### Spatial
ARG CKANEXT_SPATIAL_VERSION="2.1.1"
RUN apk add --no-cache \
proj-util \
proj-dev \
geos-dev
RUN pip3 install -e git+https://github.com/ckan/ckanext-spatial.git@v${CKANEXT_SPATIAL_VERSION}#egg=ckanext-spatial && \
pip3 install -r https://raw.githubusercontent.com/ckan/ckanext-spatial/v${CKANEXT_SPATIAL_VERSION}/requirements.txt
### Hierarchy
ARG CKANEXT_HIERARCHY_VERSION="1.2.1"
RUN pip3 install -e git+https://github.com/ckan/ckanext-hierarchy.git@v${CKANEXT_HIERARCHY_VERSION}#egg=ckanext-hierarchy && \
pip3 install -r https://raw.githubusercontent.com/ckan/ckanext-hierarchy/v${CKANEXT_HIERARCHY_VERSION}/requirements.txt
### Pages
ARG CKANEXT_PAGES_VERSION="0.5.2"
RUN pip3 install -e git+https://github.com/ckan/ckanext-pages.git@v${CKANEXT_PAGES_VERSION}#egg=ckanext-pages && \
pip3 install -r https://raw.githubusercontent.com/ckan/ckanext-pages/v${CKANEXT_PAGES_VERSION}/requirements.txt
### Datapusher Plus
ARG CKANEXT_DATAPUSHER_PLUS_VERSION="1cfba1c1373b6890f7338df1fa0f46a28297f0ef"
RUN pip3 install -e git+https://github.com/dathere/datapusher-plus.git@${CKANEXT_DATAPUSHER_PLUS_VERSION}#egg=datapusher-plus && \
pip3 install -r https://raw.githubusercontent.com/dathere/datapusher-plus/${CKANEXT_DATAPUSHER_PLUS_VERSION}/requirements.txt
# Activate plugins
ENV CKAN__PLUGINS "stats text_view image_view datatables_view video_view audio_view ldap pdf_view dcat dcat_json_interface structured_data envvars datapusher_plus datastore resource_proxy geo_view geojson_view shp_view wmts_view spatial_metadata spatial_query hierarchy_display hierarchy_form hierarchy_group_form pages"
qsv is obviously still missing from that Dockerfile, but it doesn't event get that far to call datapusher_plus, it is only calling the classic datapusher functions as far as I can see.
Thanks for the input, i will try to locate whats the issue with dockerized setup. Starting DP+ from source setup was working and it is testing on one of our development environments. I hope i could fix this soon and start using the DP+ as extension more often
I've just looked into this again, and it seems that I am getting further than before, yay!
I now get this error while executing a datapusher-plus job:
[SQL: INSERT INTO logs (job_id, timestamp, message, level, module, "funcName", lineno) VALUES (%(job_id)s, %(timestamp)s, %(message)s, %(level)s, %(module)s, %(funcName)s, %(lineno)s) RETURNING logs.id]
[parameters: {'job_id': '9de313e1-1969-45af-a07d-67031ff65fd0', 'timestamp': datetime.datetime(2024, 3, 28, 10, 51, 29, 238720), 'message': 'Fetching from: http://localhost:5000/dataset/a68c293b-d3ea-4318-9bac-c122ff4552b9/resource/1900fa9d-baf6-442a-9e58-d86e4e7d873c/download/kanton_zuerich_266.csv...', 'level': 'INFO', 'module': 'jobs', 'funcName': '_push_to_datastore', 'lineno': 441}]
(Background on this error at: https://sqlalche.me/e/14/f405) (Background on this error at: https://sqlalche.me/e/14/7s2a)
Traceback (most recent call last):
File "/usr/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1900, in _execute_context
self.dialect.do_execute(
File "/usr/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 736, in do_execute
cursor.execute(statement, parameters)
psycopg2.errors.UndefinedColumn: column logs.id does not exist
LINE 1: ...'INFO', 'jobs', '_push_to_datastore', 441) RETURNING logs.id
^
I have run the ckan datapusher init-db
command, so the migrations should've been run. This is running on a (testing) original datapusher/datastore database. Should I move this to a new issue, or is it expected behaviour for DP+ dev-1.0 right now?
@jhbruhn thanks for testing the DP+
Yes, there is a issue with the migration script, i think i have a fix for it. Currently is on fixing-issues
branch but not ready for merging into dev-v1.0
yet
Thanks for your reply, that branch is indeed working quite well with my limited tests.
@jqnatividad is there any roadmap for the v1.0
of datapusher-plus
? I'm looking forward to start using it, but currently the fact that it is based in ckan-serviceprovider
is a limitation. Even more knowing that soon it will be a CKAN extension.
I have been testing the dev-v1.0
(and fixing-issues
) branch and for normal workflows I didn't find any major issue.
Please let me know if there is anything else we can help with!
@pdelboca , it's really on me. DP+ 0.x stalled a bit because the main implementation that was informing its build was paused for a few months.
We started up again a month ago and we're currently writing specs for that implementation (we just actually submitted the "DP+ - Data Resource Upload First (DRUF) edition" specs for review by the customer today)
Good thing is @tino097 has been working diligently on DP+ 1.0 regardless given that ckanext-serviceprovider has been deprecated, and I don't intend to keep working on 0.x.
I'll generalize the spec and publish it here as it will functionally serve as the DP+ 1.0 roadmap.
We are breaking down DP+ DRUF development into three phases.
Phase 1 will be primarily adapted to the customer's requirements and is targeted for a 2024 summer release.
Phase 2 is still being scoped out but its focus will be being able to apply DRUF metadata inferencing to spatial data formats; being able to specify formulas for computed metadata fields in the scheming yaml file, and using the iDataDictionaryForm to implement an expanded data dictionary with to store summary statistics and frequency data for each field.
Phase 3 is even more aspirational, but its looking to do "smarter analysis" using machine learning techniques (perhaps, leveraging @amercader's work on related datasets; and maybe using qsv describegpt
to do tag suggestions.
I'm also thinking of using qsv's schema
and validate
commands.
schema
to automatically create a JSONSchema validation files that can be used to enforce field data types and domain/range values when updating a resource. For each column in a CSV, it will enforce the following:
- data type
- enum - range of valid literal values that can be accepted for the field (if the cardinality of the column is less than the default threshold of 50 unique values, it will create an enumeration for that field)
- for selected fields using the
qsv schema --pattern-columns
option, it can infer a regex pattern constraint based on the field's frequency table - minLength
- maxLength
- min
- max
The generated JSONschema file can be further fine-tuned by the Data Steward if required to apply some fairly complex validation rules as the crate I'm using fully implements JSONSchema validation and is widely deployed and used in production systems in the Rust ecosystem.
Currently, qsv validate
command is now being used by DP+ to ensure that CSVs are well-formed and conform to RFC4180 standard and is UTF8 encoded. In Phase 3, I want to use validate
's ability to use a JSONSchema generated by qsv schema
to validate the CONTENTS of the CSV conform to the JSONschema definition.
And it's quite performant - on benchmarks using a 1 million row sample of NYC's 311 data with a non-trivial JSONschema, it validates the CSV in 1.3 seconds!
And yeah, we would love to take your help! Perhaps, we can carve out some time at CSVConf next month?
That's an exciting roadmap @jqnatividad ! Thanks for sharing and let's definitely catchup at the CSVConf :)
As a suggestion/thought: In terms of releases, it would be nice to split architectural change and features into separated releases. It will be easier to develop, easier to deploy, easier to maintain and will give the community time to change. Mixing behavioral changes with architectural can be messy because new errors can come from two fronts :). In that sense, an initial release of DP+ v1.0 as a CKAN extension would be great.
Since the DP+ has been a CKAN extension and we no longer need the built-in datapusher
plugin, I think we should include the assets from the datapusher
plugin to prevent from the following import errors:
2024-07-02 08:49:09,763 ERROR [ckan.lib.webassets_tools] Trying to include unknown asset:
<ckanext-datapusher/datapusher-css>
2024-07-02 08:49:09,823 ERROR [ckan.lib.webassets_tools] Trying to include unknown asset:
<ckanext-datapusher/datapusher>