/dataportals-registry

Registry of data portals, catalogs, data repositories including data catalogs dataset and catalog description standard

Primary LanguagePythonMIT LicenseMIT

dataportals-registry

Registry of data portals, catalogs, data repositories and e.t.c.

This is a transitional repository to create registry of all existing open data portals and repositories.

This is the first pillar of the open search engine project. Other pillars include:

  • registry of all catalogs (this one)
  • datasets raw metadata database
  • unified dataset search index and search engine
  • datasets backup and file cache

Please take a look at project mindmap to see it's goals and structure.

What kind of data catalogs collected?

This registry includes description of the following data catalogs:

  • Open data portals
  • Geoportals
  • Scientific data repositories
  • Indicators catalogs
  • Microdata catalogs
  • Machine learning catalogs
  • Data search engines
  • API Catalogs
  • Data marketplaces
  • Other

Inspiration

This project inspired by Re3Data and Fairsharing projects. Key difference is the focus on open data as a broad topic, not just open research data.

Final version of this repository will be reorganized as database with publicly available open API and bulk data dumps.

How this repository organized

Warning: this is temporary description and subject of change

Entities

Data catalog descriptions are YAML files in data/entities folder. Files separated by country/territory folders and inside each country folder there are folders like scientific, opendata, microdata, geo, search, marketplace, other.

Example

Data.gov YAML file

access_mode:
- open
api: true
api_status: active
catalog_type: Open data portal
content_types:
- dataset
coverage:
- location:
    country:
      id: US
      name: United States
    level: 1
endpoints:
- type: ckanapi
  url: https://catalog.data.gov/api/3
export_standard: CKAN API
id: catalogdatagov
identifiers:
- id: wikidata
  url: https://www.wikidata.org/wiki/Q5227102
  value: Q5227102
- id: re3data
  url: https://www.re3data.org/repository/r3d100010078
  value: r3d100010078
- id: fairsharing
  url: https://fairsharing.org/FAIRsharing.6069e1
  valye: FAIRsharing.6069e1
langs:
- EN
link: https://catalog.data.gov
name: NETL Energy Data eXchange
owner:
  location:
    country:
      id: US
      name: United States
    level: 1
  name: U.S. Department of Energy
  type: Central government
software: CKAN
status: active
tags:
- government
- has_api

Datasets and code

Datasets kept in data/datasets folder, right now it's catalogs.jsonl file generated by script builder.py in scripts folder.

Run python builder.py build in scripts folder to regenerate catalogs.jsonl file from YAML files.

How to contribute?

If you find any mistake or you have an additional data catalog to add, please generate pull request or write an issue.

Data sources

Following data sources used:

License

Source code licensed under MIT license Data licensed under CC-BY 4.0 license