/datapackage-rb

Ruby library and tools for working with datapackages

Primary LanguageRubyMIT LicenseMIT

datapackage-rb

Build Coverage Release Codebase Support

A ruby library for working with Data Packages.

The library is intending to support:

  • Parsing and using data package metadata and data
  • Validating data packages to ensure they conform with the Data Package specification

Installation

Add the gem into your Gemfile:

gem 'datapackage.rb'

Or:

gem install datapackage

Reading a Data Package

Require the gem, if you need to:

require 'datapackage'

Parsing a data package descriptor from a remote location:

package = DataPackage::Package.new( "http://example.org/datasets/a/datapackage.json" )

This assumes that http://example.org/datasets/a/datapackage.json exists. Similarly you can load a package descriptor from a local JSON file.

package = DataPackage::Package.new( "/my/data/package/datapackage.json" )

The data package descriptor i.e. datapackage.json file, is expected to be at the root directory of the data package and the path attribute of the package's resources will be resolved relative to it.

You can also load a data package descriptor directly from a Hash:

 descriptor = {
  'resources'=> [
    {
      'name'=> 'example',
      'profile'=> 'tabular-data-resource',
      'data'=> [
        ['height', 'age', 'name'],
        ['180', '18', 'Tony'],
        ['192', '32', 'Jacob'],
      ],
      'schema'=>  {
        'fields'=> [
          {'name'=> 'height', 'type'=> 'integer'},
          {'name'=> 'age', 'type'=> 'integer'},
          {'name'=> 'name', 'type'=> 'string'},
        ],
      }
    }
  ]
}

package = DataPackage::Package.new(descriptor)

There are a set of helper methods for accessing data from the package, e.g:

package.name
package.title
package.description
package.homepage
package.license

Reading Data Resources

A data package must contain an array of Data Resources. You can access the resources in your Data Package either by their name or by their index in the resources array:

first_resource = package.resources[0]
first_resource = package.get_resource('example')

# Get info about the data source of this resource
first_resource.inline?
first_resource.local?
first_resource.remote?
first_resource.multipart?
first_resource.tabular?
first_resource.source

You can then read the source depending on its type. For example if resource is local and not multipart it could by open as a file: File.open(resource.source).

If a resource complies with the Tabular Data Resource spec or uses the tabular-data-resource profile you can read resource rows:

resoure = package.resources[0]
resource.tabular?
resource.headers
resource.schema

# Read the the whole rows at once
data = resource.read
data = resource.read(keyed: true)

# Or iterate through it
data = resource.iter {|row| print row}

See TableSchema documentation for other things you can do with tabular resource.

Creating a Data Package

package = DataPackage::Package.new

# Add package properties
package.name = 'my_sleep_duration'

# Add a resource
package.add_resource(
  {
    'name'=> 'sleep_durations_this_week',
    'data'=> [7, 8, 5, 6, 9, 7, 8],
  }
)

If the resource is valid it will be added to the resources array of the Data Package; if it's invalid it will not be added and you should try creating and validating your resource to see why it fails.

# Update a resource
my_resource = package.get_resource('sleep_durations_this_week')
my_resource['schema'] = {
  'fields'=> [
    {'name'=> 'number_hours', 'type'=> 'integer'},
  ]
}

# Save the Data Package descriptor to the target file
package.save('datapackage.json')

# Remove a resource
package.remove_resource('sleep_durations_this_week')

Profiles

Data Package and Data Resource descriptors can be validated against JSON schemas that we call profiles.

By default, this gem uses the standard Data Package profile and Data Resource profile but alternative profiles are available for both.

According to the specs the value of the profile property can be either a URL or an indentifier from the registry.

Profiles in the local cache

The profiles from the registry come bundled with the gem. You can reference them in your Data Package descriptor by their identifier in the registry:

{
  "profile": "tabular-data-package"
}

Profiles from elsewhere

If you have a custom profile schema you can reference it by its URL:

{
  "profile": "https://specs.frictionlessdata.io/schemas/tabular-data-package.json"
}

Validation

Data Resources and Data Packages are validated against their profiles to ensure they respect the expected structure.

Validating a Resource

descriptor = {
  'name'=> 'incorrect name',
  'path'=> 'https://cdn.rawgit.com/frictionlessdata/datapackage-rb/master/spec/fixtures/test-pkg/test.csv',
}
resource = DataPackage::Resource.new(descriptor, base_path='')

# Returns true if resource is valid, false otherwise
resource.valid?

# Returns true or raises DataPackage::ValidationError
resource.validate

# Iterate through validation errors
resource.iter_errors{ |err| p err}

Validating a Package

The same methods used to check the validity of a Resource - valid?, validate and iter_errors- are also available for a Package. The difference is that after a Package descriptor is validated against its profile, each of its resources are also validated against their profile.

In order for a Package to be valid all its Resources have to be valid.

Developer notes

These notes are intended to help people that want to contribute to this package itself. If you just want to use it, you can safely ignore them.

After checking out the repo, run bundle to install dependencies. Then, run rake spec to run the tests.

To install this gem onto your local machine, run bundle exec rake install. To release a new version, update the version number in version.rb, and then run bundle exec rake release, which will create a git tag for the version, push git commits and tags, and push the .gem file to rubygems.org.

Updating the local schemas cache

We cache the local schemas from https://specs.frictionlessdata.io/schemas/registry.json. The local schemas should be kept up to date with the remote ones using:

rake update_profiles