[PR Proposal] Exposing the normalized slice of Quads in a NormalisationAlgorithm

Question

[PR Proposal] Exposing the normalized slice of Quads in a NormalisationAlgorithm

joeltg opened this issue 5 years ago · 1 comments

In addition to the JSON-LD features, we have been using this library's exported Quad struct as an internal RDF representation (as I imagine others have as well).

One of the things we want to do is 1) normalize an RDF dataset and then 2) access a []*Quad representation of that normalized dataset, without serializing it to an application/n-quads string or converting it back to JSON-LD. In particular we want a []*Quad, not an *RDFDataset, since the slice of quads preserves the canonical ordering of the quads that *RDFDataset does not.

Currently there are a few API choices that prevent this kind of usage:

The .quads field of the NormalisationAlgorithm struct that holds the slice of quads in the dataset is unexported, and there is no exported .Quads(): []*Quad access method
When the dataset is sorted after normalization, only the normalized: []string slice gets sorted, not the .quads: []*Quad slice (not sorting is correct for the current API since the quads slice is not exported or used afterwards).
The normalized: []string slice is always concatenated into a giant application/n-quads string, which is either then returned or re-parsed into an *RDFDataset, depending on opts.Format. This step would be unnecessary if users only want to use the []*Quad slice, and it would be a shame to re-parse the concatenated string with an n-quads serializer to get it back.

If this is something you'd be open to, I'd be happy to put together a small pull request that refactors the normalization API slightly to support this kind of usage. If not, I totally understand!

The changes I imagine would be:

Add a Quads() []*Quad method to the NormalisationAlgorithm struct that just returns its .quads field.
Sort the normalized []string and quads []*Quad slices simultaneously. To do this I imagine moving the normalized variable to a field of the NormalisationAlgorithm struct (and maybe renaming it .terms instead?), and implementing the sort interface on NormalisationAlgorithm directly, which would let us sort.Sort(na) with basically identical performance as sort.Strings(normalized):

func (na *NormalisationAlgorithm) Len() int           { return len(na.normalized) }
func (na *NormalisationAlgorithm) Less(i, j int) bool { return na.normalized[i] < na.normalized[j] }
func (na *NormalisationAlgorithm) Swap(i, j int) {
	na.normalized[i], na.normalized[j] = na.normalized[j], na.normalized[i]
	na.quads[i], na.quads[j] = na.quads[j], na.quads[i]
}

Separating step 8 of na.Main(dataset *RDFDataset, opts *JsonLdOptions) (interface{}, error) from the rest of the normalization algorithm. To maintain backward-compatibility, this would mean re-naming the rest the method (steps 1 - 7.2) to something like na.Normalize(dataset *RDFDataset), so that Main would look like:

func (na *NormalisationAlgorithm) Main(dataset *RDFDataset, opts *JsonLdOptions) (interface{}, error) {
	// Steps 1 through 7.2, and sorting, happen here
	na.Normalize(dataset)

	// 8) Return the normalized dataset.
	// handle output format
	if opts.Format != "" {
		if opts.Format == "application/n-quads" || opts.Format == "application/nquads" {
			rval := ""
			for _, n := range normalized {
				rval += n
			}
			return rval, nil
		} else {
			return nil, NewJsonLdError(UnknownFormat, opts.Format)
		}
	}
}

All together, the new normalization API would be perfectly backwards-compatible, and would support a new usage pattern like:

na := ld.NewNormalisationAlgorithm("URDNA2015")
na.Normalize(dataset.(*ld.RDFDataset))
for i, quad := range na.Quads() {
  // ...
}

Sorry for the lengthy issue; let me know if this is something you're open to and I'll open a PR! And of course feel free to suggest a different way of approaching it if there's anything I'm not seeing.

Answer 1 · 2020-02-12T15:50:28.000Z

Hi @joeltg,

I understand what's missing and agree something needs to be done about it. Your proposal makes sense in principle, but I need to remind myself how the RDF code works and think about Step 2.

On a side note, the current interface of the whole library is, in my opinion, quite poorly designed. This is due to a direct port of the underlying algorithms which, at the time the first version was written, was the most straightforward way to follow the (sometimes very complicated) logic. My plan is, once we have the full support of JSON-LD 1.1 spec, to create a new major version of the library and do a proper overhaul of the interface. With breaking changes. So, from your perspective, once I review your proposal, I'd be grateful to accept a PR. Or we can wait for the next version without having to worry about backward compatibility. I would appreciate if you provide feedback on the new interface once it's ready.

Of course, we can do both: make a change now and then review the new interface. I just wanted to share the plan so that your time and effort don't go to waste.

Stan