jekyll/jekyll-seo-tag

Make JSON-LD output deterministic

mison6 opened this issue ยท 28 comments

Having the same issues reported in #362, where every rendering is appearing as a modified file and triggering unnecessary updates. Can the output of template.html be made deterministic to support reproducible builds?

If making the output of jsonify deterministic is too difficult, perhaps #135 is a good solution to give users more control where reproducible builds are critical.

A possible solution would be similar to https://github.com/periodo/jsonld-stable-stringify

I'm also hitting this, would be great to see the output sorted so it's generated the same every time.

This issue has been automatically marked as stale because it has not been commented on for at least two months.

The resources of the Jekyll team are limited, and so we are asking for your help.

If this is a bug and you can still reproduce this error on the master branch, please reply with all of the information you have about it in order to keep the issue open.

If this is a feature request, please consider whether it can be accomplished in another way. If it cannot, please elaborate on why it is core to this project and why you feel more than 80% of users would find this beneficial.

This issue will automatically be closed in two months if no further activity occurs. Thank you for all your contributions.

I'm hitting this issue, it is not stale.

See e.g., this commit.

It actually isn't clear to me why the JSON changing from build to build. As far as I can tell, Ruby hashes have been insertion-ordered in any semi-modern Ruby (that is, for years), so where does the non-determinism come fork? It seems to be that the plugin itself will always use the same insertion order.

In any case, I've created a patched gem against 2.7.1 which sorts the hash (and adds debugging output if you build with --verbose). I use it like this:

gem "jekyll-seo-tag", "= 2.7.1p", github: 'travisdowns/jekyll-seo-tag', branch: 'v2.7.1-patched'

in my Gemfile.

Maybe give it a shot and see if it fixes your problem: so far it has fixed it for me.

I'm happy to send a PR if the maintainers would be interested in merging it? cc @ashmaroli

When making changes to plugins, layouts, and configuration settings, I routinely run a diff between the Jekyll _site directory before and after the change to ensure no unwanted changes were introduced. The non-determinism in this plugin makes that a pain---I end up having to temporary delete it from my head.html template in order to make the diff.

I'd love to see @travisdowns's solution or another solution for this implement.

Thanks for the plugin!

@harding - if you want, you can use my patched gem in the meantime: it's identical to v2.7.1 of this plugin, but with a small change to sort the JSON-LD keys prior to update. It has been solving my diffing problem (I have the same use-case as you) since I switched.

@travisdowns I took a look at your patch.
As it stands, my individual opinion is that it is not good enough for a generalized audience for performance reasons.

From past experience, computing a Drop's to_h (inherited from Jekyll) is already slightly expensive.
Chaining that with multiple iterations (first a call to :reject and then a call to `:sort) is only going to increase the expense..

Imagine having this routine repeated for every page / document in a site. Now imagine the impact on a large site with thousands of posts and pages..

@ashmaroli - thanks for taking a look.

To be clear, my patch would be different than the one in my fork. The fork has a bunch of debugging output and adds several lines of code, but the suggested patch I could submit as a PR would be a 1-line change, from to the current:

to_h.reject { |_k, v| v.nil? }.to_json

to

to_h.reject { |_k, v| v.nil? }.sort.to_h.to_json

Do you comments still apply in that case?

I didn't follow what you mean by "Chaining that with multiple iteration", but I'm not ruby expert. As I understand it though, the reject, sort, to_h calls are all just happening one at a time, passing their result to the next call.

So what we're talking about is just adding a sort (and one more to_h on the result of the sort), on an hash of less than a dozen elements. How long does Ruby take to sort a dozen string keys? I'd guess double-digit microseconds at worst: based on a superficial search, Ruby sort is about 50 nanoseconds per element for integer keys at around 200 elements, so if we handwave say string keys are 10 times worse, 12 elements would only be 6 microseconds. Even if were 100x worse than that, we're still comfortably under a millisecond.

Better than abstract arguments, however, we can just measure this. Here's the generation time for my site (100 builds, disk cache and incremental disabled) with and without the patch:

image

Hard to see daylight between the two! Let's zoom in:

image

We can see some run to run variation of about 2%, but there no clear pattern of one being faster than the other.

The averages of both runs were identical to 4 significant figures: 2.545. So I think we can say that the cost of sorting a dozen elements per page with the Ruby native sort method is in the noise, consistent with my earlier back-of-the-envelope arguments.

Would you accept a PR in light of this new info?

BTW, for completeness, here's the spreadsheet with the data, and I collected the results in bash like:

for i in {1..100}; do bundle exec jekyll build --profile --disable-disk-cache | grep done | egrep -o '[0-9][0-9.]*[0-9]'; done

The patched version was even the slower version you see in my fork with extra debug output and iteration, not the one-liner I suggest as the PR above.

@travisdowns Thank you for putting forward a strong argument. I appreciate the effort involved.
Regarding the patch, I was indeed referring to the distilled one-liner you suggested above. (The debug output just being noise).

In any Ruby dialog, to iterate primarily refers to evaluate every item in a given list, serially, one-at-a-time until signaled to skip or stop; and to chain simply means the same as you understand it..
Therefore, by adding a sort and one more to_h means Ruby is essentially evaluating the same items again and again.
That to me is wasteful and optimizable.

Say, the first to_h returns a Ruby Hash object with 10 key-value pairs.
Calling reject on this Hash will first allocate a new empty Array object into memory, then allocate the first pair of key-value (as another temporary Array of 2 items) into memory, check if second element of the pair is null and based on the result, either discards the temporary array or pushes it into the empty array. This continues for the remaining nine pairs. At the end of which, the array is converted into a new Hash object, allocated into memory and returned as result.
When sort is called on this Hash, another new array is allocated and again each of the (10 or lesser) key-value pairs are evaluated (as arrays) to be sorted (by their key) alphabetically. At the end of which the new array of sub-arrays is returned.
When to_h is called on this array, yet another Hash object is allocated to memory and sub-arrays as its key and value pairs.

During all those iterations (evaluating each key-value pair serially), the contents do not change but are nevertheless called upon, duplicated and discarded as garbage almost immediately.

In theory, a better patch would be one that combines these extra evaluations into one pass-through. That would mean opting for a slightly complicated alternative. In the end, the question would be is the extra complication worthwhile? That's where benchmarking (like what you demonstrated above) comes into action.

When you have results that substantiate the implementation, the maintainers would not have grounds to reject your patch.


[optional]

On a side note, since you have the technical know-how, may I ask what is the performance difference between computing the JSON-LD json and not computing the json entirely (when the associated Liquid block is deleted from the template) for your sample site?

Update: I did some testing with a sorted json output.
That still doesn't resolve the primary issue reported by the OP: every rendered file is shown as a modified file..

The dateModified attribute traces back to Jekyll::Document#date which in turn falls back to Jekyll::Site#time which is essentially Time.now. Therefore, as long as a user has posts or other collection documents in their site, a diff of the destination directory _site is going to show those files as recently modified files.

@ashmaroli dateModified on my posts appears to be set to the publication date, so it's constant between diffs. On collection documents without a date specified, I don't get a dateModified field in the SEO plugin output.

Also, if stuff was set to Time.now, it would probably be possible to avoid problems by running with faketime, e.g. faketime 00:00 bundle exec jekyll build.

That still doesn't resolve the primary issue reported by the OP: every rendered file is shown as a modified file..

Hmm, that's not how I read it. The OP isn't too specific, but they refer to earlier issue #362 which is clear that the issue is inconsistent order of JSON-LD elements: in that example, the order changes but the date modified is consistent. In my own experience, the non-determinism in JSON-LD arises solely from they key ordering.

I agree though that there is another issue where the modified date is updated every time a build happens: but this doesn't seem to appear in JSON-LD or in pages by default. In my case, it does appear on a debug page and so that page appears to change every time I rebuild my site, but that along with feed.xml are the only two files that have that problem.

In any case, even if there is an issue with the modified time, I think there is also an issue with key ordering and indeed that seems to be the one most people are running into. So if this issue is tracking only the modified time issue, maybe we can create another issue for the key ordering one?

[optional]

On a side note, since you have the technical know-how, may I ask what is the performance difference between computing the JSON-LD json and not computing the json entirely (when the associated Liquid block is deleted from the template) for your sample site?

Here's that test, comparing the existing 2.7.1 tag (not my patch) with no tag at all (the {% seo %} liquid command deleted from the template):

chart

Overall there is about a ~190ms difference: 2.59s with the tag and 2.40 without. My site has 110 pages with tags, so that's somwhat less than 2 milliseconds per page with a SEO tag on average.

@travisdowns Please go ahead and submit a pull request (with tests) to sort the JSON keys.
I'll review it this weekend.

Thanks @ashmaroli - I haven't had time to write any tests yet but I'll get to it at some point hopefully.

This issue has been automatically marked as stale because it has not been commented on for at least two months.

The resources of the Jekyll team are limited, and so we are asking for your help.

If this is a bug and you can still reproduce this error on the master branch, please reply with all of the information you have about it in order to keep the issue open.

If this is a feature request, please consider whether it can be accomplished in another way. If it cannot, please elaborate on why it is core to this project and why you feel more than 80% of users would find this beneficial.

This issue will automatically be closed in two months if no further activity occurs. Thank you for all your contributions.

parkr commented

I'd be delighted to review a PR for this.

@travisdowns Okay. Will you be able to submit a pull request without tests?
If yes, please rebase your branch onto latest master and submit a proposal.
If not, do let us know.

Any updates on this?

This issue has been automatically marked as stale because it has not been commented on for at least two months.

The resources of the Jekyll team are limited, and so we are asking for your help.

If this is a bug and you can still reproduce this error on the master branch, please reply with all of the information you have about it in order to keep the issue open.

If this is a feature request, please consider whether it can be accomplished in another way. If it cannot, please elaborate on why it is core to this project and why you feel more than 80% of users would find this beneficial.

This issue will automatically be closed in two months if no further activity occurs. Thank you for all your contributions.

Still needs a fix

parkr commented

Quick PR here: #458

Thanks @parkr and sorry I never got around to this. Real life intervened.

parkr commented

No worries at all, @travisdowns! I happened to have some free time and took a swing at it. We're all working with the energy and time that is left over after all the other day's responsibilities are done. Often there's nothing left to devote.