samdark/sitemap

Multiple writes to the same compressed XML file, would break it in browsers

Closed this issue · 9 comments

Here is a difference in files, there are two start sequences in multi-time-write.gz:

Two start sequences highlighted
Chrome, v60 response
Firefox,v54 response

It works fine with 7Zip decompressing, but browsers couldn't get it. So I'm worrying whether all search engines would correctly handle it for now and for future comparability.

Here is a sample repository to reproduce an issue:
https://github.com/terales/multi-writes-to-gzipped-file

I've prepared a workaround for my project with disabling gzip for Sitemap instances and compressing already generated files:

<?php
    $tempfile = fopen('php://temp/', 'r+');

    $sitemapRead = fopen($path, 'r');
    stream_copy_to_stream($sitemapRead, $tempfile);
    fclose($sitemapRead);

    rewind($tempfile);

    $sitemapWrite = fopen('compress.zlib://' . $path, 'w');
    stream_copy_to_stream($tempfile, $sitemapWrite);
    fclose($sitemapWrite);

    fclose($tempfile);

And I can't figure out any ways to fix this issue inside a library. Can you share your thoughts, so I can prepare a PR with a fix?

According to RFC 1952 gzip could have multiple members. These members are data chunks and byte sequences you've referred to are starting marks of these chunks.

Unfortunately, browsers seems to fail decoding gzip with multiple members. Common behavior is to decode 1st members only. So the issue is that content gzipped this way isn't read by browsers properly.

The question to ask is whenever sitemap crawlers are happy with multi-member gzips or not. Browsers doesn't really matter. So I've checked some.

So in order to know how to deal with issue we need to know if the issue is valid for crawlers (most of them have validation tools) such as the ones Yandex and Google provide. Would be cool if you'll check these.

Thanks for overview, I'll check issues with Google, Yandex and Bing till Tuesday.

Research notes:

  • I've created one member reference file with a workaround I've mentioned in the first comment,
  • used test suite with disabled Apache's mode_deflate,
  • php v5.6.30.

Here are results of my research for both http and https:

search engine multi-members file one member file
Google ✔️
Bing (note)* ✔️
Yandex ✔️ ✔️

(note) as Bing was showing "Pending" for the whole day, I checked the results from the "Fetch as Bingbot" page.


Here are screenshots for search engines:

Google
Bing

Thanks for research. The problem is valid.

We have to gzip once in order to solve it.

In memory

i.e. accumulate all the changes in a variable and then gzip.

It would work but also would render all the library memory efficiency useless making it very memory-costy to generate big enough sitemaps.

Gzip ready file

Write a file in a temp dir as it's done now then gzip it to final destination in one pass. I'm not sure it it could be done efficiently.

Any other ways?

I'm not sure these are the only possible options...

I've managed to get valid compressed file for browsers, Google and Yandex (having trouble with logging into Bing, so couldn't check) with php v7 deflate_add function.

It works in a memory efficient manner (for 6MB file memory was stable: 2MB real usage, ~670KB for script).

Source code of the experiment.


Maybe, we can use deflate_add with v5.6 fallback which would write uncompressed xml to temporary file and write it to the required destination with stream_copy_to_stream like in my workaround from the first comment?

If it sounds good to you I can prepare PR.

Yes. Sounds good.

Here is a PR: #45