WashingtonPost Articles Not Saving Correctly
sj365 opened this issue · 6 comments
WSB has not been working correctly for the WashingtonPost. Pages are not being saved completely. Specifically, the comments are not being saved completely, nor are the images for other articles. I made a video for you to review to understand the issue. If you think that some settings need to be changed to fix this please advise.
Thanks,
SJ
https://drive.google.com/drive/folders/1en544cc5lTajJnZHSvsas0GhnW3FeINu?usp=sharing
https://www.washingtonpost.com/technology/2022/07/16/racist-robots-ai/
WebScrapBook 1.4.3
Chrome is up to date
Version 103.0.5060.114 (Official Build) (64-bit)
Processor 11th Gen Intel(R) Core(TM) i9-11980HK @ 2.60GHz 3.30 GHz
Installed RAM 32.0 GB (31.7 GB usable)
System type 64-bit operating system, x64-based processor
Pen and touch Touch support with 10 touch points
Edition Windows 11 Home
Version 21H2
Installed on 1/31/2022
OS build 22000.795
Experience Windows Feature Experience Pack 1000.22000.795.0
Pages in the web site are controlled by scripts that automatically unload images out of the screen. Unfortunately we can do nothing for such case. Some site-specific scripts and/or techniques are required to capture everything wanted.
You can confirm this by disabling scripts and see that the page does NOT show normally. Maybe you can contact the web master and request them to make the site display normally without scripts (with a justification that you are disabling scripts for security purpose etc.), so that you can capture the web page with scripts disabled in the future.
Thanks for replying. I will seek to contact the WashingtonPost as you advised.
However, there is still the matter of page errors not related to images showing. As you can see in the screenshot attached, WSB did not display the text correctly. SavePageWeb is making this exact same error. SingleFile does not make this particular error even if it doesnt get all the page images (file named texterror WSB SPW.png).
Also, I did further testing of all the best screen capture extensions WSB, SingleFile, SavePageWeb and they all have issues with this method of image loads WashingtonPost started using. Although SingleFile appeared to do the best, it is inconsistent with the Washingtonpost page saves. I might have to do the same page save 5-6 times before it gets it correct with full images and comments appearing, but it can be done. You can view it for yourself in the 2ndVideos google link folder, it is named SingleFile perfect.
https://www.washingtonpost.com/world/2022/07/18/heat-wave-uk-temperatures-40c-record/
To remove speculation of whether scripts on my system are involved in the issue, I made videos with scripts active and inactive. The results are essentially the same. WSB and SPW are still making the text error and not showing all images. Singlefile is hit and miss until it gets it right (no settings changes between saves).
I also included a video of the WashingtonPost page being saved correctly as a .png file using FaststoneCapture 9.7 scrolling page saver. Although it doesn't save as an .html file, it does save the WPost page exactly with no text errors and all lazy loaded images completely. That .png file is named Fastone Scroll complete and the video is named FastoneScroll.
Just to be clear, I'm not providing this information to attempt to be annoying or show you up in any way. I have a practical reason to want this application to work correctly; I'm a grad student who does a lot of web research and I use WSB, SingleFile, and SavePageWeb extensively to save my web research pages. By submitting bug reports, I do so to improve the product because basically I need them to work. My major is not programming or I would do it myself.
https://drive.google.com/drive/folders/1Vmi1RelwNSh3r7SX8OJ3jPMrFb3LTRrF?usp=sharing
The problem of the comment image is due to bad HTML.
The source code of the page is something like:
<button><button><svg>...</svg></button>...</button>
This is a bad HTML and will be interpreted by browsers as such when the page is loaded:
<button></button><button><svg>...</svg></button>...
(I currently failed to find a solid HTML spec saying that <button>
in <button>
is not allowed or <button><button>
should be interpreted as <button></button><button>
like <p>
or <li>
. It may be possible that browsers don't follow the spec.)
This can be confirmed by disabling page scripts and see that the image is misplaced. On the other hand, when scripts are enabled, it seems that there's some page script re-generating the elements in such nesting way, which is allowed by browsers, but will be interpreted in the same way when saved as static HTML by WSB or other tools and re-loaded.
Hi Danny,
Your explanation sounds very technical. However it does not explain why WSB and SPW are showing this html coding as an incorrect page save view, but Singlefile interpreting the same html information from the browser and saving the page as .html does not. This has been proven by the Singlefile Perfect .html saved page inside of the 2ndVideos google folder link.
Is it possible for you to make the adjustment necessary to prevent the described bad .html coding from affecting the WSB saved page?
Thanks
SingleFile has some hard code that translates <button><button><svg>...</svg></button>...</button>
into <button><span><svg>...</svg></span>...</button>
. This is a very subjective translation, as no one actually knows how a bad HTML should be translated into, and won't work for all cases although it happens to work as expected for this case.
This can also be done with WSB by setting up a capture helper:
{
"description": "Fix button button in WashingtonPost",
"pattern": "/^https://www\\.washingtonpost\\.com//",
"commands": [
["html", {"css": "button"}, ["replace", ["get_html", null], "/(<\/?)button/g", "$1span"]]
]
}
, or another way which may be more appropriate:
{
"description": "Fix button button in WashingtonPost",
"pattern": "/^https://www\\.washingtonpost\\.com//",
"commands": [
["unwrap", {"css": "button button"}]
]
}
, though some research and "programming" are required.
Close stale issue.