Possible issue around caching on apnews.com
soupglasses opened this issue · 8 comments
When attempting to use the new configuration in the following PR: html2rss/html2rss-configs#176 I seem to have hit an issue around the caching of dynamic config files. I have yet to figure out a good set of steps to reproduce, or find the root cause of where exactly it goes wrong. But what i have done is this:
Try to load section=trending-news
then shortly after attempt to load another, for example section=ukraine
. It will either mix the stories from both trending-news and ukraine. Or only load the previous trending-news news stories under ukraine.
My feeling is that it might have to do with these pages taking a bit to load, and there are errors seeming to be around timeouts. But this might be the wrong path to go down. You can see the logs below:
[1] Puma starting in cluster mode...
[1] * Puma version: 5.6.2 (ruby 3.1.1-p18) ("Birdie's Version")
[1] * Min threads: 5
[1] * Max threads: 5
[1] * Environment: production
[1] * Master PID: 1
[1] * Workers: 2
[1] * Restarts: (✔) hot (✖) phased
[1] * Preloading application
[1] * Listening on http://0.0.0.0:3000
[1] Use Ctrl-C to stop
[1] - Worker 0 (PID: 3) booted in 0.0s, phase: 0
[1] - Worker 1 (PID: 4) booted in 0.0s, phase: 0
source=rack-timeout id=0894f721-5348-4ba2-9fac-0fffcabb1078 timeout=15000ms state=ready at=info
/usr/local/bundle/bundler/gems/html2rss-a898923f92b2/lib/html2rss/item_extractors/href.rb:26: warning: redefining constant Struct::Options
/usr/local/bundle/bundler/gems/html2rss-a898923f92b2/lib/html2rss/item_extractors/html.rb:25: warning: redefining constant Struct::Options
/usr/local/bundle/bundler/gems/html2rss-a898923f92b2/lib/html2rss/item_extractors/static.rb:16: warning: redefining constant Struct::Options
/usr/local/bundle/bundler/gems/html2rss-a898923f92b2/lib/html2rss/item_extractors/text.rb:23: warning: redefining constant Struct::Options
source=rack-timeout id=0894f721-5348-4ba2-9fac-0fffcabb1078 timeout=15000ms service=1462ms state=completed at=info
source=rack-timeout id=45b4009a-7b73-4652-b7d2-fc2e84178fe1 timeout=15000ms state=ready at=info
source=rack-timeout id=45b4009a-7b73-4652-b7d2-fc2e84178fe1 timeout=15000ms service=1ms state=completed at=info
source=rack-timeout id=a2bd4d83-a7b9-420e-b131-668b5394abc3 timeout=15000ms state=ready at=info
source=rack-timeout id=a2bd4d83-a7b9-420e-b131-668b5394abc3 timeout=15000ms service=1ms state=completed at=info
source=rack-timeout id=1e97365c-5228-4d84-a3b2-c3b401cd06d5 timeout=15000ms state=ready at=info
source=rack-timeout id=1e97365c-5228-4d84-a3b2-c3b401cd06d5 timeout=15000ms service=4ms state=completed at=info
source=rack-timeout id=a2720804-beca-41de-9003-194396ef1a69 timeout=15000ms state=ready at=info
source=rack-timeout id=9f3a171e-3956-47fc-8eed-cd3fe10a51a0 timeout=15000ms state=ready at=info
source=rack-timeout id=a2720804-beca-41de-9003-194396ef1a69 timeout=15000ms service=2ms state=completed at=info
source=rack-timeout id=9f3a171e-3956-47fc-8eed-cd3fe10a51a0 timeout=15000ms service=1ms state=completed at=info
source=rack-timeout id=ad960dbb-d315-4752-8229-fc42159466d8 timeout=15000ms state=ready at=info
source=rack-timeout id=ad960dbb-d315-4752-8229-fc42159466d8 timeout=15000ms service=1ms state=completed at=info
source=rack-timeout id=806abc6b-43c8-4479-ba37-0ad93475f3a3 timeout=15000ms state=ready at=info
/usr/local/bundle/bundler/gems/html2rss-a898923f92b2/lib/html2rss/item_extractors/href.rb:26: warning: redefining constant Struct::Options
/usr/local/bundle/bundler/gems/html2rss-a898923f92b2/lib/html2rss/item_extractors/html.rb:25: warning: redefining constant Struct::Options
/usr/local/bundle/bundler/gems/html2rss-a898923f92b2/lib/html2rss/item_extractors/static.rb:16: warning: redefining constant Struct::Options
/usr/local/bundle/bundler/gems/html2rss-a898923f92b2/lib/html2rss/item_extractors/text.rb:23: warning: redefining constant Struct::Options
source=rack-timeout id=806abc6b-43c8-4479-ba37-0ad93475f3a3 timeout=15000ms service=4094ms state=completed at=info
source=rack-timeout id=2c4a2607-64e9-44ca-9732-bbaace89398b timeout=15000ms state=ready at=info
source=rack-timeout id=2c4a2607-64e9-44ca-9732-bbaace89398b timeout=15000ms service=634ms state=completed at=info
source=rack-timeout id=de17384a-22f5-4aac-8295-f79e2c6c5204 timeout=15000ms state=ready at=info
source=rack-timeout id=de17384a-22f5-4aac-8295-f79e2c6c5204 timeout=15000ms service=364ms state=completed at=info
Finally I found some time… took me a while.
Unfortunately, I can't reproduce the described problem.
I've checked what's generated by html2rss and saved the titles in a file for each request:
curl http://127.0.0.1:5000/apnews.com/hub.rss\?section\=ukraine | pup title | grep --invert-match title | sort > ukraine.txt
curl http://127.0.0.1:5000/apnews.com/hub.rss\?section\=trending-news | pup title | grep --invert-match title | sort > trending.txt
With those two files, I've checked if there are any titles that are present in both files, by reading both files, split them by line into and array and find the intersection of these two arrays:
ruby -e 'puts IO.read($*[0]).to_s.split("\n") & IO.read($*[1]).to_s.split("\n")' ukraine.txt trending.txt
The result was empty.
If there would have been an intersection, next step would be to check if html2rss is wrong or apnews links to the same article on both crawled pages.
Of course I played around a bit without the intersection approach, but all looked good.
To reproduce the error, I'd need to investigate deeper.
- How did you encounter the error?
- Did you use
curl
or did you compare the two generated RSS feeds in your browser? If the latter, can you try testing again and when you've got duplicate items, do a hard reload (without cache) and check, if the problem remains?
Thanks!
Okay so after some digging into what i was doing a month ago i figured it out. The currently shipped apnews
config works flawlessly. I failed however to say that i was extending apnews using my own custom feeds.yml
file with my own apnews.com scraper, which I am running trough podman run -d --name html2rss-web -v ./config:/app/config:z -p 3000:3000 gilcreator/html2rss-web
.
# config/feeds.yml
stylesheets:
- href: '/rss.xsl'
media: 'all'
type: 'text/xsl'
headers:
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; rv:68.0) Gecko/20100101 Firefox/68.0"
feeds:
apnewshub:
channel:
url: https://apnews.com/hub/%<section>s
language: en
ttl: 120
time_zone: UTC
selectors:
items:
selector: ".FeedCard"
title:
selector: h2
link:
selector: a:first
extractor: href
description:
selector: p
post_process:
name: gsub
pattern: '^COPENHAGEN, Denmark \(AP\) \— '
replacement: ''
updated:
selector: ".Timestamp"
extractor: attribute
attribute: data-source
post_process:
name: "parse_time"
author:
selector: 'span[class^="Component-bylines"]'
post_process:
- name: gsub
pattern: '^By '
replacement: ''
- name: gsub
pattern: '^$'
replacement: 'APNEWS'
This configuration does fail if you go to http://localhost:3000/apnewshub.rss?section=denmark
then quickly thereafter http://localhost:3000/apnewshub.rss?section=ukraine
. However, this works properly when using curl
and the ruby comparison you provided, but it does load wrongly under chromium, see attached pictures below.
Okay i found some even more oddity with whatever is going on, i stopped and deleted the container, edited some settings in feeds.yml
, restarted a brand new container with new and changed settings. Then going to http://localhost:3000/apnewshub.rss?section=norway
under chromium, a section i have never loaded before, and somehow got returned what the previous http://localhost:3000/apnewshub.rss?section=ukraine
should have looked like.
And again testing against curl
gives me the correct and new ukraine, and the correct and new norway
section. But with chromium it is somehow pulling the last generation of the config file's ukraine response on the newly never before fetched norway
section.
I have a strong feeling this is some really absurd caching issue with chromium rather than an issue with html2rss-web, but i have no clue how this is even possible.
Forced refreshes (SHIFT+F5) does not seem to help either. It may sometimes give me the correct result where a previous one was wrong. But i have also had the behavior where section=denmark
went from Hub Denmark
to Hub Norway
when i ran a SHIFT+F5 force refresh.
Wow, thank you for the detailed response. Love it! That's something to work with :-)
Sadly, I can't free up the required time currently to properly address this issue.
If anyone wants to take over, feel free and maybe give me a little ping. Otherwise, I'll gladly take on this issue as soon as I find the time.
I no longer use html2rss, so I am afraid I will be unable to test this. But if it does work for others, I would say this issue could be closed. Thank you for your work @mabeett and @gildesmarais! 😄