html2rss/html2rss-web

Possible issue around caching on apnews.com

soupglasses opened this issue · 8 comments

When attempting to use the new configuration in the following PR: html2rss/html2rss-configs#176 I seem to have hit an issue around the caching of dynamic config files. I have yet to figure out a good set of steps to reproduce, or find the root cause of where exactly it goes wrong. But what i have done is this:

Try to load section=trending-news then shortly after attempt to load another, for example section=ukraine. It will either mix the stories from both trending-news and ukraine. Or only load the previous trending-news news stories under ukraine.

My feeling is that it might have to do with these pages taking a bit to load, and there are errors seeming to be around timeouts. But this might be the wrong path to go down. You can see the logs below:

[1] Puma starting in cluster mode...
[1] * Puma version: 5.6.2 (ruby 3.1.1-p18) ("Birdie's Version")
[1] *  Min threads: 5
[1] *  Max threads: 5
[1] *  Environment: production
[1] *   Master PID: 1
[1] *      Workers: 2
[1] *     Restarts: (✔) hot (✖) phased
[1] * Preloading application
[1] * Listening on http://0.0.0.0:3000
[1] Use Ctrl-C to stop
[1] - Worker 0 (PID: 3) booted in 0.0s, phase: 0
[1] - Worker 1 (PID: 4) booted in 0.0s, phase: 0
source=rack-timeout id=0894f721-5348-4ba2-9fac-0fffcabb1078 timeout=15000ms state=ready at=info
/usr/local/bundle/bundler/gems/html2rss-a898923f92b2/lib/html2rss/item_extractors/href.rb:26: warning: redefining constant Struct::Options
/usr/local/bundle/bundler/gems/html2rss-a898923f92b2/lib/html2rss/item_extractors/html.rb:25: warning: redefining constant Struct::Options
/usr/local/bundle/bundler/gems/html2rss-a898923f92b2/lib/html2rss/item_extractors/static.rb:16: warning: redefining constant Struct::Options
/usr/local/bundle/bundler/gems/html2rss-a898923f92b2/lib/html2rss/item_extractors/text.rb:23: warning: redefining constant Struct::Options
source=rack-timeout id=0894f721-5348-4ba2-9fac-0fffcabb1078 timeout=15000ms service=1462ms state=completed at=info
source=rack-timeout id=45b4009a-7b73-4652-b7d2-fc2e84178fe1 timeout=15000ms state=ready at=info
source=rack-timeout id=45b4009a-7b73-4652-b7d2-fc2e84178fe1 timeout=15000ms service=1ms state=completed at=info
source=rack-timeout id=a2bd4d83-a7b9-420e-b131-668b5394abc3 timeout=15000ms state=ready at=info
source=rack-timeout id=a2bd4d83-a7b9-420e-b131-668b5394abc3 timeout=15000ms service=1ms state=completed at=info
source=rack-timeout id=1e97365c-5228-4d84-a3b2-c3b401cd06d5 timeout=15000ms state=ready at=info
source=rack-timeout id=1e97365c-5228-4d84-a3b2-c3b401cd06d5 timeout=15000ms service=4ms state=completed at=info
source=rack-timeout id=a2720804-beca-41de-9003-194396ef1a69 timeout=15000ms state=ready at=info
source=rack-timeout id=9f3a171e-3956-47fc-8eed-cd3fe10a51a0 timeout=15000ms state=ready at=info
source=rack-timeout id=a2720804-beca-41de-9003-194396ef1a69 timeout=15000ms service=2ms state=completed at=info
source=rack-timeout id=9f3a171e-3956-47fc-8eed-cd3fe10a51a0 timeout=15000ms service=1ms state=completed at=info
source=rack-timeout id=ad960dbb-d315-4752-8229-fc42159466d8 timeout=15000ms state=ready at=info
source=rack-timeout id=ad960dbb-d315-4752-8229-fc42159466d8 timeout=15000ms service=1ms state=completed at=info
source=rack-timeout id=806abc6b-43c8-4479-ba37-0ad93475f3a3 timeout=15000ms state=ready at=info
/usr/local/bundle/bundler/gems/html2rss-a898923f92b2/lib/html2rss/item_extractors/href.rb:26: warning: redefining constant Struct::Options
/usr/local/bundle/bundler/gems/html2rss-a898923f92b2/lib/html2rss/item_extractors/html.rb:25: warning: redefining constant Struct::Options
/usr/local/bundle/bundler/gems/html2rss-a898923f92b2/lib/html2rss/item_extractors/static.rb:16: warning: redefining constant Struct::Options
/usr/local/bundle/bundler/gems/html2rss-a898923f92b2/lib/html2rss/item_extractors/text.rb:23: warning: redefining constant Struct::Options
source=rack-timeout id=806abc6b-43c8-4479-ba37-0ad93475f3a3 timeout=15000ms service=4094ms state=completed at=info
source=rack-timeout id=2c4a2607-64e9-44ca-9732-bbaace89398b timeout=15000ms state=ready at=info
source=rack-timeout id=2c4a2607-64e9-44ca-9732-bbaace89398b timeout=15000ms service=634ms state=completed at=info
source=rack-timeout id=de17384a-22f5-4aac-8295-f79e2c6c5204 timeout=15000ms state=ready at=info
source=rack-timeout id=de17384a-22f5-4aac-8295-f79e2c6c5204 timeout=15000ms service=364ms state=completed at=info

Finally I found some time… took me a while.

Unfortunately, I can't reproduce the described problem.

I've checked what's generated by html2rss and saved the titles in a file for each request:

curl http://127.0.0.1:5000/apnews.com/hub.rss\?section\=ukraine | pup title | grep --invert-match  title | sort > ukraine.txt

curl http://127.0.0.1:5000/apnews.com/hub.rss\?section\=trending-news | pup title | grep --invert-match  title | sort > trending.txt

With those two files, I've checked if there are any titles that are present in both files, by reading both files, split them by line into and array and find the intersection of these two arrays:

ruby -e 'puts IO.read($*[0]).to_s.split("\n") & IO.read($*[1]).to_s.split("\n")' ukraine.txt trending.txt

The result was empty.
If there would have been an intersection, next step would be to check if html2rss is wrong or apnews links to the same article on both crawled pages.

Of course I played around a bit without the intersection approach, but all looked good.

To reproduce the error, I'd need to investigate deeper.

  1. How did you encounter the error?
  2. Did you use curl or did you compare the two generated RSS feeds in your browser? If the latter, can you try testing again and when you've got duplicate items, do a hard reload (without cache) and check, if the problem remains?

Thanks!

Okay so after some digging into what i was doing a month ago i figured it out. The currently shipped apnews config works flawlessly. I failed however to say that i was extending apnews using my own custom feeds.yml file with my own apnews.com scraper, which I am running trough podman run -d --name html2rss-web -v ./config:/app/config:z -p 3000:3000 gilcreator/html2rss-web.

# config/feeds.yml
stylesheets:
  - href: '/rss.xsl'
    media: 'all'
    type: 'text/xsl'
headers:
  "User-Agent": "Mozilla/5.0 (Windows NT 10.0; rv:68.0) Gecko/20100101 Firefox/68.0"
feeds:
  apnewshub:
    channel:
      url: https://apnews.com/hub/%<section>s
      language: en
      ttl: 120
      time_zone: UTC
    selectors:
      items:
        selector: ".FeedCard"
      title:
        selector: h2
      link:
        selector: a:first
        extractor: href
      description:
        selector: p
        post_process:
          name: gsub
          pattern: '^COPENHAGEN, Denmark \(AP\) \— '
          replacement: ''
      updated:
        selector: ".Timestamp"
        extractor: attribute
        attribute: data-source
        post_process:
          name: "parse_time"
      author:
        selector: 'span[class^="Component-bylines"]'
        post_process:
          - name: gsub
            pattern: '^By '
            replacement: ''
          - name: gsub
            pattern: '^$'
            replacement: 'APNEWS'

This configuration does fail if you go to http://localhost:3000/apnewshub.rss?section=denmark then quickly thereafter http://localhost:3000/apnewshub.rss?section=ukraine. However, this works properly when using curl and the ruby comparison you provided, but it does load wrongly under chromium, see attached pictures below.

09 Thu 22:43:24
09 Thu 22:43:11

Okay i found some even more oddity with whatever is going on, i stopped and deleted the container, edited some settings in feeds.yml, restarted a brand new container with new and changed settings. Then going to http://localhost:3000/apnewshub.rss?section=norway under chromium, a section i have never loaded before, and somehow got returned what the previous http://localhost:3000/apnewshub.rss?section=ukraine should have looked like.

And again testing against curl gives me the correct and new ukraine, and the correct and new norway section. But with chromium it is somehow pulling the last generation of the config file's ukraine response on the newly never before fetched norway section.

I have a strong feeling this is some really absurd caching issue with chromium rather than an issue with html2rss-web, but i have no clue how this is even possible.

Forced refreshes (SHIFT+F5) does not seem to help either. It may sometimes give me the correct result where a previous one was wrong. But i have also had the behavior where section=denmark went from Hub Denmark to Hub Norway when i ran a SHIFT+F5 force refresh.

Wow, thank you for the detailed response. Love it! That's something to work with :-)
Sadly, I can't free up the required time currently to properly address this issue.

If anyone wants to take over, feel free and maybe give me a little ping. Otherwise, I'll gladly take on this issue as soon as I find the time.

Hello @imsofi ,
Can you check if the error appears after applying the patch I propose in #589 ?
Maybe #587 is a duplicated and/or particular case of this issue.

regards,

With #587 being closed and a bugfix merged, I have the feeling this issue likely is also fixed by that. Can you please give it a try with the updated docker image, @imsofi, and report back? Thank you! :)

I no longer use html2rss, so I am afraid I will be unable to test this. But if it does work for others, I would say this issue could be closed. Thank you for your work @mabeett and @gildesmarais! 😄