Scrape integration fails to verify SSL certificate

Question

Scrape integration fails to verify SSL certificate

JPorter-02 opened this issue 9 days ago · 5 comments

The problem

When using the scrape integration with resource https://archiveofourown.org and verify ssl set to true, i receive the following error:

Logger: homeassistant.components.rest.data
Source: components/rest/data.py:128
integration: rest (documentation, issues)
First occurred: 10:28:29 PM (6 occurrences)
Last logged: 10:55:37 PM

Error connecting to https://archiveofourown.org/works/51298045 failed with [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1000)

**Setting the "Verify SSL certificate" to false or off breaks the resource and associated sensors

What version of Home Assistant Core has the issue?

core-2024.10.2

What was the last working version of Home Assistant Core?

core-2024.10.2

What type of installation are you running?

Home Assistant OS

Integration causing the issue

Scrape

Link to integration documentation on our website

https://www.home-assistant.io/integrations/scrape

Diagnostics information

home-assistant_scrape_2024-10-18T04-03-13.736Z.log

Example YAML snippet

No response

Anything in the logs that might be useful for us?

Logger: homeassistant.components.rest.data
Source: components/rest/data.py:128
integration: rest (documentation, issues)
First occurred: 10:28:29 PM (8 occurrences)
Last logged: 11:06:44 PM

Error connecting to https://archiveofourown.org/works/51298045 failed with [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1000)

Additional information

The debug file contains a lot of other messages from other integrations. I am happy to provide any other info that may be needed to help get this resolved. I believe the system root SSL certs may just need to be updated, but I was unable to do that via CLI. Based on history from HA, all sensors attached to this resource stopped reporting at 10/16/24 at or around 5:45a.

Answer 1 · 2024-10-18T04:15:55.000Z

rest documentation
rest source

Answer 2 · 2024-10-18T04:15:57.000Z

Hey there @fabaff, @gjohansson-ST, mind taking a look at this issue as it has been labeled with an integration (scrape) you are listed as a code owner for? Thanks!

Code owner commands

Code owners of scrape can trigger bot actions by commenting:

@home-assistant close Closes the issue.
@home-assistant rename Awesome new title Renames the issue.
@home-assistant reopen Reopen the issue.
@home-assistant unassign scrape Removes the current integration label and assignees on the issue, add the integration domain after the command.
@home-assistant add-label needs-more-information Add a label (needs-more-information, problem in dependency, problem in custom component) to the issue.
@home-assistant remove-label needs-more-information Remove a label (needs-more-information, problem in dependency, problem in custom component) on the issue.

_{^{(message by CodeOwnersMention)}}

scrape documentation
scrape source
_{^{(message by IssueLinks)}}

Answer 3 · 2024-10-18T07:55:30.000Z

You could always try to connect to the page via CLI so we could start there to know if it's OS certification problem or something built into the libraries used for this integration.

Answer 4 · 2024-10-21T16:46:28.000Z

Sorry for the delay in getting back to this. I successfully tried to scrape the URL using curl and everything works as it should. Here is the output from a curl -v from windows powershell incase it helps. Something to note, the "You are being redirected" at the end is expected

`
curl -v https://archiveofourown.org/works/51298045

Host archiveofourown.org:443 was resolved.
IPv6: (none)
IPv4: 104.20.28.24, 104.20.29.24
Trying 104.20.28.24:443...
Connected to archiveofourown.org (104.20.28.24) port 443
schannel: disabled automatic use of client certificate
ALPN: curl offers http/1.1
ALPN: server accepted http/1.1
using HTTP/1.x

GET /works/51298045 HTTP/1.1
Host: archiveofourown.org
User-Agent: curl/8.9.1
Accept: /

Request completely sent off
schannel: remote party requests renegotiation
schannel: renegotiating SSL/TLS connection
schannel: SSL/TLS connection renegotiated
< HTTP/1.1 302 Found
< Date: Mon, 21 Oct 2024 16:42:21 GMT
< Content-Type: text/html; charset=utf-8
< Transfer-Encoding: chunked
< Connection: keep-alive
< Location: /works/51298045/chapters/129614635
< CF-Ray: 8d62b34cb9dcbf9d-ATL
< CF-Cache-Status: DYNAMIC
< Cache-Control: no-cache
< Set-Cookie: view_adult=true; path=/; SameSite=Lax
< content-security-policy: frame-ancestors 'self'
< potential_upstream: unicorn_bots
< referrer-policy: strict-origin-when-cross-origin
< x-ao3-priority: 0
< x-aooo-debug1: Archive Unicorn
< x-clacks-overhead: GNU Terry Pratchett
< x-content-type-options: nosniff
< x-download-options: noopen
< x-frame-options: SAMEORIGIN
< x-hostname: ao3-front10
< x-permitted-cross-domain-policies: none
< x-request-id: c9a89c0a-fd93-4b49-864d-651ebb84e568
< x-runtime: 0.025693
< x-sentry-rate: 0.01
< x-xss-protection: 1; mode=block
< Set-Cookie: _otwarchive_session=eyJfcmFpbHMiOnsibWVzc2FnZSI6ImV5SnpaWE56YVc5dVgybGtJam9pTnpoalpqTmpaREE1WldVMVpUTmlNakE0TW1Ga1lXWTJOVGt6WVdVME1tRWlMQ0p5WlhSMWNtNWZkRzhpT2lJdmQyOXlhM012TlRFeU9UZ3dORFUvZG1sbGQxOWhaSFZzZEQxMGNuVmxJbjA9IiwiZXhwIjoiMjAyNC0xMS0wNFQxNjo0MjoyMS42MTFaIiwicHVyIjoiY29va2llLl9vdHdhcmNoaXZlX3Nlc3Npb24ifX0%3D--beed74d07fc1dd6a7ab2fd507894080f87a8fd42; path=/; expires=Mon, 04 Nov 2024 16:42:21 GMT; HttpOnly; SameSite=Lax
< Set-Cookie: __cf_bm=BoK5IowXr8FptqQC6mSjiPwjKBNypObnj.Qa9DjQzDM-1729528941-1.0.1.1-t2U.HywKpsJjWLam_G1aXx0qu.uNt80oOFsKd6bmh47khXwsekQ_SHAOnmhLvFL0Cc8TNQl5ggrkcgzrZZNxUg; path=/; expires=Mon, 21-Oct-24 17:12:21 GMT; domain=.archiveofourown.org; HttpOnly; Secure; SameSite=None
< Set-Cookie: _cfuvid=j9Et.2xCXLYkM8RSHU6OlfFvHxy8c49hogDLwf1kGv4-1729528941632-0.0.1.1-604800000; path=/; domain=.archiveofourown.org; HttpOnly; Secure; SameSite=None
< Server: cloudflare
< alt-svc: h3=":443"; ma=86400
<

You are being redirected.* Connection #0 to host archiveofourown.org left intact `

Answer 5 · 2024-10-21T16:52:36.000Z

I also tried the curl command in the HA Terminal Add-On and successfully connected to the site and was able to retrieve the stats i am attempting to scrape