Before you continue to YouTube - Cookie consent

Question

Before you continue to YouTube - Cookie consent

milosb793 opened this issue 4 years ago · 9 comments

Hello there,

I'm facing an issue with Youtube consent, getting the message:

The page did not load elements! If you've scraped many channels within a short period of time, please try rerunning the program after waiting to make sure YouTube isn't throttling your IP address! For further debugging, this was the exact error message (might also be blank):
Message: 

Traceback (most recent call last):
  File "venv/lib/python3.8/site-packages/yt_videos_list/execute.py", line 142, in logic
    wait.until(EC.element_to_be_clickable((By.XPATH, '//yt-formatted-string[@class="style-scope ytd-channel-name"]')))
  File "venv/lib/python3.8/site-packages/selenium/webdriver/support/wait.py", line 80, in until
    raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message:

I'm running this piece of code, which works great on my local machine:

from yt_videos_list import ListCreator

my_driver = 'firefox'

lc = ListCreator(csv=True,
                 md=False,
                 txt=False,
                 headless=True,
                 driver=my_driver,
                 scroll_pause_time=1,
                 reverse_chronological=True)

print(lc.create_list_for("https://www.youtube.com/channel/<channel id>", True))

but on the server, it fails. After a lot of debugging, I found that it got redirected to "Before you continue to Youtube" page running this code sample, simulating the code from create_list_for function:

import selenium
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.support   import expected_conditions as EC
from selenium.webdriver.common.by import By

url = "https://www.youtube.com/channel/<channel id>/videos"

options = Options()
options.headless = True

driver = webdriver.Firefox(options=options)

driver.get(url)
driver.set_window_size(780, 800)
driver.set_window_position(0, 0)
wait = selenium.webdriver.support.ui.WebDriverWait(driver, 9)

print(driver.title)

wait.until(EC.element_to_be_clickable((By.XPATH, '//yt-formatted-string[@class="style-scope ytd-channel-name"]')))

print("Done")

and output is Before you continue to YouTube with the same error as above.

Is there any way case covering this, or am I doing something wrong?

Answer 1 · 2021-05-05T09:28:46.000Z

Hey milosb793, thanks for filing this issue!

Don't worry, you aren't doing anything wrong. 🙂 This is a new problem associated with YouTube's privacy compliant tracking rollout that requires users to indicate how they want to be tracked, and I'll provide some workarounds below on how to get the program running again. Also note, I'll make a future release (that'll probably incorporate the changes I suggest below) to enable the yt_videos_list program to handle the consent form automatically (or avoid it altogether if run with the user profile), so I'll add those changes when I get the chance to test everything properly.

A simple workaround would be to include a check like the following to see if YouTube is asking for cookie consent and accept the form if it does:

if 'consent.youtube.com' in driver.current_url:
    driver.find_element_by_xpath('//button[@aria-label="Agree to the use of cookies and other data for the purposes described"]').click()

before this wait.until line in dev/logic.py (and your sample code):

wait.until(EC.element_to_be_clickable((By.XPATH, '//yt-formatted-string[@class="style-scope ytd-channel-name"]')))

To see if these changes work, do the following:

git clone git@github.com:slow-but-steady/yt-videos-list.git

cd yt_videos_list
cd python

# make the changes you want in dev/logic.py (more details below)
# make sure you're still in the yt_videos_list/python/ path, then

# run minifier.py to bundle code from yt_videos_list/python/dev/
# into the yt_videos_list/python/yt_videos_list directory
python3 minifier.py   # macOS/Linux
python  minifier.py.  # Windows

# install the changes you made locally with
pip3 install .   # macOS/Linux
pip  install .   # Windows
# NOTE the dot after "install" is required!

When making changes, you might also need to add some sleep timers to wait for the page to load before/after agreeing to the cookies, so the code in logic.py might look something like the following after you make changes:

            driver.get(url)
            driver.set_window_size(780, 800)
            driver.set_window_position(0, 0)
            wait = selenium.webdriver.support.ui.WebDriverWait(driver, 9)
            try:
                # might need a sleep timer here to wait for the consent page to load
                # time.sleep(3)
                if 'consent.youtube.com' in driver.current_url: # THIS IS THE CHECK
                    driver.find_element_by_xpath('//button[@aria-label="Agree to the use of cookies and other data for the purposes described"]').click() # THIS ACCEPTS THE COOKIE CONSENT FORM
                wait.until(EC.element_to_be_clickable((By.XPATH, '//yt-formatted-string[@class="style-scope ytd-channel-name"]')))
            except selenium.common.exceptions.TimeoutException as error_message:
...
... # rest of code probably unchanged

After you make the changes you want in /dev/logic.py, make sure to run minifier.py with python3 minifier.py (python minifier.py on Windows) and install the changes with pip3 install . (pip install . on Windows), then run yt_videos_list on a YouTube channel you want to scrape to see if the changes worked.

Another workaround you can use involves setting your user profile for the driver (firefox, opera, chrome, etc.) as mentioned in discussion #14 Problem with cookies.

Note that you shouldn't face problem I described in this comment following commit d90c29f, so doing what sirodus describes in this comment (I explained what the code there is doing in this comment from the thread above) should be as simple as going to the configure_{SPECIFIC}driver() function for the specific driver you're working with (firefox, opera, chrome, etc.) in dev/logic.py and adding your personal user profile for the browser you're using.

Using the sample code you provided as an example, this would look something like:

import selenium
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.support   import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.firefox.firefox_profile import FirefoxProfile   # NOTE this new import!

url = "https://www.youtube.com/channel/<channel id>/videos"

options = Options()
options.headless = True

# setting FirefoxProfile on Windows:
profile = FirefoxProfile('C:\\Users\\USERNAME\\AppData\\Roaming\\Mozilla\\Firefox\\Profiles\\CHARACTERS.EXTENSION')

# setting FirefoxProfile on macOS:
profile = FirefoxProfile('/Users/USERNAME/Library/Application Support/Firefox/Profiles/CHARACTERS.EXTENSION')

# setting FirefoxProfile on Linux:
profile = FirefoxProfile('/.mozilla/firefox/CHARACTERS.EXTENSION/')

### NOTE: you might have multiple profiles, so you'll need to check them ###
### individually to figure out which directory actually corresponds to the ###
### actual user profile for your browser - the most recently modified ###
### directory is probably the one, but this isn't guaranteed ###

# NOTE: the following does NOT launch selenium in headless mode
# since figuring out if the FirefoxProfile you set in profile is difficult to 
# do when the browser is invisible :)
### also NOTE: launching the selenium driver with the user profile is kind of slow ###
driver = webdriver.Firefox(firefox_profile=profile)

# to launch in headless mode once you figure out the FirefoxProfile
# path, comment the line above and uncomment the line below:
# driver = webdriver.Firefox(firefox_profile=profile, options=options)

driver.get(url)
driver.set_window_size(780, 800)
driver.set_window_position(0, 0)
wait = selenium.webdriver.support.ui.WebDriverWait(driver, 9)

print(driver.title)

wait.until(EC.element_to_be_clickable((By.XPATH, '//yt-formatted-string[@class="style-scope ytd-channel-name"]')))

print("Done")

Once you verify the user profile works using some test code as above, you can try adding these changes to the configure_{SPECIFIC}driver() function in /dev/logic.py, run python3 minifier.py (python minifier.py on Windows) again, install the local changes with pip3 install . (pip install . on Windows), then run yt_videos_list on a channel you want to scrape. If you're using firefox, you would add these changes under the configure_firefoxdriver() function.

Also keep in mind, the exact changes you need to use the user profile to force selenium to use your personal browser settings instead of the empty profile selenium uses by default varies based on which driver/browser (firefox, opera, chrome`, etc.) you use, so here are some references:

https://chromium.googlesource.com/chromium/src/+/master/docs/user_data_dir.md
https://www.guru99.com/firefox-profile-selenium-webdriver.html
https://stackoverflow.com/questions/45521012/how-to-start-firefox-with-with-specific-profile-selenium-python-geckodriver
https://stackoverflow.com/questions/45521012/how-to-start-firefox-with-with-specific-profile-selenium-python-geckodriver
https://stackoverflow.com/questions/50321278/how-to-load-firefox-profile-with-python-selenium
https://stackoverflow.com/questions/55130791/how-to-enable-built-in-vpn-in-operadriver (shows how to use the Opera user profile with webdriver.ChromeOptions() for webdriver.Opera())

If you have any questions or something doesn't work properly, please add to this thread below! 🙂 Also if you have any suggestions for any other additions/modifications, feel free to include that as well. One thing I can think of that sounds like a good idea would be to opt out of all cookies if the consent.youtube.com page comes up, but this might cause problems since agreeing to the cookies is easy, but opting out takes you to a different page where you need to click more options and then submit the form.

The issue probably wouldn't be with clicking the boxes, but rather with the timing (if the page to opt out of cookies takes a long time to load, or if redirecting to the channel after opting out takes a long time). Do you think it might be useful to add this (opt out of cookies) option to yt_videos_list as well?

Answer 2 · 2021-05-08T19:24:04.000Z

Man, THANK YOU SO MUCH for this all effort! Really appreciate it!

I still haven't had enough time to test the given solutions, but I'll definitely post the feedback once I test it.

Answer 3 · 2021-05-17T01:12:53.000Z

Added support for the program to block cookies/accept cookies in Release v0.5.7. You should be able to download these changes and run the program with the updated code using

pip3 install -U yt-videos-list   # macOS/Linux
pip  install -U yt-videos-list   # Windows

# run your yt-videos-list code as you normally do
# ListCreator is instantiated with cookie_consent=False by default (blocks cookies)
# so you shouldn't need to modify anything to get this functionality,
# but if you want to accept cookies, you'd need to add the cookie_consent argument to the instantiation:
# lc = ListCreator(cookie_consent=True)

Let me know if this works, or if you have any problems with anything.

I'll work on adding support to enable the program to use your user profile to allow selenium to run with your personal browser settings instead of the empty profile selenium uses by default next!

Answer 4 · 2021-05-17T20:18:38.000Z

Hi slow-but-steady,
I test the feature, cookie_consent=True (False is tested too), but the consent page is still shown, and you must "click" in the "I Agree"button.
lc = ListCreator(cookie_consent=True, driver=firefox, scroll_pause_time=0.8, headless=False, csv=False, md=False)

Thanks in advence

This is the error after the timeout (if the button is clicked all works fine):

===>ERROR!<===
The page did not load elements! If you've scraped many channels within a short period of time, please try rerunning the program after waiting to make sure YouTube isn't throttling your IP address! For further debugging, this was the exact error message (might also be blank):
Message:

Traceback (most recent call last):
File "/home/pi/.local/lib/python3.7/site-packages/yt_videos_list/logic.py", line 126, in run_scraper
wait.until(EC.element_to_be_clickable((By.XPATH, '//yt-formatted-string[@Class="style-scope ytd-channel-name"]')))
File "/usr/local/lib/python3.7/dist-packages/selenium/webdriver/support/wait.py", line 80, in until
raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message:

Answer 5 · 2021-05-17T20:47:50.000Z

Hi slow-but-steady,
I added the if statement that you have shown previously after the line 126 of logic.py file:
...snip
try:
if 'consent.youtube.com' in driver.current_url: # THIS IS THE CHECK
driver.find_element_by_xpath('//button[@aria-label="Agree to the use of cookies and other data for the purposes described"]').click() # THIS ACCEPTS THE COOKIE CONSENT FORM
wait.until(EC.element_to_be_clickable((By.XPATH, '//yt-formatted-string[@Class="style-scope ytd-channel-name"]')))

except selenium.common.exceptions.TimeoutException as error_message:
...snip

I tested it with headless true and false.
It seems that works OK.
Thank you very much

Answer 6 · 2021-05-25T03:44:13.000Z

Hi asiergda,

Thanks so much for writing up the error and the workaround! I looked into the problem using the information you provided, and (hopefully) fixed the issues with Release 0.5.8. The linked release page references specific commits with a more comprehensive explanation of the problem and the fix, but here's a short summary of the relevant problems:

the create_list_for() method for ListCreator passed in cookie_consent as the last argument to logic.execute() in release 0.5.7, but the execute() function expected the last argument to be _execution_type
- since the argument order passed into execute() was incorrect, the program did not correctly block or accept cookies using the cookie_consent boolean attribute as intended (see commit cd65c5c for more details)
commit 0d4d218 incorrectly provided the error message as a string return value (and also started printing a log message instead of logging the message) instead of printing the error message (see commit b410814 for the fix), so the traceback error you saw was not as descriptive as it should have been; i.e.

===>ERROR!<===
The page did not load elements! If you've scraped many channels within a short period of time, please try rerunning the program after waiting to make sure YouTube isn't throttling your IP address! For further debugging, this was the exact error message (might also be blank):
Message:

Traceback (most recent call last):
File "/home/pi/.local/lib/python3.7/site-packages/yt_videos_list/logic.py", line 126, in run_scraper
wait.until(EC.element_to_be_clickable((By.XPATH, '//yt-formatted-string[@Class="style-scope ytd-channel-name"]')))
File "/usr/local/lib/python3.7/dist-packages/selenium/webdriver/support/wait.py", line 80, in until
raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message:

instead of

### start of missing message ###
YouTube is redircting to youtube.onsent.com, but you entered an invalid argument for the cookie_consent instance attrribute!
Please use cookie_consent=True or cookie_consent=False.
Your current setting is: cookie_consent={cookie_consent}     # this line also would have helped debug the cookie_consent/_execution_type argument mix up
### end of missing message ###

===>ERROR!<===
The page did not load elements! If you've scraped many channels within a short period of time, please try rerunning the program after waiting to make sure YouTube isn't throttling your IP address! For further debugging, this was the exact error message (might also be blank):
Message:

Traceback (most recent call last):
File "/home/pi/.local/lib/python3.7/site-packages/yt_videos_list/logic.py", line 126, in run_scraper
wait.until(EC.element_to_be_clickable((By.XPATH, '//yt-formatted-string[@Class="style-scope ytd-channel-name"]')))
File "/usr/local/lib/python3.7/dist-packages/selenium/webdriver/support/wait.py", line 80, in until
raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message:

Hopefully this fixed the problem, but let me know if it didn't!

Also, as mentioned earlier in the thread above, I'll eventually add an argument to allow the program to use the your personal browser settings via the user profile instead of the empty profile selenium uses by default, so that should be available after I test and (properly 😅) verify everything works!

Answer 7 · 2021-05-28T14:22:01.000Z

Hi slow-but-steady,
All the kudos for your work, thank you very much.

Answer 8 · 2023-05-30T01:34:46.000Z

I think I planned to leave this open for a while after release v0.5.8 to make sure there were no further bugs, but forgot to come back and close this. 😂

Summary: the logic to handle the cookie consent was added in release v0.5.7, and release v0.5.8 fixed a bug in the cookie consent handling logic from release v0.5.7.

There has not been any activity on this issue for more than 2 years now, so I'll close this issue a week after posting this comment if there are no objections/further comments.

Answer 9 · 2023-06-08T06:47:43.000Z

Closing this issue since this issue was addressed with fixes 2+ years ago, and it looks like no further problems related to this issue have come up since then.

Please reopen if something causes problems for this (or something related to this issue) again!