Persine is an automated tool to study and reverse-engineer algorithmic recommendation systems. It has a simple interface and encourages reproducible results. You tell Persine to drive around YouTube and it gives back a spreadsheet of what else YouTube suggests you watch!
Persine => Pers[ona Eng]ine
People have suggested that if you watch a few lightly political videos, YouTube starts suggesting more and more extreme content – but does it really?
The theory is difficult to test since it involves a lot of boring clicking and YouTube already knows what you usually watch. Persine to the rescue!
- Persine starts a new fresh-as-snow Chrome
- You provide a list of videos to watch and buttons to click (like, dislike, "next up" etc)
- As it watches and clicks more and more, YouTube customizes and customizes
- When you're all done, Persine will save your winding path and the video/playlist/channel recommendations to nice neat CSV files.
Beyond analysis, these files can be used to repeat the experiment again later, seeing if recommendations change by time, location, user history, etc.
If you didn't quite get enough data, don't worry – you can resume your exploration later, picking up right where you left off. Since each "persona" is based on Chrome profiles, all your cookies and history will be safely stored until your next run.
See Persine in action on Google Colab. Includes a few examples for analysis, too.
pip install persine
Persine will automatically install Selenium and BeautifulSoup for browsing/scraping, pandas for data analysis, and pillow for processing screenshots.
You will need to install chromedriver to allow Selenium to control Chrome. Persine won't work without it!
- Installing chromedriver on OS X: I hear you can install it using homebrew, but I've never done it! You can also follow the link above and click the "latest stable release" link, then download
chromedriver_mac64.zip
. Unzip it, then move thechromedriver
file into yourPATH
. I typically put it in/usr/local/bin
. - Installing chromedriver on Windows: Follow the link above, click the "latest stable release" link. Download
chromedriver_win32.zip
, unzip it, and movechromedriver.exe
into yourPATH
(in the spirit of anarchy I just put it inC:\Windows
). - Installing chromedriver on Debian/Ubuntu: Just run
apt install chromium-chromedriver
and it'll work.
In this example, we start a new session by visiting a YouTube video and clicking the "next up" video three times to see where it leads us. We then save the results for later analysis.
from persine import PersonaEngine
engine = PersonaEngine(headless=False)
with engine.persona() as persona:
persona.run("https://www.youtube.com/watch?v=hZw23sWlyG0")
persona.run("youtube:next_up#3")
persona.history.to_csv("history.csv")
persona.recommendations.to_csv("recs.csv")
We turn off headless mode because it's fun to watch!
Persine is built around an engine that stores all of your global settings, and personas that represent the individual users who browse the web.
Personas are always generated by an engine.
from persine import PersonaEngine
engine = PersonaEngine()
persona = engine.persona()
By default, personas are single-use and their browsing history will be discarded after your script is run. If you give them a name, though, they'll save their browsing/recommendation history so you can resume them later.
persona = engine.persona('Mulberry')
This is useful in conjunction with signing in to YouTube (see below), allowing you to imitate a real user watching videos over multiple sessions.
You can use with
to automatically start/stop Chrome. Makes life easy.
with engine.persona() as persona:
persona.run("https://www.youtube.com/watch?v=hZw23sWlyG0")
persona.run("youtube:next_up#3")
If you prefer more control or to visit sites one-by-one, you can manually call .quit()
when you're done.
persona.run("https://www.youtube.com/watch?v=hZw23sWlyG0")
persona.run("youtube:next_up#3")
# Quit Chrome
persona.quit()
We can turn headless mode off or on depending on whether we want to actually watch what Chrome is up to. When running in non-headless mode, Persine automatically installs uBlock Origin so you don't have to deal with ads.
engine = PersonaEngine(headless=False)
Headless mode doesn't support extensions, so by default our invisible Chrome is unfortunately watching ads. We should probably switch to Firefox but it has its own problems.
History is all of your commands you've run and the pages you've visited, while recommendations are what you've been recommended. Recommendations include video sidebars, homepage listings, and search results.
Right now recommendations also include ads and unrelated promoted content. I'm on the fence about whether they should stay or go.
For convenience, you can use .to_df()
to see history and recommendations as pandas DataFrames.
persona.recommendations.to_df()
persona.history.to_df()
If you'd prefer to do your analysis elsewhere, you can save them to CSV files.
persona.recommendations.to_csv('recs.csv')
persona.history.to_csv('hist.csv')
Bridges are site-specific scrapers that tell Persine what to click, what to scrape, and other site-specific commands. Right now the only completed bridge we have is for YouTube, while an Amazon one is in the works.
Tthe YouTube bridge supports the following custom commands:
command | action |
---|---|
youtube:homepage |
Visits youtube.com |
youtube:search?SEARCHTERM |
Searches YouTube for the specified term |
youtube:next_up |
When on a video page, clicks the "next up" video |
youtube:like |
Clicks the like button |
youtube:dislike |
Clicks the dislike button |
youtube:subscribe |
Clicks the subscribe button |
youtube:unsubscribe |
Clicks the unsubscribe button |
youtube:sign_in |
Begins the signin process. You'll need to complete the process manually, but Persine will resume as soon as it notices you're logged in. |
If you'd like to repeat a command multiple times, you can append #[NUMBER]
to it. For example, youtube:next_up#50
will watch the next fifty "next up" videos.