- sqlite3
- ruby (3.2.2, see
.ruby-version
) - bundler
./run
Install and run ~./run
If you've already installed dependencies, run ~./bin/scraper
directly
After checking out the repo, run bin/setup
to install dependencies. Then, run
rake spec
to run the tests. You can also run bin/console
for an interactive
prompt that will allow you to experiment.
To install this gem onto your local machine, run bundle exec rake install
.
I searched Github for web scraping or headless browser libraries for Ruby, and
ultimately went with a library called "Ferrum" because it wasn't a wrapper
around Selenium, and used CDP directly, which would allow us to run arbitrary
JS, which should bypass fingerprint detection and allow reading of any network
request and any element on the page. Additionally, it can run chrome in "headed"
(non-headless?) mode, which would prevent bot detection because an actual
browser has an actual fingerprint - there are no stubbed javascript values or
headers. For example, navigator.webdriver
will always return false, there will
be an actual screen resolution, a real user agent (we may not want to overwrite
the user agent header with a headed browser, since that may trigger an abnormal
fingerprint that doesn't match the browser. Headers should all match those of a
real user, and if specific URLs are to be visited directly, we must set the
"Referrer" header to the previous page so it mimics user behavior. If we really
wanted to have a variable fingerprint, we could rotate the version of chrome
that Ferrum uses. Even better, we could try rotating browsers. If we have a
legitimate fingerprint, we should then focus on javascript-based bot detection.
To defeat this detection we could emulate real user mouse/keyboard behavior,
send random inputs, and use different browser window sizes. Bot/scraper
detectors try to determine if user behavior is "human" or not: for example, a
scraper may lack the "random" mouse movements that we sometimes make when
reading a website, or it may interact with elements on a page too quickly and
too rigidly, or it may scroll down too fast. Essentially we would want the
scraping to send different data to the bot detection code each time.
Bot detectors also rely on browser fingerprinting: this usually involves checking the browser type, browser version, IP address, geolocation data, WebGL information, screen resolution, DOM dimensions (can be determined using an iframe), and available fonts, among many other things. On mobile devices, other information like gyroscope and sensor data may be available.
On sephora.com, heavily obfuscated javascript files are loaded, including
this one,
which contains a variable named "bmak". In the javascript console, you can call
bmak
, and it matches the "bmak" object that this
example Akamai bypass
stubs. Therefore Sephora.com uses Akamai's Bot Manager. Looking at the bypass
code, you can see Akamai heavily obfuscates their code, which collects a very
thorough browser fingerprint.
However, because this scraper uses a real browser, we do not need to manually overwrite any javascript variables loaded by the page. Bypasses can easily become outdated as the source changes, and they rely on de-obfuscation, which may not be possible in some cases.
Sephora also uses Akamai Image Manager to prevent direct access to its product images. We can see this in the response headers of a product image request:
{
"date": "Tue, 04 Jul 2023 13:55:14 GMT",
"strict-transport-security": "max-age=31536000",
"last-modified": "Wed, 14 Jun 2023 02:22:45 GMT",
"server": "Akamai Image Manager",
"content-type": "image/webp",
"cache-control": "no-transform, max-age=21600",
"server-timing": "cdn-cache; desc=HIT, edge; dur=1, ak_p; desc=\"469021_388971212_846115625_7249_22446_42_0_-\";dur=1",
"content-length": "5160",
"expires": "Tue, 04 Jul 2023 19:55:14 GMT"
}
If you try accessing an image URL directly via CURL, you are met with an "Access
Denied" message, similar to when you access https://www.sephora.com
via CURL.
If you are using a web browser, and you have visited sephora.com, you should
have the right cookies to access the image URL directly.
- We could run this "non-headless" scraper on several VPS instances (Ubuntu desktops in AWS), connected to the internet via residential proxies, so they look like real user sessions. Windows instances would probably look more like real users because more users use Windows than Linux. On a schedule, these instances would rotate proxies.
- Port the script to one that can run on android devices (Can have 100 android devices hooked up to SIM cards), which are behind https://proxidize.com/ proxies. Can use Playwright to automate android browsers.
- Set
window.localStorage
so thatisFirstTimeChatMarketingMsg
is false, which might disable the chat popup. Try to disable login/signup modals by setting cookies oor values in local storage. - Use residential proxies to appear more like real users. For ewxmaple, it's possible EC2 server IP addresses are recognizable.
- Expand random user inpuits to include random scrolling and smooth/rounded mouse moves using animation formulas rather than just moving the mouse up, down, left, and right.
- Use random server locations or randomize and mock locations
- Test if Playwright would make the code any simpler (it also supports CDP)
- Consider headless-chrome, now that is apparently undetectable
- To scrape more difficult URLs, I read that TLS and HTTP/2 fingerprints could be mocked using low-level APIs (source)