Chrome Crawler
This service uses puppeteer(to be deprecated), playwright and axios to crawl pages.
It allows to scroll, take screenshots and get images encoded as Base64.
Out of scope: it allows to get apps from playstore, in the future it will be migrated to a new service.
How to run?
Dev environment
yarn start
Docker
A Dockerfile
is provided which install chrome inside the container.
Environment variables
WEB_PORT = 3000
WEB_TIMEOUT = 150 # segs
JWT_SECRET = my-secret-hash
JWT_ALG = HS256
PLAYWRIGHT_CHROMIUM_EXECUTABLE_PATH=/usr/bin/chromium-browser
HEADLESS = true
REDIS = redis://127.0.0.1:6379 # with localhost it tries ipv6
If JWT_ALG
is "ES512", then JWT_SECRET
must contain the absolute or relative path to the public key:
JWT_ALG = "ES512"
JWT_SECRET = ".secrets/public.key"
PLAYWRIGHT_CHROMIUM_EXECUTABLE_PATH
is set by default and is related to where Alpine install the chromiun browser.
This variable doesn't belong to playwright, and it is used becose playwright uses their own browser binaries, which are not compatible for alpine.
⚠️ In the future a Dockerfile.debian version could be provided. It requires a change in how the dockerfile is structured.
API
⚠️ V3 and V4 deprecated
⚠️ From V4 endpoint, it uses Playwright instead of puppeteer
⚠️ V3 will be deprecated soon
⚠️ V1 and v2 of chrome&axios apis are disabled in the code
⚠️ Playstore API could be very inestable, for more information refer to https://github.com/facundoolano/google-play-scraper
General endpoints
-
GET /
- 200 Root status
-
GET /metrics
- 200 prometheus stats
Chrome and axios endpoints
-
GET /v5/image
- Uses axios to get an image encoded as base64
- Query Params: url
-
POST /v5/chrome
- 200 if everything ok, 500 if something went wrong
- body
url
[string]:ts
: number (in secs)waitElement
[string | null]: Visible text of an element to waitscreenshot
[bool]: Take a screenshot of the fullpageuseCookies
[bool]: It will store and load cookies, default falsecookieId
: [string]: cookie idheaders
: not usedbrowser
[object]:proxy
[object]: Configure a proxy to be usedserver
[string]: required, without protocolusername
[string]: optionalpassword
[string]: optional
emulation
[object]:locale
[string]: default "en-US"timezoneId
[string]: default "America/New_York"isMobile
[bool]: default Falseviewport
[object]:width
[number]: default 1280height
[number]: default 720
geoEnabled
[bool]: default True, add geolocation to permissionsgeolocation
[object]: (default New York)longitude
[number]: default 40.6976312latitude
[number]: default -74.1444858
- response 200:
fullurl
[string]: Raw html of the responsecontent
[string]: Raw html of the responseheaders
[object]: not usedstatus
[number]: status code, 200 or 500fullLoaded
[bool]: if the page was loaded completlyscreenshot
[string]: Base64 encoded imageerror
[string]: any message errorcookieId
[string]: generated
- response 500:
error
[string]: message error
-
POST /v5/axios
- 200 if everything ok, 500 if something went wrong
- body
url
[string]:ts
[number]: timeout (in secs)waitElement
[string | null]: Not usedscreenshot
[bool]: Not usedautoscroll
: bool (not used)headers
: any, to be used with axiosproxy
[object]: Not implemented right nowserver
[string]: requiredusername
[string]: optionalpassword
[string]: optional
-
POST /v5/duckduckgo
- 200 if everything ok, 500 if something went wrong
- body
text
[string]: a query to search in googlets
[number]: timeout (in secs)moreResults
[number]: How many click on "More results" button it will doregion
[string]: "ar-es" by default. See regions codestimeFilter
[string|null]: "Any Time", "Past day", "Past week", "Past month", "Past year". Null by defaultscreenshot
[bool]: Take a screenshot of full rendered pageuseCookies
[bool]: It will store and load cookiescookieId
: [string]: cookie idbrowser
[object]: same than chrome endpoint
- response 200:
query
[string]: Parsed queryfullurl
[string]: Fullurlcontent
[string]: Raw html of the responseheaders
[object]: Emptystatus
[number]: status code, 200 or 500links
[List[{href:text}]]: uri of the next pagefullLoaded
[bool]: if the page was loaded completlyscreenshot
[string]: Base64 encoded imageerror
[string]: any message errorcookieId
[string]: generated
- response 500:
error
[string]: message error
-
POST /v5/google
- 200 if everything ok, 500 if something went wrong
- body
text
[string]: a query to search in googlets
[number]: timeout (in secs)moreResults
[number]: It will performs a "PgDown" actions formoreResults
times.region
[string]: "ar-es" by default. See regions codestimeFilter
[string|null]: "Any Time", "Past hour", "Past 24 hours", "Past week", "Past month", "Past year". Null by defaultscreenshot
[bool]: Take and screenshotuseCookies
[bool]: default TruecookieId
: [string]: cookie idbrowser
[object]: same than chrome endpoint
- response 200:
query
[string]: Parsed queryfullurl
[string]: Fullurlcontent
[string]: Raw html of the responseheaders
[object]: Emptystatus
[number]: status code, 200 or 500links
[List[{href:text}]]: uri of the next pagefullLoaded
[bool]: if the page was loaded completlyscreenshot
[string]: Base64 encoded imageerror
[string]: any message errorcookieId
[string]: generated
- response 500:
error
[string]: message error
Playstore endpoints
-
GET /v1/playstore/:appid
- Get app detail based on the appid
-
POST /v1/playstore/list
-
POST /v1/playstore/search
- Perform a search in google playstore
- body
term
: the term to search by.num
(optional, defaults to 20, max is 250): the amount of apps to retrieve.lang
(optional, defaults to'en'
): the two letter language code used to retrieve the applications.country
(optional, defaults to'us'
): the two letter country code used to retrieve the applications.fullDetail
(optional, defaults tofalse
): iftrue
, an extra request will be made for every resulting app to fetch its full detail.price
(optional, defaults toall
): allows to control if the results apps are free, paid or both.all
: Free and paidfree
: Free apps onlypaid
: Paid apps only
-
POST /v1/playstore/similar
- Returns a list of similar apps to the one specified
- body:
appId
: the Google Play id of the application to get similar apps for.lang
(optional, defaults to'en'
): the two letter language code used to retrieve the applications.country
(optional, defaults to'us'
): the two letter country code used to retrieve the applications.fullDetail
(optional, defaults tofalse
): iftrue
, an extra request will be made for every resulting app to fetch its full detail.
Example:
curl http://localhost:3000/v1/chrome?url=https://www.google.com/doodles/
url must have the protocol screen is a optional param, any value is taked as true
Changelog
- screen param added for chrome's endpoint in the version 2 of the api. If is true, then an screenshot will be taken and encoded in base64. After that, could be decoded as a png file, throught the key
screenshot
. - /image endpoint added to download images as base64.
- image.py a script to test image endpoint.
Regions codes
duckduckgo
Check https://duckduckgo.com/settings
Copy the value of the option:
- For "All regions" the value is
wt-wt
- For "Argentina" the value is
ar-es
google: Check https://www.google.com/preferences Copy as the text shown in Region settings part:
- For "Brazil", the value is
Brazil
- For "Agentina", the value is
Argentina
Resources