apify/crawlee
Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
TypeScriptApache-2.0
Issues
- 2
Race conditions in CI/CD
#2417 opened by barjin - 2
Issue with decoding quotation mark
#2401 opened by HonzaKirchner - 0
Add `waitForAllRequestsToBeAdded` option to `enqueueLinks`
#2318 opened by barjin - 2
The request queue scans all 450k (99.999% of which are done towards the end) requests for each iteration
#2406 opened by zopieux - 0
Support for crawling from secondary IP address
#2409 opened by teammakdi - 1
Statistics does not use crawler log
#2412 opened by rougsig - 0
- 0
Incorrect Request Timeout in Error Message
#2403 opened by teammakdi - 0
type error `puppeteerUtils.gotoExtended` ?
#2398 opened by KOCEAN33 - 0
Make RequestQueueV2 default
#2388 opened by drobnikj - 0
Huge sitemap takes forever to load
#2384 opened by teammakdi - 0
Pass options to browser context
#2383 opened by cybairfly - 2
- 1
Errors in node_modules/@crawlee/http/internals/http-crawler.d.ts 140 errors in the same file
#2377 opened by jawspeak - 0
Http crawler does not return response in gzip format
#2379 opened by teammakdi - 0
Rename `Snapshotter` to something more accurate
#2378 opened by janbuchar - 0
Can not run crawleee puppeteer unit test with Jest
#2374 opened by duylddev - 1
- 0
Adopt a code formatter and enforce it with CI
#2366 opened by janbuchar - 0
Tor as proxy
#2365 opened by atefBB - 1
ProxyUrl not accepted: "(array `proxyUrls`) Expected property string values to be a URL, got "
#2362 opened by itinance - 1
Crawlee docs - the default values are wrongly displayed
#2266 opened by katacek - 0
Adaptive crawling
#2351 opened by B4nan - 0
- 0
Write an e2e test of adaptive playwright crawler
#2350 opened by janbuchar - 0
Initial PoC version of adaptive crawler
#2352 opened by B4nan - 0
page.waitForTimeout is removed
#2335 opened by KOCEAN33 - 1
`useIncognitoPages` doesn't rotate fingerprints
#2310 opened by mnmkng - 0
HttpCrawler - determining character encoding
#2317 opened by barjin - 0
- 0
Image not available(build status) in readme
#2331 opened by souravjain540 - 2
Could not kill browser: Cannot read private member #process from an object whose class did not declare it
#2327 opened by marcplouhinec - 1
- 1
- 2
Save screenshot/HTML on first occurrence of error in error statistics
#2280 opened by metalwarrior665 - 0
- 4
XPATH selectors support
#2320 opened by Ehsan-U - 2
`page.evaluate` results error
#2314 opened by foxt451 - 1
- 13
- 0
Implement Automatic Memory Management in Playwright for Enhanced Stability in Web Crawling Operations
#2303 opened by wojtekKrol - 0
add "exclude" property to enqueueLinksByClickingElements like "enqueueLinks"
#2298 opened by AraCoders - 0
dataset as requestsFromUrl
#2297 opened by apify-alexey - 1
Double clicking title selects also prefix pill – makes it harder to copypaste
#2282 opened by webrdaniel - 8
Issue Downgrading from Crawlee 3.7.2 to 3.4.0 - Persistent Version and TypeScript Errors
#2279 opened by wojtekKrol - 3
No links are being enqueued on some pages
#2273 opened by batu-archive - 5
Typescript issue with 3.7
#2264 opened by timsu - 0
Show line numbers in code blocks on Crawlee docs
#2272 opened by vladfrangu - 0
- 0
scrape page count is exceed maxRequestsPerCrawl too much
#2268 opened by zshnb