scrape_email: A JavaScript repository from mfbx9da4

Scrape emails

Provide either a json file with list of names
Scrapes google first page for ${name} contact @
Then scrapes first few links until it finds @ on page

Method

part 1 - save html files:

create html files for all google pages and top results
in parallel
would be faster to store in db

part 2 - read each html file:

scrape for email regex, accumalate all email

Naming

Always prefix functions and files with either fetchBar or extractFoo

Getting Started

requries docker and docker compose

docker-compose up
yarn

Use for debugging values

npm install -g redis-commander
redis-commander

Deploying

https://docs.aws.amazon.com/AmazonECS/latest/developerguide/docker-basics.html#docker-basics-create-image

╰─$ aws ecr get-login --no-include-email --region eu-west-2
## paste output to login
╰─$ docker build -t scrape_email .
╰─$ docker tag scrape_email:latest 016582366134.dkr.ecr.eu-west-2.amazonaws.com/scrape_email:latest
╰─$ docker push 016582366134.dkr.ecr.eu-west-2.amazonaws.com/scrape_email:latest

Run Once

See package.json for scripts

node index.js

Testing docker locally

docker build -t scrape_email .
docker run -p 49160:3001 3000:3000 -d scrape_email

Streamlining: When to update db?

Individual agents should be updated independent of other agents therefore:
Wait for all subpages to be scraped before updating an agent record? YES
Wait for all properties to be scraped before updating an agent record? YES
Wait for all agents to be scraped before updating an agent record? NO
Wait for all regions to be scraped before updating an agent record? NO
=> Emit an event everytime you extract some data completely for an agent
Emit an event when:
- Found agent name, address, etc
- Extracted agent property stats
- Extracted agent email

API

Get email for query
Get email for list of queries

TODO

Support upload csv native upload
Support download

mfbx9da4/scrape_email