This is a simple example of web scraping using Node.js. This node script, make use of cheerio and puppeteer.
This script, use a json file in src/sites
to scrape. The json file, contains some information about the sites that going to be scrape, such as the selector and URL.
- clone the repo
git clone https://github.com/nurbxfit/JobScrapper-Nodejs-example.git
- cd into the directory and install dependencies
cd JobScrapper-Nodejs-example && npm install
- because this script is in Typescript, we need to build it
npm run start:build
- then we can start it
npm start
In this example, I use this script to scrape a Job-board website mystarjob.com
.
- We use the
name
attribute to create a folder for our data. headless
attributes if set to false, will openpuppeteer
browser when we run the script.search
is used to construct starting page url where we want to crawl,search.pagination
contains pagination information we to go to next page.crawl
contains selector we will use to crawl for content's URLs in the list page.scrape
contains selector we will use to scrape the detail content page.
The selector contains:
field_name
, this will be use to create key name, and the value will be from thequery
.query
, this is a querySelector to select html element usingcheerio
.method
, this is a cheerio method used to extract the element text content.regex
(optional), we use regex to refine our text content.
{
"name":"my_starjob",
"headless": true,
"base_url":"http://mystarjob.com",
"search" : {
"url" : "/search/default.aspx?a=&i=-1&sb=-1&stb=1&std=2&f=-1&c=-1&s=-1&jt=-1&jl=-1&fw=&jf=&jb=&rs=-1",
"pagination" : {
"type": "query",
"attrs": {
"param": "&p=",
"initial": 1,
"incremental": 1,
"limit": 10
}
}
},
"crawl": {
"selectors":[
{
"field_name": "result",
"query" : "div[class=\"resultDisplay\"] > p",
"method" : ".text()",
"regex" : "(?<=\\bof )\\d+"
},
{
"field_name": "content_url",
"query": "h2[class=\"titleL\"] > a",
"method": ".map((i,e)=> $(e).attr(\"href\")).get()"
}
]
},
"scrape" : {
"selectors": [
{
"field_name" : "job_title",
"query": "h1[class=\"jobsTitle\"]",
"method" : ".text()"
},
{
"field_name" : "company",
"query": "h2[class=\"company\"]",
"method" : ".text().trim()"
},
{
"field_name" : "date_posted",
"query": "p[class=\"date\"]",
"method" : ".text()",
"regex" : "(?<=^Posted\\son\\s)\\d.+$"
},
{
"field_name" : "job_description",
"query": "div[class=\"jobsDesc\"]",
"method" : ".toString()"
}
]
}
}
Using above json file, will produce the following data
{
"site_name": "my_starjob",
"site_url": "http://mystarjob.com",
"content_url": "http://mystarjob.com/../job/default.aspx?pid=103047",
"content": {
"job_title": "Bus Captain",
"company": "GMP Recruitment Services (S) Pte Ltd",
"date_posted": "9 Jun 2022",
"job_description": "<div class=\"jobsDesc\">\n <h1 class=\"titleM\">Job Description</h1>\n\t\t\t\t <a href=\"mailto:Jessica.Pan@gmprecruit.com\">Jessica.Pan@gmprecruit.com</a>Make a Difference Everyday<br>\nJoin us as a Bus Captain/ Technician/Assistant Station Manager/ Customer Service Officers<br>\nBus Captain<br>\n<br>\nResponsibilities:\n<ul>\n\t<li>Provides a safe and pleasant journey to the passengers</li>\n\t<li>Maintain the bus well</li>\n</ul>\nRequirements:\n\n<ul>\n\t<li>Minimum Secondary 2 education/ WPL Level 3 or equivalent.</li>\n\t<li>Valid Class 3/4 driving licence with a minimum of one year driving experience.</li>\n</ul>\n<br>\nWalk in interview:<br>\n11 June 2022<br>\n9.30am to 5.00pm<br>\nHatten Hotel Melaka<br>\nMarco Polo 1, Level 22<br>\nJalan Merdeka, Bandar Hilir,75000 Melaka, Malaysia<br>\n<br>\nSTEADY INCOME<br>\nCONTINUAL TRAINING<br>\nEXTENSIVE MEDICAL BENEFITS<br>\n <br>\nPlease WhatsApp +6584503157 (Winn) / +60123857848 (Jess)<br>\nEmail: Jessica.Pan@gmprecruit.com<br>\n………………………………………………………………………………………………………….<br>\nCompany Name: GMP Recruitment Services (S) Pte Ltd<br>\nAddress: 1 Finlayson Green, #10-00 One Finlayson Green, Singapore 049246<br>\n \n </div>"
}
},