Scrape Alibaba Supplier contact
- request alibaba category and add category to redis.
- get category from redis, request list, insert brief info about company to the postgres.
- get brief info from postgres, add url to redis.
- request company contact, write data to postgres.
docker run -p 5432:5432 --restart=always --name alibaba -e POSTGRES_PASSWORD=123456 -d postgres:9.3
docker exec -i -t alibaba bash
psql -U postgres
create database alibaba;
\c alibaba
create table alibaba_company
(
id BIGSERIAL PRIMARY KEY,
name character varying(255),
sid bigint,
url character varying(255),
status character varying(255), /* brief, detail, detailErr */
gold_supplier bigint,
assurance boolean,
contact json,
update_date timestamp
);
docker run -p 6379:6379 --restart=always --name redis -d redis
node getCategory.js
node getCompanyBrief.js
node prepareCOmpanyDetailEnv.js
node getCompanyDetail.js
select * from alibaba_company;
select count(*) from alibaba_company;
select * from alibaba_company order by id desc limit 50 ;
SELECT sid, count(*) as count FROM alibaba_company group by sid order by count desc;
SELECT count(*),name FROM alibaba_company group by name order by count(*) desc;
SELECT * FROM alibaba_company WHERE name LIKE '%Chengli Special%'; /* same company name with different sid, url*/
SCARD alibaba_category_key
SCARD alibaba_company_key
SRANDMEMBER alibaba_category_key 5
SRANDMEMBER alibaba_company_key 5
select sid, name, url, contact->>'person' as Person,contact->>'Department' as Department,contact->>'Job Title' as Job_Title,contact->>'Telephone' as Telephone,contact->>'Mobile Phone' as Mobile_Phone,contact->>'Fax' as Fax,contact->>'Address' as Address,contact->>'Country/Region' as Country_Region,contact->>'Province/State' as Province_State, contact->>'City' as City,contact->>'Province/State' as Province_State,contact->>'Zip' as Zip from alibaba_company where status = 'detail' and contact->>'person' <> '' order by name;
- read csv save to redis
node loadLastCSV <file_path>
- delete the same name in the db
node removeLastData
When request company url such as http://www.alibaba.com/member/yuen_/contactinfo.html, will redirect to http://yuen_.fm.alibaba.com/contactinfo.html, due to the hostname include '_', node.js can't access the page. (http://frank_gu.fm.alibaba.com/contactinfo.html can request)