This a basic web crawler developed in JAVA to get info from Amazon to generate ads records.
We start by making our own rawQuery to get adds.
- Still need improvement on generate N-gram query and add Hashset to prevent duplications.
- 互联网广告形式
- 广告数据结构
- 搜索广告基本流程
- 爬虫
- 搜索日志数据
- search ads 搜索广告
logic: match keywords to user's query
format: text, image
position: main line, side bar, etc. - native ads 原生广告
logic: match keywords to web page or APP context format: text, image, style should also match context of web page
position: embedded in original content - display ads 显示广告
logic: match user's demographic, interests to ad's category interests collected from user behavior:click engage time
format: image, animation, video, audio
position: sidebar, top, bottom of page, etc..
Common:
logic: match ads to something
something: users's query, interest, demographic, context of web page
click rate:
native > search > display
Campaign
● A campaign focuses on a theme or a group of products
● set a budget
● choose your audience
● write your ad including keywords, ad content
Quiz:
click rate = click / impression
potential clicks related to pricing algorithm
ad
- AdID
- CampaignID
- Keywords
- Bid
- Description
- Landing Page
compaign
- CampaignID
- Budget (reduce as ad got click)
- query understand
- select ads
- filter ads
- rank ads
- select top K ads
- pricing
- allocate ads
cleaning
● remove stop words: a, an, the ...
● remove ending s, ‘s
query intent prediction
● predict user’s intent
● buy harry potter dvd online -> harry potter dvd
● need query history log
query expansion
● nike running shoe -> nike running sneakers
● software developer -> software engineer
result: a list of query
● send query understand result to index and select as much candidates as possible
● apply information retrieval algorithm on index server
● how to make it scalable ?
● what if QPS > 10 K ? ( distributed ads index server )
● given ad candidates, how to calculate rank score
● a good rank score should reflect
★ relevance between query and ad
★ click probability
★ bid price
- how to charge advertiser
- is it fair to charge them by bid price?
Quiz:
cost per click increase is good not bad?
Web crawling is about making requests and extracting data from the response.
- start with feeds file
- a list of website
- request url with different parameters
- sample feeds file
Query | Bid | Campaign ID |
---|---|---|
sofa | 12.5 | 8022 |
corner sofa | 13.6 | 8023 |
leather sofa | 13.8 | 8024 |
- need multithread
- What if website like amazon block your IP?
- http 503 service unavailable
- Spoof http header: User Agent, Accept, Accept-Encoding
- Proxy Service: Rotate IPs using a list of over 100+ proxy servers
● What if crawler crash due to uncaught exception?
● How to avoid crawl same url ?
● How to handle url failed to crawl ?
To tackle these problems:
● log crawled url
● log failed url
● log exception
- Device IP (random generated)
- Device ID (random generated)
- Session ID
- Query
- Ad ID
- Clicked (0/1)
- Ad category
- Query category
Goal:
Generate click log for query intent prediction and click prediction
method
reverse engineering
- IP
- device_id (some IP like click ads )
- Ad_id
- queryCategory_AdsCategory match
- query_campaign_ID match
- query_Ad_id match
- mismatched query_category and ads_category
- mismatched query_campaignID
- mismatched query_adid
- lowest campaignID weight, lowest adid weight per query group
Step 1: Segment users to 4 level
● 10000 IP, 5 device id per IP, user = ip + device_id
● level 0: 5% of user who click for each query
● level 1: 5% of IP whose 1st 2 device id click for each query,rest 3 device_id click on 70% of query, rest of 30% query no click
● level 2: 40% of IP from where we random select 1 device_id click for 50% of query, rest of 50% query no click, rest of 4 device_id no click
● level 3: 50% never click
Step 2: assign campaign ID, Ad Id, Click(0/1),Ad category_Query category(0/1) to user
● assign weight to campaign ID, Ad IDs
● group ad data by Query group Id
How to calculate weight for each Ads, campaign?
● For each ad per campaign , calculate relevance between ad and query, ad_weight_i = relevance_i / sum(relevance_i) , i is index of ad for current campaign
● For each campaign per query group, campaign_weight_j = sum of ad relevance in current campaign /total_relevance_cross_campaign
How to assign weighed campaignID, AdId to user?
Weighted Random sampling
Given n item, each one has weight w[i], these weights form the unnormalized probability distribution we want to sample from, each item should have probability w[i]/sum(w) of being chosen.