/WebScraping

๐Ÿ“‘ ์›น ์Šคํฌ๋ž˜ํ•‘ (Web Scraping) ์‘์šฉ

Primary LanguagePython

๐Ÿ“‘ ์›น ์Šคํฌ๋ž˜ํ•‘ (Web Scraping)

No. Content. Remark.
1 Html Code Change ์œ ํŠœ๋ธŒ ์ธ๋„ค์ผ ๋ฐ”๊พธ๊ธฐ
2 GGG_world url ๊ฒ€์‚ฌ
3 Wikipedia : ISO 3166-1 ๊ตญ๊ฐ€ ๋‚˜์—ด ๋ฐ ๊ฒ€์ƒ‰ / url ์ œ๊ณต

์›น ์Šคํฌ๋ž˜ํ•‘์ด๋ž€?

์›น ์ƒ์˜ HTML์„ ์ฝ์–ด์™€ ํŠน์ • ๋ฐ์ดํ„ฐ๋ฅผ ์ˆ˜์ง‘ํ•˜๋Š” ํ–‰์œ„


์›น ์Šคํฌ๋ž˜ํ•‘ ๊ณผ์ •

  1. ๋ฐ์ดํ„ฐ ๊ฐ์ฒด ์ •์˜ + ์›น ํŽ˜์ด์ง€ ์„ ์ •
  2. ์›น ํŽ˜์ด์ง€์—์„œ ์ถ”์ถœํ•  ๋ฐ์ดํ„ฐ ๋ถ„์„
  3. ์›น์Šคํฌ๋ž˜ํ•‘ ์ฝ”๋“œ ์ž‘์„ฑ
  4. ์ถ”์ถœํ•œ ๋ฐ์ดํ„ฐ ์ €์žฅ

์™ธ๋ถ€ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์„ค์น˜

๐Ÿ“Œ Request

Requests๋Š” HTTP ์š”์ฒญ์„ ๋ณด๋‚ด๋Š” ๊ธฐ๋Šฅ์„ ์ œ๊ณตํ•˜๋Š” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋กœ HTTP GET, POST, PUT, DELETE ๋“ฑ์„ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•˜๋‹ค.
๋˜ํ•œ dictionary๋กœ ๋งŒ๋“  ๋ฐ์ดํ„ฐ๋ฅผ ํ•„์š”ํ•œ request ์ธ์ฝ”๋”ฉ์„ ์ž๋™์œผ๋กœ ์ฒ˜๋ฆฌํ•ด์ค€๋‹ค.

import requests

url = "https://github.com/"
request = requests.get(url)                    # GET
print(request.text)                            # HTML ์ถ”์ถœ

url = "https://github.com/post"
dic={'kind':'zest', 'title':'Truffle', 'age':3}
request = requests.post(url, data=dic)         # POST
print(request.text)

https://github.com/psf/requests
https://requests.readthedocs.io/projects/3/


๐Ÿ“Œ BeautifulSoup4

BeautifulSoup4์€ ์›น ํŽ˜์ด์ง€ ์ •๋ณด๋ฅผ ์Šคํฌ๋žฉํ•˜๋Š” ๊ธฐ๋Šฅ์„ ์ œ๊ณตํ•˜๋Š” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋กœ HTML๊ณผ XML ๋ฌธ์„œ ๋“ฑ์„ ๋ถ„์„ํ•œ๋‹ค.

from bs4 import BeautifulSoup

url = "https://github.com/"
request = requests.get(url)  
soup = BeautifulSoup(request.text, 'html.parser')

https://www.crummy.com/software/BeautifulSoup/bs4/doc/


๐Ÿ“Œ Selenium

BeautifulSoup ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋งŒ์œผ๋กœ ๋‹ค์–‘ํ•œ ์‚ฌ์ดํŠธ์˜ ์ •๋ณด๋ฅผ ์ถ”์ถœํ•  ์ˆ˜ ์žˆ์ง€๋งŒ ์ž๋ฐ”์Šคํฌ๋ฆฝํŠธ๋กœ ๋™์ ์œผ๋กœ ์ƒ์„ฑ๋œ ์ •๋ณด๋Š” ๊ฐ€์ ธ์˜ฌ ์ˆ˜ ์—†๋‹ค๋Š” ํ•œ๊ณ„๊ฐ€ ์žˆ๋‹ค.

๋งŒ์•ฝ ์Šคํฌ๋ž˜ํ•‘์„ ์‹œ๋„ํ•˜๋‹ค๊ฐ€ ์•„๋ฌด ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ ธ์˜ค์ง€ ๋ชปํ–ˆ๋‹ค๋ฉด ๋Œ€๋ถ€๋ถ„์˜ ๊ฒฝ์šฐ๊ฐ€ ์ž๋ฐ”์Šคํฌ๋ฆฝํŠธ๋กœ html์„ ๋งŒ๋“ค์–ด์„œ ๊ทธ๋ ‡๋‹ค.


๋”ฐ๋ผ์„œ Selenium ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ์ด์œ ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

  1. ์ž๋ฐ”์Šคํฌ๋ฆฝํŠธ๊ฐ€ ๋™์ ์œผ๋กœ ๋งŒ๋“  ๋ฐ์ดํ„ฐ๋ฅผ ํฌ๋กค๋ง ํ•˜๊ธฐ ์œ„ํ•ด
  2. ์‚ฌ์ดํŠธ์˜ ๋‹ค์–‘ํ•œ html ์š”์†Œ์— ํด๋ฆญ, ํ‚ค๋ณด๋“œ ์ž…๋ ฅ ๋“ฑ ์ด๋ฒคํŠธ๋ฅผ ์ฃผ๊ธฐ ์œ„ํ•ด
  3. ๋ฐ˜๋ณต์ ์œผ๋กœ ํ•˜๋Š” ์›น์ƒ์˜ ์—…๋ฌด ์ž๋™ํ™” (ex. ์ž๋™๋กœ๊ทธ์ธ, ๋ธ”๋กœ๊ทธ ์ด์›ƒ์ƒˆ๊ธ€ ์ž๋™์ข‹์•„์š”์™€ ๋Œ“๊ธ€ ์ž‘์„ฑ ๋“ฑ)
from selenium import webdriver

url="http://google.com"

# driver ๊ฒฝ๋กœํซ ํŒŒ์ผ๊ฒฝ๋กœ์™€ ๊ฐ™์€ ๊ณณ์— ๋‘˜ ๊ฒฝ์šฐ
driver_same=webdriver.Chrome()
driver_same.get(url)

# driver ๊ฒฝ๋กœ๋ฅผ ํŒŒ์ผ๊ฒฝ๋กœ์™€ ๋‹ค๋ฅธ ๊ณณ์— ๋‘˜ ๊ฒฝ์šฐ
driver_diff=webdriver.Chrome("driver_diff์˜ ๊ฒฝ๋กœ")
driver_diff.get(url)

ํฌ๋กฌ๋“œ๋ผ์ด๋ฒ„ ์„ค์น˜ : https://chromedriver.chromium.org/downloads



๊ฐ„๋‹จํ•œ ์›น ์Šคํฌ๋ž˜ํ•‘

Github ๋ธ”๋กœ๊ทธ ์ฒซ ํŽ˜์ด์ง€์— ๋–  ์žˆ๋Š” ํฌ์ŠคํŒ… url ์Šคํฌ๋กค๋ง ํ•˜๊ธฐ

  1. url ์ถ”์ถœ์„ ์œ„ํ•ด ์„ ์ •ํ•œ ์›น ์‚ฌ์ดํŠธ๋ฅผ ์ผœ๋‘” ์ƒํƒœ์—์„œ F12๋ฅผ ๋ˆŒ๋Ÿฌ html ์ฝ”๋“œ์ฐฝ์„ ํ‚จ๋‹ค.
  2. ctrl + shift + c๋ฅผ ๋ˆ„๋ฅธ ์ƒํƒœ์—์„œ ์›ํ•˜๋Š” ๊ณณ์„ ํด๋ฆญํ•˜๋ฉด ํ•ด๋‹น ์œ„์น˜์™€ ๋งค์นญ ๋˜๋Š” html ์ฝ”๋“œ๋ฅผ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.
  3. ์›น ์‚ฌ์ดํŠธ์—์„œ ์ฝ”๋“œ์ฐฝ์œผ๋กœ ๋„˜์–ด์™€ ํ•ด๋‹น hmtl ์ฝ”๋“œ๋ฅผ ๋งˆ์šฐ์Šค ์˜ค๋ฅธ์ชฝ ๋ฒ„ํŠผ์œผ๋กœ ํด๋ฆญํ•˜๋ฉด ๋œจ๋Š” ์ฐฝ ์ค‘ Copy > copy selector ๋ฅผ ์„ ํƒํ•˜๋ฉด ์›ํ•˜๋Š” ์ •๋ณด์˜ hmtl ์œ„์น˜๊ฒฝ๋กœ๋ฅผ ๋ณต์‚ฌํ•˜๊ฒŒ ๋œ๋‹ค.
  4. find, find_all ํ˜น์€ select ํ•จ์ˆ˜ ๋“ฑ๊ณผ html ์œ„์น˜๊ฒฝ๋กœ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ•„์š”ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ์ถ”์ถœํ•œ๋‹ค.

html์ฝ”๋“œ์ฐฝ ์˜ˆ์‹œ

gitblog


์ฝ”๋“œ

import requests
from bs4 import BeautifulSoup

url = "https://github.blog/"
request = requests.get(url)  
soup = BeautifulSoup(request.text, "html.parser")
results = soup.select("#main > section > div > div > div")
count=1

for result in results:
    https=result.find("a")
    link=https.attrs['href']
    print(f"# {count} : ",link)
    count+=1

๊ฒฐ๊ณผ

# 1 :  https://github.blog/2021-08-11-githubs-engineering-team-moved-codespaces/
# 2 :  https://github.blog/2021-08-11-githubs-engineering-team-moved-codespaces/
# 3 :  https://github.blog/2021-08-16-securing-your-github-account-two-factor-authentication/
# 4 :  https://github.blog/2021-08-12-teaching-learning-github-classroom-visual-studio-code/
# 5 :  https://github.blog/2021-08-12-whats-new-from-github-changelog-july-2021-recap/
# 6 :  https://github.blog/category/community/
# 7 :  https://github.blog/category/education/
# 8 :  https://github.blog/category/engineering/
# 9 :  https://github.blog/category/enterprise/
# 10 :  https://github.blog/category/open-source/
# 11 :  https://github.blog/category/policy/
# 12 :  https://github.blog/category/product/
# 13 :  https://github.blog/category/security/