Web scraping in Python

Using Scrapy

Install Scrapy by entering pip install scrapy in your Terminal.
Navigate to the directory where you would like to create your Scrapy project.

Enter scrapy startproject myproject. This will create a project with the following directory structure:

myproject/
    scrapy.cfg
    myproject/
        __init__.py
        items.py
        pipelines.py
        settings.py
        spiders/
            __init__.py

Navigate to the directory that contains the project's spiders by entering cd myproject/myproject/spiders.
Create a new spider by entering touch myspider.py, then open it in your default Python code editor by entering open myspider.py.

Enter the following code:

import scrapy
from myproject.items import MyItem

class MySpider(scrapy.Spider):
    name = 'myspider'
    allowed_domains = ['bwog.com']
    start_urls = ['http://bwog.com']

    def parse(self, response):
        for section in response.xpath('//div[@class="blog-section"]'):
            link = section.xpath('.//a/@href').extract_first()
            yield scrapy.Request(link, callback=self.parse_entry)
        next = response.xpath('//div[@class="comnt-btn"]//@href').extract_first()
        yield scrapy.Request(next, callback=self.parse)

    def parse_entry(self, response):
        for comment in response.xpath('//div[contains(@class, " comment-body")]'):
            item = MyItem()
            
            item['author'] = comment.xpath('./div[@class="comment-author vcard"]/cite/text()').extract_first()
            
            metadata = comment.xpath('./div[@class="comment-meta datetime"]')
            item['up'] = int(metadata.xpath('./span[@data-voting-direction="up"]/span/text()').extract_first())
            item['down'] = int(metadata.xpath('./span[@data-voting-direction="down"]/span/text()').extract_first())
            item['datetime'] = metadata.xpath('./a/text()').extract_first().strip()
            
            paragraphs = comment.xpath('./div[contains(@class, "reg-comment-body")]/p/text()').extract()
            item['content'] = '\n'.join(paragraphs)
            
            yield item

Edit the project items file in your default Python code editor by entering open ../items.py.

Enter the following code:

import scrapy

class MyItem(scrapy.Item):
    author = scrapy.Field()
    up = scrapy.Field()
    down = scrapy.Field()
    datetime = scrapy.Field()
    content = scrapy.Field()

Navigate to the top directory of your project by entering cd ../...
Run the spider you created and store its output in a comments.json file by entering scrapy crawl myspider -o comments.json. View the stored comments by entering open comments.json.

xuan95/web-scraping

Web scraping in Python

Using Scrapy