/web_crawler

fast and symple web crawler

Primary LanguageHTML

Profile views GitHub top language GitHub language count GitHub code size in bytes GitHub repo size GitHub GitHub last commit GitHub User's stars

visitors

Read this in other languages: Russian, हिन्दी, 中國人

Bot logo

Fast and simple crawler

How it works?

Вit's very simple: your bot massively signs your account in response, people follow you.

The order of preparation and work with the bot

  1. Clone the repository or download the archive from github or using the following commands on the command line

    $ cmd
    $ git clone https://github.com/BEPb/github_bot
    $ cd github_bot
    
  2. Create a Python virtual environment.

  3. Install all necessary packages for our code to work using the following command:

    pip install -r requirements.txt
    
  4. create a project called nameproject

scrapy startproject nameproject
  1. after which you will have a folder with the name of this project and in it the minimum necessary files and dependencies
    scrapy.cfg #deploy configuration file
    nameproject/ # project's Python module, you'll import your code from here
        __init__.py
        items.py # project items definition file
        middlewares.py # project middlewares file
        pipelines.py # project pipelines file
        settings.py # project settings file
        spiders/ # a directory where you'll later put your spiders
            __init__.py
  1. go to our project folder
cd nameproject
  1. create a quotes_spider.py file in the spiders/ folder and write in it who and how we cheat
  2. launch our crawler
scrapy crawl quotes
  1. as a result of the execution, two new files were created: quotes-1.html and quotes-2.html with content for the corresponding URLs, as our parse method specifies.
  2. use shell selectors
scrapy shell 'https://quotes.toscrape.com/page/1/'
  1. view all 'title' objects using css. The result of executing response.css('title') is similar to list object named SelectorList which is a list of Selector objects that wrap XML/HTML elements and allow you to perform additional queries to refine the selection or retrieve data.
response.css('title')
  1. and in order to view the list, specify the getall () method
response.css('title::text').getall()
  1. the same can be done with xpath
response.xpath('//title/text()').get()
  1. and now take div tags with class quote
response.css("div.quote")
  1. take only the first element in the list
response.css("div.quote")[0]
  1. in order to get the class in the tag, use the following command:
quote.css("span.text::text").get()
quote.css("small.author::text").get()
  1. and this is how we will display the complete list of the class of the div tag
response.css("div.quote").css("div.tags a.tag::text").getall()
  1. this is how we save the result in json format, where the -O command line switch overwrites any existing file;
scrapy crawl quotes -O quotes.json
  1. and this is how we save the result in csv format
scrapy crawl quotes -O quotes.csv
  1. The following command writes line by line using the .jl format
scrapy crawl quotes -o quotes.jl

Bot logo