
PyScraper is a python class powered by known libraries to make it easier to scrape websites!

Primary LanguagePythonMIT LicenseMIT


PyScraper is a python class powered by known libraries to make it easier to scrape websites.

How to use ?

first of all you need to import the class

from PyScraper import PyScraper

done? tested? no errors?

Create A PyScraper Object!

You can define the target URL while creating the object or doing that later, it won't be a big deal to care for ^^

# Syntax
# PyScraper([url=None, handleRobotsTXT=False])
# I guess the syntax is clear to be understandable for you.

scraper = PyScraper.PyScraper()

Now we created the object ... Let's head to TRY:

Normal Scraping:

# if you haven't set the URL while creating the object use any of these
# scraper.setURL('http://www.example.com')
# or
# scraper.url = 'http://www.example.com'
scraper.url = 'http://www.example.com'
scrapedText = scraper.scrape("div.myDiv p")
# >>> [...] a list of <p> elements inside <div class="myDiv"></div>
# Selection is done using css selectors as you see above

don't know how to use css selectors? No worries, Check This

Does that website require you to login before accessing the info you need to scrape?

- Check this example below: The HTML of the login page of that website:

<form id="loginForm" action="/passinfo" method="post">
    <input type="text" name="username" />
    <input type="password" name="pwd" />
# let's define the URL while creating the class this time
scraper = PyScraper.PyScraper('http://www.example.com')
# you can use this below as well
# scraper.loginURL = 'http://www.example.com/login/'
########METHOD 1########
scraper.setFormSelectors({"id":"loginForm", "method":"post"})
# or select the first form using scraper.selectFormByOrder(0)
# select the right login form
scraper.UserNameFormName = 'username' # username input field name
scraper.PasswordFormName = 'pwd' # password input field name
scraper.login("MyUserName", "MyPassword")
# There are other ways to define stuff just check the bottom side of this
# README.md file later.

########METHOD 2########
scraper.setFormSelectors({"id":"loginForm", "method":"post"})
# or select the first form using scraper.selectFormByOrder(0)
scraper.otherLoginFormInfo = [("username", "MyUserName"), ("pwd", "MyPassword")]
# That's how you can add values while passing into the form
# Now you should be logged in.
scrapedText = scraper.scrape("div.myDiv p")
# >>> [...] a list of <p> elements inside <div class="myDiv"></div>

All Methods

scraper.addHeader(k, [v=None]) # v=None removes the header
scraper.getHeader(k, [d=None]) # gets the value of a header
# k -> key/name, v -> value, d -> default
scraper.getStrippedDomain() # gets the domain stripped
# if the URL is http://example.com/test/
# it returns example.com
scraper.url = url
# sets the target URL
scraper.loginURL = loginUrl
# sets login url
scraper.UserNameFormName = unameattr
# sets username input target name
scraper.PasswordFormName = pwdattr
# sets password input target name
scraper.addOtherLoginFormNames(name, value)
# adds another name with its value while passing the form
scraper.otherLoginFormInfo = [("name1", "value1"), ("name2", "value2")]
# same thing, but different syntax, You can add several names and values
# at once using that syntax
scraper.formOrder = order
# selects the login form by its order, 0 is the first order index
# selects the login form by its attributes
# e.g: scraper.setFormSelectors(selectors={"id":"loginform", "name":"form"})
scraper.login([username=None, password=None]) # logs in depending on the  given Info above
scraper.addCookie(name, value[, domain='', path="/"])
scraper.getCookie(name, [directValue=False, domain='', path="/"])
# adding/getting cookies
# don't specify domain or set domain to '' to use the default scrapping
# URL which was set before
# directValue if was set to True it returns the cookie value directly
# as string, otherwise it returns a dictionary which has the value
# as returnedDict["value"]
# scraping function
# selector is a css selector as string
# whenever you use this, the scraped object becomes the "main" object you scrape inside after
# to avoid that you can use this below
# it scrapes from the doc directly no matter how much you use scraper.scrape([selector=None])

Thanks! That's all