/G-Scraper

A GUI web scraper, written completely in Python

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

G-Scraper

A GUI based web scraper, written wholly in Python


please read How to Use and Watch the Video before messaging me with any concerns/issues

DISCLAIMER: Some highly dynamic websites, like Youtube, use alot of Javascript to render their content. Since G-Scraper is built on Requests and Beautifulsoup4, these tools arent suitable for scraping sites like that. As such, you will encounter problems in scraping those sites.

What❓:

A GUI based web scraper written in Python. Useful for data collectors who want a nice UI for scraping data from many sites.


Why❓:

I was looking through Reddit for fun project ideas and came across a thread in which there was a comment of someone complaining about there not being a GUI Web Scraper. Thus I started working on G-Scraper.


Screenshots📷

URL adding menu

Elements adding menu

Web parameters adding menu

Presetting menu

Final menu


Features ✨:

(✅ means that it is implemented. ❌ means that i am working on it.)
  1. ✅ Supports 2 request types; GET & POST (at the moment)
  2. ✅ Shows all your added info in a list
  3. ✅ Can scrape multiple URLs
  4. ✅ Can scrape multiple elements from the same URL (webpage)
  5. ✅ So putting the two together, can scrape multiple elements from multiple URLs, ensuring that the element is from the URL it was assigned to
  6. ✅ Can pass request parameters into the request to send for scrape EXCEPT FILES (for now)
  7. ✅ Since parameters can be passed, it can also handle logins/signups
  8. ✅ Saves the scraped data in a seperate 'data/scraped-data' folder
  9. Has a logging function: logs 3 types of outputs
    • ✅ Elemental (for elements)
    • ✅ Pagical (for webpages)
    • ✅ Error (for errors)
  10. ✅ Handles all types of errors
  11. ✅ Request function runs in a seperate thread than GUI so you can do things while your request is being run
  12. ✅ Functionality to edit the variables once they have been added
  13. ✅ All errors are handled and logged
  14. ✅ Can delete an unwanted item from the list of added variables
  15. ✅ Can reset the entire app to start brand new after a scrape/set of scrapes
  16. ❌ Provides verbose output to user in the GUI
  17. ✅ User can set 'presets', basically if user does a scrape repetitively they can set a preset. User can then just load and run the preset without having to define the variables each time
  18. ✅ Can scrape links
  19. ✅ Unique way for generating unique filename for each log AND save data file so that no mixups happen
DISCLAIMER 2: G-Scraper CAN ONLY SCRAPE TEXTUAL DATA (texts, links etc.) NOT THINGS LIKE images, videos

Libraries used to create this:

Main:
  • PyQT5 (for the GUI) 💻
  • Requests (for the web requests) 📶
  • BeautifulSoup4 (for scraping and parsing the HTML) 🍲
  • threading (for the seperate threads) 🧵
Add ons:
  • datetime (used in logging and saved data file creation) 📅⌚
  • random (used in file creation) ❔
  • os (used to get current working directory) ⚡

Video Demo:

Here

How to use:

STEP 0: Install The App
-Clone this repository on your machine

git clone https://github.com/muaaz-ur-habibi/G-Scraper.git

-Move into the directory G-Scraper
-Run the command

pip install -r requirements.txt

to install the libraries
-Run the command

  python gui.py

inside your terminal to launch the app

STEP 1: Adding URLs
-Add sites to scrape.
-To do this select the "Set the Site to scrape" button and a enter in the URL of any number of websites you wish to scrape, along with its request method (THIS IS COMPULSORY).
-Then just click on the "+" button and it is added.
-Note: URL should have format like 'https://someurl.com; simply click the URL bar at the top of the webpage, Ctrl+C, then Ctrl+V in the textbox.
-Note 2: add one URL at a time. Dont just enter the entire list into the text-box.

STEP 2: Adding Elements (OPTIONAL)
-Add elements of that site to scrape.
-This is optional in the sense that if you don't specify any elements the app will scrape the entire webpage.
-To specify, click the "Set the elements to scrape" button.
-In here you are presented with 3 text boxes: one for the element name, one for the attribute to specify (OPTIONAL) and one for the attribute value (OPTIONAL).
-So if you want to scrape a div with class of text-box, in the HTML of the webpage it would look like: div class="text-box". Here, "div" is the element name, "class" is the element attribute, "text-box" is the attribute value.
-Once you have entered the element, you must then select the URL/site this element belongs to from the URLs you added in the previous step.
-Finally click on the "+" button and its added. Note: if there are multiple elements with the same properties you specified, the script will scrape all their data.
-Note 2: it is possible you to only specify the element name, nothing else; this will scrape all the elements of that tag
-Note 3: In order to obtain the necessary info about an element, you will have to inspect it. Just right click on the element, select 'Inspect' then you will be presented with the HTML of the element. Use the info in the HTML to scrape it
-Note 4: If you have specified an a tag a.k.a a link tag to be scraped, it wont scrape the text it has, rather the link/href value of it. You can override this by going into 'requestExecutor.py' and finding the part where if says 'if x['name'] == 'a' then just comment out the else part, and the a tag's text will be scraped

STEP 3: Specifying Request Parameters
-Add the web request parameters/payloads to send with your request.
-Click on "Set Payloads or Headers for scrape".
-First you select the site with which you want to associate these parameters with.
-Then you select the type. Currently, only FILE is not worked on, so it will probably throw an unexpected error.
-The rest work fine. (NOTE: IF YOU DONT WANT TO SEND ANY PARAMETERS YOU MUST SPECIFY SO BY SELECTING THE SITE YOU DONT WANT ANY PARAMETERS FOR AND SELECTING THE "NO PARAMETER" VALUE. LEAVE THE REST EMPTY AND ADD).
-After you have selected your parameter, specify its contents, then "ADD (+)"
-Note: If you want to obtain the payload, headers, or any web parameter data, you can do so in the Networking tab of Dev Tools.
-Note 2: For sending files, more specifically images (currently only images are tested for files), just type the payload name then specify the complete path to the image file.

STEP 4: Starting Scrape
-Once you have everything set, you can start the scrape by clicking on "Start Scraping".
-Then once you have reviewed all the details, you can select "Yes".
-Note: If you havent specified any elements to scrape, app will give you a warning. If you forgot to, you can go back and specify them. Else you can just click on "Yes".

STEP 5: Setting Presets (OPTIONAL):
-You can also set presets, they are just what they sound like. You save some values, then in the future you can load those values without having to explicitly specify them
-Currently, you can only set a preset for one URL at a time, but the number of elements and web parameters for that URL is to your liking
-To set a preset, just type in the values like normally as specified above. But now instead of starting the scrape, click on the 'Set/Run Presets' button in the menu bar.
-Here you will be presented with an option to 'create a preset'. -Then to load that preset in the future,

  1. First load them from the database using the 'Load presets from database' button
  2. Next select the preset you would like to run
-The data will be loaded, although if you try to view them from the lists, they won't show up. -Note: If you load a preset while some data is already in the app, the function will erase all that was there and just add the preset data -Note 2: To run the preset, since all the values are loaded, just simply run the scrape like how you usually will
-Note 3: Preset names are case-sensitive, so muaazkhan, muaazKhan and Muaazkhan are all different

As of now, there really isnt a way to give verbose output to the user. So once you start the scrape, just wait for a few seconds and check the scraped data folder in the data folder. Alternatively, if you find nothing there, you can check the logs folder to see if any error had occured.

Updates:

July 3, 2024

  • URL editing is implemented, but not request type.
  • Images are supported in files payload, since only they have been tested so far

July 4, 2024

  • Added functionality to scrape the links of a tags

July 5, 2024

  • Fixed some code mess
  • Started working on preset adding function
  • Finished the presetting GUI elements

July 7, 2024

  • Completed the basic presetting functionality, i.e being able to take, clean and process all the necessary data
  • Also added some ifs and elses so that presetting now also can support webpage scraping

10 July, 2024

  • Completed the presetting functionality, with the exception of deleting a preset
  • Added a pop-up when scrape started to let user know

July 13, 2024

  • Added the functionality to delete a preset