/ImageURLScraper

Image Scraper for Google Drive, Imgur, AsiaChan, and more.

Primary LanguagePythonMIT LicenseMIT

ImageURLScraper

This project is no longer maintained. This was one of my first projects and may still work, but it's inefficient.

ImageURLScraper is a multi-site image scraper. It automatically detects which site the image is coming from and scrapes it. Only relevant images are scraped from the site and shortened links are automatically unshortened. In the case that you have many links that need to be processed, these links can be distinguished by IDs when requesting the image links.

Currently Supported Sites:
Asiachan - Checks all previous and next pages from it's current location.
Google Drive - Checks all folders and grabs the first 1000 images in each folder.
Imgur - Grabs all images in a gallery.

Installation

In a terminal, type pip install imageurlscraper.

In order to get images from Google Drive, the credentials are needed (the scraping process I had in place was too inefficient).
Steps to add Google Drive credentials:

Go to https://console.developers.google.com/apis/dashboard and at the top click + ENABLE APIS AND SERVICES.
Next, search for Google Drive API, click it, and then click Enable.
Select a project and then you'll be on a page with your project.
You will see a notice: "To use this API, you may need credentials. Click 'Create credentials' to get started.".
Go ahead and click Create Credentials.
Once you finish the process, you should be able to download a JSON file.
Get your credentials and rename the JSON file to credentials.JSON
Go to YOUR project source and put the credentials.json in the same folder as the file you are running.

Sample Code

"""
This sample code links directly to the main function that automatically processes the links 
and returns back a dict with IDs and their image links. The original link will not be shown,
which is why IDs are useful.
IDs are REQUIRED input alongside their links, although they are only for classifying links.
Links can have several IDs if necessary to group them together.
"""
import imageurlscraper
import pprint
pp = pprint.PrettyPrinter(indent=4)

list_of_links = [
    # the list must contain an ID along with a link
    # This ID is helpful for distinguishing certain objects or people.
    # When the dict is returned.
    [0, "https://kpop.asiachan.com/222040"],
    [1, 'https://imgur.com/a/mEUURoG'],
    [2, 'https://bit.ly/xxxxxxx'],
    [3, 'http://imgur.com/a/jRcrF'],
    # [999, 'https://drive.google.com/drive/folders/1uWIObdgq65-TmBcA8oJIWOnbuuR_H5PB']
    # This google drive folder has a lot of media and will be skipped for testing purposes. but it can support
    # google drive links like these and will go through every folder in that folder.
]

scraper = imageurlscraper.main.Scraper()
all_images = scraper.run(list_of_links)  # a dict with all the links of the images.
pp.pprint(all_images)  

# Want to send in a dict instead of a list?
# Dict Format is expected to be:
dict_links = {
    0: ["link1.com",
        "link2.com",
        "link3.com"
        ],
    1: [
       "link1.com",
       "link2.com",
       "link3.com"
    ],
    2: [
       "link1.com",
       "link2.com",
       "link3.com"
    ]
}
all_images = scraper.run(dict_links)

print(5)

Sample Output (dict)

{   1: [   'https://i.imgur.com/RUb6Xwl.jpg',
           ...],
    3: [   'https://i.imgur.com/ILixI73.jpg',
           ...],
    4: [   'https://i.imgur.com/X8jZOc7.jpg',
           ...],
    5: [   'https://i.imgur.com/L4SFme0.jpg',
           ...],
    6: [   'https://i.imgur.com/G2ltCDf.jpg',
           ...],
    204: [   'https://static.asiachan.com/Lee.Jueun.full.222040.jpg',
             ...]
}

More Samples

import imageurlscraper
scraper = imageurlscraper.main.Scraper()

shortened_link = "https://bit.ly/311n6vP"
unshortened_link = scraper.get_main_link(shortened_link)  # Expected Output -> http://google.com/


# Want to process links one by one or do not want to use IDs?
link = "https://imgur.com/a/mEUURoG"
image_links = scraper.process_source(link)  # Expected Output -> A LIST of image links.


# Want to run from the sources directly?
images = imageurlscraper.asiachan.AsiaChan().get_all_image_links(link)  # Asiachan, expected output -> A LIST of image links.
images = imageurlscraper.googledrive.DriveScraper().get_links(link)  # Google Drive, expected output -> A LIST of image links.
images = imageurlscraper.imgur.MediaScraper().start(link)  # Imgur, expected output -> A LIST of image links.