/urlFormatter

urlFormatter automates the unshortening and cleaning of bulk URLs

Primary LanguagePythonMIT LicenseMIT

urlFormatter

urlFormatter is intended to solve the problem of converting a bulk collection of raw URLs into a format that is analytically useful. Notably, this includes the automation of the otherwise labor-intensive process of doing this for social media links. As an example, it will convert the URLs of social media posts to the URLs for the accounts/channels that posted them. When necessary, it will extract the required information from the source code of the page at the post URL. This makes the quantitative analysis of online communities utilizing URLs feasible.

The research project that it was developed for can be found at: https://irregularhorizons.substack.com/p/the-dugin-international.

As examples, it will:

  1. Convert social media URLs to the URL for the originating account/channel, ensuring that account/channel URLs are standardized within each platform.
  2. Unshorten short URLs.
  3. Remove everything from non-social media URLs other than the domain.

Here is an example of an input file:

raw

...and here is an example of an output file:

cleaned

It should be noted that as this tool was created for cleaning links scraped from Telegram chats, the non-URL filter will not discard non-URLs that are not typical of Telegram link scrapes. This can be made to suit your needs by modifying the REGEX on line 151 of cleaner.py, as seen below:

Screenshot 2023-04-28 at 1 09 20 AM

Installation

To install urlFormatter directly into the site-packages directory of your virtual environment, run:

pip install git+https://github.com/agile-enigma/urlFormatter.git

urlFormatter can now be accessed as a module from anywhere in your directory structure via import urlFormatter or accessed as a command-line script.

Usage

urlFormatter can either be run from the command line as a script, or imported as a module.

Module

urlFormatter comprises two classes: unshortener and cleaner. unshortener contains the unshorten method, which detects and unshortens short links. To create an unshortener object type unshortener_obj = urlFormatter.unshortener(raw_links), where "raw_links" is a list containing the raw URLs that you would like to process. Execute the unshorten method by typing unshortener_obj.unshorten(). Upon completion, unshorten will return a list containing all raw_link items not consisting of a short URL as well as all of the unshortened URLs. In addition, unshorten will print the number of short URLs that were detected as well as the number that were successfully unshortened.

cleaner objects contain the clean method, which rids non-social media URLs of everything other than domains, and converts social media URLs to the URLs of originating accounts/channels. To create a cleaner object type cleaner_obj = urlFormatter.cleaner(raw_with_expansion), where "raw_with_expansion" is a list not containing short URLs (as might be obtained by running unshorten). Upon executing the clean method by typing cleaner_obj.clean(), urlFormatter will check if there are any short URLs in raw_with_expansion. If short URLs are detected, clean will prompt the user to unshorten the short URLs and exit. Upon completion, clean will return a list containing fully formatted URLs and print metrics pertaining to the number of items that were discarded owing to either an error or its inability to convert.

A typical workflow performed on a list containing short URLs would look like this:

unshortener_obj = urlFormatter.unshortener(raw_links)
raw_with_expansion = unshortener_obj.unshorten()
cleaner_obj = urlFormatter.cleaner(raw_with_expansion)
formatted_urls = cleaner_obj.clean()

Please reference docstrings for information on unshortener and cleaner attributes.

Command Line

To run urlFormatter from the command line simply type python3 -m urlFormatter [OPTIONS] and when prompted provided the full path to the .txt file containing newline-separated URLs as well as the identifier that you'd like to use for ouput file-naming purposes. The options are as follows:

Screenshot 2023-04-30 at 1 20 42 AM

The -u/--unshorten option runs the unshorten method. The -c/--clean option will run the program's clean method. Both command line options will output their results to a text file located in the directory that the script is executed from. These options can be run together, in which case the script will first run the unshorten method on the URLs found in the input file and then the clean method on the output of unshorten.

Built-in Integrity Checking and Troubleshooting Features

urlFormatter contains features that enable users to assess the integrity of results and troubleshoot any issues that might arise.

Attributes that may be of interest to the user include:

  • known_shorteners : contains the list of shorteners used to filter for shortened URLs.
  • clean_errors_df : pandas DataFrame containing error messages and associated URLs produced while executing the clean method. cleaner object attribute.
  • unshorten_errors_df: pandas DataFrame containing error messages and associated URLs produced while executing the unshorten method. unshortener object attribute.
  • garbage_df : pandas DataFrame containing counts for the number of lines that were discarded owing to an error, a failure of the program to extract target information from source code, or to their not being URLs. This is an cleaner object attribute.

The platform-specific garbage bins that garbage_df comprises are as follows:

  • self.bitchute_garbage: list containing BitChute URLs that could not be cleaned.
  • self.facebook_garbage: list containing Facebook URLs that could not be cleaned.
  • self.gettr_garbage: list containing Gettr URLs that could not be cleaned.
  • self.ig_garbage: list containing Instagram URLs that could not be cleaned.
  • self.mail_garbage: list containing strings beginning w/ 'mailto:' followed by an e-mail address.
  • self.non_url_garbage: list containing non-URLs that could not be cleaned.
  • self.odysee_garbage: list containing Odysee URLs that could not be cleaned.
  • self.rumble_garbage: list containing Rumble URLs that could not be cleaned.
  • self.tiktok_garbage: list containing TikTok URLs that could not be cleaned.
  • self.twitter_garbage: list containing Twitter URLs that could not be cleaned.
  • self.youtube_garbage: list containing YouTube URLs that could not be cleaned.
  • self.vk_garbage: list containing VKontakte URLs that could not be cleaned.
  • self.fb_watch_garbage: list of URLs containing facebook.com/watch or fb.watch that could not be cleaned.
  • self.yt_watch_garbage: list of URLs containing youtube.com/watch or youtube.com/live that could not be cleaned.
  • self.final_overall_garbage: list containing all discarded items.
  • self.final_sm_garbage: list containing all discarded social media URLs.

In addition, the clean method will print metrics for URL processing, which include:

  • The total number of URLs that were successfully cleaned and its percentage of all valid URLs.
  • The total number of discarded social media URLs and its percentage of all social media-related URLs (both discarded and not discarded).
  • The total number of items that could either not be converted into a useful format or were not URLs and were consequently discarded.
  • The number of errors produced in the cleaning process.
  • The number of discarded items that are not accounted for by the combination of all of the garbage bins.
  • The number of items included in each garbage bin.

This last item is a cleaner attribute that can also be accessed via cleaner_obj.garbage_df. Additionally, a pandas DataFrame containing error data can be accessed via the cleaner clean_errors_df attribute.

The unshorten method will indicate how many short URLs were detected, and how many were successfully unshortened:

Screenshot 2023-04-30 at 1 55 23 AM

Error data can be accessed via the unshortener unshorten_errors_df attribute.