meanii/Search4

Suggestion: Using multi-threaded & multi-processes to make search4 more faster & flexible

Opened this issue · 8 comments

Issue

Search4 as of now uses single-thread to fetch & view results. But then it becomes too slow! to be used. And, as in todo, new sites are also going to be added.
We face Python limitation called Global Interpreter Lock (GIL). The GIL concept means that at any specific time only one thread may be executed by a processor. It was designed to avoid the threads being competing for different variables. An executing thread gains access through the whole environment. This feature of Python thread implementation significantly simplifies the work with threads.

Suggestion

Image1
Image2

Example

import requests
import multiprocessing

BASE_URI = 'http://placewherestuff.is/?q='

def internet_resource_getter(stuff_to_get):
  session = session.Session()
  stuff_got = []

  for thing in stuff_to_get:
    response = session.get(BASE_URI + thing)
    stuff_got.append(response.json())

  return stuff_got

stuff_that_needs_getting = ['a', 'b', 'c']

pool = multiprocessing.Pool(processes=3)
pool_outputs = pool.map(internet_resource_getter,
                        stuff_that_needs_getting)
pool.close()
pool.join()
print pool_outputs

Source

I can thread it this weekend (possibly tomorrow night) if no one else has done so yet. I would pick threading over multiprocessing for this use case. Also would you consider using a .yaml or .json file to hold all of your links/urls? Maybe read them into a list, and replace each of those result() calls with just one result() call inside of a loop? The username can still be added via string formatting for each iteration.

The main reason I would do that (besides shortening your code) is so that I can know the total number of URLs that you have. THEN I can break/chunk them into groups and give each it's own thread. That way you can dynamically add more URLs (1000s if you want) as desired to the yaml file and the program will run just the same.

Or you can leave it the way it is, and I can just give each GET request it's own thread. You may run into problems with this though if you add hundreds of URLs or more.

Most possibly no one has done this yet.

The first option seems more applicable!

As search4's main objective is to search for a particular username on almost every social-platform, so we don't want to loose any site. But, since currently search4 uses web-responses to fetch details, many other sites don't give a bad response when a username is not found on the site, instead they give error messages. So, to get the error messages will need to use bs4.

And, so first adding bs4 snippet to search4 then adding multi-threading will be good?. Else will have to re-code multi-threading part?

I don't know if I am right or wrong

Sure... might even just be able to extract error messages via your requests’ response inside of r.text.... I’ll see what I can come up with and then you can decide if you want to use it or not.

nm17 commented

Also would you consider using a .yaml or .json file to hold all of your links/urls? Maybe read them into a list, and replace each of those result() calls with just one result() call inside of a loop? The username can still be added via string formatting for each iteration.

I think we should load python modules from a folder and run some kind of function that checks if username exists. We can do that by using importlib.import_module. That way, it can check it without relying on one function (for example, as @7rillionaire said, not all sites return 404 when the user is not found). Although I think that if we could provide the same flexibility using yaml, it would be even better.

nm17 commented

Some ideas for yaml files:

Facebook example

version: '1.0'
site:
  url:
    method: get_req
    url: 'https://facebook.com/{username}'
  check:
     method: status_eq
     status: '404'

Post request

version: '1.0'
site:
  url:
    method: post_req
    url: 'https://example.com/api/check_username'
    form:
      type: urlencoded
      data:
        username: "{username}"
  check:
     method: status_eq
     status: '404'

Check if text contains some string

version: '1.0'
site:
  url:
    method: post_req
    url: 'https://example.com/api/check_username'
    form:
      type: urlencoded
      data:
        username: "{username}"
  check:
     method: text_contains
     data: 'Not Found'

almost done... damn I forgot I had a Steam account hahaha

Time is much faster now. Pull Request #9 opened!

Thanks @rootVIII !

issue #8 solved in pull #9