Suggestion: Using multi-threaded & multi-processes to make search4 more faster & flexible
Opened this issue · 8 comments
Issue
Search4 as of now uses single-thread to fetch & view results. But then it becomes too slow! to be used. And, as in todo, new sites are also going to be added.
We face Python limitation called Global Interpreter Lock (GIL). The GIL concept means that at any specific time only one thread may be executed by a processor. It was designed to avoid the threads being competing for different variables. An executing thread gains access through the whole environment. This feature of Python thread implementation significantly simplifies the work with threads.
Suggestion
Example
import requests
import multiprocessing
BASE_URI = 'http://placewherestuff.is/?q='
def internet_resource_getter(stuff_to_get):
session = session.Session()
stuff_got = []
for thing in stuff_to_get:
response = session.get(BASE_URI + thing)
stuff_got.append(response.json())
return stuff_got
stuff_that_needs_getting = ['a', 'b', 'c']
pool = multiprocessing.Pool(processes=3)
pool_outputs = pool.map(internet_resource_getter,
stuff_that_needs_getting)
pool.close()
pool.join()
print pool_outputs
Source
I can thread it this weekend (possibly tomorrow night) if no one else has done so yet. I would pick threading over multiprocessing for this use case. Also would you consider using a .yaml or .json file to hold all of your links/urls? Maybe read them into a list, and replace each of those result() calls with just one result() call inside of a loop? The username can still be added via string formatting for each iteration.
The main reason I would do that (besides shortening your code) is so that I can know the total number of URLs that you have. THEN I can break/chunk them into groups and give each it's own thread. That way you can dynamically add more URLs (1000s if you want) as desired to the yaml file and the program will run just the same.
Or you can leave it the way it is, and I can just give each GET request it's own thread. You may run into problems with this though if you add hundreds of URLs or more.
Most possibly no one has done this yet.
The first option seems more applicable!
As search4's main objective is to search for a particular username on almost every social-platform, so we don't want to loose any site. But, since currently search4 uses web-responses to fetch details, many other sites don't give a bad response
when a username is not found on the site, instead they give error messages. So, to get the error messages will need to use bs4.
And, so first adding bs4 snippet to search4 then adding multi-threading will be good?. Else will have to re-code multi-threading part?
I don't know if I am right or wrong
Sure... might even just be able to extract error messages via your requests’ response inside of r.text.... I’ll see what I can come up with and then you can decide if you want to use it or not.
Also would you consider using a .yaml or .json file to hold all of your links/urls? Maybe read them into a list, and replace each of those result() calls with just one result() call inside of a loop? The username can still be added via string formatting for each iteration.
I think we should load python modules from a folder and run some kind of function that checks if username exists. We can do that by using importlib.import_module
. That way, it can check it without relying on one function (for example, as @7rillionaire said, not all sites return 404 when the user is not found). Although I think that if we could provide the same flexibility using yaml, it would be even better.
Some ideas for yaml files:
Facebook example
version: '1.0'
site:
url:
method: get_req
url: 'https://facebook.com/{username}'
check:
method: status_eq
status: '404'
Post request
version: '1.0'
site:
url:
method: post_req
url: 'https://example.com/api/check_username'
form:
type: urlencoded
data:
username: "{username}"
check:
method: status_eq
status: '404'
Check if text contains some string
version: '1.0'
site:
url:
method: post_req
url: 'https://example.com/api/check_username'
form:
type: urlencoded
data:
username: "{username}"
check:
method: text_contains
data: 'Not Found'
almost done... damn I forgot I had a Steam account hahaha