/person-insights

Python search engine to recover public information on a person

Primary LanguageJupyter Notebook

Person insights

This module is a python search engine to recover public information on a person from a number of sources.

In order to use it, it is required to register a Twitter API key, a LinkedIn account and a New York Times API key (see code for more information).

Steps:

  • query Forbes and crawl to get info (selenium)
  • query Wikipedia API, flag if present, scrape info and get summary if present
  • query LinkedIn API, get profession and past experiences
  • query Twitter, number of followers, whether it's a verified account
  • crawl Google search, news and a number of news sites (Financial times, The economist, bilan.ch, challenges.fr)
  • query New York Times API
  • build a model to predict if person is famous/politically exposed mainly from online presence
  • apply model and record probability

Further improvements:

  • see other APIs
  • create dictionary to convert country name in 3-letter country code
  • create a model to estimate wealth
  • create an option to return wealth in different units

How to use it

See notebook/search engine.ipynb for more details. The current workflow is the following:

  • Launch web driver with window to control behavior:
driver = data_acquisition.launch_browser_driver(headless=False)
  • create person object (info will contain only firstname and lastname):
person = data_acquisition.Person('Jeff', 'Bezos', middlename='Preston', driver=driver)
person.print_info()
  • get info sequentially
person.get_info_from_Forbes()
person.get_info_from_Wikipedia()
person.get_info_from_LinkedIn()
person.get_info_from_Twitter()
person.get_info_from_Google()
person.get_info_from_nytimes()
  • print results
person.print_info()
  • run famous people model
reload(data_modeling)
data_modeling.predict_PEP(person)
  • print final information
person.print_info()

Data sources

Weatlh

Below is a list of additional sources of information:

  • Forbes (use selenium and Chrome headless to crawl the Forbes website).
  • another source of wealth information
  • build a correlation Company wealth/CEO wealth (get CEO info from LinkedIn)
  • youTube/Facebook/Instagram stars
  • Swiss public employees
  • Glassdoor salary from LinkedIn profession
  • Panama papers
  • Politicians: public tax declararion in France, Switzerland, US
  • actors/IMBD artist's fee

Public exposure

Features:

  • wikipedia_presence
  • Google_search_nresults
  • Google_news_nresults
  • Financial_news_nresults
  • nytimes_nresults

Sources:

  • Wikipedia: Use Wikipedia API with Python package.
  • Twitter: number of followers
  • startpage: number of results
  • Financial news: site:bilan.ch OR site:challenges.fr OR site:forbes.com OR site:ft.com OR site:economist.com
  • news website (BBC, New York Times)
  • blogs
  • CIA worldfactbook

Other info

  • LinkedIn (via Google?), profession, experience, number of followers
  • Google, Wikipedia
  • white pages API