Insight Data Science Fellow Background Investigation, including Data-clean, wrangle and analysis
Python and iPython Notebook are used
The following Python Libraries are used in this investigation
1. regular expression (re)
2. requests
3. pandas
4. Beautifulsoup
5. folium and plugins
6. seaborn
1. Web Scraping for data collection
2. Data Wrangle and Retrieve the Geo-Location of fellows
3. API and Mapping (Using Free API)
4. Background Investigation
I navigate the website of Insight, Insight Data Science Fellows to retrieve the background information of Insight's Fellow. Insight's website is created on a structured way which leads to the convinient way to scrap the structured data
The basic workflow in this section is like the following:
- Request the html data page by page => Utilize Requests
- Scrap the html data from pages in a structued way => Recommend BeautifulSoup for html data
- Manipulate String with Regular Expression
- Store the scrapped data into Dataframe format for the further wranggling => Use Pandas
Even with the web data successfully loaded into Pandas Dataframe, the scrapped data are sometimes very row. The row data in this web-scrapping action comes from two actions:
- Missing Data or Null Data or Non-PhD Background
- Miss Placement in the html
The row data will further requires cleanning and can be reasonable
manipulated sometimes.
The basic cleanning workflow is like the following:
- Identify the null data in the dataframe
- To see if the mis-placed data could be fixed or manipulated
- Save the data into new csv
The workflow can be implemented with Pandas
In this section, we utilize Folium Map to have the interactive marker. In addition, we can also have the plugin function of MarkerCluster
to have the interactive aggregated number of regional cluster shown in Map
In this section, we would use the following functions to have the background check of fellows:
Pandas Apply
=> This is similar asSQL
casewhen
to have the catogorized data- Visualize the distribution of the major => Use Seaborn