#startingDataEngineeringFromScratch

Project Topic: Webscrape Companies-Information Listed on Y-combinator and Perform analysis



Project Objective - (Last updated: August 2022)

The project aims to demonstrate an end-to-end data engineering skill by performing ETL tasks and analyses on the y-combinator listed companies (https://ycombinator.com/companies). The project's core concept is to help beginners optimize data pipelines. In the project doc, three different approaches were used

  • An approach that made the extraction process run for 3 hours
  • An approach that ran for 12 mins
  • An approach that ran for 1.06 mins

Pipeline Architecture

This is the architecture diagram for the simple end-to-end ETL pipeline. Data-model-ETL-YC Scraper


Analysis Result

click here for full analysis details

From the analyzed data, here are the insights

  • The top 5 countries under y-combinator are
    • USA
    • India
    • Canada
    • UK
    • Nigeria
  • Companies from USA take 65.3% of y-combinator start-up
  • Nigeria is the only Africa country that has more than 10 start-ups under y-combinator
  • Company with a single founder under YC has the highest percentage, 42.4%, while 38.6% and 13.9% are for 2 and 3 founders respecively. The other percentage is shared among 4 and 5 founders.
  • Airbnb is the largest company under YC in terms of employees
  • The total number of people YC has empowered is 90373

Information to scrape

The image below indicates the information to be scraped for analysis.

  1. company_name (company's summary and tags)
  2. short_description (company's summary and tags)
  3. tags (company's summary and tags)
  4. link (company's link)
  5. company_socials (company's info)
  6. founded (company's info)
  7. team_size (company's info)
  8. location (company's info)
  9. active_founders (Founder's description)
  10. about_founder (Founder's info)
  11. description (Company's description)

Screenshot 2022-04-03 at 7 29 58 PM


Important Notice

If the code breaks, the closest fix is to verify if the HTML tag in the code is still valid. If not, change the HTML tags.


Output data sample

company_name link short_description tags company_socials founded team_size location active_founders about_founders description
Airbnb http://airbnb.com Book accommodations around the world. ['W09', 'Public', 'Marketplace', 'Travel'] ['https://www.linkedin.com/company/airbnb/', 'https://twitter.com/Airbnb', 'https://www.facebook.com/airbnb/', 'https://www.crunchbase.com/organization/airbnb'] 2008 5000 San Francisco ['Nathan Blecharczyk', 'Brian Chesky', 'Joe Gebbia'] [{'name': 'Joe Gebbia, CPO', 'role': 'CPO', 'social_media_links': ['https://twitter.com/jgebbia']}, {'name': 'Joe Gebbia, CPO', 'role': 'CPO', 'social_media_links': ['https://twitter.com/jgebbia']}, {'name': 'Joe Gebbia, CPO', 'role': 'CPO', 'social_media_links': ['https://twitter.com/jgebbia']}] Founded in August of 2008 and based in San Francisco, California, Airbnb is a .. \n

Tools

  • Selenium - (handled dynamic scraping)
  • BeautifulSoup - (for static scraping)
  • Pandas - (for data cleaning)
  • Matplotlib
  • Seaborn
  • S3

Data Analysis

Some charts were created to make sense of the data and communicate the insight from the data.

start-up by country

Visualize the country distribution by start-up image

Start-up by year

Distribution of start-up by year image

Start-up by founder

image

Team size per company

image

Total Empowered by Y-Combinator

90373 people