The project aims to demonstrate an end-to-end data engineering skill by performing ETL tasks and analyses on the y-combinator listed companies (https://ycombinator.com/companies). The project's core concept is to help beginners optimize data pipelines. In the project doc, three different approaches were used
- An approach that made the extraction process run for
3 hours
- An approach that ran for
12 mins
- An approach that ran for
1.06 mins
This is the architecture diagram for the simple end-to-end ETL pipeline.
click here for full analysis details
From the analyzed data, here are the insights
- The top 5 countries under y-combinator are
- USA
- India
- Canada
- UK
- Nigeria
- Companies from
USA
take 65.3% of y-combinator start-up Nigeria
is the only Africa country that has more than 10 start-ups under y-combinator- Company with a single founder under YC has the highest percentage,
42.4%
, while38.6%
and13.9%
are for 2 and 3 founders respecively. The other percentage is shared among 4 and 5 founders. Airbnb
is the largest company under YC in terms of employees- The total number of people YC has empowered is
90373
The image below indicates the information to be scraped for analysis.
company_name
(company's summary and tags)short_description
(company's summary and tags)tags
(company's summary and tags)link
(company's link)company_socials
(company's info)founded
(company's info)team_size
(company's info)location
(company's info)active_founders
(Founder's description)about_founder
(Founder's info)description
(Company's description)
If the code breaks, the closest fix is to verify if the HTML tag in the code is still valid. If not, change the HTML tags.
company_name | link | short_description | tags | company_socials | founded | team_size | location | active_founders | about_founders | description |
---|---|---|---|---|---|---|---|---|---|---|
Airbnb | http://airbnb.com | Book accommodations around the world. | ['W09', 'Public', 'Marketplace', 'Travel'] | ['https://www.linkedin.com/company/airbnb/', 'https://twitter.com/Airbnb', 'https://www.facebook.com/airbnb/', 'https://www.crunchbase.com/organization/airbnb'] | 2008 | 5000 | San Francisco | ['Nathan Blecharczyk', 'Brian Chesky', 'Joe Gebbia'] | [{'name': 'Joe Gebbia, CPO', 'role': 'CPO', 'social_media_links': ['https://twitter.com/jgebbia']}, {'name': 'Joe Gebbia, CPO', 'role': 'CPO', 'social_media_links': ['https://twitter.com/jgebbia']}, {'name': 'Joe Gebbia, CPO', 'role': 'CPO', 'social_media_links': ['https://twitter.com/jgebbia']}] | Founded in August of 2008 and based in San Francisco, California, Airbnb is a .. \n |
- Selenium - (handled dynamic scraping)
- BeautifulSoup - (for static scraping)
- Pandas - (for data cleaning)
- Matplotlib
- Seaborn
- S3
Some charts were created to make sense of the data and communicate the insight from the data.
Visualize the country distribution by start-up
Distribution of start-up by year
90373 people