For scalability: Event scraping method to implement

Question

For scalability: Event scraping method to implement

Closed this issue 3 years ago · 2 comments

Feature purpose
This issue is regarding how the Eventbrite scraper scrapes events and displays them based on individual users' locations. For better future scalability support.

Solution 1
The Eventbrite scraper scrapes events in major cities and stores them all in Firestore. Based on the user's location, all events in the user's city are extracted from the respective Firestore collection and displayed on the application map.

+ Efficient total event extraction from Firestore since all user created events and scraped events would be stored in the same collection in the database.
+ Number of API calls per period of time would be constant as it is not reliant on the number of active users. (Eg: Calling API to scrape events in Toronto, Mississauga, and Scarborough twice everyday at 7AM and 7PM)
- Would require manually adding names of cities/regions to list of locations to scrape on Eventbrite. Hence, more manual coding needed to add support for additional regions.
- More Firebase storage required as all scraped events need to be stored in the database. Would roughly increase linearly based on the number of locations to extract events from. For example, scraping events from all cities in just Ontario would require roughly 17.33x the storage as compared to only scraping events in Toronto, Mississauga and Scarborough (Ontario has 52 cities in total). This would increase to insane proportions if we want to scale the application to have support all over Canada or all over the world.

Solution 2
The scraper scrapes events from Eventbrite based on the user location. In this case, the scraper API would be called every time the user launches the application.

Pros and Cons:
+ Scraped events do not need to be stored in Firestore
+ Application can be used by users all over the world from the get go
- Displaying Eventbrite events near the user would take significantly longer since the events need to first be scraped from Eventbrite based on the user's current location
- Number of API calls per period of time would linearly increase based on the number of active users, which would require better API call support in order for the application to not break due to excess API calls.

Additional context
The solution we choose to implement would impact what areas we would have to invest in when focusing on scalability.

In order to increase the scale of the application we would have to invest in both increased Firebase storage and better API call support. However, each solution would require a different proportion of investment in both fields:

For Solution 1 we would need to invest more in increasing Firebase storage.
For Solution 2 we would need to invest more in better API call support.

Answer 1 · 2021-11-02T22:24:53.000Z

These are all great points, for the final solution we will be implementing something in the middle.

Initially, we will scrape some events from major cities (Toronto, Mississauga, Hamilton, etc..) and store them in firestore. For all cities collection, we will also log the scraped time, this value will be used later to determine if the events in a city need to be re-scraped.

After the initialization, there will be 2 ways the scraper can be triggered:

Our google cloud scheduler will periodically call the scraper (we'll say once a day) to trigger a cloud function to re-scrape the top 10 (this number can be changed through experimentation) cities with the oldest scraped time.
In the case where a user is in a city that is not scraped yet, the frontend will call the scraper with the city as a parameter and scrape the events for that city. The front end can then grab those events from the firestore.

Answer 2 · 2021-12-04T20:43:12.000Z

Moved to #136