Stairway to Travel offers personalized travel recommendations that help you shape unique itineraries.
Stairway to Travel has long been my dream project with the aspiration of becoming a profitable business. Now, I am donating my code to the community that I have benefitted of so much in the creation of this website. I hope you will learn or benefit from what I did. Please feel free to reach out in case of questions or remarks!
Read the full story about why I am open sourcing everything in my blog.
This is the code repository containing backend related services for Stairway to Travel. The repository's goal is to version control code related to:
- The backend web-service API hosted on Google Cloud Platform (GCP).
- Analysis and preparation of various data sources.
- One-off analyses, for example for marketing purposes.
The frontend code for Stairway to Travel's user interface can be found in the related repository named stairwaytotravel-frontend.
.
├── api # Flask API files for Google App Engine
├── data # Helper folder for local file storage
├── documentation # Diagrams and detailed documentation
├── notebooks # Jupyter notebooks for demos and experimentation
├── scripts # Scripts for ingesting large amounts of data
├── src # Core functionality for preparing data
├── tests # Tets on core functions (but limited - WIP)
├── environment.yml # Requirements for Conda environment
└── README.md
First prepare the Conda virtual environment:
conda env create --file environment.yml
conda activate stairwaytotravel-backend
Then install the stairway
package in the virtual environment in editable
mode so that any changes in your package are directly available in your
notebook when using %autoreload
.
pip install --editable .
Several scripts and functions require credentials to access third party APIs like Flickr, Mailchimp or Google App Engine. These keys have not been uploaded to Git and you will have to get your own keys for access to these tools.
Retrieval of the keys happens in two ways:
- Either through a
.env
file using thepython-dotenv
package; or - Through saving the keys in a gitignored
credentials/
folder and reading it from the local file.
The code for the web app can be found in the api/
folder, which includes the
api/README.md
file with detailed instructions on how to run the API server.
Please find below a high level architecture of the application. Read more about each component in the remainder of this section.
At the time of creation, I choose Python Flask for the web service framework. Nowadays, it might be wise to consider FastAPI as an alternative when you want to continue with my work or start your own web service.
The app is fully deployed and run on Free Tier products of Google Cloud. This means that with limited usage, the hosting of the website and webservice cost nothing.
App Engine is used for deploying the Flask App. App Engine is an easy to use serverless application, meaning that the apps scales automatically to meet traffic demand. I have also considered Cloud Functions, but I found that App Engine is a bit more flexible in terms of customizing the application infrastructure.
The downside of a serverless offering is a bit of request and response latency during the time when your app's code is being loaded to a newly created instance. Although this can be completely avoided by always having a machine up and running, this would also incur costs. To stay within the Free Tier, I use Cloud Functions and Cloud Scheduler to regularly ping my App Engine instance so that the machine will be kept 'alive'. By doing so every 10 minutes, I found an ideal balance between possible latency and costs on the other hand.
I initially used Google's NoSQL cloud database Cloud Firestore to fetch place information from (also a Free Tier product). However, as my dataset turned out to be limited in size, it is simply faster to upload the data to the App Engine and serve recommendations directly from there without connecting to a separate database. The Cloud Firestore component in the architecture diagram above is therefore no longer in use.
The notebooks/api/google-firestore/
folder still contains several notebooks
with examples on how to load and retrieve data with Cloud Firestore. I make an
assessment on whether Firestore is fit for purpose in
querying-firestore.ipynb
and conclude that it's not suited for my use case.
Mailchimp is used for email automation whenever users sign up for newsletters or when they check-out with their bucket list of places they want to visit.
Details on the designed architecture and workflow automation can be found in
the documentation/mailchimp/
folder.
The second function of this repository is to prepare data for use in the
recommendation service. Over time, I have investigated many different datasets
and I have often done so in Jupyter notebooks. Hence, the code for data
preparation is a bit more messy then for the API and the code is spread over
the scripts/
, notebooks/
and src/
folders.
A diagram with a detailed approach on how to clean and combine data can be
found in the documentation/data-processing/
folder. In general, data prep
follows the following phases wherein intermediate data is stored locally in
the data/
folder:
- Raw: a copy from the source as-is in its original format
- Clean: transformed data in a easy to handle CSV format
- Processed: feature extraction on the cleaned data
- Enriched: cleaned datasets are combined into their final shape
- API Data: data for the API is copied into the
api/data/
folder so that it will be uploaded to Google App Engine when deploying the Flask app.
To get your own copy of raw data, follow the instructions below:
Source | Data type | How to get it |
---|---|---|
Wikivoyage | Place info | Download latest .xml.bz2 files here |
Wikivoyage | Page info | Public API, run script wikivoyage_page_info.py |
Wikivoyage | Place activities | Feature extraction with BM25. See features-bm25.ipynb |
University of Delaware | Weather | Download .nc files here |
Visual Crossing | Weather | Paid API, run script visualcrossing_monthly_weather_threaded.py |
Flickr | Place images | Private API, run script flickr_image_list |
Flickr | People info | Private API, run script flickr_people_list |
Geonames | Place info | Download .zip files here |
With the above instructions you should be able to replicate all data sets. The
final data that is used in the API service is the only data that I checked in,
see the api/data/
folder.
On occassions I did a one-off analysis that isn't quite related to data prep or the backend API service. For example, for marketing purposes, I retrieved the top 5 Flickr images per place and formatted them automatically into the standard square Instagram format with a text and logo. See the result on Stairway to Travel's Instagram profile. I also made a rotating globe depicting which places I already covered on Instagram.
Code for these things can be found partly in notebooks/one-off-analyses/
.
When I was still actively working on this project I kept a huge list of tasks with new and improved functionality in Trello. In case you are curious or are considering to continue this project, feel free to have a look at the frontend repository's issues . I labelled tasks that require backend work with a 'backend' label.