Kuwala is a tool to build rich features for analytics based on clean data. It uses PySpark in combination with Parquet for data processing. Different data sources are connected in a Neo4j graph database allowing for fast and flexible feature generation.
There are basically 3 ways you can work with the clean data at the moment:
- Preprocessed Parquet files
- Neo4j graph database queries
- Jupyter notebooks with convenience functions
Points of interest are places that are physically accessible. This includes, for example, businesses, restaurants,
schools, tourist attractions and parks. A complete list of categories and further information can be found in our
OSM documentation. We take the daily
updated .pbf
files with the entire data on OSM from Geofabrik. We filter objects based on
tags that are irrelevant for POIs. We then further aggregate the tags to high-level categories allowing for easy query
building and extract other metadata such as the address, contact details, or the building footprint. By extracting and
cleaning the data from OpenStreetMap, Kuwala has one of the largest POI databases scalable to any place in the world.
Kuwala offers a scraper that retrieves all available metadata for POIs as seen on Google Search. You can verify the POIs from OSM and enrich them further with an hourly, standardized score for visitation frequency throughout the week. This helps to understand the flow of people throughout a city. We do not use the Google Maps API so there is no need for registration. Instead, the results are generated based on search strings which can be based on OpenStreetMap (OSM) data. For more information, please go to the complete documentation.
The high-resolution demographic data comes from Facebook's Data for Good initiative. It provides population estimates for the whole world at a granularity of roughly 30 x 30 meters for different demographic groups such as total, female, male or youth. It is a statistical model based on official census data combined with Facebook's data and satellite images. The demographic data represents the highest granularity and most up-to-date data of population estimates that is available.
H3 is a hierarchically ordered indexing method for geo-spatial data which represents the world in unique hexagons of different sizes (bins). H3 makes it possible to aggregate data fast on different levels and different forms. It is computationally efficient for databases and has applications in weighting data. One example might be weighting less granular data like income data with the high-resolution demographic data provided through Kuwala. H3 was developed by Uber. For the complete documentation please go to the H3 Repo
Installed version of Python3, Docker and docker-compose (Go here for instructions)
Note: We recommend giving Docker at least 8 GB of RAM (On Docker Desktop you can go under settings -> resources)
We have a notebook with which you can correlate any value associated with a geo-reference with the Google popularity score. In the demo we have a preprocessed graph and a test dataset with Uber rides in Lisbon, Portugal.
Launch Docker in the background and from inside the root directory run:
Linux/Mac:
cd kuwala/scripts && sh initialize_core_components.sh && sh run_cli.sh
and for Windows (Please use PowerShell or any Docker integrated terminal):
cd kuwala/scripts && sh initialize_windows.sh && cd windows && sh initialize_core_components.sh && sh run_cli.sh
To run the pipelines yourself, build the components first from inside the kuwala/scripts
directory (or if the computer uses Windows, go to kuwala/scripts/windows
) by executing the
initialize_all_components.sh
script and the starting the CLI by running the run_cli.sh
script. .
Apart from using the CLI, you can also run the pipelines individually without Docker. For more detailed instructions
please take a look at the ./kuwala/README.md
.
We currently have the following pipelines published:
osm-poi
: Global collection of point of interests (POIs)population-density
: Detailed population and demographic datagoogle-poi
: Scraping API to retrieve POI information from Google (incl. popularity score)
The best first step to get involved is to join the Kuwala Community on Slack. There we discuss everything related to data integration and new pipelines. Every pipeline will be open-source. We entirely decide, based on you, our community, which sources to integrate. You can reach out to us on Slack or email to request a new pipeline or contribute yourself.
If you want to contribute yourself, you can use your choice's programming language and database technology. We have the only requirement that it is possible to run the pipeline locally and use Uber's H3 functionality to handle geographical transformations. We will then take the responsibility to maintain your pipeline.
Note: To submit a pull request, please fork the project and then submit a PR to the base repo.
By working together as a community of data enthusiasts, we can create a network of seamlessly integratable pipelines. It is now causing headaches to integrate third-party data into applications. But together, we will make it straightforward to combine, merge and enrich data sources for powerful models.
Based on the use-cases we have discussed in the community and potential users, we have identified a variety of data sources to connect with next:
Already structured data but not adapted to the Kuwala framework:
- Google Trends - https://github.com/GeneralMills/pytrends
- Instascraper - https://github.com/chris-greening/instascrape
- GDELT - https://www.gdeltproject.org/
- Worldwide Administrative boundaries - https://index.okfn.org/dataset/boundaries/
- Worldwide scaled calendar events (e.g. bank holidays, school holidays) - https://github.com/commenthol/date-holidays
Unstructured data becomes structured data:
- Building Footprints from satellite images
Data we would like to integrate, but a scalable approach is still missing:
- Small scale events (e.g., a festival, movie premiere, nightclub events)