/wwxr

An experiment in crawling the web for XR content

Primary LanguageWebAssemblyMIT LicenseMIT

wwxr

wwxr was an experiment in crawling the Web for WebXR and XR content. It used crawl data from Common Crawl, and Elastic Map Reduce via cc-mrjob, to scrape the web for <a-frame> and <model-viewer> scenes. Crawled data was ingested into a MongoDB instance, and made available via a simple nodejs search and browse interface.

wwxr was useful as an experiment, and showed that it was valuable to have central access to XR content from across the Web, with a search index of keywords, but the Common Crawl data source was too limited to be generally useful, since CC only captures a random sample of the Web with each crawl. Future projects ought to consider using a live on-going crawl of the Web, using something like Nutch.

Ideally, WebXR content could also be published with meta data for easier crawling. See a discussion about this here: immersive-web/proposals#73

The community also suggested crawling for 3D models via SERP, Structured Data, used by Google, the :xr-overlay CSS pseudo-class used by the WebXR spec, and JanusVR tags.

See this post for more info: About wwxr

The seed/ directory in this repo contains a docker container and scripts for downloading and crawling Common Crawl data.

The repo also contains terraform modules for provisioning a base AWS instance, ops scripts for spinning up the nodejs/mongo site. Though there's nothing particularly special about this part of it.

1476393420667207687-vxtq6_RawlrpvUTd.mp4