/overscripted

Repository for the Mozilla Overscripted Data Mining Challenge

Primary LanguageJupyter NotebookMozilla Public License 2.0MPL-2.0

Overscripted Web: Data Analysis in the Open

The Systems Research Group (SRG) at Mozilla have created and open sourced a data set of publicly available information that was collected by a November 2017 Web crawl. We want to empower the community to explore the unseen or otherwise not obvious series of JavaScript execution events that are triggered once a user visits a webpage, and all the first- and third-party events that are set in motion when people retrieve content. Some preliminary insights already uncovered from this data are illustrated in this blog post. Ongoing analyses can be tracked here

The crawl data hosted here was collected using OpenWPM, which is developed and maintained by the Mozilla Security Engineering team.

Submitting an analysis:

  • Analyses should be performed in Python using the jupyter scientific notebook format and executing in this environment.
  • Analysis can be submitted by filing a Pull Request against this repository with the analysis formatted as an *.ipynb file or folder in the /analyses/ folder.
  • Set-up instructions are provided here: https://github.com/mozilla/overscripted/blob/master/analyses/README.md
  • Notebooks must be well documented and run on the environment described. If additional installations are needed these should be documented.
  • Files and folders should have the format yyyy_mm_username__short-title - the analyses directory contains examples already if this is not clear.
  • PRs altering or updating an existing analysis will not be accepted unless they are tweaking formatting / small errors to facilitate that notebook running. If you wish to continue / build-on someone else's existing analysis start your own analysis folder / file, cite their work, and then proceed with your extension.

Accessing the Data

Each of the links below links to a bz2 zipped portion of the total dataset.

A small sample of the data is available in safe_dataset.sample.tar.bz2 to get a feel for the content without commiting to the full download.

Three samples that are large enough to meaningful analysis of the dataset are also available as the full dataset is very large. More details about the samples are available in data_prep/Sample Review.ipynb

The full dataset. Unzipped the full parquet data will be approximately 70GB. Each (compressed) chunk dataset is around 9GB. SHA256SUMS contains the checksums for all datasets including the sample.

Refer hello_world.ipynb to load and have a quick look at the data with pandas, dask and spark.

New Contributor Tips

  • Make contributions with respect to any of your learnings, be it by reading related research papers or through your interaction with the community on gitter and submitting a Pull Request (PR) to the repository. You can submit the PR to the README on the main page or the analysis folder README.

  • This is not a one issue per person repo. All the questions are very open ended and different people may find very different and complementary things when looking at a question.

  • Use a reaction emoji to acknowledge a comment rather than writing a comment like "sure" - helps to keep things clean - but the contributor can still let folks know that they saw a comment.

  • You can ask for help and discuss your ideas on gitter. Click here to join !

  • When you open an issue and work on a Pull Request relating to the same, add "WIP" in the title of the PR. "WIP" is work in progress. When your PR is ready for review remove the WIP tag. You can also request feedback on specific things while it's still a WIP.

  • Please reference your issues on a PR so that they link and autoclose. Refer to this

  • If your OS is Ubuntu and you have trouble installing spark with conda. Refer to this link.

  • The dataset is very large. Even the subsets of the dataset are unlikely to fit into memory. Working with this dataset will typically require using Dask (http://dask.pydata.org/), Spark (http://spark.apache.org/) or similar tools to enable parallelized / out-of-core / distributed processing.

Glossary

  • Fingerprinting is the process of creating a unique identifier based off of some characteristics of your hardware, operating system and browser.
  • TLD means Top-level Domain. You can read up more about it here.
  • User Agent (UA), is a string that helps identify which browser is being used, what version, and on which operating system.
  • Web Crawler- It is a program or automated script which browses the World Wide Web in a methodical, automated manner.

Resources

  • Please refer the reading list for additional references and information.

  • This is a great tutorial to learn Pandas.

  • Tutorial on Jupyter Notebook.

  • We have used dask in some of our Jupyter notebooks. Dask gives you a pandas-like API but lets you work on data that is too big to fit in memory. Dask can be used on a single machine or a cluster. Most analyses done for this project were done on a single machine. Please start by reviewing the docs to learn more about it.

  • This will help you get started with GIT. For visual thinkers this tutorial can be a good start.

  • Apache Spark is an open source parallel processing framework for running large-scale data analytics applications across clustered computers. It can handle both batch and real-time analytics and data processing workloads. We use findspark to set up spark. You can learn more about it here