Support running OpenWPM crawls on Windows
motin opened this issue · 2 comments
motin commented
Path, I hope, to supporting Windows. There may be some limitations, but first step.
ToDo:
- Once #648 is merged, there will be three dependencies that don't work on windows:
- leveldb
- have looked at the conda-forge recipe. appears it should now be straight forward to add windows support to it. (alternatively windows users have to manually install). Solved: conda-forge/leveldb-feedstock#12
- plyvel
- i believe we can switch out plyvel for python-leveldb with almost no fuss
- python-virtualdriver
- this is for running xvfb which won't work on windows anyway, so just need to figure a package management solution / environment.yaml that accomodates both (most likely just making installing python-xvfb a manual step, as install xvfb is manual anyway -- maybe moving to pip will workaround)
- leveldb
- Make some tweaks in deploy_firefox so we're not manually making paths by concatenating strings
- Also suggest making some tweaks in deploy_firefox so that we let geckodriver set a profile path and we then read off it. this will help in goal of restoring stateful crawls and will make it easier to work here.
- Find a replacement for the log interceptor that uses mkfifo which is unix only. This stack overflow thread has something that maybe we can drop in as a replacement. Alternatively, I used a different approach in faust-selenium and created something to constantly "tail" geckodriver.log (https://github.com/birdsarah/faust-selenium/blob/master/crawler/geckodriver_log_reader.py). Alternatively again, we just save the geckodriver.log at the end and don't weave it into our logging. @englehardt - what is the motivation for interleaving the geckodriver logs?
- First step could be to skip geckodriver logs for windows platform - they're not crawl essential as best as I can tell.
Future (open issues):
- Add CircleCI tests and test on Win, OSX, and Linux (at least once per PR - or once a week).
birdsarah commented
An alternate version of openwpm was created as a proof of concept and has done windows crawls with openwpm. It uses basically the same openwpm instrumentation extension, but replaces the socket with a websocket, and uses kafka for orchestrating the crawl: https://github.com/birdsarah/faust-selenium
birdsarah commented
Moved to issue ToDos.