Crawl existing web history?

Question

Crawl existing web history?

jarmitage opened this issue 8 years ago · 2 comments

This might be a can of worms but, at the least I'd be interested to know your [the author's] thoughts on if it's feasible/how one might approach scraping a user's web history and integrating that content into the search capabilities of this tool.

Thanks for making this plugin!

Answer 1 · 2016-09-19T11:17:44.000Z

I've come up with a hackish way and a little technical way of doing this. Chrome/Opera stores the past 3 months worth of history, not more, which is annoying but that is what we have to work with. For me, that's still a helluva lot of urls so I had to come up with various ways of filtering it down to something more manageable. I don't really want to load every random website I visited in any case.. So here's what I did. These instructions are for Linux but I'm sure they would be similar on Mac too:

Change Chrome's settings to not load any images to save bandwidth and memory. Also close/save any tabs you care about because we're going to load a lot of new tabs at once and you won't be able to rescue old ones.
Close all windows of Chrome/Opera - you can't open the history file if you don't.
Install SQliteman or a similar SQLite database viewer and sqlite3-pcre (a regex plugin for sqlite)
Open the History database which is located at ~/.config/google-chrome/Default/History (or something similar if you have several profiles) or ~/.config/opera/History
Load the regex plugin into Sqliteman with SELECT load_extension('/usr/lib/sqlite3/pcre.so');
Run the following code to create a list of websites you want.
select urls.url from urls inner join visits on urls.id = visits.url where urls.url not like '%google.%' and urls.url not like '%facebook.com%' and urls.url not like '%youtube.com%' and urls.url not like '%localhost%' and urls.url not like '%127.0%' and urls.url not like '%192.168%' and urls.url not like '%zero%' and urls.url not like '%out.reddit.com%' and urls.url not regexp '^https?:\/\/[\w\.]+[a-z\/]?$' and urls.title like '%income%' or urls.title like '%climate%' group by urls.url order by sum(visits.visit_duration) desc;
This is just an example but you can change it to suit your needs. For example, I filtered out facebook, youtube, localhost, etc because they wouldn't be interesting. Then I filtered out all urls that go to the homepage of a site and finally I searched for the words "income" or "climate" in the page titles because I'm interested in basic income and climate change. Without those final filters, I would get thousands of urls but with them, I only get about 200. Anyway, play with the filters a bit in sqliteman to get a list of urls you want to archive but make sure it isn't too long. Save the SQL code you used, including the load_extension line to a file called interesting_sites.sql. Then close sqliteman.
Open a terminal and run something like this:
cat interesting_sites.sql | sqlite3 ~/.config/opera-developer/History | while read line; do opera-developer --new-page $line &; done
Replace opera-developer with google-chrome, etc, etc
This command will get the list of urls from sqlite, then load up each url in chrome/opera and hopefully, falcon will automatically index every site. It worked pretty well for me and only took a few seconds to load about 150 sites.

Hope that helps. I'll try to find a way to do better filtering of history but this is what I have so far!

Cheers,
Durand

Answer 2 · 2017-01-11T12:37:28.000Z

hey @dldx @jarmitage

We forked the Falcon tool a while back and integrated the import of the existing history and bookmarks.

We have done it by importing it via the chrome.history/bookmarks api.
You can check it out here: https://github.com/WorldBrain/Research-Engine

We are more than happy to collaborate on this in the future!

Best,
Oliver