Crawl existing web history?
jarmitage opened this issue · 2 comments
This might be a can of worms but, at the least I'd be interested to know your [the author's] thoughts on if it's feasible/how one might approach scraping a user's web history and integrating that content into the search capabilities of this tool.
Thanks for making this plugin!
I've come up with a hackish way and a little technical way of doing this. Chrome/Opera stores the past 3 months worth of history, not more, which is annoying but that is what we have to work with. For me, that's still a helluva lot of urls so I had to come up with various ways of filtering it down to something more manageable. I don't really want to load every random website I visited in any case.. So here's what I did. These instructions are for Linux but I'm sure they would be similar on Mac too:
- Change Chrome's settings to not load any images to save bandwidth and memory. Also close/save any tabs you care about because we're going to load a lot of new tabs at once and you won't be able to rescue old ones.
- Close all windows of Chrome/Opera - you can't open the history file if you don't.
- Install SQliteman or a similar SQLite database viewer and sqlite3-pcre (a regex plugin for sqlite)
- Open the History database which is located at ~/.config/google-chrome/Default/History (or something similar if you have several profiles) or ~/.config/opera/History
- Load the regex plugin into Sqliteman with SELECT load_extension('/usr/lib/sqlite3/pcre.so');
- Run the following code to create a list of websites you want.
select urls.url from urls inner join visits on urls.id = visits.url where urls.url not like '%google.%' and urls.url not like '%facebook.com%' and urls.url not like '%youtube.com%' and urls.url not like '%localhost%' and urls.url not like '%127.0%' and urls.url not like '%192.168%' and urls.url not like '%zero%' and urls.url not like '%out.reddit.com%' and urls.url not regexp '^https?:\/\/[\w\.]+[a-z\/]?$' and urls.title like '%income%' or urls.title like '%climate%' group by urls.url order by sum(visits.visit_duration) desc;
This is just an example but you can change it to suit your needs. For example, I filtered out facebook, youtube, localhost, etc because they wouldn't be interesting. Then I filtered out all urls that go to the homepage of a site and finally I searched for the words "income" or "climate" in the page titles because I'm interested in basic income and climate change. Without those final filters, I would get thousands of urls but with them, I only get about 200. Anyway, play with the filters a bit in sqliteman to get a list of urls you want to archive but make sure it isn't too long. Save the SQL code you used, including the load_extension line to a file called interesting_sites.sql. Then close sqliteman. - Open a terminal and run something like this:
cat interesting_sites.sql | sqlite3 ~/.config/opera-developer/History | while read line; do opera-developer --new-page $line &; done
Replace opera-developer with google-chrome, etc, etc - This command will get the list of urls from sqlite, then load up each url in chrome/opera and hopefully, falcon will automatically index every site. It worked pretty well for me and only took a few seconds to load about 150 sites.
Hope that helps. I'll try to find a way to do better filtering of history but this is what I have so far!
Cheers,
Durand
hey @dldx @jarmitage
We forked the Falcon tool a while back and integrated the import of the existing history and bookmarks.
We have done it by importing it via the chrome.history/bookmarks api.
You can check it out here: https://github.com/WorldBrain/Research-Engine
We are more than happy to collaborate on this in the future!
Best,
Oliver