pse: A Python repository from wuben3125

Personal Search engine (python 3 port)

Combined Bookmarks and external search

What is this ?

Aren't you frustrated having a boatload of quality bookmarks, but not using them because it is faster to just fire a browser and do a Google search, instead ! Yeah, me too !

You no longer need to do that. Enter the Personal search engine (PSE), which you can use to index your bookmarks and search like you do with Google.

But wait there is more, when you issue your search query the PSE in the background does a Google search for you (or other search if you implement it :)) and displays both results.

The code is working but is still in Alpha stage. When it is Beta, I will write an article on http://www.igrok.site how it works. Below is a quick recepie how to install it and use it.

INSTALLATION AND RUNNING

Clone the PSE repository

> git clone https://github.com/vsraptor/pse.git
> cd pse

Dependencies

You probably have those already installed, but I list them here for completness. Skip this section in general.

Dependencies :

> apt-get install build-dep build-essential
> apt-get install python-dev python-numpy python-scipy libatlas-dev libatlas3-base

Installation

You would need to install scikit-learn (for Tfidf support) and Flask for the web app

> pip install lxml
> pip install numpy
> pip install requests
> pip install stop_words
> pip install scikit-learn
> pip install flask
> pip install flask-script
> pip install flask-bootstrap

Create url.lst file.

Next either create manually url.lst file in the data directory or generate one using bin/bm2urlst.py. Btw url.lst is simply list of URLs. (This repo contains one just for tests, but better generate your own once you have the app running. You can also have empty lines or comment urls with hash so they don't get included in the index)

Create the index

Now you have to run the indexer to create the tfidf index matrices. This will go trough the list of URLs, fetch the pages and create index, which later you will use to do the searches.

> cd bin
> python idx.py

(4/14/19 note): I forgot to perform this step, and was scratching my head over a vocabulary.csv does not exist (paraphrased) error.

Run the cmd-line app

There cmd line app, is mainly for testing purposes. You can run it like this (-b bookmark search, -g google search) :

> cd bin
> python query.py -b -q 'history biology'

Run the the web-app

Or better run the Web app :

> cd site
> python manage.py runserver

Then go to the following web address :

http://localhost:5000

Converting firefox bookmarks to url.lst

> cd bin
> python bm2urlst.py /path/to/bookmarks.html | grep -v 'png$\|gif$\|jpg$' > ../data/url.lst

wuben3125/pse