Discover workshop

This workshop material uses Discover, a Code discovery service using TF-IDF (term frequency - inverse document frequency) estimator.

We aim to index code (content) from *.yaml and *.py files. User queries are then used to compared against the indexed code.

Content - ipython notebooks

This notebook explains

how to index content from *.py files, specifically all functions in each python file.
use the index to search keywords and show results (functions)

This notebook uses existing code indexes saved on disk, loads them, and performs search against it.

YAML
- We tokenize YAML values under url section (using gramex.yaml as an example, this can be replaced with your specific format).
- Each YAML file forms a document (row) in the matrix.
Python
- We tokenize all functions. In each function, we identify its name, docstring, function and method calls.
- Each Python file forms a document (row) in the matrix.

These matrices are then stored on disk for lookups.
For a given user query, first we create a query vector then we determine the cosine similarity between that and the document vector (matrix).
Only the relevant columns (words) are highlighted. The files are then identified (using the key from cosine similarity result and a key mapping). This isn't complete yet.
We want to identify the relevant code snippet for a given user query. We repeat the step 4 for the query vector and against a new document vector (specific to the file identified).
We now will have the relevant code snippet.

TF-IDF => term frequency - inverse document frequency

Cosine Similarity is performed on the input vector (user query) and document matrix.
The results are ordered in descending order (highest cosine similarity values first) for suggesting code snippets.