edx-search
This is a django application to provide access to search services from within edx-platform applications.
Searching is accomplished by creating an index of documents, and then searching within that index for matching information. This application provides a way to add documents to the index, and then search for them.
SearchEngine
The SearchEngine is an abstract object which may have multiple implementations (at the time of writing there are 2 in existence - Elasticsearch and MockSearchEngine, which is primarily used for testing)
To operate with the currently-specified SearchEngine from the django settings, one can invoke a static method on SearchEngine as follows:
search_engine = SearchEngine.get_search_engine(index="index_name")
Indexing and searching is always performed within the context of a specific index, and the index name is provided within the call to get_search_engine
The core operations available within the SearchEngine interface are as follows:
search_engine.index(doc_type, body, **kwargs)
search_engine.remove(doc_type, doc_id, **kwargs)
search_engine.search(
query_string=None,
field_dictionary=None,
filter_dictionary=None,
**kwargs
)
where
-
index
- the operation to add a document to the index -
remove
- the operation to remove a document from the index -
search
- the operation to find matching documents within the index.doc_type
is supported as an optional keyword parameter to return results only with a certain doc_type
Index documents
Index documents are passed to the search application as python dictionaries, along with a doc_type
document type, which is also optionally supported as a way to return only certain document types from a search.
Example indexing operation
search_engine = SearchEngine.get_search_engine(index="test_index")
test_document = {
"course_id": "COURSE_ID",
"id": "object_id",
"some_attribute": "some_value",
"nested_attributes": {
"another_attribute": "another_value",
},
"content": {
"text": "This text will be searched completely"
},
"start_date": "2030-01-01T00:00:00+00:00",
}
search_engine.index("test_doc_type", test_document)
Searching for documents
Once documents have been added to the index, there are many ways in which one may want to search for content
- Search by matching field
- Search if matching numeric / date field is within a certain range
- Google-like search matching some text somewhere within the document contents
- Search by matching field, allowing unspecified results to be included
- Searches using a combination of these criteria
Search by matching field
This is accomplished by passing a field_dictionary
object with the desired values therein:
search_engine = SearchEngine.get_search_engine(index="test_index")
match_field_dict = {
"some_attribute": "some_value"
}
search_result = search_engine.search(field_dictionary=match_field_dict)
More than one value can be used in the matching criteria:
search_engine = SearchEngine.get_search_engine(index="test_index")
match_field_dict = {
"some_attribute": "some_value",
"nested_attributes.another_attribute": "another_value"
}
search_result = search_engine.search(field_dictionary=match_field_dict)
Notice the . notation for querying fields that are nested within the indexed object
Important notice: searching in multivalue fields (i.e. a lists) have a special semantics - if search term is a scalar value (i.e. string, number, etc.), search uses "contains in" predicate, so all documents containing specified search value as one of the elements of multivalue field are included in result set. If search term is vector value (i.e. list, tuple, dictionary, etc.), search will result in undefined behavior, specific to underlying search engine; thus using iterable as filter field value is discouraged.
Search results
The search_result
object returned from a call to search
is a python dict object that contains the following fields:
took
total
max_score
results
where:
took
is how many milliseconds the search engine took to process the querytotal
is the count of how many matches were foundmax_score
is the highest "score" match of the results - one cannot assume that there is an absolute scale of scores, but that a result with a higher score represents a thought to be better match than another result with a lower scoreresults
is an array of result objects
Each result object is a python dict object that contains the following fields:
score
data
where:
score
is the relative score of this record compared to othersdata
is a copy of the dict object stored within the index
Search if matching numeric / date field is within a certain range
This is a very similar situation as the matching field, in this case however the user wishes to find results witin a certain range.
For this purpose, this search app provides 2 classes within its utils
namespace - ValueRange
and DateRage
from search.utils import ValueRange, DateRange
search_engine = SearchEngine.get_search_engine(index="test_index")
range_specification = {
"age": ValueRange(18, 39)
}
# Check for people that can vote, but cannot become president yet
search_result = search_engine.search(field_dictionary=range_specification)
already_started = {
"start_date": DateRange(None, datetime.utcnow())
}
# Check for results that have started
search_result = search_engine.search(field_dictionary=already_started)
Note that passing None in either range parameter indicates no lower/upper bound is necessary
Google-like search matching some text somewhere within the document contents
This is a very popular way to search textual content, looking for matches for a specific word or phrase. Objects will often want to be indexed with field values that may want to be matched, but not matched textually. As a result, this type of search query takes the search term and restricts matches to text found within the content
subdictionary of any item.
search_engine = SearchEngine.get_search_engine(index="test_index")
search_result = search_engine.search(query_string="chocolate")
other_search_result = search_engine.search(query_string="chocolate heart")
Search by matching field, allowing unspecified results to be included
Consider a search for items that are allowed to be viewed up until this date. Many items may include a start_date
and one would want to look for items that have a start_date
that is past now
. However, content that does not specify a start_date
may be desireable to include.
Another example could be where content specifies a group of users that have access (e.g. cohorting) - content that does not specify a cohort should be visible to users from all cohorts.
In order to accomplish this operation, simply pass the criteria to the filter_dictionary
parameter instead of field_dictionary
search_engine = SearchEngine.get_search_engine(index="test_index")
cohort_filter = {
"cohort": "CohortA"
}
# Check for all objects that members of CohortA can access (with cohort specification for CohortA or no cohort specification)
search_result = search_engine.search(filter_dictionary=cohort_filter)
from search.utils import DateRange
already_started = {
"start_date": DateRange(None, datetime.utcnow())
}
# Check for results that have started, or don't specifiy a start_date
search_result = search_engine.search(filter_dictionary=already_started)
Important notice: same concerns about searching in multivalue fields apply here.
Searches using a combination of these criteria
All of these criteria can be combined to present results that are desired. Consider a search that wants to return objects that:
- Have a
doc_type
ofcourseware_content
- Have a textual match for "chocolate"
- Are within the course with id = "edX/DemoX/Demo_Course"
- Have already been released -
start_date
is not in the future (if it is specified)
A search like this is accomplished in one call to the search engine:
from search.utils import DateRange
search_engine = SearchEngine.get_search_engine(index="test_index")
filter_fields = {
"start_date": DateRange(None, datetime.utcnow())
}
match_fields = {
"course": "edX/DemoX/Demo_Course"
}
search_result = search_engine.search(
query_string="chocolate",
field_dictionary=match_fields,
filter_dictionary=filter_fields,
doc_type="courseware_content"
)
This example is so popular that the search app provides additional facility for performing this search
Paging search results
The call to search_engine.search
also accepts parameters to allow client applications to fetch results from muiltiple "pages" of results. The parameters to use are:
size
- the number of results to return in the pagefrom_
- the zero-based index of the first result for the page
Higher Level Operations
The app also provides a higher level operation in order to more easily integrate courseware search
api.perform_search
perform_search(
search_terms,
user=None,
size=10,
from_=0,
course_id=None
)
perform_search takes the complete search string and optionally a course_id, user, and paging arguments
SearchFilterGenerator
SearchFilterGenerator is a class that provides default filter_dictionary and field_dictionary arguments to be used in a search that is used for courseware.
By default, it implements a filter_dictionary with start_date
filtering, and if a course_id is provided will implement a field_dictionary that matches upon course id.
Users of the api.perform_search
or search app views.do_search
view can choose to override the SearchFilterGenerator, providing the override object in django setting named SEARCH_FILTER_GENERATOR
views.do_search
Currently the only http interface to the search app, django applications can use this by including search.urls in their urls.py:
url(r'^search/', include('search.urls'))
This http endpoint calls api.perform_search
to interface with the SearchEngine, and allows for clients to add additional fields with values inferred from the results received.
SearchResultProcessor
SearchResultProcessor is a class that provides default properties for a result, in particular a text excerpt showing matches from within the text.
Users of the views.do_search
view can choose to override the SearchResultProcessor, providing the override object in django setting named SEARCH_RESULT_PROCESSOR
.
In particular, the base SearchResultProcessor adds in each python property alongside the properties within the results. So, the override class simply defines properties that it desires to be exposed alongside the results - for example, the LMS includes an LmsSearchResultProcessor
that defines properties url
and location
which are used to infer the url / location for each search result. In this fashion, the search client can blend in information that it infers from the result fields, but it knows how to format / calculate.
Also, SearchResultProcessor overriders can override the member should_remove
which allows the client app to determine if access should be excluded to the search result - for example, the LMS includes an implementation of this member that calls the LMS has_access
as a last safety resort in case the end user does not have access to the result returned.
Testing
Tests use an Elasticsearch Docker container. To run tests locally use command:
make test_with_es
To simply run the container without starting the tests, run:
make test.start_elasticsearch
To stop an Elasticsearch Docker container, run:
make test.stop_elasticsearch