This repository is for the talk of the same name that was presented at the Linux Foundation Open Source Summit 2023. This repository shows how to use Apache OpenNLP to generate sentence vectors and then use those vectors for k-NN search in OpenSearch.
For the presentation see Using Apache OpenNLP with OpenSearch k-NN Vector Search.
If you have any questions please reach out to me through LinkedIn.
You need a few things to run the commands listed in this file:
- Docker and
docker-compose
- Java 11 and Maven
- Python3
First, set:
sudo sysctl -w vm.max_map_count=262144
Now start OpenSearch:
docker-compose up
Verify OpenSearch is running:
curl -k -u admin:admin https://localhost:9200
Verify the opensearch-knn
plugin is installed:
curl -k -u admin:admin https://localhost:9200/_cat/plugins
curl -k -u admin:admin -X PUT -H "Content-type: application/json" https://localhost:9200/vectors -d '
{
"settings": {
"index.knn": true
},
"mappings": {
"properties": {
"my_vector": {
"type": "knn_vector",
"dimension": 384
}
}
}
}'
This converts the model to a directory called onnx
.
python3 -m pip install -r requirements.txt
python3 convert-model.py
cd opennlp-knn
mvn clean install
Run the Java app, passing in the path to the onnx
directory that was created above.
java -jar ./target/opennnlp-knn-jar-with-dependencies.jar /path/to/onnx/
The Java app generates vectors for three sentences:
sentences.add("george washington was president");
sentences.add("abraham lincoln was president");
sentences.add("john likes ice cream");
The output (vectors for each sentence) will be written to a file out.txt
ready to be indexed into OpenSearch. You can now index the vectors into OpenSearch:
curl -s -k -u admin:admin -X POST -H "Content-type: application/x-ndjson" https://localhost:9200/vectors/_bulk --data-binary @out.txt
Now with the vectors indexed, you can search by sending a vector to OpenSearch to find similar documents:
curl -s -k -u admin:admin -X GET -H "Content-type: application/json" https://localhost:9200/vectors/_search -d '
{
"size": 10,
"query": {
"knn": {
"my_vector": {
"vector": [0.23664545, 0.16271955, 0.2174448, 0.19018926, 0.14418952, 0.13174078, 0.14475523, 0.15135369, 0.13017027, 0.18495294, 0.15273653, 0.21680894, 0.15522662, 0.13694441, 0.11260824, 0.12069248, 0.124871716, 0.21574062, 0.12304607, 0.26746073, 0.22132963, 0.17709397, 0.13960555, 0.060655076, 0.114867084, 0.19016309, 0.15640156, 0.13960022, 0.16447519, 0.10776763, 0.13393763, 0.15837277, 0.19648154, 0.25433046, 0.09048271, 0.15899889, 0.27460718, 0.23531353, 0.26636258, 0.17056502, 0.15411225, 0.18631229, 0.18292066, 0.15764469, 0.11144164, 0.15515296, 0.14647679, 0.12992007, 0.19755481, 0.21127276, 0.16773675, 0.17822684, 0.081488326, 0.19486889, 0.11746454, 0.18362841, 0.10810352, 0.095823295, 0.18721107, 0.16446202, 0.09478745, 0.17543244, 0.09723724, 0.17882656, 0.14108664, 0.16814047, 0.09164065, 0.16521196, 0.19185877, 0.12102438, 0.20289262, 0.17702778, 0.1477192, 0.18535486, 0.14254645, 0.13670816, 0.27466482, 0.21628429, 0.23626985, 0.20824929, 0.14723091, 0.29158756, 0.16650334, 0.170777, 0.17382859, 0.16168734, 0.14707841, 0.15071529, 0.16275497, 0.19760016, 0.119973764, 0.16246775, 0.22451362, 0.17063412, 0.12662533, 0.14431766, 0.1835509, 0.23468848, 0.18764499, 1.0, 0.13367075, 0.17335148, 0.23693828, 0.20538032, 0.17373805, 0.16211696, 0.0998079, 0.116707265, 0.1830955, 0.14858359, 0.15820478, 0.15011069, 0.20348215, 0.18964784, 0.18103087, 0.15561956, 0.095463276, 0.16301574, 0.09802429, 0.09372587, 0.1933215, 0.15122011, 0.16783695, 0.13272944, 0.18347937, 0.0, 0.1815874, 0.17167109, 0.09428583, 0.1925427, 0.24836546, 0.18353534, 0.121468276, 0.3457675, 0.1355196, 0.12590978, 0.21900332, 0.18979128, 0.15065387, 0.21686985, 0.18482178, 0.23940022, 0.18947776, 0.2031004, 0.15762848, 0.16114101, 0.22075693, 0.23564969, 0.173029, 0.13671051, 0.29958567, 0.15742525, 0.25908074, 0.17523195, 0.15779102, 0.14940053, 0.19008367, 0.10765594, 0.10944032, 0.11613366, 0.105877146, 0.14264658, 0.18766277, 0.19525541, 0.23629734, 0.04603964, 0.19965075, 0.11592721, 0.23894139, 0.16100037, 0.1681287, 0.18925342, 0.12981479, 0.14560045, 0.20460646, 0.20139179, 0.20177117, 0.19033647, 0.17518646, 0.19974054, 0.1689669, 0.13102426, 0.0840263, 0.22153068, 0.22257482, 0.16642016, 0.1255874, 0.2541051, 0.1869613, 0.16180694, 0.18619464, 0.18035275, 0.13024777, 0.19522472, 0.02552168, 0.22151403, 0.17530297, 0.20385198, 0.17834094, 0.07808495, 0.16007194, 0.12479354, 0.14123559, 0.20516622, 0.16084681, 0.117723696, 0.14159155, 0.16063575, 0.14099841, 0.1765709, 0.29642156, 0.11697753, 0.15479986, 0.18462579, 0.18700477, 0.21281801, 0.19642152, 0.12790817, 0.12610824, 0.18212147, 0.13186763, 0.119399115, 0.20349103, 0.17167109, 0.15752763, 0.11593006, 0.16146657, 0.19028728, 0.17608745, 0.21866994, 0.18162717, 0.15089077, 0.12592393, 0.1736157, 0.24570778, 0.19349174, 0.13993415, 0.17995381, 0.19037879, 0.19429448, 0.15939124, 0.14427215, 0.13817333, 0.10171517, 0.115659416, 0.22828655, 0.16872443, 0.16765508, 0.18964003, 0.18936592, 0.17748052, 0.13721555, 0.19756551, 0.209285, 0.16828321, 0.2243201, 0.19638564, 0.20979631, 0.18657446, 0.21446039, 0.16728161, 0.08388079, 0.24585138, 0.22565176, 0.12493765, 0.16055486, 0.2030657, 0.14127095, 0.14577648, 0.16496988, 0.19037668, 0.21545793, 0.12634592, 0.07807021, 0.15814641, 0.18368497, 0.1840515, 0.11190097, 0.19126022, 0.19897985, 0.06268184, 0.14517978, 0.16868734, 0.15939514, 0.107347146, 0.0878329, 0.15592113, 0.20570728, 0.15630648, 0.12607224, 0.13068745, 0.14428177, 0.08001451, 0.1419112, 0.1917735, 0.14215901, 0.2179921, 0.19925006, 0.14066926, 0.12932129, 0.12169988, 0.11029747, 0.17215972, 0.119957775, 0.16751705, 0.15364987, 0.16617599, 0.1051671, 0.117208436, 0.2093214, 0.18148111, 0.15815775, 0.17999752, 0.14196743, 0.14687419, 0.2184067, 0.23346452, 0.18894196, 0.057921283, 0.17167108, 0.1822196, 0.16115609, 0.26758492, 0.2018112, 0.1500529, 0.18790597, 0.16545667, 0.12878121, 0.19523199, 0.13644966, 0.1815596, 0.11932636, 0.18732114, 0.19135337, 0.17326991, 0.13787106, 0.12483077, 0.2034319, 0.2388653, 0.2278496, 0.14538608, 0.20477888, 0.088797055, 0.23211145, 0.23524137, 0.19275272, 0.2570222, 0.2044691, 0.18843903, 0.16659254, 0.19449286, 0.14957592, 0.15855056, 0.16775526, 0.14744045, 0.20881936, 0.2503084, 0.17591618, 0.1580938, 0.21269342, 0.16027167, 0.22504497, 0.059246995, 0.19432402, 0.1312063, 0.23117721, 0.13519742, 0.17674147, 0.23158675, 0.22360769, 0.14149134, 0.28885874, 0.17583111, 0.07876832, 0.14243649, 0.1814256, 0.1749298, 0.193659, 0.15650244, 0.11954998, 0.17497128, 0.20174696, 0.12325848, 0.21539776],
"k": 10
}
}
},
"stored_fields": [
"id"
]
}
'
In this example we just searched for a vector that we already had, but we could have re-run the Java app to generate a vector for a different sentence and used that vector to search.
In this response we get the indexed documents back, along with a score for each document:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 3,
"relation": "eq"
},
"max_score": 1,
"hits": [
{
"_index": "vectors",
"_id": "1",
"_score": 1
},
{
"_index": "vectors",
"_id": "2",
"_score": 0.75490916
},
{
"_index": "vectors",
"_id": "3",
"_score": 0.4180176
}
]
}
}
We see the first document is a match 1, followed by documents 2 and 3. This visibly makes sense given the similarity of those first two sentences and the difference in the third sentence.