Xapian binding for Erlang
License: MIT, GPL2 or higher (Xapian is still under GPL only.)
Author: Uvarov Michael (arcusfelis@gmail.com)
Xapian is an Open Source Search Engine Library, written in C++. Xapian is a highly adaptable toolkit which allows developers to easily add advanced indexing and search facilities to their own applications.
Xapian library
Install Xapian library itself.
emerge dev-libs/xapian
in Gentoo Linux.
Installation
I use rebar for building.
Try as a stand-alone Erlang application:
git clone git://github.com/arcusfelis/xapian-erlang-bindings.git xapian
cd xapian
./rebar get-deps compile
./start-dev.sh
Add as a dependency to rebar.config
:
{deps, [
{xapian, ".*",
{git, "git://github.com/arcusfelis/xapian-erlang-bindings.git", "master"}}
]}.
Google hash map (optional)
You can use google sparse hash for storing resources' ids.
In the Debian and Ubuntu repositories, this is packaged as
libsparsehash-dev
.
The C++-preprocessor macro GOOGLE_HASH_MAP
enables using google hash
map as a hash map.
emerge dev-cpp/sparsehash
in Gentoo Linux.
Using
This application uses records, defined in the file
include/xapian.hrl
. To include it use:
-include_lib("xapian/include/xapian.hrl").
Tests
Next command runs tests:
$ ./rebar eunit skip_deps=true
A pool of readers
Path = filename:join([code:priv_dir(xapian), test_db, simple]).
{ok, Pid} = xapian_pool:open([{name, simple}], Path, []).
result = xapian_pool:checkout([simple],
fun([Server]) -> io:write(Server), result end).
Readers use the Poolboy application. There is only one writer for each database, so there is no writer pool. You can use a named process and a supervisor instead:
{ok, Pid} = xapian_server:open(Path, [{name, simple_writer}, write]).
xapian_server:add_document(simple_writer, [#x_text{value = "Paragraph 1"}]).
If you try to run this code from the console, then next command will be useful:
rr(code:lib_dir(xapian, include) ++ "/xapian.hrl").
It loads information about records into the console.
A pool is supervised by xapian_sup
. That is why calling the
xapian_pool:open
function does not link the parent process with the
new process.
As with xapian_drv:transaction
, you can checkout a few pools.
xapian_pool:checkout([pool1, poo2],
fun([Server1, Server2]) -> actions_here end).
If an error occurs, an exception will be thrown and workers will be returned into the pool.
catch xapian_pool:checkout([simple], fun([S]) -> 5 = 2 + 2 end).
{'EXIT',{{badmatch,4},[{erl_eval,expr,3,[]}]}}
Multi-database support
You can use this code for opening two databases from the directories "DB1" and "DB2".
{ok, Server} = xapian_driver:open([#x_database{path="DB1"},
#x_database{path="DB2"}], []).
Only read-only databases can be used.
There are two fields meaning a document's id: docid
and
multi_docid
. They are equal if only one database is used.
Otherwise, the first field contains a document id (can be repeated) and
multi_docid
is a unique idintifier, which is calculated from
docid
and db_number
.
db_number
is the number of the document's database counting from 1.
db_name
field contains pseudonyms of the databases. Information from
name
field of #x_database{}
record will be used for this. This
field is undefined
by default.
Here is a full multi-database example:
-record(document, {docid, db_name, multi_docid, db_number}).
example() ->
DB1 = #x_database{name=db1, path="DB1"},
DB2 = #x_database{name=db1, path="DB2"},
{ok, Server} = xapian_driver:open([DB1, DB2], []),
EnquireResourceId = xapian_driver:enquire(Server, "query string"),
MSetResourceId = xapian_driver:match_set(Server, EnquireResourceId),
%% Use a record_info call for retrieving a list of field names
Meta = xapian_record:record(document, record_info(fields, document)),
Table = xapian_mset_qlc:table(Server, MSetResourceId, Meta),
qlc:e(qlc:q([X || #document{multi_docid=DocId} <- Table])).
Resources
A resource is a C++ object, which can be passed and stored inside an Erlang VM. Each server can have its own set of resources. Resources from other servers cannot be used or controlled. Resources are not automatically garbidge-collected, but if a control process (server) dies, all its resources are released.
Use the release_resource(Server, Resource)
function call to free
a resource which is no longer needed.
A second call of this function with the same arguments will cause an error:
1> Path = filename:join([code:priv_dir(xapian), test_db, simple]).
"/home/user/erlang/xapian/priv/test_db/simple"
2> {ok, Server} = xapian_server:open(Path, []). {ok,<0.57.0>}
3> ResourceId = xapian_server:enquire(Server, "query").
#Ref<0.0.0.69>
4> xapian_server:release_resource(Server, ResourceId).
ok
5> xapian_server:release_resource(Server, ResourceId).
** exception error: elem_not_found
Using a port
Ports cannot crash the Erlang VM. The port program will be compiled by rebar.
For running a single server in port mode use:
{ok, Server} = xapian_driver:open(Path, [port|Params]).
For running all servers in port mode use:
application:set_env(xapian, default_open_parameters, [port]).
Testing a port
$ erl -pa ./.eunit/ ./../xapian/ebin ./deps/?*/ebin
application:set_env(xapian, default_open_parameters, [port]).
eunit:test({application, xapian}, [verbose]).
Document forms
- Document Constructor (CD)
- Extracted Document (ED)
- Document Id (ID)
- Document Resource (RD)
Conversations:
- ID to RD: xapian_server:document(S, ID) -> RD
- CD to RD: xapian_server:document(S, CD) -> RD
- DC to EC: xapian_server:document_info(S, DC, Meta) -> EC
- ID to EC: xapian_server:read_document(S, ID, Meta) -> EC
Helpers
Stand-alone Stemmer
1> {ok, S} = xapian_server:open([],[]).
{ok,<0.79.0>}
2> xapian_helper:stem(S, <<"english">>, "octopus cat").
[#x_term{value = <<"Zcat">>,position = [],frequency = 1},
#x_term{value = <<"Zoctopus">>,position = [],frequency = 1},
#x_term{value = <<"cat">>, position = [2], frequency = 1},
#x_term{value = <<"octopus">>, position = [1], frequency = 1}]
3> xapian_helper:stem(S, <<"english">>, "octopus cats").
[#x_term{value = <<"Zcat">>,position = [],frequency = 1},
#x_term{value = <<"Zoctopus">>,position = [],frequency = 1},
#x_term{value = <<"cats">>, position = [2], frequency = 1},
#x_term{value = <<"octopus">>, position = [1], frequency = 1}]
4> xapian_helper:stem(S, none, "octopus cats").
[#x_term{value = <<"cats">>, position = [2], frequency = 1},
#x_term{value = <<"octopus">>, position = [1], frequency = 1}]
5> xapian_helper:stem(S, "english", "Zcat").
[#x_term{value = <<"Zzcat">>,position = [], frequency = 1},
#x_term{value = <<"zcat">>, position = [1], frequency = 1}]
6> xapian_helper:stem(S, "english", "cat octo-cat").
[#x_term{value = <<"Zcat">>,position = [],frequency = 2},
#x_term{value = <<"Zocto">>,position = [],frequency = 1},
#x_term{value = <<"cat">>, position = [1,3], frequency = 2},
#x_term{value = <<"octo">>, position = [2], frequency = 1}]
"Z"
is a prefix. It means that this term is stemmed.