License: MIT, GPL2 or higher (Xapian is still under GPL only.)
Author: Uvarov Michael (arcusfelis@gmail.com)
Xapian is an Open Source Search Engine Library, written in C++. Xapian is a highly adaptable toolkit which allows developers to easily add advanced indexing and search facilities to their own applications.
Install Xapian library itself. emerge dev-libs/xapian
in Gentoo Linux.
I use rebar for building.
Try as a stand-alone Erlang application:
git clone git://github.com/arcusfelis/xapian-erlang-bindings.git xapian
cd xapian
./rebar get-deps compile
./start-dev.sh
Add as a dependency to rebar.config
:
You can use google sparse hash for storing resources' ids.
In the Debian and Ubuntu repositories, this is packaged as libsparsehash-dev
.
The C++-preprocessor macro GOOGLE_HASH_MAP
enables using google hash map as a hash map.
emerge dev-cpp/sparsehash
in Gentoo Linux.
This application uses records, defined in the file include/xapian.hrl
. To include it use:
Next command runs tests:
$ ./rebar eunit skip_deps=true
Readers use the Poolboy application. There is only one writer for each database, so there is no writer pool. You can use a named process and a supervisor instead:
If you try to run this code from the console, then next command will be useful:
It loads information about records into the console.
A pool is supervised by xapian_sup
. That is why calling the xapian_pool:open
function does not link the parent process with the new process.
As with xapian_drv:transaction
, you can checkout a few pools.
If an error occurs, an exception will be thrown and workers will be returned into the pool.
You can use this code for opening two databases from the directories "DB1" and "DB2".
Only read-only databases can be used.
There are two fields meaning a document's id: docid
and multi_docid
. They are equal if only one database is used.
Otherwise, the first field contains a document id (can be repeated) and multi_docid
is a unique idintifier, which is calculated from docid
and db_number
.
db_number
is the number of the document's database counting from 1.
db_name
field contains pseudonyms of the databases. Information from name
field of #x_database{}
record will be used for this. This field is undefined
by default.
Here is a full multi-database example:
-record(document, {docid, db_name, multi_docid, db_number}).
example() ->
DB1 = #x_database{name=db1, path="DB1"},
DB2 = #x_database{name=db1, path="DB2"},
{ok, Server} = xapian_driver:open([DB1, DB2], []),
EnquireResourceId = xapian_driver:enquire(Server, "query string"),
MSetResourceId = xapian_driver:match_set(Server, EnquireResourceId),
%% Use a record_info call for retrieving a list of field names
Meta = xapian_record:record(document, record_info(fields, document)),
Table = xapian_mset_qlc:table(Server, MSetResourceId, Meta),
qlc:e(qlc:q([X || #document{multi_docid=DocId} <- Table])).
A resource is a C++ object, which can be passed and stored inside an Erlang VM. Each server can have its own set of resources. Resources from other servers cannot be used or controlled. Resources are not automatically garbidge-collected, but if a control process (server) dies, all its resources are released.
Use the release_resource(Server, Resource)
function call to free a resource which is no longer needed.
A second call of this function with the same arguments will cause an error:
1> Path = filename:join([code:priv_dir(xapian), test_db, simple]).
"/home/user/erlang/xapian/priv/test_db/simple"
2> {ok, Server} = xapian_server:open(Path, []). {ok,<0.57.0>}
3> ResourceId = xapian_server:enquire(Server, "query").
#Ref<0.0.0.69>
4> xapian_server:release_resource(Server, ResourceId).
ok
5> xapian_server:release_resource(Server, ResourceId).
** exception error: elem_not_found
Ports cannot crash the Erlang VM. The port program will be compiled by rebar.
For running a single server in port mode use:
For running all servers in port mode use:
$ erl -pa ./.eunit/ ./../xapian/ebin ./deps/?*/ebin
- Document Constructor (CD)
- Extracted Document (ED)
- Document Id (ID)
- Document Resource (RD)
Conversations:
- ID to RD: xapian_server:document(S, ID) -> RD
- CD to RD: xapian_server:document(S, CD) -> RD
- DC to EC: xapian_server:document_info(S, DC, Meta) -> EC
- ID to EC: xapian_server:read_document(S, ID, Meta) -> EC
1> {ok, S} = xapian_server:open([],[]).
{ok,<0.79.0>}
2> xapian_helper:stem(S, <<"english">>, "octopus cat").
[#x_term{value = <<"Zcat">>,position = [],frequency = 1},
#x_term{value = <<"Zoctopus">>,position = [],frequency = 1},
#x_term{value = <<"cat">>, position = [2], frequency = 1},
#x_term{value = <<"octopus">>, position = [1], frequency = 1}]
3> xapian_helper:stem(S, <<"english">>, "octopus cats").
[#x_term{value = <<"Zcat">>,position = [],frequency = 1},
#x_term{value = <<"Zoctopus">>,position = [],frequency = 1},
#x_term{value = <<"cats">>, position = [2], frequency = 1},
#x_term{value = <<"octopus">>, position = [1], frequency = 1}]
4> xapian_helper:stem(S, none, "octopus cats").
[#x_term{value = <<"cats">>, position = [2], frequency = 1},
#x_term{value = <<"octopus">>, position = [1], frequency = 1}]
5> xapian_helper:stem(S, "english", "Zcat").
[#x_term{value = <<"Zzcat">>,position = [], frequency = 1},
#x_term{value = <<"zcat">>, position = [1], frequency = 1}]
6> xapian_helper:stem(S, "english", "cat octo-cat").
[#x_term{value = <<"Zcat">>,position = [],frequency = 2},
#x_term{value = <<"Zocto">>,position = [],frequency = 1},
#x_term{value = <<"cat">>, position = [1,3], frequency = 2},
#x_term{value = <<"octo">>, position = [2], frequency = 1}]
"Z"
is a prefix. It means that this term is stemmed.