/fangorn

Automatically exported from code.google.com/p/fangorn

Primary LanguageJava

**********************************
Overview
**********************************

Fangorn runs within a Jetty webserver present in the "server" directory. The 
"index" directory contains the search indexes and "db" contains an apache derby 
embedded database. The database is only used to manage corpora state. We use 
the term application synonymously with the server, because starting the server 
starts the application.


**********************************
Installation instructions
**********************************

Read INSTALL file in the root directory.
All commands listed below are to be executed from the install directory and 
not from the root directory which contains this file.


**********************************
Data
**********************************

The data should be present in the Penn Treebank bracketed style. An example 
annotation looks like this:

((S(NP-SBJ(EX There))(VP(VBZ is)(NP-PRD(DT no)(NN asbestos))(PP-LOC(IN in)(NP(PR
P$ our)(NNS products)))(ADVP-TMP(RB now)))(. .)('' '')))

A file can contain one or more sentences, and could be present in any directory
hierarchy under a main corpus directory. A corpus file is identified by its 
extension .mrg, or .gz if gzipped.

A line starting with a '*' is treated as a comment and is ignored.

The application is distributed with sample data from the Penn Treebank Corpus
available in NLTK's data distribution.


**********************************
Running the application
**********************************

Run the script start-app.sh

NOTE: 
A. Software requirements: Java 5 SE JRE (runtime env) or JDK (dev kit) or 
above. Please note that this distribution has been compiled without
optimization in order to make it run on several distributions of Java. For
faster querying you could send an email to: sghodke@csse.unimelb.edu.au 
asking for the Java 5/6/7 optimised version.
B. The application is distributed with an example set of 343 sentences from 
The Penn Treebank corpus.


**********************************
Stopping the application
**********************************

Press Ctrl + c in the terminal where the application is running.


**********************************
Accessing the application
**********************************

* If accessing the application on the same machine as the server: Start a
 browser and type in http://localhost:9090 or http://localhost:9090/index
in the address bar.

* The application could be accessed on a network using 
http://<machine-name>:9090 or http://<machine-name>:9090/index


**********************************
Adding a Corpus to Fangorn
**********************************

1. Run the script create-index-GZ.sh or create-index-MRG.sh based on whether the
corpus files are gizpped and have the extension .gz or if they are plain text 
files with the extension .mrg
Both scripts require 3 compulsory parameters and 1 optional parameter that 
should be specified in the order mentioned below:
	(1) The complete path to the corpus directory.
	(2) The directory where the index is stored (should be different from the 
	    ones already present).
	(3) The name of the corpus as should be displayed in the UI.
	(4) [Optional] Total number of sentences to be indexed from the corpus. All 
	    sentences in the directory are indexed if this parameter is not 
	    specified.  
	
2. Example usages:

sh create-index-MRG.sh /home/test/Corpora/Penn/WSJ pwsj Penn_Treebank_WSJ

Indexes all sentences in the directory /home/test/Corpora/Penn/WSJ and displays 
the corpus as "Penn Treebank WSJ" in the user interface.

sh create-index-MRG.sh /home/test/Corpora/Penn/Br pbrn PT_Brown_100 100

Indexes first 100 sentences in the directory /home/test/Corpora/Penn/Br and 
displays it as "PT Brown 100" in the user interface.

NOTE:
A. The create-index-GZ.sh file expects each corpus file to be gzip compressed 
and available with an extension .gz.
B. If a corpus is added while the application is running it will only be visible
after re-starting the application.
C. If the index directory supplied is not unique the script will report it and 
terminate. Previous data will not be overwritten. You could avoid this by 
viewing a current listing of all corpora (See Listing all Corpora section below).


**********************************
Deleting a Corpus from Fangorn
**********************************

1. This should be performed when the application is not running.

2. Run the script delete-index.sh with the name of the directory where the index
is stored. The term should be the same as that mentioned in Step 2. (2) in the 
Add a Corpus instructions.

NOTE:
A. If unsure about the dir name, check a listing of all corpora (See Listing all
 Corpora section below).


**********************************
Listing all Corpora
**********************************

1. Run the script list-db.sh.

NOTE: This only lists the contents of the database.