/ML-Data-Crawler

Open source web crawler implementation to ingest data into MarkLogic

================== ML Crawler ==================
This crawler wrapes the functionality provided by open-source crawler.
It takes crawling instructions in xml format and returns returns the 
crawling results in XML format for ML Ingestion
		*** For internal use only ***

Steps:
1. Unzip the folder 
2. Folder Structure:
 	a) ML-Crawler.jar: Jar file which wrapes the functionality provided by the open source crawler. 
 	b) sample:
 		sample configuration file
 	b) setup: 
 	   	setup files should not be touched 
 	     i)  functions.xml
 	     ii) xquery.xml
 	     iii) crawler.xml
 	c) configurations:     
 	   This folder is the placeholder for the crawling instructions 
 		This folder is the place-holder for the crawling-instruction (e.g. sample-setup.xml).
 		Crawler requires the following items to be configured prior to executing, for specifying the items, you would need to do a 'view-source' on the html page
 		
 		<!-- ======== Defining the crawling premise =========== -->
 		<?xml version="1.0" encoding="UTF-8"?>
			<config charset="ISO-8859-1">
		    <include path="..//setup//functions.xml"/>
    	 	<!-- Crawling definition begins here -->
    		<var-def name="products">    
        		<call name="download-multipage-list">
            		<call-param name="pageUrl">http://topsy.com/s?q=Nyquil</call-param> <!-- Define the URL to be crawled  -->
            		<call-param name="nextXPath">//a[starts-with(., 'next')]/@href</call-param> <!-- Define the HTML pattern to determine the next page URL for paginated content  -->
            		<call-param name="itemXPath">//span[@class="twitter-post-text translatable language-en"]</call-param> <!-- Define the HTML pattern to capture desired content  -->
            		<call-param name="maxloops">10</call-param><!-- Looping variable to define the number of iterations crawler should perform  -->
        		</call>
    		</var-def>
 		<!-- ========== Building the result files =============== -->
 		<!-- iterates over all crawled items and extract desired data -->
    	<file action="write" path="nyquil.xml" charset="UTF-8"> <!-- Set the output filename  -->
       
       	<!-- build the basic doc structure -->
       	<![CDATA[ <reviews> ]]> 
        <loop item="item" index="i">
            <list><var name="products"/></list>
            <body>
                <xquery>
                    <xq-param name="item" type="node()"><var name="item"/></xq-param>
                    <xq-expression><![CDATA[
                            declare variable $item as node() external;
                             <!-- Specify the HTML pattern to capture the text  -->
                            let $desc := data($item//*[@class='twitter-post-text translatable language-en'])
                            let $user := data($item//*[@class='author-name'])
                                return
                                    <review>
                                        <comment>{normalize-space($desc)}</comment>
										<reviewer>{normalize-space($user)}</reviewer>                                        
                                    </review>
                    ]]></xq-expression>
                </xquery>
            </body>
        </loop>
        <![CDATA[ </reviews> ]]>
    	</file>
		</config>
 3. Once the setup file is place in the folder, open terminal/command prompt and run the following command
    	java -jar ML-Crawler.jar
    
 4. After execution, you will have the results stored in the folder that you defined in the xml above.