1. Installation of PHP7 ----------------------- We need a working installation of PHP7 (under development) with the php-ast extension by nikic. This goes somethink like this (to install into $HOME/php7): $ mkdir ~/php7 $ cd ~/php7 $ git clone https://git.php.net/repository/php-src.git $ cd php-src/ext $ git clone https://github.com/nikic/php-ast.git ast $ cd .. $ ./buildconf $ ./configure --prefix=$HOME/php7/usr --with-config-file-path=$HOME/php7/usr/etc --enable-ast $ make $ make install Lastly, put the file conf/php.ini bundled with php-joern to the folder you specified in the --with-config-file-path command line option. 2. Using the parser ------------------- The parser is implemented in PHP in the file src/Parser.php. It takes as argument either a PHP file or a directory. If it is a directory, the parser will search for all PHP files in the given directory and generate an AST for each of them. For convenience (i.e., command-line laziness), there is a bash script called 'parser' in this project's root directory that will execute the PHP interpreter on the file src/Parser.php and pass along any arguments. The variable PHP7 needs to be set in the script to point to the location of the php executable from PHP7. Example usage: $ ./parser test-own/42.php $ ./parser test-repos/agavi This creates two files nodes.csv and rels.csv for use with the neo4j-import tool. It is also possible to generate these files for use with the batch-import tool, which uses a slightly different CSV file format, using the '-f jexp' switch: $ ./parser -f jexp test-own/42.php $ ./parser -f jexp test-repos/agavi For more information on the parser, see $ ./parser --help For more information on the neo4j-import and batch-import tools, see section 5. 3. Obtaining test repositories ------------------------------ The script ./get_test_repos.sh will obtain various well-known and/or Github-trending PHP projects via git, and put them in a newly created directory test-repos/ Simply call it like so: $ ./get_test_repos.sh 4. Installing the Neo4J graph database -------------------------------------- 4a. Installing the Neo4J server ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ We're going to import the CSV files created in section 2 into a Neo4J graph database. We're currently working with Neo4J Community 2.2.3, available from http://neo4j.com/download/other-releases/. Download and unpack it somewhere: $ curl -O http://neo4j.com/artifact.php?name=neo4j-community-2.2.3-unix.tar.gz $ tar xvfz artifact.php\?name=neo4j-community-2.2.3-unix.tar.gz In the following, let $NEO4J_HOME be the directory where we unpacked the Neo4J tarball. TODO actually we'd *like* to work with Neo4J 2.2 (particularly because of the availability of neo4j-import), however the Gremlin plugin does not work with Neo4J 2.2 yet, and we have to use Neo4J 2.1 for now. 4b. Installing the Gremlin plugin ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Additionally, we shall need the Gremlin plugin for Neo4J (The Gremlin plugin will be discussed more in depth in section 7b). This plugin is available here: https://github.com/neo4j-contrib/gremlin-plugin The Gremlin plugin is no longer bundled with Neo4J by default as of Neo4J 2.x, but can still be downloaded and added manually. TODO Unfortunately, as of now, Neo4J 2.2 is not supported by the Gremlin plugin. So we have to use Neo4J 2.1... To install the Gremlin plugin for Neo4J 2.1, proceed as follows: $ git clone https://github.com/neo4j-contrib/gremlin-plugin $ cd gremlin-plugin $ mvn clean package $ unzip target/neo4j-gremlin-plugin-2.1-SNAPSHOT-server-plugin.zip -d $NEO4J_HOME/plugins/gremlin-plugin 5. Importing ASTs into a Neo4J graph database --------------------------------------------- Once we have the files nodes.csv and rels.csv for some PHP project, we want to import them into a Neo4J database. Two tools are available for this purpose. 5a. Using neo4j-import ~~~~~~~~~~~~~~~~~~~~~~ Since Neo4J 2.2, Neo4J comes with its own massively parallel and scalable CSV importer. The tool comes bundled with Neo4J since version 2.2, and is invoked like so: $ $NEO4J_HOME/bin/neo4j-import --into graph.db --nodes nodes.csv --relationships rels.csv This creates a new directory graph.db/ populated with a new database to be loaded by the Neo4J server. The files nodes.csv and rels.csv as generated by the parser in section 2 conform to the format expected by this tool. For more information, see: * http://neo4j.com/developer/guide-import-csv/#_super_fast_batch_importer_for_huge_datasets * http://neo4j.com/docs/stable/import-tool.html Note: You can configure higher Java heap sizes by choosing appropriate values for wrapper.java.initmemory and wrapper.java.maxmemory in $NEO4J_HOME/conf/neo4j-wrapper.conf 5b. Using batch-import (legacy support) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ It is also possible to use the batch-importer tool available at https://github.com/jexp/batch-import/ This may be useful, e.g., if there is a problem with neo4j-import of if an older version of Neo4J (prior to 2.2) is to be used for some reason. To install it, use something like this: $ mkdir batch-import # or if you also want the sources: $ git clone https://github.com/jexp/batch-import.git $ cd batch-import $ curl -O https://dl.dropboxusercontent.com/u/14493611/batch_importer_22.zip $ unzip batch_importer_22.zip Make sure that the version of batch-importer matches the version of the Neo4J database (2.2 in the above example). Note: for 2.1: $ curl -O https://dl.dropboxusercontent.com/u/14493611/batch_importer_21.zip In the following, let $JEXP_HOME be the newly created directory batch-import/. Extracting the ZIP file creates a directory $JEXP_HOME/lib/ with the necessary JAR files. Next, use the following command to create a Neo4J database directory graph.db/ from the two CSV files: $ HEAP=6G; java -classpath "$JEXP_HOME/lib/*" -Xmx$HEAP -Xms$HEAP -Dfile.encoding=UTF-8 org.neo4j.batchimport.Importer conf/batch.properties graph.db nodes.csv rels.csv Note that the format of the nodes.csv and rels.csv files expected by batch-import is slightly different from that expected by neo4j-import. The parser will generate the format expected by batch-import if invoked with the '-f jexp' flag (see section 2). The heap size may be adapted as needed; the batch.properties file should be configured accordingly. See http://joern.readthedocs.org/en/latest/performance.html#optimizing-code-importing The file batch.properties provided with php-joern is for heap sizes of 6GB or more. 6. Starting the Neo4J database server ------------------------------------- Once we created a database directory graph.db/ as described in section 5, we can point the Neo4J server to the location of that directory in the configuration file $NEO4J_HOME/conf/neo4j-server.properties by changing the variable 'org.neo4j.server.database.location' accordingly. Then, we can start the server: $ $NEO4J_HOME/bin/neo4j console The server is then accessible at http://localhost:7474/. It offers an HTTP based RESTful API which can be used to query the database. The graph previously created by the parser is a weakly connected, directed rooted tree. The graph's root node is a node of type either 'Directory' or 'File', depending on whether a whole folder of some PHP project or a single file was parsed. In the former case, the root node represents the root directory of the PHP project; in the latter, it represents the parsed file. The parser always assigns node index 0 to the root node, which can be accessed in the browser at http://localhost:7474/db/data/node/0. 7. Querying the database ------------------------ Two languages are available for querying the database: Cypher and Gremlin. The two are quite different. Cypher is a declarative language wherein you specify *what* to find. Gremlin is an imperative language that allows you to specify *how* to find something. See for instance http://www.quora.com/Is-Neo4j-using-Gremlin-as-its-core for a short discussion on the subject. 7a. Cypher ~~~~~~~~~~ Cypher queries can be issued via Neo4j Browser, a command driven client which works like a web-based shell environment. This is nice for running some ad-hoc graph queries. Cypher uses SQL-like clauses. For instance, to search for all nodes of type "File" (representing the individual PHP files in a previously parsed folder of a PHP project), visit http://localhost:7474/ and issue MATCH node WHERE node.type = "File" RETURN node; Relationships are of course essential for meaningful queries. For instance, to find the AST root node of a file named 42.php, use MATCH (filenode)-[:FILE_OF]-(astroot) WHERE filenode.name = "42.php" RETURN astroot; This pattern can be used transitively. Say we want to find all the AST nodes that correspond to the functions declared in a file Parser.php, we would query MATCH (filenode)-[:FILE_OF]-()-[:PARENT_OF]-(astnode) WHERE filenode.name = "Parser.php" AND astnode.type = "AST_FUNC_DECL" RETURN astnode; More information on Cypher can be found here: http://neo4j.com/docs/2.2.3/cypher-query-lang.html 7b. Gremlin ~~~~~~~~~~~ Gremlin is a general-purpose graph traversal language and our preferred choice, as it offers a more fine-grained control of the exact traversal pattern to use (whereas the Cypher engine tries to find the best pattern itself). It is possible to extend the Neo4J RESTful API with support for Gremlin queries via the Gremlin plugin. See section 4b for instructions on how to install the plugin. To verify that the Gremlin REST endpoint is available, issue the following command: $ curl localhost:7474/db/data/ { "extensions" : { "GremlinPlugin" : { "execute_script" : "http://localhost:7474/db/data/ext/GremlinPlugin/graphdb/execute_script" } }, ... } Once a Neo4J server with the Gremlin plugin is setup (section 4), a graph database is imported into Neo4J (section 5), and the server started (section 6), we can issue queries by sending appropriate POST requests to the Gremlin REST endpoint, e.g., $ curl -v --data-urlencode 'script="Hello World!"' http://localhost:7474/db/data/ext/GremlinPlugin/graphdb/execute_script Taking up the examples from last section, we can find all nodes with type File using the query $ curl -v --data-urlencode 'script=g.V("type","File").map()' http://localhost:7474/db/data/ext/GremlinPlugin/graphdb/execute_script To find the AST root node of a file named 42.php, use $ curl -v --data-urlencode 'script=g.V("type","File").has("name","42.php").out("FILE_OF").map()' http://localhost:7474/db/data/ext/GremlinPlugin/graphdb/execute_script Following paths is particularly beautiful in Gremlin. Finding all the AST nodes that correspond to the functions declared in a file Parser.php is as simple as $ curl -v --data-urlencode 'script=g.V("type","File").has("name", "Parser.php").out("FILE_OF").out("PARENT_OF").has("type","AST_FUNC_DECL").map()' http://localhost:7474/db/data/ext/GremlinPlugin/graphdb/execute_script For more information on Gremlin, see * http://gremlin.tinkerpop.com * http://gremlindocs.com * http://tinkerpop.incubator.apache.org/docs/ * http://sql2gremlin.com 8. Scripting queries -------------------- Using curl as above quickly gets unwieldy for larger queries. It is more convenient to be able to script such queries from within a scripting language that provides methods to perform Cypher or Gremlin requests to the Neo4J server's REST API as above. For this purpose, we use the tool python-joern: http://joern.readthedocs.org/en/latest/access.html We're currently using a port of python-joern for PHPJoern. Get it like so: $ git clone ssh://git@service.cispa.uni-saarland.de:2222/python-joern.git Now switch to the branch portPHPJoern: $ cd python-joern $ git checkout portPHPJoern TODO: * work on the PHPJoern port for pyhton-joern, and continue here