phpjoern_mirror: A PHP repository from ngocdang499

1. Installation of PHP7
-----------------------

We need a working installation of PHP7 (under development) with the
php-ast extension by nikic.

This goes somethink like this (to install into $HOME/php7):

$ mkdir ~/php7
$ cd ~/php7
$ git clone https://git.php.net/repository/php-src.git
$ cd php-src/ext
$ git clone https://github.com/nikic/php-ast.git ast
$ cd ..
$ ./buildconf
$ ./configure --prefix=$HOME/php7/usr --with-config-file-path=$HOME/php7/usr/etc --enable-ast
$ make
$ make install

Lastly, put the file conf/php.ini bundled with php-joern to the folder
you specified in the --with-config-file-path command line option.


2. Using the parser
-------------------

The parser is implemented in PHP in the file src/Parser.php. It takes
as argument either a PHP file or a directory. If it is a directory,
the parser will search for all PHP files in the given directory and
generate an AST for each of them.

For convenience (i.e., command-line laziness), there is a bash script
called 'parser' in this project's root directory that will execute the
PHP interpreter on the file src/Parser.php and pass along any
arguments. The variable PHP7 needs to be set in the script to point to
the location of the php executable from PHP7.

Example usage:

$ ./parser test-own/42.php
$ ./parser test-repos/agavi

This creates two files nodes.csv and rels.csv for use with the
neo4j-import tool. It is also possible to generate these files for use
with the batch-import tool, which uses a slightly different CSV file
format, using the '-f jexp' switch:

$ ./parser -f jexp test-own/42.php
$ ./parser -f jexp test-repos/agavi

For more information on the parser, see

$ ./parser --help

For more information on the neo4j-import and batch-import tools, see
section 5.


3. Obtaining test repositories
------------------------------

The script ./get_test_repos.sh will obtain various well-known and/or
Github-trending PHP projects via git, and put them in a newly created
directory test-repos/

Simply call it like so:

$ ./get_test_repos.sh


4. Installing the Neo4J graph database
--------------------------------------

4a. Installing the Neo4J server
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

We're going to import the CSV files created in section 2 into a Neo4J
graph database. We're currently working with Neo4J Community 2.2.3,
available from http://neo4j.com/download/other-releases/. Download and
unpack it somewhere:

$ curl -O http://neo4j.com/artifact.php?name=neo4j-community-2.2.3-unix.tar.gz
$ tar xvfz artifact.php\?name=neo4j-community-2.2.3-unix.tar.gz

In the following, let $NEO4J_HOME be the directory where we unpacked
the Neo4J tarball.

TODO actually we'd *like* to work with Neo4J 2.2 (particularly because
of the availability of neo4j-import), however the Gremlin plugin does
not work with Neo4J 2.2 yet, and we have to use Neo4J 2.1 for now.

4b. Installing the Gremlin plugin
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Additionally, we shall need the Gremlin plugin for Neo4J (The Gremlin
plugin will be discussed more in depth in section 7b). This plugin is
available here: https://github.com/neo4j-contrib/gremlin-plugin

The Gremlin plugin is no longer bundled with Neo4J by default as of
Neo4J 2.x, but can still be downloaded and added manually.

TODO Unfortunately, as of now, Neo4J 2.2 is not supported by the
Gremlin plugin. So we have to use Neo4J 2.1...

To install the Gremlin plugin for Neo4J 2.1, proceed as follows:

$ git clone https://github.com/neo4j-contrib/gremlin-plugin
$ cd gremlin-plugin
$ mvn clean package
$ unzip target/neo4j-gremlin-plugin-2.1-SNAPSHOT-server-plugin.zip -d $NEO4J_HOME/plugins/gremlin-plugin


5. Importing ASTs into a Neo4J graph database
---------------------------------------------

Once we have the files nodes.csv and rels.csv for some PHP project, we
want to import them into a Neo4J database. Two tools are available for
this purpose.


5a. Using neo4j-import
~~~~~~~~~~~~~~~~~~~~~~

Since Neo4J 2.2, Neo4J comes with its own massively parallel and
scalable CSV importer. The tool comes bundled with Neo4J since version
2.2, and is invoked like so:

$ $NEO4J_HOME/bin/neo4j-import --into graph.db --nodes nodes.csv --relationships rels.csv

This creates a new directory graph.db/ populated with a new database
to be loaded by the Neo4J server. The files nodes.csv and rels.csv as
generated by the parser in section 2 conform to the format expected by
this tool.

For more information, see:
* http://neo4j.com/developer/guide-import-csv/#_super_fast_batch_importer_for_huge_datasets
* http://neo4j.com/docs/stable/import-tool.html

Note: You can configure higher Java heap sizes by choosing appropriate
values for wrapper.java.initmemory and wrapper.java.maxmemory in
$NEO4J_HOME/conf/neo4j-wrapper.conf


5b. Using batch-import (legacy support)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

It is also possible to use the batch-importer tool available at
https://github.com/jexp/batch-import/

This may be useful, e.g., if there is a problem with neo4j-import of
if an older version of Neo4J (prior to 2.2) is to be used for some
reason.

To install it, use something like this:

$ mkdir batch-import # or if you also want the sources: $ git clone https://github.com/jexp/batch-import.git
$ cd batch-import
$ curl -O https://dl.dropboxusercontent.com/u/14493611/batch_importer_22.zip 
$ unzip batch_importer_22.zip

Make sure that the version of batch-importer matches the version of
the Neo4J database (2.2 in the above example).

Note: for 2.1:
$ curl -O https://dl.dropboxusercontent.com/u/14493611/batch_importer_21.zip

In the following, let $JEXP_HOME be the newly created directory batch-import/.

Extracting the ZIP file creates a directory $JEXP_HOME/lib/ with the necessary JAR files.

Next, use the following command to create a Neo4J database directory
graph.db/ from the two CSV files:

$ HEAP=6G; java -classpath "$JEXP_HOME/lib/*" -Xmx$HEAP -Xms$HEAP -Dfile.encoding=UTF-8 org.neo4j.batchimport.Importer conf/batch.properties graph.db nodes.csv rels.csv

Note that the format of the nodes.csv and rels.csv files expected by
batch-import is slightly different from that expected by
neo4j-import. The parser will generate the format expected by
batch-import if invoked with the '-f jexp' flag (see section 2).

The heap size may be adapted as needed; the batch.properties file
should be configured accordingly. See
http://joern.readthedocs.org/en/latest/performance.html#optimizing-code-importing

The file batch.properties provided with php-joern is for heap sizes of 6GB or more.


6. Starting the Neo4J database server
-------------------------------------

Once we created a database directory graph.db/ as described in section
5, we can point the Neo4J server to the location of that directory in
the configuration file $NEO4J_HOME/conf/neo4j-server.properties by
changing the variable 'org.neo4j.server.database.location'
accordingly.

Then, we can start the server:

$ $NEO4J_HOME/bin/neo4j console

The server is then accessible at http://localhost:7474/. It offers an
HTTP based RESTful API which can be used to query the database.

The graph previously created by the parser is a weakly connected,
directed rooted tree.

The graph's root node is a node of type either 'Directory' or 'File',
depending on whether a whole folder of some PHP project or a single
file was parsed. In the former case, the root node represents the root
directory of the PHP project; in the latter, it represents the parsed
file. The parser always assigns node index 0 to the root node, which
can be accessed in the browser at
http://localhost:7474/db/data/node/0.


7. Querying the database
------------------------

Two languages are available for querying the database: Cypher and
Gremlin. The two are quite different. Cypher is a declarative language
wherein you specify *what* to find. Gremlin is an imperative language
that allows you to specify *how* to find something. See for instance
http://www.quora.com/Is-Neo4j-using-Gremlin-as-its-core for a short
discussion on the subject.


7a. Cypher
~~~~~~~~~~

Cypher queries can be issued via Neo4j Browser, a command driven
client which works like a web-based shell environment. This is nice
for running some ad-hoc graph queries. Cypher uses SQL-like
clauses. For instance, to search for all nodes of type "File"
(representing the individual PHP files in a previously parsed folder
of a PHP project), visit http://localhost:7474/ and issue

MATCH node
WHERE node.type = "File"
RETURN node;

Relationships are of course essential for meaningful queries. For
instance, to find the AST root node of a file named 42.php, use

MATCH (filenode)-[:FILE_OF]-(astroot)
WHERE filenode.name = "42.php"
RETURN astroot;

This pattern can be used transitively. Say we want to find all the AST
nodes that correspond to the functions declared in a file Parser.php,
we would query

MATCH (filenode)-[:FILE_OF]-()-[:PARENT_OF]-(astnode)
WHERE filenode.name = "Parser.php"
  AND astnode.type = "AST_FUNC_DECL"
RETURN astnode;

More information on Cypher can be found here:
http://neo4j.com/docs/2.2.3/cypher-query-lang.html


7b. Gremlin
~~~~~~~~~~~

Gremlin is a general-purpose graph traversal language and our
preferred choice, as it offers a more fine-grained control of the
exact traversal pattern to use (whereas the Cypher engine tries to
find the best pattern itself). It is possible to extend the Neo4J
RESTful API with support for Gremlin queries via the Gremlin
plugin.

See section 4b for instructions on how to install the plugin.

To verify that the Gremlin REST endpoint is available, issue the
following command:

$ curl localhost:7474/db/data/
{
  "extensions" : {
    "GremlinPlugin" : {
      "execute_script" : "http://localhost:7474/db/data/ext/GremlinPlugin/graphdb/execute_script"
    }
  },
...
}

Once a Neo4J server with the Gremlin plugin is setup (section 4), a
graph database is imported into Neo4J (section 5), and the server
started (section 6), we can issue queries by sending appropriate POST
requests to the Gremlin REST endpoint, e.g.,

$ curl -v --data-urlencode 'script="Hello World!"' http://localhost:7474/db/data/ext/GremlinPlugin/graphdb/execute_script

Taking up the examples from last section, we can find all nodes with
type File using the query

$ curl -v --data-urlencode 'script=g.V("type","File").map()' http://localhost:7474/db/data/ext/GremlinPlugin/graphdb/execute_script

To find the AST root node of a file named 42.php, use

$ curl -v --data-urlencode 'script=g.V("type","File").has("name","42.php").out("FILE_OF").map()' http://localhost:7474/db/data/ext/GremlinPlugin/graphdb/execute_script

Following paths is particularly beautiful in Gremlin. Finding all the
AST nodes that correspond to the functions declared in a file
Parser.php is as simple as

$ curl -v --data-urlencode 'script=g.V("type","File").has("name", "Parser.php").out("FILE_OF").out("PARENT_OF").has("type","AST_FUNC_DECL").map()' http://localhost:7474/db/data/ext/GremlinPlugin/graphdb/execute_script

For more information on Gremlin, see
* http://gremlin.tinkerpop.com
* http://gremlindocs.com
* http://tinkerpop.incubator.apache.org/docs/
* http://sql2gremlin.com


8. Scripting queries
--------------------

Using curl as above quickly gets unwieldy for larger queries. It is
more convenient to be able to script such queries from within a
scripting language that provides methods to perform Cypher or Gremlin
requests to the Neo4J server's REST API as above. For this purpose, we
use the tool python-joern:
http://joern.readthedocs.org/en/latest/access.html

We're currently using a port of python-joern for PHPJoern. Get it like so:

$ git clone ssh://git@service.cispa.uni-saarland.de:2222/python-joern.git

Now switch to the branch portPHPJoern:

$ cd python-joern
$ git checkout portPHPJoern

TODO:
* work on the PHPJoern port for pyhton-joern, and continue here
ngocdang499/phpjoern_mirror