BioJava-Spark

Algorithms that are built around BioJava and are running on Apache Spark

Starting up

Some initial instructions can be found on the mmtf-spark project

First download and untar a Hadoop sequence file of the PDB (~7 GB download)

wget http://mmtf.rcsb.org/v0.2/hadoopfiles/full.tar
tar -xvf full.tar

Or you can get a C-alpha, phosphate, ligand only version (~800 Mb download)

wget http://mmtf.rcsb.org/v0.2/hadoopfiles/reduced.tar
tar -xvf reduced.tar

Second add the biojava-spark dependecy to your pom

		<dependency>
			<groupId>org.biojava</groupId>
			<artifactId>biojava-spark</artifactId>
			<version>0.1.1</version>
		</dependency>

Extra Biojava examples

Do some simple quality filtering

	float maxResolution = 3.0f;
	float maxRfree = 0.3f;
	StructureDataRDD structureData = new StructureDataRDD("/path/to/file")
				.filterResolution(maxResolution)
				.filterRfree(maxRfree);

Summarsing the elements in the PDB

	Map<String, Long> elementCountMap = BiojavaSparkUtils.findAtoms(structureData).countByElement();

Finding inter-atomic contacts from the PDB

		Double mean = BiojavaSparkUtils.findContacts(structureData,
				new AtomSelectObject()
						.groupNameList(new String[] {"PRO","LYS"})
						.elementNameList(new String[] {"C"})
						.atomNameList(new String[] {"CA"}),
						cutoff)
				.getDistanceDistOfAtomInts("CA", "CA")
				.mean();
		System.out.println("\nMean PRO-LYS CA-CA distance: "+mean);

andreasprlic/biojava-spark

BioJava-Spark

Starting up

Some initial instructions can be found on the mmtf-spark project

First download and untar a Hadoop sequence file of the PDB (~7 GB download)

Second add the biojava-spark dependecy to your pom

Extra Biojava examples

Do some simple quality filtering

Summarsing the elements in the PDB

Finding inter-atomic contacts from the PDB