ONSdigital/address-index-data

Issues running the address-index-data project -

Closed this issue · 3 comments

Steps followed:
Cloned master branch -> git clone --branch master https://github.com/ONSdigital/address-index-data
Ran ‘set clean assembly’ on the address-index-data folder (assembly was successful but with WARNs while Merging)

Created an application.conf with the following:
addressindex.elasticsearch.nodes="IP of my Elasticsearch"
addressindex.elasticsearch.pass="password"
addressindex.elasticsearch.user="elastic"

In /batch/build.sbt,
I set val localTarget: Boolean = true since I needed a single jar
"org.elasticsearch" %% "elasticsearch-spark-20" % "8.7.1" to go with my ElasticSearch version

Ran the following
java -Dconfig.file=application.conf -jar batch/target/scala-2.11/ons-ai-batch-assembly-0.0.1.jar throws exceptions.

I am pasting parts of the dump below:
_23/06/06 11:51:21 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/06/06 11:51:24 WARN Persistence: Error creating validator of type org.datanucleus.properties.CorePropertyValidator
ClassLoaderResolver for class "" gave error on creation : {1}
org.datanucleus.exceptions.NucleusUserException: ClassLoaderResolver for class "" gave error on creation : {1}
at org.datanucleus.NucleusContext.getClassLoaderResolver(NucleusContext.java:1087)
at org.datanucleus.PersistenceConfiguration.validatePropertyValue(PersistenceConfiguration.java:797)
at org.datanucleus.PersistenceConfiguration.setProperty(PersistenceConfiguration.java:714)
at org.datanucleus.PersistenceConfiguration.setPersistenceProperties(PersistenceConfiguration.java:693)
at org.datanucleus.NucleusContext.(NucleusContext.java:273)
at org.datanucleus.NucleusContext.(NucleusContext.java:247)
at org.datanucleus.NucleusContext.(NucleusContext.java:225)
at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.(JDOPersistenceManagerFactory.java:416)
at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.createPersistenceManagerFactory(JDOPersistenceManagerFactory.java:301)

Caused by: java.lang.NullPointerException
at org.datanucleus.NucleusContext.getClassLoaderResolver(NucleusContext.java:1079)
... 135 more
Nested Throwables StackTrace:
java.lang.NullPointerException
at org.datanucleus.NucleusContext.getClassLoaderResolver(NucleusContext.java:1079)
at org.datanucleus.PersistenceConfiguration.validatePropertyValue(PersistenceConfiguration.java:797)
at org.datanucleus.PersistenceConfiguration.setProperty(PersistenceConfiguration.java:714)
at org.datanucleus.PersistenceConfiguration.setPersistenceProperties(PersistenceConfiguration.java:693)
at org.datanucleus.NucleusContext.(NucleusContext.java:273)
at org.datanucleus.NucleusContext.(NucleusContext.java:247)
at org.datanucleus.NucleusContext.(NucleusContext.java:225)
at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.(JDOPersistenceManagerFactory.java:416)
at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.createPersistenceManagerFactory(JDOPersistenceManagerFactory.java:301)
.........
at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.getPersistenceManagerFactory(JDOPersistenceManagerFactory.java:202)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at javax.jdo.JDOHelper$16.run(JDOHelper.java:1965)
at java.security.AccessController.doPrivileged(Native Method)
at javax.jdo.JDOHelper.invoke(JDOHelper.java:1960)
.......
23/06/06 11:51:24 WARN HiveMetaStore: Retrying creating default database after error: Unexpected exception caught.
javax.jdo.JDOFatalInternalException: Unexpected exception caught.
at javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1193)
at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808)
......
23/06/06 11:51:24 WARN Hive: Failed to access metastore. This class should not accessed in runtime.
org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1236)
at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174)
at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:166)
.......
Caused by: org.datanucleus.exceptions.NucleusUserException: Persistence process has been specified to use a ClassLoaderResolver of name "datanucleus" yet this has not been found by the DataNucleus plugin mechanism. Please check your CLASSPATH and plugin specification.
at org.datanucleus.NucleusContext.(NucleusContext.java:283)
at org.datanucleus.NucleusContext.(NucleusContext.java:247)
at org.datanucleus.NucleusContext.(NucleusContext.java:225)
at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.(JDOPersistenceManagerFactory.java:416)
at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.createPersistenceManagerFactory(JDOPersistenceManagerFactory.java:301)
at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.getPersistenceManagerFactory(JDOPersistenceManagerFactory.java:202)
... 123 more_

  1. I tried switching from Java 8 to 11 and back. At least it builds with Java 8.
  2. Tried using "bintray-spark-packages" at "https://repos.spark-packages.org", instead of "bintray-spark-packages" at "https://dl.bintray.com/spark-packages/maven/".
  3. Tried to add "org.apache.hive" %% "hive-common" % "2.3.3" as a dependency, as some blogs seemed to suggest, but didn't know which repo to find it from.

Scala and Java version as follows:
Scala code runner version 2.12.4 -- Copyright 2002-2017, LAMP/EPFL and Lightbend, Inc.
openjdk version "1.8.0_362"

The 'develop' branch seems to follow later versions of all software, and that is the recommendation in #197 . Is the 'develop' branch the correct and stable one? though it seems to be way ahead of master.

The project ran successfully with

build.sbt:
_lazy val commonSettings = Seq(
version := "0.0.1",
organization := "uk.gov.ons",
scalaVersion := "2.12.14",
assembly / test := {}
)

lazy val buildSettings = Seq(
assembly / mainClass := Some("uk.gov.ons.addressindex.Main"),
name := "ons-ai-batch",
assembly / assemblyMergeStrategy := {
case "reference.conf" => MergeStrategy.concat
case PathList("META-INF", "services", "org.apache.hadoop.fs.FileSystem") => MergeStrategy.filterDistinctLines
case PathList("META-INF", _*) => MergeStrategy.discard
case _ => MergeStrategy.first
}
)

lazy val address-index-batch = project.in(file("batch")).settings(commonSettings ++ buildSettings: *)

batch/build.sbt
_val localDeps = Seq(
"org.apache.spark" %% "spark-core" % "3.2.2",
"org.apache.spark" %% "spark-sql" % "3.2.2",
"org.apache.spark" %% "spark-hive" % "3.2.2"
)

val clouderaDeps = Seq(
"org.apache.spark" %% "spark-core" % "3.2.2" % "provided",
"org.apache.spark" %% "spark-sql" % "3.2.2" % "provided",
"org.apache.spark" %% "spark-hive" % "3.2.2" % "provided",
"org.apache.httpcomponents" % "httpclient" % "4.5.13"
)

val otherDeps = Seq(
"com.typesafe" % "config" % "1.4.1",
"org.elasticsearch" %% "elasticsearch-spark-30" % "8.7.1" excludeAll ExclusionRule(organization = "javax.servlet"),
"org.rogach" %% "scallop" % "4.0.3",
"org.scalaj" %% "scalaj-http" % "2.4.2",
"com.crealytics" %% "spark-excel" % "3.3.1_0.18.7",
"org.scalatest" %% "scalatest" % "3.2.9" % Test
)_

The indexes were loaded, but a minor issue. It seems the csv headers need to be lowercase. Here are some warnings:

CSV file: file:///home/dt224374/address-index-data/batch/src/test/resources/csv/lpi/ABP_E811a_v111017.csv
23/06/07 08:49:38 WARN CSVHeaderChecker: CSV header does not conform to the schema.
Header: UPRN, PRIMARY_UPRN, THIS_LAYER, PARENT_UPRN
Schema: uprn, primaryUprn, thisLayer, parentUprn
Expected: primaryUprn but found: PRIMARY_UPRN
CSV file: file:///home/dt224374/address-index-data/batch/src/test/resources/csv/hierarchy/ABP_E811a_v111017.csv
23/06/07 08:49:38 WARN CSVHeaderChecker: CSV header does not conform to the schema.
Header: UPRN, CLASSIFICATION_CODE, CLASS_SCHEME
Schema: uprn, classificationCode, classScheme
Expected: classificationCode but found: CLASSIFICATION_CODE
CSV file: file:///home/dt224374/address-index-data/batch/src/test/resources/csv/classification/ABP_E811a_v111017.csv
23/06/07 08:49:38 WARN CSVHeaderChecker: CSV header does not conform to the schema.
Header: UPRN, LOGICAL_STATUS, PARENT_UPRN, X_COORDINATE, Y_COORDINATE, LATITUDE, LONGITUDE, RPC, LOCAL_CUSTODIAN_CODE, COUNTRY, ADDRESSBASE_POSTAL, POSTCODE_LOCATOR, MULTI_OCC_COUNT
Schema: uprn, logicalStatus, parentUprn, xCoordinate, yCoordinate, latitude, longitude, rpc, localCustodianCode, country, addressbasePostal, postcodeLocator, multiOccCount