samelamin/spark-bigquery

writing to bigquery from dataproc

Closed this issue · 1 comments

when trying to write to bigquery like so from pyspark in dataproc:
`bigquery = spark._sc._jvm.com.samelamin.spark.bigquery

Prepare the bigquery context

bq = bigquery.BigQuerySQLContext(spark._wrapped._jsqlContext)
bq.setBigQueryProjectId(BQ_PROJECT_ID)
bq.setGSProjectId(BQ_PROJECT_ID)
bq.setBigQueryGcsBucket(STAGING_BUCKET)
bq.setBigQueryDatasetLocation(DATASET_LOCATION)

Extract and Transform a dataframe

df = session.read.csv(...)

Load into a table or table partition

bqDF = bigquery.BigQueryDataFrame(df_master._jdf)
bqDF.saveAsBigQueryTable(
"{0}:{1}.{2}".format(BQ_PROJECT_ID, DATASET_ID, TABLE_NAME),
False, # Day paritioned when created
0, # Partition expired when created
bigquery.getattr("package$WriteDisposition$").getattr("MODULE$").WRITE_EMPTY(),
bigquery.getattr("package$CreateDisposition$").getattr("MODULE$").CREATE_IF_NEEDED(),
)`
I am getting the following exception:

`Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
:: loading settings :: url = jar:file:/usr/lib/spark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
com.github.samelamin#spark-bigquery_2.11 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-c234c14e-60d0-482e-a149-6a46d2a4c43f;1.0
confs: [default]
found com.github.samelamin#spark-bigquery_2.11;0.2.6 in central
found com.databricks#spark-avro_2.11;4.0.0 in central
found org.slf4j#slf4j-api;1.7.5 in central
found org.apache.avro#avro;1.7.6 in central
found org.codehaus.jackson#jackson-core-asl;1.9.13 in central
found org.codehaus.jackson#jackson-mapper-asl;1.9.13 in central
found com.thoughtworks.paranamer#paranamer;2.3 in central
found org.xerial.snappy#snappy-java;1.0.5 in central
found org.apache.commons#commons-compress;1.4.1 in central
found org.tukaani#xz;1.0 in central
found com.google.cloud.bigdataoss#bigquery-connector;0.13.4-hadoop2 in central
found com.google.api-client#google-api-client-java6;1.24.1 in central
found com.google.api-client#google-api-client;1.24.1 in central
found com.google.oauth-client#google-oauth-client;1.24.1 in central
found com.google.http-client#google-http-client;1.24.1 in central
found com.google.code.findbugs#jsr305;3.0.2 in central
found org.apache.httpcomponents#httpclient;4.0.1 in central
found org.apache.httpcomponents#httpcore;4.0.1 in central
found commons-logging#commons-logging;1.1.1 in central
found commons-codec#commons-codec;1.6 in central
found com.google.http-client#google-http-client-jackson2;1.24.1 in central
found com.fasterxml.jackson.core#jackson-core;2.9.2 in central
found com.google.guava#guava;26.0-jre in central
found org.checkerframework#checker-qual;2.5.2 in central
found com.google.errorprone#error_prone_annotations;2.1.3 in central
found com.google.j2objc#j2objc-annotations;1.1 in central
found org.codehaus.mojo#animal-sniffer-annotations;1.14 in central
found com.google.oauth-client#google-oauth-client-java6;1.24.1 in central
found com.google.api-client#google-api-client-jackson2;1.24.1 in central
found com.google.apis#google-api-services-storage;v1-rev135-1.24.1 in central
found com.google.apis#google-api-services-bigquery;v2-rev398-1.24.1 in central
found com.google.cloud.bigdataoss#util;1.9.4 in central
found com.google.auto.value#auto-value-annotations;1.6.2 in central
found com.google.cloud.bigdataoss#util-hadoop;1.9.4-hadoop2 in central
found com.google.cloud.bigdataoss#gcs-connector;1.9.4-hadoop2 in central
found com.google.cloud.bigdataoss#gcsio;1.9.4 in central
found joda-time#joda-time;2.9.3 in central
:: resolution report :: resolve 1060ms :: artifacts dl 27ms
:: modules in use:
com.databricks#spark-avro_2.11;4.0.0 from central in [default]
com.fasterxml.jackson.core#jackson-core;2.9.2 from central in [default]
com.github.samelamin#spark-bigquery_2.11;0.2.6 from central in [default]
com.google.api-client#google-api-client;1.24.1 from central in [default]
com.google.api-client#google-api-client-jackson2;1.24.1 from central in [default]
com.google.api-client#google-api-client-java6;1.24.1 from central in [default]
com.google.apis#google-api-services-bigquery;v2-rev398-1.24.1 from central in [default]
com.google.apis#google-api-services-storage;v1-rev135-1.24.1 from central in [default]
com.google.auto.value#auto-value-annotations;1.6.2 from central in [default]
com.google.cloud.bigdataoss#bigquery-connector;0.13.4-hadoop2 from central in [default]
com.google.cloud.bigdataoss#gcs-connector;1.9.4-hadoop2 from central in [default]
com.google.cloud.bigdataoss#gcsio;1.9.4 from central in [default]
com.google.cloud.bigdataoss#util;1.9.4 from central in [default]
com.google.cloud.bigdataoss#util-hadoop;1.9.4-hadoop2 from central in [default]
com.google.code.findbugs#jsr305;3.0.2 from central in [default]
com.google.errorprone#error_prone_annotations;2.1.3 from central in [default]
com.google.guava#guava;26.0-jre from central in [default]
com.google.http-client#google-http-client;1.24.1 from central in [default]
com.google.http-client#google-http-client-jackson2;1.24.1 from central in [default]
com.google.j2objc#j2objc-annotations;1.1 from central in [default]
com.google.oauth-client#google-oauth-client;1.24.1 from central in [default]
com.google.oauth-client#google-oauth-client-java6;1.24.1 from central in [default]
com.thoughtworks.paranamer#paranamer;2.3 from central in [default]
commons-codec#commons-codec;1.6 from central in [default]
commons-logging#commons-logging;1.1.1 from central in [default]
joda-time#joda-time;2.9.3 from central in [default]
org.apache.avro#avro;1.7.6 from central in [default]
org.apache.commons#commons-compress;1.4.1 from central in [default]
org.apache.httpcomponents#httpclient;4.0.1 from central in [default]
org.apache.httpcomponents#httpcore;4.0.1 from central in [default]
org.checkerframework#checker-qual;2.5.2 from central in [default]
org.codehaus.jackson#jackson-core-asl;1.9.13 from central in [default]
org.codehaus.jackson#jackson-mapper-asl;1.9.13 from central in [default]
org.codehaus.mojo#animal-sniffer-annotations;1.14 from central in [default]
org.slf4j#slf4j-api;1.7.5 from central in [default]
org.tukaani#xz;1.0 from central in [default]
org.xerial.snappy#snappy-java;1.0.5 from central in [default]
:: evicted modules:
org.slf4j#slf4j-api;1.6.4 by [org.slf4j#slf4j-api;1.7.5] in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 38 | 0 | 0 | 1 || 37 | 0 |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-c234c14e-60d0-482e-a149-6a46d2a4c43f
confs: [default]
0 artifacts copied, 37 already retrieved (0kB/19ms)
19/09/17 16:14:30 INFO org.spark_project.jetty.util.log: Logging initialized @4647ms
19/09/17 16:14:30 INFO org.spark_project.jetty.server.Server: jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown
19/09/17 16:14:31 INFO org.spark_project.jetty.server.Server: Started @4753ms
19/09/17 16:14:31 WARN org.apache.spark.util.Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
19/09/17 16:14:31 INFO org.spark_project.jetty.server.AbstractConnector: Started ServerConnector@2242b679{HTTP/1.1,[http/1.1]}{0.0.0.0:4041}
19/09/17 16:14:31 WARN org.apache.spark.scheduler.FairSchedulableBuilder: Fair Scheduler configuration file not found so jobs will be scheduled in FIFO order. To use fair scheduling, configure pools in fairscheduler.xml or set spark.scheduler.allocation.file to a file that contains the configuration.
19/09/17 16:14:31 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at sidm-cluster-02-m/10.128.0.2:8032
19/09/17 16:14:31 INFO org.apache.hadoop.yarn.client.AHSProxy: Connecting to Application History server at sidm-cluster-02-m/10.128.0.2:10200
19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/com.github.samelamin_spark-bigquery_2.11-0.2.6.jar added multiple times to distributed cache.
19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/com.databricks_spark-avro_2.11-4.0.0.jar added multiple times to distributed cache.
19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/com.google.cloud.bigdataoss_bigquery-connector-0.13.4-hadoop2.jar added multiple times to distributed cache.
19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/joda-time_joda-time-2.9.3.jar added multiple times to distributed cache.
19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/org.slf4j_slf4j-api-1.7.5.jar added multiple times to distributed cache.
19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/org.apache.avro_avro-1.7.6.jar added multiple times to distributed cache.
19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/org.codehaus.jackson_jackson-core-asl-1.9.13.jar added multiple times to distributed cache.
19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/org.codehaus.jackson_jackson-mapper-asl-1.9.13.jar added multiple times to distributed cache.
19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/com.thoughtworks.paranamer_paranamer-2.3.jar added multiple times to distributed cache.
19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/org.xerial.snappy_snappy-java-1.0.5.jar added multiple times to distributed cache.
19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/org.apache.commons_commons-compress-1.4.1.jar added multiple times to distributed cache.
19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/org.tukaani_xz-1.0.jar added multiple times to distributed cache.
19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/com.google.api-client_google-api-client-java6-1.24.1.jar added multiple times to distributed cache.
19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/com.google.api-client_google-api-client-jackson2-1.24.1.jar added multiple times to distributed cache.
19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/com.google.apis_google-api-services-storage-v1-rev135-1.24.1.jar added multiple times to distributed cache.
19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/com.google.apis_google-api-services-bigquery-v2-rev398-1.24.1.jar added multiple times to distributed cache.
19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/com.google.code.findbugs_jsr305-3.0.2.jar added multiple times to distributed cache.
19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/com.google.guava_guava-26.0-jre.jar added multiple times to distributed cache.
19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/com.google.oauth-client_google-oauth-client-1.24.1.jar added multiple times to distributed cache.
19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/com.google.oauth-client_google-oauth-client-java6-1.24.1.jar added multiple times to distributed cache.
19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/com.google.cloud.bigdataoss_util-1.9.4.jar added multiple times to distributed cache.
19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/com.google.cloud.bigdataoss_util-hadoop-1.9.4-hadoop2.jar added multiple times to distributed cache.
19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/com.google.cloud.bigdataoss_gcs-connector-1.9.4-hadoop2.jar added multiple times to distributed cache.
19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/com.google.api-client_google-api-client-1.24.1.jar added multiple times to distributed cache.
19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/com.google.http-client_google-http-client-jackson2-1.24.1.jar added multiple times to distributed cache.
19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/com.google.http-client_google-http-client-1.24.1.jar added multiple times to distributed cache.
19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/org.apache.httpcomponents_httpclient-4.0.1.jar added multiple times to distributed cache.
19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/org.apache.httpcomponents_httpcore-4.0.1.jar added multiple times to distributed cache.
19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/commons-logging_commons-logging-1.1.1.jar added multiple times to distributed cache.
19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/commons-codec_commons-codec-1.6.jar added multiple times to distributed cache.
19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/com.fasterxml.jackson.core_jackson-core-2.9.2.jar added multiple times to distributed cache.
19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/org.checkerframework_checker-qual-2.5.2.jar added multiple times to distributed cache.
19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/com.google.errorprone_error_prone_annotations-2.1.3.jar added multiple times to distributed cache.
19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/com.google.j2objc_j2objc-annotations-1.1.jar added multiple times to distributed cache.
19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/org.codehaus.mojo_animal-sniffer-annotations-1.14.jar added multiple times to distributed cache.
19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/com.google.auto.value_auto-value-annotations-1.6.2.jar added multiple times to distributed cache.
19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/com.google.cloud.bigdataoss_gcsio-1.9.4.jar added multiple times to distributed cache.
19/09/17 16:14:35 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Submitted application application_1568730559207_0014
19/09/17 16:16:10 WARN org.apache.spark.util.Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
Traceback (most recent call last):
File "/tmp/5573a94d3db144558111c14b4dba7f38/ingest_bq.py", line 49, in
bigquery.getattr("package$CreateDisposition$").getattr("MODULE$").CREATE_IF_NEEDED(),
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in call
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o68.saveAsBigQueryTable.
: java.lang.NoSuchMethodError: com.google.cloud.hadoop.io.bigquery.BigQueryStrings.parseTableReference(Ljava/lang/String;)Lcom/google/api/services/bigquery/model/TableReference;
at com.samelamin.spark.bigquery.BigQueryDataFrame.saveAsBigQueryTable(BigQueryDataFrame.scala:40)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)

19/09/17 16:16:11 INFO org.spark_project.jetty.server.AbstractConnector: Stopped Spark@2242b679{HTTP/1.1,[http/1.1]}{0.0.0.0:4041}
Job output is complete`

Sounds like you are missing some libraries, try ensuring the bigdata-interop libraries are added to the class path

Using scala libs from python is always a pain im afraid