mrchristine/db-migration

Add support for spark 3.0

Closed this issue · 2 comments

When using export with a specific cluster using the --cluster_name argument and the cluster is spark 3.0 then an error is generated as shown below. This was tested on Azure Databricks. Cluster used is : 7.0 (includes Apache Spark 3.0.0, Scala 2.12)

python3 ./export_db.py --azure --metastore --cluster-name export --profile ws-databricks

ERROR:
AttributeError: databaseName
{"resultType": "error", "summary": "<span class="ansi-red-fg">AttributeError: databaseName", "cause": "---------------------------------------------------------------------------\nValueError Traceback (most recent call last)\n/databricks/spark/python/pyspark/sql/types.py in getattr(self, item)\n 1594 # but this will not be used in normal cases\n-> 1595 idx = self.fields.index(item)\n 1596 return self[idx]\n\nValueError: 'databaseName' is not in list\n\nDuring handling of the above exception, another exception occurred:\n\nAttributeError Traceback (most recent call last)\n in \n----> 1 all_dbs = [x.databaseName for x in spark.sql("show databases").collect()]; print(len(all_dbs))\n\n in (.0)\n----> 1 all_dbs = [x.databaseName for x in spark.sql("show databases").collect()]; print(len(all_dbs))\n\n/databricks/spark/python/pyspark/sql/types.py in getattr(self, item)\n 1598 raise AttributeError(item)\n 1599 except ValueError:\n-> 1600 raise AttributeError(item)\n 1601 \n 1602 def setattr(self, key, value):\n\nAttributeError: databaseName"}

Traceback (most recent call last):
File "./export_db.py", line 151, in
main()
File "./export_db.py", line 137, in main
hive_c.export_hive_metastore(cluster_name=args.cluster_name)
File "/Users/saldroubi/Dropbox/git/db-migration/dbclient/HiveClient.py", line 200, in export_hive_metastore
all_dbs = self.log_all_databases(cid, ec_id, metastore_dir)
File "/Users/saldroubi/Dropbox/git/db-migration/dbclient/HiveClient.py", line 21, in log_all_databases
raise ValueError("Cannot identify number of databases due to the above error")
ValueError: Cannot identify number of databases due to the above error

This would be a feature request that I can add.
For the time being, can you use Spark 2 to export?

Support added for DBR 7 release.