mrchristine/db-migration

add documentation on data migration options

Opened this issue · 1 comments

add documentation on data migration options

Metastore can be imported into workspace but data all will have to be external mounts.

a) The mounts will have to remounted in new workspace with exact directory structure. for external tables. do %fs mounts in old workspace and copy the exact mount structure in destination workspace. %fs mounts on source and destination workspace should be same if mounts are to be pointed by any tables in metastore.

b) For tables using abfs:// or adl:// declare credentials in spark config and use that cluster for migration.

example: In destination workspace

python import_db.py --azure --profile secondary --metastore --cluster-name CLUSTER_WITH_SPARKCREDS_FOR_STORAGE

adl://

https://docs.databricks.com/data/data-sources/azure/azure-datalake.html

spark.sparkContext.hadoopConfiguration.set(".oauth2.access.token.provider.type", "ClientCredential")
spark.sparkContext.hadoopConfiguration.set(".oauth2.client.id", "")
spark.sparkContext.hadoopConfiguration.set(".oauth2.credential", dbutils.secrets.get(scope = "", key = ""))
spark.sparkContext.hadoopConfiguration.set(".oauth2.refresh.url", "https://login.microsoftonline.com//oauth2/token")

abfss://

https://docs.databricks.com/data/data-sources/azure/azure-datalake-gen2.html

spark.hadoop.fs.azure.account.auth.type..dfs.core.windows.net OAuth
spark.hadoop.fs.azure.account.oauth.provider.type..dfs.core.windows.net org.apache.hadoop.spark.hadoop.fs.azure.fs.oauth2.ClientCredsTokenProvider
spark.hadoop.fs.azure.account.oauth2.client.id..dfs.core.windows.net
spark.hadoop.fs.azure.account.oauth2.client.secret..dfs.core.windows.net
spark.hadoop.fs.azure.account.oauth2.client.endpoint..dfs.core.windows.net https://login.microsoftonline.com//oauth2/token

wasbs://

https://docs.databricks.com/data/data-sources/azure/azure-storage.html

spark.hadoop.fs.azure.account.key..blob.core.windows.net

S3://

https://docs.databricks.com/data/data-sources/aws/amazon-s3.html#access-s3-buckets-directly

b) For managed table we will have to copy dbfs:/user/hive/warehouse/ to dbfs:/tempMount location. Then we will have to copy from dbfs:/tempMount to dbfs:/user/hive/warehouse/ location in new workspace in azure. There is need for temp because root dbfs is locked in Azure. Users having 1000 of TBs of data can reach out to azure SAs to workout a plan to avoid overhead in copy to temp location.

The copy tool can be used to copy between temp and dbfs mount.

At source workspace: dbutils.fs.cp ("dbfs:/user/hive/warehouse/", "dbfs:/tempMount/", True)
At destination workspace: : dbutils.fs.cp ("dbfs:/tempMount/", "dbfs:/user/hive/warehouse/", True)

dbfs:/tempMount shall be ADLS or Blob or even S3:

https://docs.databricks.com/data/data-sources/azure/azure-datalake-gen2.html or
https://docs.databricks.com/data/data-sources/azure/azure-datalake.html
or https://docs.databricks.com/data/data-sources/azure/azure-storage.html
https://docs.databricks.com/data/data-sources/azure/azure-storage.html

AWS is similar but we have tools to directly copy the data from root bucket to root bucket I suppose. /temp