add documentation on data migration options
Opened this issue · 1 comments
add documentation on data migration options
Metastore can be imported into workspace but data all will have to be external mounts.
a) The mounts will have to remounted in new workspace with exact directory structure. for external tables. do %fs mounts
in old workspace and copy the exact mount structure in destination workspace. %fs mounts
on source and destination workspace should be same if mounts are to be pointed by any tables in metastore.
b) For tables using abfs:// or adl:// declare credentials in spark config and use that cluster for migration.
example: In destination workspace
python import_db.py --azure --profile secondary --metastore --cluster-name CLUSTER_WITH_SPARKCREDS_FOR_STORAGE
adl://
https://docs.databricks.com/data/data-sources/azure/azure-datalake.html
spark.sparkContext.hadoopConfiguration.set(".oauth2.access.token.provider.type", "ClientCredential")
spark.sparkContext.hadoopConfiguration.set(".oauth2.client.id", "")
spark.sparkContext.hadoopConfiguration.set(".oauth2.credential", dbutils.secrets.get(scope = "", key = ""))
spark.sparkContext.hadoopConfiguration.set(".oauth2.refresh.url", "https://login.microsoftonline.com//oauth2/token")
abfss://
https://docs.databricks.com/data/data-sources/azure/azure-datalake-gen2.html
spark.hadoop.fs.azure.account.auth.type..dfs.core.windows.net OAuth
spark.hadoop.fs.azure.account.oauth.provider.type..dfs.core.windows.net org.apache.hadoop.spark.hadoop.fs.azure.fs.oauth2.ClientCredsTokenProvider
spark.hadoop.fs.azure.account.oauth2.client.id..dfs.core.windows.net
spark.hadoop.fs.azure.account.oauth2.client.secret..dfs.core.windows.net
spark.hadoop.fs.azure.account.oauth2.client.endpoint..dfs.core.windows.net https://login.microsoftonline.com//oauth2/token
wasbs://
https://docs.databricks.com/data/data-sources/azure/azure-storage.html
spark.hadoop.fs.azure.account.key..blob.core.windows.net
S3://
https://docs.databricks.com/data/data-sources/aws/amazon-s3.html#access-s3-buckets-directly
b) For managed table we will have to copy dbfs:/user/hive/warehouse/ to dbfs:/tempMount location. Then we will have to copy from dbfs:/tempMount to dbfs:/user/hive/warehouse/ location in new workspace in azure. There is need for temp because root dbfs is locked in Azure. Users having 1000 of TBs of data can reach out to azure SAs to workout a plan to avoid overhead in copy to temp location.
The copy tool can be used to copy between temp and dbfs mount.
At source workspace: dbutils.fs.cp ("dbfs:/user/hive/warehouse/", "dbfs:/tempMount/", True)
At destination workspace: : dbutils.fs.cp ("dbfs:/tempMount/", "dbfs:/user/hive/warehouse/", True)
dbfs:/tempMount shall be ADLS or Blob or even S3:
https://docs.databricks.com/data/data-sources/azure/azure-datalake-gen2.html or
https://docs.databricks.com/data/data-sources/azure/azure-datalake.html
or https://docs.databricks.com/data/data-sources/azure/azure-storage.html
https://docs.databricks.com/data/data-sources/azure/azure-storage.html
AWS is similar but we have tools to directly copy the data from root bucket to root bucket I suppose. /temp