mrchristine/db-migration

DDL extraction issue with control characters

Closed this issue · 2 comments

Hello

Just found some inconsitencey while extracting a DDL that make use of control characters.

Below was the DDL that as used to create the table

	CREATE EXTERNAL TABLE db.tab
	(
	h_ls_hash array<string>,
	ls_id STRING,
	bin array<STRING>,
	Class1  array<int>,
	Class1_valueString  array<string>,
	Class1_valueFrom  array<float>,
	Class1_valueTo  array<float>,
	Class2  array<int>,
	Class2_valueString  array<string>,
	Class2_valueFrom  array<float>,
	Class2_valueTo  array<float>,
	Class3  array<int>,
	Class3_valueString  array<string>,
	Class3_valueFrom  array<float>,
	Class3_valueTo  array<float>,
	load_ts timestamp COMMENT 'EN: load timestamp | DE: Zeitstempel für das Laden des Datensatzes',
	record_source string COMMENT 'EN: Source Name | DE: Quellenname'
	)
	ROW FORMAT DELIMITED
	FIELDS TERMINATED BY '\u0001'
	STORED AS TEXTFILE
	location 'dbfs:/loc';

but when the DDL's were extracted using the migration tool it resulted in the below.

<div class="ansiout">CREATE EXTERNAL TABLE `db`.`tab`(`h_ls_hash` ARRAY&lt;STRING&gt;, `ls_id` STRING, `bin` ARRAY&lt;STRING&gt;, `Class1` ARRAY&lt;INT&gt;, `Class1_valueString` ARRAY&lt;STRING&gt;, `Class1_valueFrom` ARRAY&lt;FLOAT&gt;, `Class1_valueTo` ARRAY&lt;FLOAT&gt;, `Class2` ARRAY&lt;INT&gt;, `Class2_valueString` ARRAY&lt;STRING&gt;, `Class2_valueFrom` ARRAY&lt;FLOAT&gt;, `Class2_valueTo` ARRAY&lt;FLOAT&gt;, `Class3` ARRAY&lt;INT&gt;, `Class3_valueString` ARRAY&lt;STRING&gt;, `Class3_valueFrom` ARRAY&lt;FLOAT&gt;, `Class3_valueTo` ARRAY&lt;FLOAT&gt;, `load_ts` TIMESTAMP COMMENT &#39;EN: load timestamp | DE: Zeitstempel für das Laden des Datensatzes&#39;, `record_source` STRING COMMENT &#39;EN: Source Name | DE: Quellenname&#39;)
ROW FORMAT SERDE &#39;org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe&#39;
WITH SERDEPROPERTIES (
  &#39;field.delim&#39; = &#39;�&#39;,
  &#39;serialization.format&#39; = &#39;�&#39;
)
STORED AS
  INPUTFORMAT &#39;org.apache.hadoop.mapred.TextInputFormat&#39;
  OUTPUTFORMAT &#39;org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat&#39;
LOCATION &#39;dbfs:/loc&#39;
TBLPROPERTIES (
  &#39;transient_lastDdlTime&#39; = &#39;1606221712&#39;
)

</div>

Can someone help me on this ?

@arjun-hareendran can you try with the new option here, --metastore-unicode to verify your issue is resolved?

@arjun-hareendran were you able to get the new flag a try?