[Bug] DBT Seed in dbt-athena-community is not working for .csv with encoding utf-8
Opened this issue · 6 comments
Is this a new bug in dbt-core?
- I believe this is a new bug in dbt-core
- I have searched the existing issues, and I could not find an existing issue for this bug
Current Behavior
I would like to start by apologizing if there is already a bug report on this subject, but I couldn't find it.
Lets go, I have a .csv file (sds_teste.csv) that I am using as a seed, like this:
So, I execute the command dbt seed -s sds_teste
and the dbt is successful executed
But, when i execute a select
to see the table created by dbt seed command, I can see that the table cannot read the special characters (accented letters)
I already try some things that I found around de internet, like to pass the encoding: utf-8
, but i not found nothing that working.
Expected Behavior
The expected behavir is that the dbt seed would can read a .csv file in encoding utf-8.
Should be: A text with special characters, like Ã, Á, Í, or Ç
Instead of: A text with special characters, like �, �, �, or �
Steps To Reproduce
1 - Install the python 3.11.9 in a windows computer
2 - Create a python environment with python venv
3 - Install dbt-core==1.8.7
and dbt-athena-community==1.8.4
by pip install
4 - Create a dbt project
5 - Create a .csv file in the folder seeds/
and write some example with special characters
6 - Configure the profile.yml to connect a AWS Athena
(storage: AWS S3
)
7 - Run the dbt seed
command
Relevant log output
============================== 13:38:20.928977 | c0ac6503-3146-44d1-84fa-552929776d79 ==============================
�[0m13:38:20.928977 [info ] [MainThread]: Running with dbt=1.8.7
�[0m13:38:20.929979 [debug] [MainThread]: running dbt with arguments {'printer_width': '80', 'indirect_selection': 'eager', 'write_json': 'True', 'log_cache_events': 'False', 'partial_parse': 'True', 'cache_selected_only': 'False', 'warn_error': 'None', 'fail_fast': 'False', 'profiles_dir': 'C:\\Users\\julio.padoveze\\.dbt', 'log_path': 'C:\\Users\\julio.padoveze\\github\\prd-Copastur-dbt-source-qv\\logs', 'version_check': 'True', 'debug': 'False', 'use_colors': 'True', 'use_experimental_parser': 'False', 'no_print': 'None', 'quiet': 'False', 'empty': 'None', 'log_format': 'default', 'introspect': 'True', 'invocation_command': 'dbt seed -s sds_teste', 'static_parser': 'True', 'target_path': 'None', 'warn_error_options': 'WarnErrorOptions(include=[], exclude=[])', 'send_anonymous_usage_stats': 'True'}
�[0m13:38:21.527840 [debug] [MainThread]: Sending event: {'category': 'dbt', 'action': 'project_id', 'label': 'c0ac6503-3146-44d1-84fa-552929776d79', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x0000020521A27B10>]}
�[0m13:38:21.561425 [debug] [MainThread]: Sending event: {'category': 'dbt', 'action': 'adapter_info', 'label': 'c0ac6503-3146-44d1-84fa-552929776d79', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x000002051CDCDE10>]}
�[0m13:38:21.563666 [info ] [MainThread]: Registered adapter: athena=1.8.4
�[0m13:38:21.575796 [debug] [MainThread]: checksum: 4af21dafb485259c48497ac86b711ddb1982f3d0f1c0ca4e09356de488b753c0, vars: {}, profile: , target: , version: 1.8.7
�[0m13:38:21.689602 [info ] [MainThread]: Unable to do partial parsing because of a version mismatch
�[0m13:38:21.690599 [debug] [MainThread]: Sending event: {'category': 'dbt', 'action': 'partial_parser', 'label': 'c0ac6503-3146-44d1-84fa-552929776d79', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x0000020521C79010>]}
�[0m13:38:22.937805 [debug] [MainThread]: Sending event: {'category': 'dbt', 'action': 'load_project', 'label': 'c0ac6503-3146-44d1-84fa-552929776d79', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x0000020522118250>]}
�[0m13:38:23.152658 [debug] [MainThread]: Sending event: {'category': 'dbt', 'action': 'resource_counts', 'label': 'c0ac6503-3146-44d1-84fa-552929776d79', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x0000020522129ED0>]}
�[0m13:38:23.152658 [info ] [MainThread]: Found 17 models, 9 seeds, 96 sources, 568 macros
�[0m13:38:23.153659 [debug] [MainThread]: Sending event: {'category': 'dbt', 'action': 'runnable_timing', 'label': 'c0ac6503-3146-44d1-84fa-552929776d79', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x0000020521F006D0>]}
�[0m13:38:23.155683 [info ] [MainThread]:
�[0m13:38:23.157687 [debug] [MainThread]: Acquiring new athena connection 'master'
�[0m13:38:23.160683 [debug] [ThreadPool]: Acquiring new athena connection 'list_awsdatacatalog'
�[0m13:38:23.161689 [debug] [ThreadPool]: Opening a new connection, currently in state init
�[0m13:38:24.834482 [debug] [ThreadPool]: On list_awsdatacatalog: Close
�[0m13:38:24.840486 [debug] [ThreadPool]: Acquiring new athena connection 'list_awsdatacatalog_db_dbt_validation'
�[0m13:38:24.840486 [debug] [ThreadPool]: Opening a new connection, currently in state init
�[0m13:38:29.476313 [debug] [ThreadPool]: On list_awsdatacatalog_db_dbt_validation: Close
�[0m13:38:29.495682 [debug] [MainThread]: Sending event: {'category': 'dbt', 'action': 'runnable_timing', 'label': 'c0ac6503-3146-44d1-84fa-552929776d79', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x0000020521B74F10>]}
�[0m13:38:29.496682 [info ] [MainThread]: Concurrency: 10 threads (target='dev')
�[0m13:38:29.498679 [info ] [MainThread]:
�[0m13:38:29.512664 [debug] [Thread-1 (]: Began running node seed.qv.sds_teste
�[0m13:38:29.514172 [info ] [Thread-1 (]: 1 of 1 START seed file db_dbt_validation.db_qvextracao__sds_teste .............. [RUN]
�[0m13:38:29.516618 [debug] [Thread-1 (]: Acquiring new athena connection 'seed.qv.sds_teste'
�[0m13:38:29.518105 [debug] [Thread-1 (]: Began compiling node seed.qv.sds_teste
�[0m13:38:29.519110 [debug] [Thread-1 (]: Began executing node seed.qv.sds_teste
�[0m13:38:29.553617 [debug] [Thread-1 (]: seed by upload...
�[0m13:38:29.566419 [debug] [Thread-1 (]: Opening a new connection, currently in state init
�[0m13:38:32.830681 [debug] [Thread-1 (]: dbt.adapters.athena.constants adapter: Table "awsdatacatalog"."db_dbt_validation"."db_qvextracao__sds_teste__dbt_tmp" does not exists - Ignoring
�[0m13:38:32.832678 [debug] [Thread-1 (]: dbt.adapters.athena.constants adapter: Table "awsdatacatalog"."db_dbt_validation"."db_qvextracao__sds_teste__dbt_tmp" does not exist - Ignoring
�[0m13:38:32.834710 [debug] [Thread-1 (]: Dropping relation via Glue and S3 APIs
�[0m13:38:34.321827 [debug] [Thread-1 (]: dbt.adapters.athena.constants adapter: Table "awsdatacatalog"."db_dbt_validation"."db_qvextracao__sds_teste__dbt_tmp" does not exists - Ignoring
�[0m13:38:34.323331 [debug] [Thread-1 (]: dbt.adapters.athena.constants adapter: Table "awsdatacatalog"."db_dbt_validation"."db_qvextracao__sds_teste__dbt_tmp" does not exist - Ignoring
�[0m13:38:35.891369 [debug] [Thread-1 (]: dbt.adapters.athena.constants adapter: Table "awsdatacatalog"."db_dbt_validation"."db_qvextracao__sds_teste__dbt_tmp" does not exist and will not be deleted, ignoring
�[0m13:38:35.893875 [debug] [Thread-1 (]: Using athena connection "seed.qv.sds_teste"
�[0m13:38:35.893875 [debug] [Thread-1 (]: On seed.qv.sds_teste: -- /* {"app": "dbt", "dbt_version": "1.8.7", "profile_name": "qv", "target_name": "dev", "node_id": "seed.qv.sds_teste"} */
create external table `db_dbt_validation`.`db_qvextracao__sds_teste__dbt_tmp` (`id` string, `some_text` string)
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
location 's3://prd-lakehouse-copastur-analytics/analytics/env=prod/top=copastur/domain=analytics/dbt/db_dbt_validation/db_qvextracao__sds_teste__dbt_tmp'
tblproperties (
'skip.header.line.count'='1'
)
�[0m13:38:36.637497 [debug] [Thread-1 (]: dbt.adapters.athena.constants adapter: Athena query ID 87dea510-7789-40ce-849c-4d2a4d6428bf
�[0m13:38:38.371586 [debug] [Thread-1 (]: SQL status: OK -1 in 2.477 seconds
�[0m13:38:38.386232 [debug] [Thread-1 (]: Skip partitioning: False
�[0m13:38:38.387230 [debug] [Thread-1 (]: dbt.adapters.athena.constants adapter: get_work_group for %s
�[0m13:38:39.758786 [debug] [Thread-1 (]: dbt.adapters.athena.constants adapter: S3 path does not exist
�[0m13:38:39.759788 [debug] [Thread-1 (]: Using athena connection "seed.qv.sds_teste"
�[0m13:38:39.759788 [debug] [Thread-1 (]: On seed.qv.sds_teste: -- /* {"app": "dbt", "dbt_version": "1.8.7", "profile_name": "qv", "target_name": "dev", "node_id": "seed.qv.sds_teste"} */
create table "awsdatacatalog"."db_dbt_validation"."db_qvextracao__sds_teste"
with (
table_type='hive',
is_external=true,
format='parquet'
)
as
select
cast(nullif("id", '') as integer) as "id"
,
cast(nullif("some_text", '') as varchar) as "some_text"
from
"awsdatacatalog"."db_dbt_validation"."db_qvextracao__sds_teste__dbt_tmp"
�[0m13:38:39.966615 [debug] [Thread-1 (]: dbt.adapters.athena.constants adapter: Athena query ID 3ff995b0-dbb9-42b2-ab05-90844aeb9925
�[0m13:38:42.972477 [debug] [Thread-1 (]: SQL status: OK 2 in 3.212 seconds
�[0m13:38:44.460039 [debug] [Thread-1 (]: dbt.adapters.athena.constants adapter: table_name : 490445445844.db_dbt_validation.db_qvextracao__sds_teste__dbt_tmp
�[0m13:38:44.462029 [debug] [Thread-1 (]: dbt.adapters.athena.constants adapter: table type : TableType.TABLE
�[0m13:38:44.462029 [debug] [Thread-1 (]: Dropping relation via Glue and S3 APIs
�[0m13:38:45.992330 [debug] [Thread-1 (]: dbt.adapters.athena.constants adapter: table_name : 490445445844.db_dbt_validation.db_qvextracao__sds_teste__dbt_tmp
�[0m13:38:45.993334 [debug] [Thread-1 (]: dbt.adapters.athena.constants adapter: table type : TableType.TABLE
�[0m13:38:45.994409 [debug] [Thread-1 (]: dbt.adapters.athena.constants adapter: "awsdatacatalog"."db_dbt_validation"."db_qvextracao__sds_teste__dbt_tmp" is stored in s3://prd-lakehouse-copastur-analytics/analytics/env=prod/top=copastur/domain=analytics/dbt/db_dbt_validation/db_qvextracao__sds_teste__dbt_tmp
�[0m13:38:46.756532 [debug] [Thread-1 (]: dbt.adapters.athena.constants adapter: Deleting table data: path='s3://prd-lakehouse-copastur-analytics/analytics/env=prod/top=copastur/domain=analytics/dbt/db_dbt_validation/db_qvextracao__sds_teste__dbt_tmp', bucket='prd-lakehouse-copastur-analytics', prefix='analytics/env=prod/top=copastur/domain=analytics/dbt/db_dbt_validation/db_qvextracao__sds_teste__dbt_tmp/'
�[0m13:38:49.383198 [debug] [Thread-1 (]: dbt.adapters.athena.constants adapter: Deleted table from glue catalog: "awsdatacatalog"."db_dbt_validation"."db_qvextracao__sds_teste__dbt_tmp"
Environment
- OS: Windons 11
- Python: 3.11.9
- dbt-core: 1.8.7
- dbt-athena-community: 1.8.4
Which database adapter are you using with dbt?
other (mention it in "Additional Context")
Additional Context
No response
Thanks for reaching out @juliodias20 !
Special characters worked for me when I tried with dbt-duckdb
, so this might be specific to the dbt-athena-community
adapter rather than dbt-core
. So I'm going to transfer this issue to that repository instead.
Example
See below for my output when using dbt-duckdb
.
Create this file:
seeds/my_seed.csv
id,some_text
1,ABC
2,Ã Á Í Ç
Run these commands:
dbt seed
dbt show --inline 'select * from {{ ref("my_seed") }}'
See this output:
| id | some_text |
| -- | --------- |
| 1 | ABC |
| 2 | Ã Á Í Ç |
Thanks for the collaboration @dbeatty10 !
Now that you said this, I tested the same case with the Databricks adapter and it works correctly! It really sounds like a problem with athena adapter.
Hello,
It works on my side with dbt-athena on Windows.
Could you please try to add this parameter to .vscode/settings.json and open a new terminal in vscode ?
{
"terminal.integrated.env.windows": {
"PYTHONUTF8": "1"
}
}
Hello @e-quili , thank you so much for collaboration! I tested this solution and it works!!
I will use this in my local environment for developments, but I still think that there is a bug, once that anothers adapters can identify the Encoding of the .csv file. What do you think?
I have a similar problem: I cannot even upload the csv file to s3 that contains these words:
Sedlišćo pódzajtšo, Dolna Łužyca, etc.
This will not work with the athena dbt adapter even if the letters are utf8.
I checked all possbilities of misconfigured encodings with:
os.environ["PYTHONIOENCODING"]
sys.getfilesystemencoding()
sys.getdefaultencoding()
And they are all set to utf8.
The issue is withing agate csv_py3.py
:
The function writerow()
calling self.writer.writerow(row)
if I add a try / except I get all the rows that cannot be processed.
So here I am actually stuck - these are the rows that cannot be processed because somehow it always checks with cp1252 which does not contain any special characters.
Here is some test data:
"Sedlitz Ost (Sedlišćo pódzajtšo)",
"Senftenberg (Zły Komorow)",
"Cottbus Hbf Calau (NL) Chóśebuz gł.dw",
"Gollmitz (NL) Chańc (Dolna Łužyca)",
"Calau (NL) Kalawa (Dolna Łužyca",
"Kolkwitz Süd Gołkojce pódpołdnjo",
"Cottbus-Sandow Chóśebuz-Žandow",
"Cottbus-Merzdorf Chóśebuz-Žylowk",
"Cottbus-Willmersdorf Nord Chóśebuz-Rogoznow pódpołnoc",
One additional note:
If I just read my csv and write it again (as dbt does) it just works:
my_seed = r"path/my_csv.csv"
with open(my_seed, encoding="utf-8") as f:
data = csv.reader(f)
table = agate.Table.from_csv(my_seed)
table.to_csv("asd.csv")
Maybe someone can lead me in the right direction where local csv file is actually read in dbt - I cannot find the creation of the agate table.
UPDATE:
Okay it is actually Powershell causing the issues. If I use git bash it just works fine.
Windows...
Okay for everyone having issues with seeds and dbt athena on Windows: Set this: $Env:PYTHONUTF8=1