bluelabsio/records-mover

Support loading Redshift url2table without an intermediate copy step

Closed this issue · 2 comments

Right now, if you try to load a compatible file directly from an S3 url, records mover still seems to want to copy to a scratch location. This is not necessary, as we can pass the URL directly to the copy command. Example:

$ mvrec url2table   s3://mybucket/xyz/sample_file.txt redshift myschema mytable --target.existing_table drop_and_recreate --source.variant bluelabs --source.encoding UTF8 --source.header_row --source.field_delimiter $'\t' --source.no_compression
09:22:55 - Using session_type=lpass from config file
09:22:55 - Starting...
09:22:58 - Mover: copying from DataUrlRecordsSource(DelimitedRecordsFormat(bluelabs - {'header-row': True, 'field-delimiter': '\t', 'compression': None})) to TableRecordsTarget(redshift) by first writing DataUrlRecordsSource(DelimitedRecordsFormat(bluelabs - {'header-row': True, 'field-delimiter': '\t', 'compression': None})) to DelimitedRecordsFormat(csv - {'dateformat': 'YYYY-MM-DD', 'datetimeformat': 'YYYY-MM-DD HH:MI:SS', 'datetimeformattz': 'YYYY-MM-DD HH:MI:SSOF'}) records format (if easy to rewrite)...
09:23:30 - Mover: copying from FileobjsSource(DelimitedRecordsFormat(bluelabs - {'header-row': True, 'field-delimiter': '\t', 'compression': None})) to TableRecordsTarget(redshift) by filling in a temporary location...
09:23:30 - Uploading s3://myscratchbucket/rBuwYAGTxvY/sample_file.txt

Well, you have to pass a manifest URL in - so you'll still need to write the manifest to the scratch bucket. But I'd imagine the manifest URL could point back to a different bucket where the actual data would live.

I was thinking since we have just a single file in this case, we could pass the URL of the S3 file directly to the COPY command and skip the manifest entirely. I may be forgetting about something that only works if we have a manifest file, though!