uswitch/blueshift

Security issues - does the JDBC password need to be in the S3 Manifest file(s)?

aaelony opened this issue · 19 comments

Great project. One concern about passwords. According to the design.md documentation, the manifest.edn file containing the JDBC connection username and password sits in S3.

Shouldn't this info be better placed in the config.edn file ?

-A

Hi,

We did think about putting passwords into either configuration on the blueshift instance or in environment variables. In the end we decided against it given all our data on S3 is readable only by our user and it makes it easier to have a single Blueshift instance interoperating with many different Redshift clusters.

Do you have a different use case that would make this more valuable/necessary :)?

Hi, it may be our use case, but I'm told that it is not uncommon that another team accidentally flips the switch that makes all S3 buckets globally readable - perhaps temporarily - and any embedded credentials in bucket folders might exacerbate the situation. The alternate line of thought says that well it's all Amazon anyways (s3, redshift, ec2,..) but I tend to prefer the more modular siloed approach with layers that would limit the scope of any such breach - just in case.

Was I persuasive?

On Dec 17, 2014, at 2:25 AM, Paul Ingles notifications@github.com wrote:

Hi,

We did think about putting passwords into either configuration on the blueshift instance or in environment variables. In the end we decided against it given all our data on S3 is readable only by our user and it makes it easier to have a single Blueshift instance interoperating with many different Redshift clusters.

Do you have a different use case that would make this more valuable/necessary :)?


Reply to this email directly or view it on GitHub.

tgk commented

I completely get what you mean. I agree, S3's GUI can make it very easy to do fatal things like that very easily. When we built Blueshift, we didn't expect any other teams to tinker with the bucket we use, so we considered ourselves safe :)

Would it make sense to add support for system variables in the manifest-files (such as $PRODUCTION_REDSHIFT_1_JDBC_URL) or were you thinking having a :default-jdbc-url in the manifest file would be a better solution for your team?

I'm open to pull-requests. Paul might have another opinion :)

I'm not sure what the best approach would be, but am currently partial to a :redshift-db-name "my_db" in the manifest, and the config.edn file containing the conceptual link, perhaps something like :jdbc-connections [ {:redshift-db-name "my_db" :jdbc-url "urlgoeshere" }] ...

Then the redshift password would be as secure as (and in the same place as) the credentials mechanism.

Any drawbacks to this suggestion?

On Dec 17, 2014, at 12:03 PM, "Thomas G. Kristensen" notifications@github.com wrote:

I completely get what you mean. I agree, S3's GUI can make it very easy to do fatal things like that very easily. When we built Blueshift, we didn't expect any other teams to tinker with the bucket we use, so we considered ourselves safe :)

Would it make sense to add support for system variables in the manifest-files (such as $PRODUCTION_REDSHIFT_1_JDBC_URL) or were you thinking having a :default-jdbc-url in the manifest file would be a better solution for your team?

I'm open to pull-requests. Paul might have another opinion :)


Reply to this email directly or view it on GitHub.

interesting! how about using {:redshift-connection :some-identifier ... inside the manifest, then inside config.edn:

{:redshift-connections {:some-identifier {:jdbc-url "" ...}}

Blueshift can then use the specified connection or one configured inside the manifest file?

To be honest, this would probably also make it a little easier to cleanup some of the repetition in manifests. I'd definitely welcome a pull request if I/we don't get around to implementing this soon.

Very elegant!:) This sounds great.

Just so I understand correctly, you mean (keys (read-string (slurp config))) would return (:s3 :redshift-connections :telemetry), right? and keys on :redshift-connections would return all keywords refering to existing redshift database jdbc urls... e.g. ( :redshift-db-1 :redshift-db-2) etc... Then, manifests in each s3 bucket that contain something like {:table "table1" :database :redshift-db-1 } ... would be eligible for loading.

I've actually never done a pull request (I'm embarassed to say). I hope to put something together and you can accept/reject/improve-on proposed changes.

Cheers,
Avram

On Dec 18, 2014, at 8:46 AM, Paul Ingles notifications@github.com wrote:

To be honest, this would probably also make it a little easier to cleanup some of the repetition in manifests. I'd definitely welcome a pull request if I/we don't get around to implementing this soon.


Reply to this email directly or view it on GitHub.

tgk commented

You've got it :) We're eagerly awaiting your pull request ;)

Oh so sorry, I was still wrapping my mind around the existing code base when the holidays hit.

Still with limited computer access for a bit until past the new year but there are still a couple things I didn't yet fully grasp and get time to tinker with when I left off:

  1. use of core.async, having never studied or actually used it,
    1. confirming what happens when 2 files with the same exact data but different filenames occur (hopefully upsert behavior means no update or insert changes), and
    2. confirming what happens when a new file is intended to clobber existing data.

Long story short it will take me some time to get back to it but hope to within the next 2-3 weeks.

sent from my iphone

On Dec 31, 2014, at 2:38 AM, "Thomas G. Kristensen" notifications@github.com wrote:

You've got it :) We're eagerly awaiting your pull request ;)


Reply to this email directly or view it on GitHub.

A quick update... I'm back at this now and got blueshift running out-of-the-box with a few changes and thoughts to present. Again, this is a great project and I'm learning quite a bit by digging into the code. Had not used core.async, schema, or component until now.

One big reason things didn't work initially was that I needed to add {:server-side-encryption "AES256"} to put-object call in put-manifest function. Perhaps that should be an option in the future? That and my misconceptions as to what exactly was meant by :key-pattern and :data-pattern (which for an s3 path is understood to be s3://my-bucket-name/key/part/goes/here/file.txt) took me a while to figure out. I used to think of S3 bucket concept as the full S3 directory path (not anymore).

Now that I've resolved the issues with sse and my understanding above, I'm about to go ahead with the jdbc-url changes discussed previously.

Component takes/took me a lot of getting used to... especially when trying to understand why things didn't work initially as expected. That we're not passing a big map around in function arguments will likely be a bit annoying when particular args aren't available locally in the upcoming changes but that's ok.

I like your aws-censor and wrote a similar jdbc-url-censor function.

Also, another issue I'm wrangling with is that files in s3 that are successfully imported into Redshift are deleted (currently) by Blueshift. We use s3 buckets for storage, so we would want to disable that default deletion behavior. What are your thoughts on this?

cheers, -A

Hi, I have a number of cool changes ready for review.

Here is a glimpse... by the way I have tested this on my setup and everything works (!!):

Config file

{:s3 {:credentials   {:access-key "***"
                      :secret-key "***"}
      :bucket        "your-bucket"
      :key-pattern   ".*"
      :poll-interval {:seconds 30}
      :server-side-encryption "AES256"
      :keep-s3-files-on-import true
      }
 :telemetry {:reporters [uswitch.blueshift.telemetry/log-metrics-reporter]}
 :redshift-connections [{:dw1 {:jdbc-url "jdbc:postgresql://foobar...."}}
                        {:dw2 {:jdbc-url "jdbc:postgresql://blahblah.fake.com..."}}
                        ]
 }

Manifest file

{:table "mydata_fact"
 :pk-columns ["id" "blah"]
 :columns ["id" "blah" "timestamp" ]
 :database :dw1    ;; must match the config to resolve the jdbc-url
 :options      ["DELIMITER '\\t'" "IGNOREHEADER 1" "GZIP" "TRIMBLANKS" "TRUNCATECOLUMNS"]
 :data-pattern ".*blah.*.gz$"
 }

What is the best way to accomplish the pull request? new branch? I'm new to pull requests so feel free to send git commands that I can execute on my end if need be.

cheers,
Avram

I ended up forking to https://github.com/aaelony/blueshift and creating a branch called "aaa". Please review and let me know... :)

Awesome- thanks for all your hard work! We're going to try and get some time today/tomorrow to take a look and get it merged. We'll keep you posted :)

Hi - Just wanted to let you know that I'm still cleaning up my code, but started thinking about other "modes" of use that I'm considering. I will open a new ticket to discuss this so as not to conflate this thread.

Hi there, just letting you know that the "aaa" branch (aaelony repo) now contains a number of enhancements, allowing more control as to whether the data files in the bucket and/or the manifest in the bucket are desired to be kept or deleted.

This is now in the bucket manifest so it can differ for each bucket! :)

We've also experienced a nicety from the recent change of putting the the alias to the jdbc-url in the bucket manifest rather than hardcorded in the manifest when we upgraded our Redshift cluster and the only thing we needed to edit was the config file jdbc-url :)

tgk commented

Sounds good! Sorry for not going through it yet - things have been a bit busy at work I'm afraid :-) I'm preparing to talk about Blueshift at the first Clojure experience report meetup in London on Tuesday, but after that I hope to have some time to go through what you have done.

Would it be possible to structure your changes as smaller individual improvements? It makes for a cleaner integration. I appreciate all the hard work you've put in, so I totally get if you'd rather now ;-)

Cool that you'll be talking about Blueshift! It's been busy on my end as well. At some point I would like to have a work flow that makes it easier to have smaller individual improvements, for now though that's not the case. I apologize for this. Hopefully at some point. The main differences are that the bucket manifest needs to have the 2 keys specifying whether or not to keep (1) the data files that match the data pattern and (2) the manifest in the bucket itself (which signals whether to keep or stop watching the bucket). I was also made aware that begin/end transaction wrappers and autocommit set to on would improve redshift load performance. (Ran into an issue where on a smallish cluster, Blueshift was spamming redshift with loads it couldn't handle faster enough, heheh.) I also made it easier to track down redshift manifests (by naming them less randomly) and also plan to name the staging tables with a UUID (not implemented yet). Want to get to a lot of things, but many other projects which rely on the data once it's in redshift which blueshift is enabling. Very cool stufff this. :)

tgk commented

Wow! You seem to get even more use out of it than we do ;-)

maybe so... :) Clojure is awesome.

On Jan 23, 2015, at 11:51 AM, "Thomas G. Kristensen" notifications@github.com wrote:

Wow! You seem to get even more use out of it than we do ;-)


Reply to this email directly or view it on GitHub.