/s3sync

s3sync compatible with ruby 1.9.2 & 2.2

Primary LanguageRuby

CHANGES from version 1.3.8 for Ruby 2

  • Use String#encode instead of Iconv#iconv

  • Use Digest instead of Digest::Digest

CHANGES from original to be compatible with 1.9.2

  • require ‘md5’

Instead require “digest/md5”

  • Thread.critical

Thread.critical is not used since 1.9

  • Dir#collect

In 1.9.2 Dir#collect is not Array but Enumerator

  • Array#to_s

The result of [1,2].to_s is different from 1.8. Instead of to_s, used join

  • use Enumerator instead of thread_generator

DESCRIPTION:

Welcome to s3sync.rb


Home page, wiki, forum, bug reports, etc: s3sync.net

This is a ruby program that easily transfers directories between a local directory and an S3 bucket:prefix. It behaves somewhat, but not precisely, like the rsync program. In particular, it shares rsync’s peculiar behavior that trailing slashes on the source side are meaningful. See examples below.

One benefit over some other comparable tools is that s3sync goes out of its way to mirror the directory structure on S3. Meaning you don’t need to use s3sync later in order to view your files on S3. You can just as easily use an S3 shell, a web browser (if you used the –public-read option), etc. Note that s3sync is NOT necessarily going to be able to read files you uploaded via some other tool. This includes things uploaded with the old perl version! For best results, start fresh!

s3sync runs happily on linux, probably other *ix, and also Windows (except that symlinks and permissions management features don’t do anything on Windows). If you get it running somewhere interesting let me know (see below)

s3sync is free, and license terms are included in all the source files. If you decide to make it better, or find bugs, please let me know.

The original inspiration for this tool is the perl script by the same name which was made by Thorsten von Eicken (and later updated by me). This ruby program does not share any components or logic from that utility; the only relation is that it performs a similar task.

Management tasks


For low-level S3 operations not encapsulated by the sync paradigm, try the companion utility s3cmd.rb. See README_s3cmd.txt.

About single files


s3sync lacks the special case code that would be needed in order to handle a source/dest that’s a single file. This isn’t one of the supported use cases so don’t expect it to work. You can use the companion utility s3cmd.rb for single get/puts.

About Directories, the bane of any S3 sync-er


In S3 there’s no actual concept of folders, just keys and nodes. So, every tool uses its own proprietary way of storing dir info (my scheme being the best naturally) and in general the methods are not compatible.

If you populate S3 by some means *other than* s3sync and then try to use s3sync to “get” the S3 stuff to a local filesystem, you will want to use the –make-dirs option. This causes the local dirs to be created even if there is no s3sync-compatible directory node info stored on the S3 side. In other words, local folders are conjured into existence whenever they are needed to make the “get” succeed.

About MD5 hashes


s3sync’s normal operation is to compare the file size and MD5 hash of each item to decide whether it needs syncing. On the S3 side, these hashes are stored and returned to us as the “ETag” of each item when the bucket is listed, so it’s very easy. On the local side, the MD5 must be calculated by pushing every byte in the file through the MD5 algorithm. This is CPU and IO intensive!

Thus you can specify the option –no-md5. This will compare the upload time on S3 to the “last modified” time on the local item, and not do md5 calculations locally at all. This might cause more transfers than are absolutely necessary. For example if the file is “touched” to a newer modified date, but its contents didn’t change. Conversely if a file’s contents are modified but the date is not updated, then the sync will pass over it. Lastly, if your clock is very different from the one on the S3 servers, then you may see unanticipated behavior.

A word on SSL_CERT_DIR:


On my debian install I didn’t find any root authority public keys. I installed some by running this shell archive: mirbsd.mirsolutions.de/cvs.cgi/src/etc/ssl.certs.shar (You have to click download, and then run it wherever you want the certs to be placed). I do not in any way assert that these certificates are good, comprehensive, moral, noble, or otherwise correct. But I am using them.

If you don’t set up a cert dir, and try to use ssl, then you’ll 1) get an ugly warning message slapped down by ruby, and 2) not have any protection AT ALL from malicious servers posing as s3.amazonaws.com. Seriously… you want to get this right if you’re going to have any sensitive data being tossed around. – There is a debian package ca-certificates; this is what I’m using now. apt-get install ca-certificates and then use: SSL_CERT_DIR=/etc/ssl/certs

You used to be able to use just one certificate, but recently AWS has started using more than one CA.

Getting started:


Invoke by typing s3sync.rb and you should get a nice usage screen. Options can be specified in short or long form (except –delete, which has no short form)

ALWAYS TEST NEW COMMANDS using –dryrun(-n) if you want to see what will be affected before actually doing it. ESPECIALLY if you use –delete. Otherwise, do not be surprised if you misplace a ‘/’ or two and end up deleting all your precious, precious files.

If you use the –public-read(-p) option, items sent to S3 will be ACL’d so that anonymous web users can download them, given the correct URL. This could be useful if you intend to publish directories of information for others to see. For example, I use s3sync to publish itself to its home on S3 via the following command: s3sync.rb -v -p publish/ ServEdge_pub:s3sync Where the files live in a local folder called “publish” and I wish them to be copied to the URL: s3.amazonaws.com/ServEdge_pub/s3sync/… If you use –ssl(-s) then your connections with S3 will be encrypted. Otherwise your data will be sent in clear form, i.e. easy to intercept by malicious parties.

If you want to prune items from the destination side which are not found on the source side, you can use –delete. Always test this with -n first to make sure the command line you specify is not going to do something terrible to your cherished and irreplaceable data.

Updates and other discussion:


The latest version of s3sync should normally be at: s3.amazonaws.com/ServEdge_pub/s3sync/s3sync.tar.gz and the Amazon S3 forums probably have a few threads going on it at any given time. I may not always see things posted to the threads, so if you want you can contact me at gbs-s3@10forward.com too.

FEATURES/PROBLEMS:

  • FIX (list of features or problems)

SYNOPSIS:

Examples:


(using S3 bucket ‘mybucket’ and prefix ‘pre’)

Put the local etc directory itself into S3
      s3sync.rb  -r  /etc  mybucket:pre
      (This will yield S3 keys named  pre/etc/...)
Put the contents of the local /etc dir into S3, rename dir:
      s3sync.rb  -r  /etc/  mybucket:pre/etcbackup
      (This will yield S3 keys named  pre/etcbackup/...)
Put contents of S3 "directory" etc into local dir
      s3sync.rb  -r  mybucket:pre/etc/  /root/etcrestore
      (This will yield local files at  /root/etcrestore/...)
Put the contents of S3 "directory" etc into a local dir named etc
      s3sync.rb  -r  mybucket:pre/etc  /root
      (This will yield local files at  /root/etc/...)
Put S3 nodes under the key pre/etc/ to the local dir etcrestore
**and create local dirs even if S3 side lacks dir nodes**
      s3sync.rb  -r  --make-dirs  mybucket:pre/etc/  /root/etcrestore
      (This will yield local files at  /root/etcrestore/...)

List all the buckets your account owns: s3cmd.rb listbuckets

Create a new bucket: s3cmd.rb createbucket BucketName

Create a new bucket in the EU: s3cmd.rb createbucket BucketName EU

Find out the location constraint of a bucket:

s3cmd.rb location BucketName

Delete an old bucket you don’t want any more: s3cmd.rb deletebucket BucketName

Find out what’s in a bucket, 10 lines at a time: s3cmd.rb list BucketName 10

Only look in a particular prefix: s3cmd.rb list BucketName:startsWithThis

Look in the virtual “directory” named foo; lists sub-“directories” and keys that are at this level. Note that if you specify a delimiter you must specify a max before it. (until I make the options parsing smarter) s3cmd.rb list BucketName:foo/ 10 /

Delete a key: s3cmd.rb delete BucketName:AKey

Delete all keys that match (like a combo between list and delete): s3cmd.rb deleteall BucketName:SomePrefix

Only pretend you’re going to delete all keys that match, but list them: s3cmd.rb –dryrun deleteall BucketName:SomePrefix

Delete all keys in a bucket (leaving the bucket): s3cmd.rb deleteall BucketName

Get a file from S3 and store it to a local file s3cmd.rb get BucketName:TheFileOnS3.txt ALocalFile.txt

Put a local file up to S3 Note we don’t automatically set mime type, etc. NOTE that the order of the options doesn’t change. S3 stays first! s3cmd.rb put BucketName:TheFileOnS3.txt ALocalFile.txt

A note about [headers]


For some S3 operations, such as “put”, you might want to specify certain headers to the request such as Cache-Control, Expires, x-amz-acl, etc. Rather than supporting a load of separate command-line options for these, I just allow header specification. So to upload a file with public-read access you could say: s3cmd.rb put MyBucket:TheFile.txt x-amz-acl:public-read

If you don’t need to add any particular headers then you can just ignore this whole [headers] thing and pretend it’s not there. This is somewhat of an advanced option.

REQUIREMENTS:

  • FIX (list of requirements)

INSTALL:

sudo gem install frahugo-s3sync

Or if you use bundler, you can point to the source repo:

gem ‘frahugo-s3sync’, :git => ‘git://github.com/frahugo/s3sync.git’

Your environment:


s3sync needs to know several interesting values to work right. It looks for them in the following environment variables -or- a s3config.yml file. In the yml case, the names need to be lowercase (see example file). Furthermore, the yml is searched for in the following locations, in order:

$S3CONF/s3config.yml
$HOME/.s3conf/s3config.yml
/etc/s3conf/s3config.yml

Required: AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY

If you don’t know what these are, then s3sync is probably not the right tool for you to be starting out with. Optional: AWS_S3_HOST - I don’t see why the default would ever be wrong

HTTP_PROXY_HOST,HTTP_PROXY_PORT,HTTP_PROXY_USER,HTTP_PROXY_PASSWORD - proxy

SSL_CERT_DIR - Where your Cert Authority keys live; for verification SSL_CERT_FILE - If you have just one PEM file for CA verification S3SYNC_RETRIES - How many HTTP errors to tolerate before exiting S3SYNC_WAITONERROR - How many seconds to wait after an http error S3SYNC_MIME_TYPES_FILE - Where is your mime.types file S3SYNC_NATIVE_CHARSET - For example Windows-1252. Defaults to ISO-8859-1.

AWS_CALLING_FORMAT - Defaults to REGULAR
    REGULAR   # http://s3.amazonaws.com/bucket/key
    SUBDOMAIN # http://bucket.s3.amazonaws.com/key
    VANITY    # http://<vanity_domain>/key

Important: For EU-located buckets you should set the calling format to SUBDOMAIN Important: For US buckets with CAPS or other weird traits set the calling format to REGULAR

I use “envdir” from the daemontools package to set up my env variables easily: cr.yp.to/daemontools/envdir.html For example: envdir /root/s3sync/env /root/s3sync/s3sync.rb -etc etc etc I know there are other similar tools out there as well.

You can also just call it in a shell script where you have exported the vars first such as: #!/bin/bash export AWS_ACCESS_KEY_ID=valueGoesHere … s3sync.rb -etc etc etc

But by far the easiest (and newest) way to set this up is to put the name:value pairs in a file named s3config.yml and let the yaml parser pick them up. There is an .example file shipped with the tar.gz to show what a yaml file looks like. Thanks to Alastair Brunton for this addition.

You can also use some combination of .yaml and environment variables, if you want. Go nuts.

LICENSE:

(The MIT License)

Copyright © 2009 FIXME full name

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the ‘Software’), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED ‘AS IS’, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.