fileconveyor: A Python repository from pallinger

Description
-----------
File Conveyor is designed to discover new, changed and deleted files via the
operating system's built-in file system monitor. After discovering the files,
they can be optionally be processed by a chain of processors – you can easily
write new ones yourself. After files have been processed, they can also
optionally be transported to a server.

Discovery happens through inotify on Linux (with kernel >= 2.6.13), through
FSEvents on Mac OS X (>= 10.5) and through polling on other operating systems.

Processors are simple Python scripts that can change the file's base name (it
is impossible to change the path) and apply any sort of processing to the
file's contents. Examples are image optimization and video transcoding.

Transporters are simple threaded abstractions around Django storage systems.

For a detailed description of the innards of file conveyor, see my bachelor
thesis text (find it via http://wimleers.com/tags/bachelor-thesis).

This application was written as part of the bachelor thesis [1] of Wim Leers
at Hasselt University [2].


[1] http://wimleers.com/tags/bachelor-thesis
[2] http://uhasselt.be/


<BLINK>IMPORTANT WARNING</BLINK>
--------------------------------
I've attempted to provide a solid enough README to get you started, but I'm
well aware that it isn't superb. But as this is just a bachelor thesis, time
was fairly limited. I've opted to create a solid basis instead of an extremely
rigourously documented piece of software. If you cannot find the answer in the
README.txt, nor the INSTALL.txt, nor the API.txt files, then please look at
my bachelor thesis text instead. If neither of that is sufficient, then please
contact me.


Upgrading
---------
If you're upgrading from a previous version of File Conveyor, please run
upgrade.py.



==============================================================================
| The basics                                                                 |
==============================================================================

Configuring File Conveyor
-------------------------
The sample configuration file (config.sample.xml) should be self explanatory.
Copy this file to config.xml, which is the file File Conveyor will look for,
and edit it to suit your needs.
For a detailed description, see my bachelor thesis text (look for the
"Configuration file design" section).

Each rule consists of 3 components:
- filter
- processorChain
- destinations

A rule can also be configured to delete source files after they have been
synced to the destination(s).

The filter and processorChain components are optional. You must have at least
one destination.
If you want to use File Conveyor to process files locally, i.e. without
transporting them to a server, then use the Symlink or Copy transporter (see
below).


Starting File Conveyor
----------------------
File Conveyor must be started by starting its arbitrator (which links
everything together; it controls the file system monitor, the processor
chains, the transporters and so on). You can start the arbitrator like this:
  python /path/to/fileconveyor/arbitrator.py


Stopping File Conveyor
----------------------
File Conveyor listens to standard signals to know when it should end, like the
Apache HTTP server does too. Send the TERMinate signal to terminate it:
  kill -TERM `cat ~/.fileconveyor.pid`

You can configure File Conveyor to store the PID file in the more typical
/var/run location on *nix:
* You can change the PID_FILE setting in settings.py to 
/var/run/fileconveyor.pid. However, this requires File Conveyor to be run with
root permissions (/var/run requires root permissions).
* Alternatively, you can create a new directory in /var/run which then no
longer requires root permissions. This can be achieved through these commands:
 1. sudo mkdir /var/run/fileconveyor`
 2. sudo chown fileconveyor-user /var/run/fileconveyor
 3. sudo chown 700 /var/run/fileconveyor
Then, you can change the PID_FILE setting in settings.py to
/var/run/fileconveyor/fileconveyor.pid, and you won't need to run File 
Conveyor with root permissions anymore.


File Conveyor's behavior
------------------------
Upon startup, File Conveyor starts the file system monitor and then performs a
"manual" scan to detect changes since the last time it ran. If you've got a
lot of files, this may take a while.

Just for fun, type the following while File Conveyor is syncing:
  killall -9 python
Now File Conveyor is dead. Upon starting it again, you should see something like:
  2009-05-17 03:52:13,454 - Arbitrator                - WARNING  - Setup: initialized 'pipeline' persistent queue, contains 2259 items.
  2009-05-17 03:52:13,455 - Arbitrator                - WARNING  - Setup: initialized 'files_in_pipeline' persistent list, contains 47 items.
  2009-05-17 03:52:13,455 - Arbitrator                - WARNING  - Setup: initialized 'failed_files' persistent list, contains 0 items.
  2009-05-17 03:52:13,671 - Arbitrator                - WARNING  - Setup: moved 47 items from the 'files_in_pipeline' persistent list into the 'pipeline' persistent queue.
  2009-05-17 03:52:13,672 - Arbitrator                - WARNING  - Setup: moved 0 items from the 'failed_files' persistent list into the 'pipeline' persistent queue.
As you can see, 47 items were still in the pipeline when File Conveyor was
killed. They're now simply added to the pipeline queue again and they will be
processed once again.


The initial sync
----------------
To get a feeling of File Conveyor's speed, you may want to run it in the console
and look at its output.


Verifying the synced files
--------------------------
Running the verify.py script will open the synced files database and verify
that each synced file actually exists.




==============================================================================
| Processors                                                                 |
==============================================================================

Addressing processors
---------------------

You can address a specific processor by first specifying its processor module
and then the exact processor name (which is its class name):
- unique_filename.MD5
- image_optimizer.KeepMetadata
- yui_compressor.YUICompressor
- link_updater.CSSURLUpdater

But, it works with third-party processors too! Just make sure the third-party
package is in the Python path and then you can just use this in config.xml:
- MyProcessorPackage.SomeProcessorClass


Processor module: filename
--------------------------
Available processors:
1) SpacesToUnderscores
   Changes a filename; replaces spaces by underscores. E.g.:
     this is a test.txt --> this_is_a_test.txt
2) SpacesToDashes
Changes a filename; replaces spaces by dashes. E.g.:
  this is a test.txt --> this-is-a-test.txt


Processor module: unique_filename
---------------------------------
Available processors:
1) Mtime
   Changes a filename based on the file's mtime. E.g.:
     logo.gif --> logo_1240668971.gif
2) MD5
   Changes a filename based on the file's MD5 hash. E.g.:
     logo.gif --> logo_2f0342a2b9aaf48f9e75aa7ed1d58c48.gif


Processor module: image_optimizer
---------------------------------
It's important to note that all metadata is stripped from JPEG images, as that
is the most effective way to reduce the image size. However, this might also
strip copyright information, i.e. this can also have legal consequences.
Choose one of the "keep metadata" classes if you want to avoid this.
When optimizing GIF images, they are converted to the PNG format, which also
changes their filename.

Available processors:
1) Max
   optimizes image files losslessly (GIF, PNG, JPEG, animated GIF)
2) KeepMetadata
   same as Max, but keeps JPEG metadata
3) KeepFilename
   same as Max, but keeps the original filename (no GIF optimization)
4) KeepMetadataAndFilename
   same as Max, but keeps JPEG metadata and the original filename (no GIF
   optimization)


Processor module: yui_compressor
--------------------------------
Warning: this processor is CPU-intensive! Since you typically don't get new
CSS and JS files all the time, it's still fine to use this. But the initial
sync may cause a lot of CSS and JS files to be processed and thereby cause a
lot of load!

Available processors:
1) YUICompressor
   Compresses .css and .js files with the YUI Compressor


Processor module: google_closure_compiler
-----------------------------------------
Warning: this processor is CPU-intensive! Since you typically don't get new
JS files all the time, it's still fine to use this. But the initial sync may
cause a lot of JS files to be processed and thereby cause a lot of load!

Available processors:
1) GoogleClosureCompiler
   Compresses .js files with the Google Closure Compiler


Processor module: link_updater
------------------------------
Warning: this processor is CPU-intensive! Since you typically don't get new
CSS files all the time, it's still fine to use this. But the initial sync may
cause a lot of CSS files to be processed and thereby cause a lot of load! Note
that this processor will skip processing a CSS file if not all files that are
referenced from it, have been synced to the CDN yet. Which means the CSS files
may need to parsed over and over again until the images have been synced.

It seems this processor is suited for optimization. It uses the cssutils
Python module, which validates every CSS property. This is an enormous slow-
down: on a 2.66 GHz Core 2 Duo, it causes 100% CPU usage every time it runs.
This module also seems to suffer from rather massive memory leaks. Memory
usage can easily top 30 MB on Mac OS X where it would never go over 17 MB
without this processor!

This processor will replace all URLs in CSS files with references to their
counterparts on the CDN. There are a couple of important gotchas to use this
processor module:
 - absolute URLs (http://, https://) are ignored, only relative URLs are
   processed
 - if a referenced file doesn't exist, its URL will remain unchanged
 - if one of the referenced images or fonts is changed and therefor resynced,
   and if it is configured to have a unique filename, the CDN URL referenced
   from the updated CSS file will no longer be valid. Therefor, when you
   update an image file or font file that is referenced by CSS files, you
   should modify the CSS files as well. Just modifying the mtime (by using the
   touch command) is sufficient.
 - it requires the referenced files to be synced to the same server the CSS
   file is being synced to. This implies that all the references files must
   also be synced to the same server, or the file will never get synced!

Available processors:
1) CSSURLUpdater
   Replaces URLs in .css files with their counterparts on the CDN




==============================================================================
| Transporters                                                               |
==============================================================================

Addressing transporters
-----------------------

You can address a specific transporter by only specifying its module:
- cf
- ftp
- mosso
- s3
- sftp
- symlink_or_copy

But, it works with third-party transporters too! Just make sure the
third-party package is in the Python path and then you can just use this in
config.xml:
- MyTransporterPackage


Transporter: FTP (ftp)
----------------------
Value to enter: "ftp".

Available settings:
- host
- username
- password
- url
- port
- path
- key


Transporter: SFTP (sftp)
------------------------
Value to enter: "sftp".

Available settings:
- host
- username
- password
- url
- port
- path


Transporter: Amazon S3
----------------------
Value to enter: "s3".

Available settings:
- access_key_id
- secret_access_key
- bucket_name
- bucket_prefix

More than 4 concurrent connections doesn't show a significant speedup.


Transporter: Amazon CloudFront
------------------------------
Value to enter: "cf".

Available settings:
- access_key_id
- secret_access_key
- bucket_name
- bucket_prefix
- distro_domain_name



Transporter: Mosso CloudFiles
---------------------------------
Value to enter: "mosso".

Available settings:
- username
- api_key
- container


Transporter: Symlink or Copy
----------------------------
Value to enter: "symlink_or_copy".

Available settings:
- location
- url


Transporter: Amazon CloudFront - Creating a CloudFront distribution
-------------------------------------------------------------------
You can either use the S3Fox Firefox add-on to create a distribution or use
the included Python function to do so. In the latter case, do the following:

>>> import sys
>>> sys.path.append('/path/to/fileconveyor/transporters')
>>> sys.path.append('/path/to/fileconveyor/dependencies')
>>> from transporter_cf import create_distribution
>>> create_distribution("access_key_id", "secret_access_key", "bucketname.s3.amazonaws.com")
Created distribution
    - domain name: dqz4yxndo4z5z.cloudfront.net
    - origin: bucketname.s3.amazonaws.com
    - status: InProgress
    - comment: 
    - id: E3FERS845MCNLE

    Over the next few minutes, the distribution will become active. This
    function will keep running until that happens.
    ............................
    The distribution has been deployed!




==============================================================================
| The advanced stuff                                                         |
==============================================================================

Constants in Arbitrator.py
--------------------------
The following constants can be tweaked to change where File Conveyor stores
its files, or to change its behavior.

RESTART_AFTER_UNHANDLED_EXCEPTION = True
  Whether File Conveyor should restart itself after it encountered an
  unhandled exception (i.e., a bug).
RESTART_INTERVAL = 10
  After how much time File Conveyor should restart itself, after it has
  encountered an unhandled exception. Thus, this setting only has an effect
  when RESTART_AFTER_UNHANDLED_EXCEPTION == True.
LOG_FILE = './fileconveyor.log'
  The log file.
PERSISTENT_DATA_DB = './persistent_data.db'
  Where to store persistent data (pipeline queue, 'files in pipeline' list and
  'failed files' list).
SYNCED_FILES_DB = './synced_files.db'
  Where to store the input_file, transported_file_basename, url and server for
  each synced file.
WORKING_DIR = '/tmp/fileconveyor'
  The working directory.
MAX_FILES_IN_PIPELINE = 50
  The maximum number of files in the pipeline. Should be high enough in order
  to prevent transporters from idling too long.
MAX_SIMULTANEOUS_PROCESSORCHAINS = 1
  The maximum number of processor chains that may be executed simultaneously.
  If you've got CPU intensive processors and if you're running File Conveyor
  on the web server, you'll want to keep this very low, probably at 1.
MAX_SIMULTANEOUS_TRANSPORTERS = 10
  The maximum number of transporters that may be running simultaneously. This
  effectively caps the number of simultaneous connections. It can also be used
  to have some -- although limited -- control on the throughput consumed by
  the transporters.
MAX_TRANSPORTER_QUEUE_SIZE = 1
  The maximum of files queued for each transporters. It's recommended to keep
  this low enough to ensure files are not unnecessarily waiting. If you set
  this too high, no new transporters will be spawned, because all files will
  be queued on the existing transporters. Setting this to 0 can only be
  recommended in environments with a continuous stream of files that need
  syncing. The default of 1 is to ensure each transporter is idling as little
  as possible.
QUEUE_PROCESS_BATCH_SIZE = 20
  The number of files that will be processed when processing one of the many
  queues. Setting this too low will cause overhead. Setting this too high will
  cause delays for files that are ready to be processed or transported. See
  the "Pipeline design pattern" section in my bachelor thesis text.
CALLBACKS_CONSOLE_OUTPUT = False
  Controls whether output will be generated for each callback. (There are
  callbacks for the file system monitor, processor chains and transporters.)
CONSOLE_LOGGER_LEVEL = logging.WARNING
  Controls the output level of the logging to the console. For a full list of
  possibilities, see http://docs.python.org/release/2.6/library/logging.html#logging-levels.
FILE_LOGGER_LEVEL = logging.DEBUG
  Controls the output level of the logging to the console. For a full list of
  possibilities, see http://docs.python.org/release/2.6/library/logging.html#logging-levels.
RETRY_INTERVAL = 30
  Sets the interval in which the 'failed files' list is appended to the
  pipeline queue, to retry to sync these failed files.


Understanding persistent_data.db
--------------------------------
We'll go through this by using a sample database I created. You should be able
to reproduce similar output on your persistent_data.db file using the exact
same commands.
Access the database, by using the SQLite console application.
  $ sqlite3 persistent_data.db
  SQLite version 3.6.11
  Enter ".help" for instructions
  Enter SQL statements terminated with a ";"
  sqlite>

As you can see, there are three tables in the database, one for every
persistent data structure:
  sqlite> .table
  failed_files_list  pipeline_list      pipeline_queue

Simple count queries show how many items there are in each persistent data
structure. In this case for example, there are 2560 files waiting to enter the
pipeline, 50 were in the pipeline at the time of stopping File Conveyor (these
will be added to the queue again once we restart File Conveyor) and 0 files
are in the list of failed files. Files end up in there when their processor
chain or (one of) their transporters fails.
  sqlite> SELECT COUNT(*) FROM pipeline_queue;
  2560
  sqlite> SELECT COUNT(*) FROM pipeline_list;
  50
  sqlite> SELECT COUNT(*) FROM failed_files_list;
  0

You can also look at the database schemas of these tables:
  sqlite> .schema pipeline_queue
  CREATE TABLE pipeline_queue(id INTEGER PRIMARY KEY AUTOINCREMENT, item pickle);
  sqlite> .schema pipeline_list
  CREATE TABLE pipeline_list(id INTEGER PRIMARY KEY AUTOINCREMENT, item pickle);
  sqlite> .schema failed_files_list
  CREATE TABLE failed_files_list(id INTEGER PRIMARY KEY AUTOINCREMENT, item pickle);

As you can see, the three tables have identical schemas. the type for the
stored item is 'pickle', which means that you can store any Python object in
there as long as it can be "pickled", which means as much as "convertable to
a string representation". "Serialization" is the term PHP developers have
given to this, although pickling is much more advanced.
The Python object stored in there is the same for all three tables: a tuple of
the filename (as a string) and the event (as an integer). The event is one of
FSMonitor.CREATED, FSMonitor.MODIFIED, FSMonitor.DELETED.

This file is what tracks the curent state of File Conveyor. Thanks to this file,
it is possible for File Conveyor to crash and not lose any data.
Deleting this file would cause File Conveyor to lose all of its current work.
Only new (as in: after the file was deleted) changes in the file system would
be picked up. Changes that still had to be synced, would be forgotten.


Understanding fsmonitor.db
--------------------------
This database has a single table: pathscanner (which is inherited from the
pathscanner module around which the fsmonitor module is built). Its schema is:

  sqlite> .schema pathscanner
  CREATE TABLE pathscanner(path text, filename text, mtime integer);

This file is what tracks the current state of the directory tree associated
with each source. When an operating system's file system monitor is used, this
database will be updated through its callbacks. When no such file system
monitor is available, it will be updated through polling.
Deleting this file would cause File Conveyor to have to sync all files again.


Understanding synced_files.db
-----------------------------
We'll go through this by using a sample database I created. You should be able
to reproduce similar output on your synced_files.db file using the exact
same commands.
Access the database, by using the SQLite console application.
  $ sqlite3 synced_files.db 
  SQLite version 3.6.11
  Enter ".help" for instructions
  Enter SQL statements terminated with a ";"
  sqlite>
  
As you can see, there's only one table: synced_files.
  sqlite> .table
  synced_files

Let's look at the schema. There are 4 fields: input_file,
transported_file_basename, url and server. input_file is the full path.
transported_file_basename is the base name of the file that was transported to
the server. This is stored because the filename might have been altered by the
processors that have been applied to it, but the path cannot change. I use
this to delete the previous version of a file if a file has been modified. The
url field is of course the URL to retrieve the file from the server. Finally,
the server field contains the name you've assigned to the server in the
configuration file. Each file may be synced to multiple servers and this
allows you to check if a file has been synchronized to a specific server.
  sqlite> .schema synced_files
  CREATE TABLE synced_files(input_file text, transported_file_basename text, url text, server text);

We can again use simple count queries to learn more about the synced files. As
you can see, 845 files have been synced, of which 602 have been synced to a
the server that was named "origin pull cdn" and 243 to the server that was
named "ftp push cdn".
  sqlite> SELECT COUNT(*) FROM synced_files;
  845
  sqlite> SELECT COUNT(*) FROM synced_files WHERE server="origin pull cdn";
  602
  sqlite> SELECT COUNT(*) FROM synced_files WHERE server="ftp push cdn";
  243


License
-------
This application is dual-licensed under the GPL and the UNLICENSE.
  
Due to the dependencies that were initially included within File Conveyor,
which were all subject to GPL-compatible licenses, it made sense to initially
release the source code under the GPL.
Then, it was decided the UNLICENSE was a better fit.


Author
------
Wim Leers ~ http://wimleers.com/

This application was written as part of the bachelor thesis of Wim Leers at
Hasselt University.
pallinger/fileconveyor