filehub: A Java repository from paulophgf

Read this in other languages: English, Portuguese

The FileHub is a service that standardizes file management, independent of the storage platform used. Moreover, it makes file persistence easier when we think about multiple storage places, serving as requests gateway, using a safe and easy way.

Sections

Configuration
Concepts
- Storage
- Schema
- Trigger
Operations
API Documentation
Docker Configuration

Configuration

The FileHub uses an XML configuration file where the properties and some rules are defined. That file can be created locally where the service is executed or remotely using a Git repository. The following table shows the environment variables used to define where the configuration file is:

Variable name	Description
* _Required
CONFIG_TYPE *	Define if the file is locally or remotely. Default value: LOCAL_FILE Possible values: LOCAL_FILE GIT_FILE
LOCAL_FILE_PATH	Used when the configuration file is local. It shows where the configuration file is in the Operational System. Example: C:/filehub/example.xml
CONFIG_GIT_FILE_PATH	Git repository file address (File URL) Use raw file URL (plain text)
CONFIG_GIT_FILE_TOKEN	Git repository authentication token
MAX_FILE_SIZE	Maximum file size allowed. Default value: 7000000000
MAX_REQUEST_SIZE	Maximum request size allowed Default value: 7000000000

Concepts

Before executing the service it is necessary to define which storage platforms will be used, in addition to configuring the access parameters of each one independently. To do that the FileHub uses an XML file that will be read when the service starts. The file contains some elements that will process the requests. Each element will be explained next:

Storage

It is used to represent a storage platform. A storage has an ID to identify it inside the service and a type. Each type corresponds to a service or a storage platform, for example, a FTP server, a cloud service like the AWS S3 or a local directory where the FileHub is running. In other words, each type has their own properties for access and specifications.

In the configuration file the storages are defined inside of tag storages like shown in the example below:

<filehub>
   <storages>
       <storage id="S3-Test" type="AWS_S3">
           <region>us-east-2</region>
           <secretKeyId>G5HG4G66RDYIYE1</secretKeyId>
           <secretKey>6F51E6f1e6F7A2E4F761F61fd51s1F</secretKey>
           <bucket>test</bucket>
       </storage>
       <storage id="FileSystem-Test" type="FILE_SYSTEM">
           <baseDir>C:\Users\user\filehub</baseDir>
       </storage>
   </storages>
</filehub>

_{Storage declaration example}

All storage elements have an ID and a type. The ID will identify the storage and the type will define which configuration properties the storage has. The storage types are listed next:

Local File System	It defines as storage a server directory where the FileHub is running. Type: FILE_SYSTEM
Properties: baseDir: root directory

Amazon S3	It defines a S3 bucket as a storage. Type: AWS_S3
Properties: region: AWS region (e.g.: sa-east-1) secretKeyId: IAM user ID secretKey: IAM user secret bucket: S3 bucket name baseDir: root directory

Google Cloud Storage	It creates a link with a google cloud storage bucket Type: GOOGLE_CLOUD
Properties: jsonCredentials: JSON object generated by account service (APIs & Services > Credentials > Service Accounts > Keys) Download the key, copy the JSON object and paste inside the jsonCredentials tag bucket: storage bucket name baseDir: root directory

Dropbox	It creates a link with a dropbox account Type: DROPBOX Limitations: Access Token: The refresh token operation was not implemented. It is necessary to generate a new token when you use this kind of storage File Size: The maximum file size is 150 Mb. Operations with big files will not work.
Properties: accessToken: access token baseDir: root directory

Schema

A schema represents a storage set. When any operation is performed on FileHub, either upload or download, it will be necessary to inform the system what the schema is that will be considered. The FileHub service doesn’t perform operations directly on the storage element. It uses a schema that represents one or more storages.

The schemas are declared inside the schemas tag, where possible the declaration of more than one schema. All schema records have a name that will be the identifier on the request used in FileHub. It is possible to link the storages to a specific schema using the storage-id tag. The example below shows how to get a schema configuration with two storages linked.

<filehub>
    <storages>
        <storage id="S3-Test" type="AWS_S3">
            <region>us-east-2</region>
            <secretKeyId>G5HG4G66RDYIYE1</secretKeyId>
            <secretKey>6F51E6f1e6F7A2E4F761F61fd51s1F</secretKey>
            <bucket>test</bucket>
        </storage>
        <storage id="FileSystem-Test" type="FILE_SYSTEM">
            <baseDir>C:\Users\user\filehub</baseDir>
        </storage>
    </storages>
    <schemas>
        <schema name="MySchema">
            <storage-id>FileSystem-Test</storage-id>
            <storage-id>S3-Test</storage-id>
        </schema>
    </schemas>
</filehub>

_{Schema declaration example}

Auto Schemas

It is not necessary to declare a schema for each storage to perform storage operations individually. It is possible to inform FileHub to perform the file reading, creating a schema for each existing storage. To do that, it uses the generate-schema attribute, filling as value, the schema’s name that will be created. See the example below:

<filehub>
    <storages>
        <storage id="S3-Test" type="AWS_S3" generate-schema="s3test">
            <region>us-east-2</region>
            <secretKeyId>G5HG4G66RDYIYE1</secretKeyId>
            <secretKey>6F51E6f1e6F7A2E4F761F61fd51s1F</secretKey>
            <bucket>test</bucket>
        </storage>
        <storage id="FileSystem-Test" type="FILE_SYSTEM">
            <baseDir>C:\Users\user\filehub</baseDir>
        </storage>
    </storages>
</filehub>

_{Example of schema creation directly on storage}

It is also possible to use the attribute generate-schema on the storages element to create a schema with all existing storages. See the example below:

<filehub>
    <storages generate-schema="all">
        <storage id="S3-Test" type="AWS_S3">
            <region>us-east-2</region>
            <secretKeyId>G5HG4G66RDYIYE1</secretKeyId>
            <secretKey>6F51E6f1e6F7A2E4F761F61fd51s1F</secretKey>
            <bucket>test</bucket>
        </storage>
        <storage id="FileSystem-Test" type="FILE_SYSTEM">
            <baseDir>C:\Users\user\filehub</baseDir>
        </storage>
    </storages>
</filehub>

_{Example of schema creation with all existing storages}

Warning If an auto schema was created without a configured default trigger, the schema won’t have any kind of security.

Trigger

Triggers are used to guarantee security on operations. They work as web hooks that will validate if an operation is authorized or not by another service/application.

The trigger element has an ID to the identification and a action attribute that can assume two possible values:

ALL: it will consider the trigger to any kind of operation, be from writing (creation/updating/exclusion) or reading (download);
UPDATE: the trigger just will be applied to writing operations (creation/updating/exclusion).

Warning The default term is a special value and cannot be used as ID to a trigger.

When a trigger is configured it is necessary to inform three properties:

header: it is a header name that should be sent to the authorization service.
url: it is the service endpoint that will validate if the request is valid or not. The request goal is to check if the header value is valid. If the response of that request does not return a 200 (OK) code, the operation will be canceled.
http-method (optional): define which HTTP method type will be used on the request (GET, HEAD, POST, PUT, PATCH, DELETE, OPTIONS). The default value is GET.

In the XML configuration file, the triggers are defined inside of the triggers tag. A trigger should be linked to a schema. That bond is created through the trigger attribute used in the schema tag. All storages inside the schema consider the trigger during its operations.

For clarification, see the following configuration example:

<filehub>
    <storages>
        <storage id="example" type="FILE_SYSTEM">
            <baseDir>C:\Users\user\filehub</baseDir>
        </storage>
    </storages>
    <trigger id="user-auth" action="ALL">
        <url>http://10.0.0.10:8080/auth</url>
        <header>myheader</header>
        <http-method>GET</http-method>
    </trigger>
    <schemas>
        <schema name="test" trigger="user-auth">
            <storage-id>example</storage-id>
        </schema>
    </schemas>
</filehub>

_{Trigger declaration example}

We can observe the trigger user-auth was created and the schema test uses it. In the other words, each operation from the storage example will call the trigger to check the authorization.

The flowchart below shows the process considering the upload operation to the previous configuration.

_{Flowchart of file uploading with trigger}

The application that uses the FileHub service should send the trigger configured header with a value. When the FileHub receives the request, it will call the trigger configured endpoint, transferring the header to the authorization service to check the validation. A JWT token is a good example of using that process.

When a trigger do a request to the configured URL, it sends the following request body:

schema: the selected schema name
operation: the operation type executed (CREATE_DIRECTORY, RENAME_DIRECTORY, DELETE_DIRECTORY, LIST_FILES, EXIST_DIRECTORY, UPLOAD_MULTIPART_FILE, UPLOAD_BASE64_FILE, DOWNLOAD_FILE, DELETE_FILE, EXIST_FILE, GET_FILE_DETAILS)
path: the path informed
filenames: a list with file names used on the request (usually for upload operations)

The following JSON shows a request body example:

{
  "schema": "test",
  "operation": "UPLOAD_MULTIPART_FILE",
  "path": "/accounts/users/avatar/",
  "filenames": [ "MyAvatar.jpeg" ]
}

Another purpose of the triggers is to allow the creation of customized paths for the files. To explain that, imagine a system where each user has a directory to store their images. We will have URLs similar to the following list:

/schema/example/user/paul/photo01
/schema/example/user/paul/photo02
/schema/example/user/john/photo01
/schema/example/user/john/photo02
/schema/example/user/john/photo03

You can see that to perform an upload or download operation, the consumer application should use the FileHub to manage the user logged identifiers. However, if the consumer application is a web interface, it will be possible to change that identifier, implicating the security of file accesses that are managed for FileHub. To deal with this problem, it is possible the trigger endpoint returns a parameter list that should be used to replace parts of the URL before completing an operation. The following sequence diagram shows that process:

_{Sequence diagram of trigger communication}

The parameter returned from the Authorization Service response should have the same name as the parameter used in operation URL ($user = user).

Note The file name can be also modified by the Authorization Service response. Use the filename parameter to do that.

Warning If a trigger has the action attribute configured as the value UPDATE and the authorization header is filled on the request, the trigger will call the configured endpoint even though.

Default Trigger

There is the possibility to create a trigger that will be called on all schemas without an explicit filled trigger. To do that, use the default attribute on the trigger as shown in the example below:

<filehub>
    <storages>
        <storage id="example" type="FILE_SYSTEM">
            <baseDir>C:\Users\user\filehub</baseDir>
        </storage>
    </storages>
    <trigger id="user-auth" action="ALL" default="true">
        <url>http://10.0.0.10:8080/auth</url>
        <header>myheader</header>
        <http-method>GET</http-method>
    </trigger>
</filehub>

_{Default trigger example}

Operations

After the understanding of the main FileHub concepts, the next step is to know which operations you can execute with the service.

Directories

The directories are used as a way to group and organize the files. The major part of storage deals with the directory structure as a special file type, but there are cases such the AWS S3 that uses it as prefixes. In this case, the prefix and the filename together are the file identification key inside a bucket. The FileHub provides further directory management, allowing the following operations:

Create a new directory
Rename a directory
Delete a directory
List the existing files inside the directory, including others directories
Check if the directory exists

Disable directory operations

To disable directory operations, it is possible use the no-dir attribute on a trigger as shown in the example below:

<trigger id="user-auth" action="ALL" no-dir="true">
    <url>http://10.0.0.10:8080/auth</url>
    <header>myheader</header>
    <http-method>GET</http-method>
</trigger>

_{Example of trigger with disabled directories}

Upload

An upload operation allows the sending of files that will be saved in all storages linked with a schema. When the FileHub receives the upload request and the file transfer begins, the FileHub can send the file to the storages by two ways:

Sequential transference: It is the default transference type. The FileHub will transfer the files to each storage in a sequential way, following the storage declaration order from the schema.
Parallel transference: The FileHub transfers the files to the storages at the same time. In this case, there isn’t a specific transference order. If you want to use the parallel transference you need to put the parallel-upload attribute in the schema tag with true value.

<schemas>
    <schema name="test-parallel" parallel-upload="true">
        <storage-id>example</storage-id>
    </schema>
</schemas>

_{Parallel transference configuration example}

Regardless of the transference type, the upload request will only return a response after the file transference from the schema to all storages has ended.

Middle-Storage

In some cases, where there exists only one storage in the schema and the files are small, the transference operation is executed quickly. On the other hand, there are cases where it is necessary to transfer greater files to more than one storage, and in these scenarios the request can take a significant amount of time. A way to soften that problem is to use the middle-storage concept.

The middle-storage defines which storage from a schema will be the intermediate between the consumer application and the rest of the storages. See the following example:

<filehub>
    <storages>
        <storage id="S3-Test" type="AWS_S3">
            <region>us-east-2</region>
            <secretKeyId>G5HG4G66RDYIYE1</secretKeyId>
            <secretKey>6F51E6f1e6F7A2E4F761F61fd51s1F</secretKey>
            <bucket>test</bucket>
        </storage>
        <storage id="FileSystem-Test" type="FILE_SYSTEM">
            <baseDir>C:\Users\user\filehub</baseDir>
        </storage>
    </storages>
    <schemas>
        <schema name="myschema" middle="FileSystem-Test">
            <storage-id>FileSystem-Test</storage-id>
            <storage-id>S3-Test</storage-id>
        </schema>
    </schemas>
</filehub>

_{Middle-storage example}

In the example above, in an upload operation, the FileSystem-Test storage will receive the file, return the answer to the consumer application and will then transfer the file to the S3-Test storage.

Temporary Middle-Storage

A storage defined as middle-storage and not included in the one of schema storages will be a temporary storage. It will work like a middle-storage, but it will delete all files after the upload operation.

<schemas>
    <schema name="myschema" middle="FileSystem-Test">
        <storage-id>S3-Test</storage-id>
    </schema>
</schemas>

_{Temporary middle-storage example}

As shown in the example above, the FileSystem-Test storage isn’t declared in any storage-id schema element. In other words, it is a temporary middle-storage.

Download

Different from the upload operation that does the communication among all the schema storages, the download operation will use the first schema storage to execute the transfer operation.

Cache-Storage

The cache attribute usage will affect the download operation. If the file isn’t inside of the first storage, the FileHub will check the file’s existence in the next storage. If the file is there, the FileHub will download it, leaving the file saved in the first storage as well.

<schemas>
    <schema name="myschema" middle="FileSystem-Test" cache="true">
        <storage-id>S3-Test</storage-id>
    </schema>
</schemas>

_{Cache-storage example}

In the previous example, if a file is missing from the FileSystem-Test storage, the FileHub will check if the S3-Test has the file. In the case of a positive result, the download operation will be executed, but also transfer the file to the FileSystem-Test. On the other hand, the FileHub will return a not found error.

Warning If there is a middle-storage linked with the schema, that storage will be used to do the cache operation, in the opposite case, it will be the first storage from the schema.

Warning It is not allowed to have a cache-storage and a temporary middle-storage configuration at the same time.

API Documentation

Run the service and access: http://localhost:8088/swagger-ui/index.html
Apiary Docs: https://filehub.docs.apiary.io

Docker Configuration

DockerHub Link: https://hub.docker.com/repository/docker/paulophgf/filehub

Docker Run Command

docker run -d --name filehub -v {LOCAL_DIR}:/filehub paulophgf/filehub:{FILEHUB_VERSION}

Example:

docker run -d --name filehub -v //c/Users/user/filehub:/filehub paulophgf/filehub:1.0.0

Compose

version: '3.1'

services:

  filehub:
    image: paulophgf/filehub:1.0.0
    hostname: filehub
    container_name: filehub
    restart: always
    networks:
      - filehub-default
    ports:
      - "8088:8088"
    volumes:
      - /etc/hosts:/etc/hosts:ro
      - {LOCAL_DIR}:/filehub # Replace the variable {LOCAL_DIR} | e.g. Win: C:\Users\%user%\filehub or Linux: /filehub
    environment:
      CONFIG_TYPE: "LOCAL_FILE" # Choose one option LOCAL_FILE or GIT_FILE
      LOCAL_FILE_PATH: "filehub/fh-config.xml"
      CONFIG_GIT_FILE_PATH: "" # Fill the variable if you choose GIT_FILE as CONFIG_TYPE
      CONFIG_GIT_FILE_TOKEN: "" # Fill the variable if you choose GIT_FILE as CONFIG_TYPE
      JAVA_OPTS : "-Xms512m -Xmx1024m"

networks:
  filehub-default:
    name: filehub-default