/filehub

The FileHub is a service that standardizes file management, independent of the storage platform used. Moreover, it makes file persistence easier when we think about multiple storage places, serving as requests gateway, using a safe and easy way.

Primary LanguageJavaApache License 2.0Apache-2.0

drawing

REST API Java Version Spring Boot Version License

Read this in other languages: English, Portuguese

The FileHub is a service that standardizes file management, independent of the storage platform used. Moreover, it makes file persistence easier when we think about multiple storage places, serving as requests gateway, using a safe and easy way.

Sections

Configuration

         The FileHub uses an XML configuration file where the properties and some rules are defined. That file can be created locally where the service is executed or remotely using a Git repository. The following table shows the environment variables used to define where the configuration file is:

* Required
Variable name Description
CONFIG_TYPE * Define if the file is locally or remotely.
Default value: LOCAL_FILE
Possible values:
  • LOCAL_FILE
  • GIT_FILE
  • LOCAL_FILE_PATH Used when the configuration file is local. It shows where the configuration file is in the Operational System.
    Example: C:/filehub/example.xml
    CONFIG_GIT_FILE_PATH Git repository file address (File URL)
    Use raw file URL (plain text)
    CONFIG_GIT_FILE_TOKEN Git repository authentication token
    MAX_FILE_SIZE Maximum file size allowed.
    Default value: 7000000000
    MAX_REQUEST_SIZE Maximum request size allowed
    Default value: 7000000000

    Concepts

             Before executing the service it is necessary to define which storage platforms will be used, in addition to configuring the access parameters of each one independently. To do that the FileHub uses an XML file that will be read when the service starts. The file contains some elements that will process the requests. Each element will be explained next:

    Storage

             It is used to represent a storage platform. A storage has an ID to identify it inside the service and a type. Each type corresponds to a service or a storage platform, for example, a FTP server, a cloud service like the AWS S3 or a local directory where the FileHub is running. In other words, each type has their own properties for access and specifications.

             In the configuration file the storages are defined inside of tag storages like shown in the example below:

    <filehub>
       <storages>
           <storage id="S3-Test" type="AWS_S3">
               <region>us-east-2</region>
               <secretKeyId>G5HG4G66RDYIYE1</secretKeyId>
               <secretKey>6F51E6f1e6F7A2E4F761F61fd51s1F</secretKey>
               <bucket>test</bucket>
           </storage>
           <storage id="FileSystem-Test" type="FILE_SYSTEM">
               <baseDir>C:\Users\user\filehub</baseDir>
           </storage>
       </storages>
    </filehub>

    Storage declaration example


             All storage elements have an ID and a type. The ID will identify the storage and the type will define which configuration properties the storage has. The storage types are listed next:

    drawing
    Local File System
    It defines as storage a server directory where the FileHub is running.
    Type: FILE_SYSTEM
    Properties:
  • baseDir: root directory
  • drawing
    Amazon S3
    It defines a S3 bucket as a storage.
    Type: AWS_S3
    Properties:
  • region: AWS region (e.g.: sa-east-1)
  • secretKeyId: IAM user ID
  • secretKey: IAM user secret
  • bucket: S3 bucket name
  • baseDir: root directory
  • drawing
    Google Cloud Storage
    It creates a link with a google cloud storage bucket
    Type: GOOGLE_CLOUD
    Properties:
  • jsonCredentials: JSON object generated by account service
    (APIs & Services > Credentials > Service Accounts > Keys)
    Download the key, copy the JSON object and paste inside the jsonCredentials tag
  • bucket: storage bucket name
  • baseDir: root directory
  • drawing
    Dropbox
    It creates a link with a dropbox account
    Type: DROPBOX
    Limitations:
  • Access Token: The refresh token operation was not implemented. It is necessary to generate a new token when you use this kind of storage
  • File Size: The maximum file size is 150 Mb. Operations with big files will not work.
  • Properties:
  • accessToken: access token
  • baseDir: root directory
  • Schema

             A schema represents a storage set. When any operation is performed on FileHub, either upload or download, it will be necessary to inform the system what the schema is that will be considered. The FileHub service doesn’t perform operations directly on the storage element. It uses a schema that represents one or more storages.

             The schemas are declared inside the schemas tag, where possible the declaration of more than one schema. All schema records have a name that will be the identifier on the request used in FileHub. It is possible to link the storages to a specific schema using the storage-id tag. The example below shows how to get a schema configuration with two storages linked.

    <filehub>
        <storages>
            <storage id="S3-Test" type="AWS_S3">
                <region>us-east-2</region>
                <secretKeyId>G5HG4G66RDYIYE1</secretKeyId>
                <secretKey>6F51E6f1e6F7A2E4F761F61fd51s1F</secretKey>
                <bucket>test</bucket>
            </storage>
            <storage id="FileSystem-Test" type="FILE_SYSTEM">
                <baseDir>C:\Users\user\filehub</baseDir>
            </storage>
        </storages>
        <schemas>
            <schema name="MySchema">
                <storage-id>FileSystem-Test</storage-id>
                <storage-id>S3-Test</storage-id>
            </schema>
        </schemas>
    </filehub>

    Schema declaration example


    Auto Schemas

             It is not necessary to declare a schema for each storage to perform storage operations individually. It is possible to inform FileHub to perform the file reading, creating a schema for each existing storage. To do that, it uses the generate-schema attribute, filling as value, the schema’s name that will be created. See the example below:

    <filehub>
        <storages>
            <storage id="S3-Test" type="AWS_S3" generate-schema="s3test">
                <region>us-east-2</region>
                <secretKeyId>G5HG4G66RDYIYE1</secretKeyId>
                <secretKey>6F51E6f1e6F7A2E4F761F61fd51s1F</secretKey>
                <bucket>test</bucket>
            </storage>
            <storage id="FileSystem-Test" type="FILE_SYSTEM">
                <baseDir>C:\Users\user\filehub</baseDir>
            </storage>
        </storages>
    </filehub>

    Example of schema creation directly on storage


             It is also possible to use the attribute generate-schema on the storages element to create a schema with all existing storages. See the example below:

    <filehub>
        <storages generate-schema="all">
            <storage id="S3-Test" type="AWS_S3">
                <region>us-east-2</region>
                <secretKeyId>G5HG4G66RDYIYE1</secretKeyId>
                <secretKey>6F51E6f1e6F7A2E4F761F61fd51s1F</secretKey>
                <bucket>test</bucket>
            </storage>
            <storage id="FileSystem-Test" type="FILE_SYSTEM">
                <baseDir>C:\Users\user\filehub</baseDir>
            </storage>
        </storages>
    </filehub>

    Example of schema creation with all existing storages


    Warning If an auto schema was created without a configured default trigger, the schema won’t have any kind of security.


    Trigger

             Triggers are used to guarantee security on operations. They work as web hooks that will validate if an operation is authorized or not by another service/application.

             The trigger element has an ID to the identification and a action attribute that can assume two possible values:

    1. ALL: it will consider the trigger to any kind of operation, be from writing (creation/updating/exclusion) or reading (download);
    2. UPDATE: the trigger just will be applied to writing operations (creation/updating/exclusion).

    Warning The default term is a special value and cannot be used as ID to a trigger.


             When a trigger is configured it is necessary to inform three properties:

    1. header: it is a header name that should be sent to the authorization service.
    2. url: it is the service endpoint that will validate if the request is valid or not. The request goal is to check if the header value is valid. If the response of that request does not return a 200 (OK) code, the operation will be canceled.
    3. http-method (optional): define which HTTP method type will be used on the request (GET, HEAD, POST, PUT, PATCH, DELETE, OPTIONS). The default value is GET.

             In the XML configuration file, the triggers are defined inside of the triggers tag. A trigger should be linked to a schema. That bond is created through the trigger attribute used in the schema tag. All storages inside the schema consider the trigger during its operations.

             For clarification, see the following configuration example:

    <filehub>
        <storages>
            <storage id="example" type="FILE_SYSTEM">
                <baseDir>C:\Users\user\filehub</baseDir>
            </storage>
        </storages>
        <trigger id="user-auth" action="ALL">
            <url>http://10.0.0.10:8080/auth</url>
            <header>myheader</header>
            <http-method>GET</http-method>
        </trigger>
        <schemas>
            <schema name="test" trigger="user-auth">
                <storage-id>example</storage-id>
            </schema>
        </schemas>
    </filehub>

    Trigger declaration example


             We can observe the trigger user-auth was created and the schema test uses it. In the other words, each operation from the storage example will call the trigger to check the authorization.

             The flowchart below shows the process considering the upload operation to the previous configuration.

    drawing

    Flowchart of file uploading with trigger

             The application that uses the FileHub service should send the trigger configured header with a value. When the FileHub receives the request, it will call the trigger configured endpoint, transferring the header to the authorization service to check the validation. A JWT token is a good example of using that process.

             When a trigger do a request to the configured URL, it sends the following request body:

    • schema: the selected schema name
    • operation: the operation type executed (CREATE_DIRECTORY, RENAME_DIRECTORY, DELETE_DIRECTORY, LIST_FILES, EXIST_DIRECTORY, UPLOAD_MULTIPART_FILE, UPLOAD_BASE64_FILE, DOWNLOAD_FILE, DELETE_FILE, EXIST_FILE, GET_FILE_DETAILS)
    • path: the path informed
    • filenames: a list with file names used on the request (usually for upload operations)

             The following JSON shows a request body example:

    {
      "schema": "test",
      "operation": "UPLOAD_MULTIPART_FILE",
      "path": "/accounts/users/avatar/",
      "filenames": [ "MyAvatar.jpeg" ]
    }

             Another purpose of the triggers is to allow the creation of customized paths for the files. To explain that, imagine a system where each user has a directory to store their images. We will have URLs similar to the following list:

    • /schema/example/user/paul/photo01
    • /schema/example/user/paul/photo02
    • /schema/example/user/john/photo01
    • /schema/example/user/john/photo02
    • /schema/example/user/john/photo03

             You can see that to perform an upload or download operation, the consumer application should use the FileHub to manage the user logged identifiers. However, if the consumer application is a web interface, it will be possible to change that identifier, implicating the security of file accesses that are managed for FileHub. To deal with this problem, it is possible the trigger endpoint returns a parameter list that should be used to replace parts of the URL before completing an operation. The following sequence diagram shows that process:

    drawing

    Sequence diagram of trigger communication

             The parameter returned from the Authorization Service response should have the same name as the parameter used in operation URL ($user = user).

    Note The file name can be also modified by the Authorization Service response. Use the filename parameter to do that.

    Warning If a trigger has the action attribute configured as the value UPDATE and the authorization header is filled on the request, the trigger will call the configured endpoint even though.


    Default Trigger

             There is the possibility to create a trigger that will be called on all schemas without an explicit filled trigger. To do that, use the default attribute on the trigger as shown in the example below:

    <filehub>
        <storages>
            <storage id="example" type="FILE_SYSTEM">
                <baseDir>C:\Users\user\filehub</baseDir>
            </storage>
        </storages>
        <trigger id="user-auth" action="ALL" default="true">
            <url>http://10.0.0.10:8080/auth</url>
            <header>myheader</header>
            <http-method>GET</http-method>
        </trigger>
    </filehub>

    Default trigger example


    Operations

             After the understanding of the main FileHub concepts, the next step is to know which operations you can execute with the service.

    Directories

             The directories are used as a way to group and organize the files. The major part of storage deals with the directory structure as a special file type, but there are cases such the AWS S3 that uses it as prefixes. In this case, the prefix and the filename together are the file identification key inside a bucket. The FileHub provides further directory management, allowing the following operations:

    • Create a new directory
    • Rename a directory
    • Delete a directory
    • List the existing files inside the directory, including others directories
    • Check if the directory exists
    Disable directory operations

             To disable directory operations, it is possible use the no-dir attribute on a trigger as shown in the example below:

    <trigger id="user-auth" action="ALL" no-dir="true">
        <url>http://10.0.0.10:8080/auth</url>
        <header>myheader</header>
        <http-method>GET</http-method>
    </trigger>

    Example of trigger with disabled directories

    Upload

             An upload operation allows the sending of files that will be saved in all storages linked with a schema. When the FileHub receives the upload request and the file transfer begins, the FileHub can send the file to the storages by two ways:

    • Sequential transference: It is the default transference type. The FileHub will transfer the files to each storage in a sequential way, following the storage declaration order from the schema.
    • Parallel transference: The FileHub transfers the files to the storages at the same time. In this case, there isn’t a specific transference order. If you want to use the parallel transference you need to put the parallel-upload attribute in the schema tag with true value.
    <schemas>
        <schema name="test-parallel" parallel-upload="true">
            <storage-id>example</storage-id>
        </schema>
    </schemas>

    Parallel transference configuration example


             Regardless of the transference type, the upload request will only return a response after the file transference from the schema to all storages has ended.

    Middle-Storage

             In some cases, where there exists only one storage in the schema and the files are small, the transference operation is executed quickly. On the other hand, there are cases where it is necessary to transfer greater files to more than one storage, and in these scenarios the request can take a significant amount of time. A way to soften that problem is to use the middle-storage concept.

             The middle-storage defines which storage from a schema will be the intermediate between the consumer application and the rest of the storages. See the following example:

    <filehub>
        <storages>
            <storage id="S3-Test" type="AWS_S3">
                <region>us-east-2</region>
                <secretKeyId>G5HG4G66RDYIYE1</secretKeyId>
                <secretKey>6F51E6f1e6F7A2E4F761F61fd51s1F</secretKey>
                <bucket>test</bucket>
            </storage>
            <storage id="FileSystem-Test" type="FILE_SYSTEM">
                <baseDir>C:\Users\user\filehub</baseDir>
            </storage>
        </storages>
        <schemas>
            <schema name="myschema" middle="FileSystem-Test">
                <storage-id>FileSystem-Test</storage-id>
                <storage-id>S3-Test</storage-id>
            </schema>
        </schemas>
    </filehub>

    Middle-storage example


             In the example above, in an upload operation, the FileSystem-Test storage will receive the file, return the answer to the consumer application and will then transfer the file to the S3-Test storage.

    Temporary Middle-Storage

             A storage defined as middle-storage and not included in the one of schema storages will be a temporary storage. It will work like a middle-storage, but it will delete all files after the upload operation.

    <schemas>
        <schema name="myschema" middle="FileSystem-Test">
            <storage-id>S3-Test</storage-id>
        </schema>
    </schemas>

    Temporary middle-storage example


             As shown in the example above, the FileSystem-Test storage isn’t declared in any storage-id schema element. In other words, it is a temporary middle-storage.

    Download

             Different from the upload operation that does the communication among all the schema storages, the download operation will use the first schema storage to execute the transfer operation.

    Cache-Storage

             The cache attribute usage will affect the download operation. If the file isn’t inside of the first storage, the FileHub will check the file’s existence in the next storage. If the file is there, the FileHub will download it, leaving the file saved in the first storage as well.

    <schemas>
        <schema name="myschema" middle="FileSystem-Test" cache="true">
            <storage-id>S3-Test</storage-id>
        </schema>
    </schemas>

    Cache-storage example


             In the previous example, if a file is missing from the FileSystem-Test storage, the FileHub will check if the S3-Test has the file. In the case of a positive result, the download operation will be executed, but also transfer the file to the FileSystem-Test. On the other hand, the FileHub will return a not found error.


    Warning If there is a middle-storage linked with the schema, that storage will be used to do the cache operation, in the opposite case, it will be the first storage from the schema.

    Warning It is not allowed to have a cache-storage and a temporary middle-storage configuration at the same time.



    API Documentation


    Docker Configuration

    DockerHub Link: https://hub.docker.com/repository/docker/paulophgf/filehub

    Docker Run Command

    docker run -d --name filehub -v {LOCAL_DIR}:/filehub paulophgf/filehub:{FILEHUB_VERSION}

    Example:

    docker run -d --name filehub -v //c/Users/user/filehub:/filehub paulophgf/filehub:1.0.0

    Compose

    version: '3.1'
    
    services:
    
      filehub:
        image: paulophgf/filehub:1.0.0
        hostname: filehub
        container_name: filehub
        restart: always
        networks:
          - filehub-default
        ports:
          - "8088:8088"
        volumes:
          - /etc/hosts:/etc/hosts:ro
          - {LOCAL_DIR}:/filehub # Replace the variable {LOCAL_DIR} | e.g. Win: C:\Users\%user%\filehub or Linux: /filehub
        environment:
          CONFIG_TYPE: "LOCAL_FILE" # Choose one option LOCAL_FILE or GIT_FILE
          LOCAL_FILE_PATH: "filehub/fh-config.xml"
          CONFIG_GIT_FILE_PATH: "" # Fill the variable if you choose GIT_FILE as CONFIG_TYPE
          CONFIG_GIT_FILE_TOKEN: "" # Fill the variable if you choose GIT_FILE as CONFIG_TYPE
          JAVA_OPTS : "-Xms512m -Xmx1024m"
    
    networks:
      filehub-default:
        name: filehub-default