kobotoolbox/kobo-docker

NGINX `protected-s3` proxy fails when filenames have spaces

jnm opened this issue · 1 comments

jnm commented
  • @jnm and @noliveleger discuss to see if these changes make sense;
  • Make the change to KoBoCAT;
  • Tag a new release;
  • Make the change to here, to kobo-docker, and update the Docker configuration to pull the new KoBoCAT release;
  • Check protected; does it need special treatment, too? If so, open a new issue.

Example

Let's consider an attachment whose filename is a space-1_55_1.jpg. I try to access http://kc.kobo.local:9000/attachment/original?media_file=regular/attachments/d75bb4d648df454ba653e692395294da/f2a84d5d-6c70-4e7e-9bd0-478e9df5340f/a%20space-1_55_1.jpg, but I receive a HTTP 505 error, HttpVersionNotSupported.

The Internet says that S3 returns this error when it receives an non-encoded URL with special characters. Looking at the NGINX log, though, shows that NGINX receives the URL encoded:

kc.kobo.local:9000 | 192.168.144.1 - - [27/Mar/2019:07:30:40 +0000] "GET /attachment/original?media_file=regular/attachments/d75bb4d648df454ba653e692395294da/f2a84d5d-6c70-4e7e-9bd0-478e9df5340f/a%20space-1_55_1.jpg HTTP/1.1" 505 (316 bytes) "http://kf.kobo.local:9000/" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:66.0) Gecko/20100101 Firefox/66.0"

However, with a little debug logging:

diff --git a/nginx/nginx_site_default.conf.tmpl b/nginx/nginx_site_default.conf.tmpl
index 8767fe9..d8b0902 100644
--- a/nginx/nginx_site_default.conf.tmpl
+++ b/nginx/nginx_site_default.conf.tmpl
@@ -56,6 +56,7 @@ server {
     }
 
     location ~ ^/protected-s3/(.*?)/(.*) {
+        error_log /tmp/protected-s3.log debug;
         internal;
         set $aws_access_key   'AWSAccessKeyId=$arg_AWSAccessKeyId';
         set $url_expires      'Expires=$arg_Expires';

…it's possible to see the actual GET request that NGINX sends to Amazon S3, and it's very non-encoded:

"GET /regular/attachments/d75bb4d648df454ba653e692395294da/f2a84d5d-6c70-4e7e-9bd0-478e9df5340f/a space-1_55_1.jpg?AWSAccessKeyId=AKIAI5FA4MKYP7R35RHQ&Expires=1553675440&Signature=eq5sQGF4vgeGv2Au6RA1eLB5qeM%3D HTTP/1.1

The crux of it seems to be that NGINX decodes the capture groups in location ~ ^/protected-s3/(.*?)/(.*); indeed, the debug log includes:

2019/03/27 07:30:40 [debug] 38#38: *54 http script capture: "regular/attachments/d75bb4d648df454ba653e692395294da/f2a84d5d-6c70-4e7e-9bd0-478e9df5340f/a space-1_55_1.jpg"

Others have grappled with this type of thing.

What to do?

Why not take advantage of NGINX's behavior and double-encode the S3 URL in KoBoCAT? We could:

  1. Have KoBoCAT append a complete, encoded S3 URL to /protected-s3/, and use that as X-Accel-Redirect;
  2. Let the Django machinery or whatever encode the X-Accel-Redirect as it already does;
  3. Simplify the NGINX configuration to use location ~ ^/protected-s3/(.*)$;
  4. Gleefully let NGINX remove one layer of encoding, leaving the encoded S3 URL in $1;
  5. proxy_pass $1.

In KoBoCAT, this might look like:

diff --git a/onadata/apps/viewer/views.py b/onadata/apps/viewer/views.py
index 9c7c8a80..e6b3a727 100644
--- a/onadata/apps/viewer/views.py
+++ b/onadata/apps/viewer/views.py
@@ -21,6 +21,7 @@ from django.http import (
 from django.shortcuts import get_object_or_404
 from django.shortcuts import redirect
 from django.shortcuts import render
+from django.utils.http import urlquote
 from django.utils.translation import ugettext as _
 from django.views.decorators.http import require_POST
 
@@ -750,12 +751,9 @@ def attachment_url(request, size='medium'):
             response = HttpResponse()
             default_storage = get_storage_class()()
             if not isinstance(default_storage, FileSystemStorage):
-                protocol = "https" if settings.AWS_S3_USE_SSL else "http"
-                protected_url = media_url.replace("{}://{}.{}/".format(
-                    protocol,
-                    settings.AWS_STORAGE_BUCKET_NAME,
-                    settings.AWS_S3_HOST,
-                ), "/protected-s3/{}/".format(settings.AWS_STORAGE_BUCKET_NAME))
+                # Double-encode the S3 URL to take advantage of NGINX's
+                # otherwise troublesome automatic decoding
+                protected_url = '/protected-s3/{}'.format(urlquote(media_url))
             else:
                 protected_url = media_url.replace(settings.MEDIA_URL, "/protected/")

The changes to this kobo-docker repository might look like:

diff --git a/nginx/nginx_site_default.conf.tmpl b/nginx/nginx_site_default.conf.tmpl
index 8767fe9..1747251 100644
--- a/nginx/nginx_site_default.conf.tmpl
+++ b/nginx/nginx_site_default.conf.tmpl
@@ -55,25 +55,36 @@ server {
         alias /media/;
     }
 
-    location ~ ^/protected-s3/(.*?)/(.*) {
+    location ~ ^/protected-s3/(.*)$ {
+        error_log /var/log/nginx/yikes.log debug;
+        # Allow internal requests only, i.e. return a 404 to any client who
+        # tries to access this location directly
         internal;
-        set $aws_access_key   'AWSAccessKeyId=$arg_AWSAccessKeyId';
-        set $url_expires      'Expires=$arg_Expires';
-        set $url_signature    'Signature=$arg_Signature';
-        set $bucket_domain_name '$1.s3.amazonaws.com';
-        set $args_full        'https://$bucket_domain_name/$2?$aws_access_key&$url_expires&$url_signature';
-        proxy_set_header       Host $bucket_domain_name;
-        proxy_http_version     1.1;
-        proxy_set_header       Authorization '';
-        proxy_hide_header      x-amz-id-2;
-        proxy_hide_header      x-amz-request-id;
-        proxy_hide_header      Set-Cookie;
-        proxy_ignore_headers   "Set-Cookie";
-        proxy_buffering        off;
-        proxy_intercept_errors off;
-        resolver               8.8.8.8 valid=300s;
-        resolver_timeout       10s;
-        proxy_pass             $args_full;
+        # Name resolution won't work at all without specifying a resolver here.
+        # Configuring a validity period is useful for overriding Amazon's very
+        # short (5-second?) TTLs.
+        resolver 8.8.8.8 8.8.4.4 valid=300s;
+        resolver_timeout 10s;
+        # Everything that S3 needs is in the URL; don't pass any headers or
+        # body content that the client may have sent
+        proxy_pass_request_body off;
+        proxy_pass_request_headers off;
+
+        # Stream the response to the client instead of trying to read it all at
+        # once, which would potentially use disk space
+        proxy_buffering off;
+
+        # Don't leak S3 headers to the client. List retrieved from:
+        # https://docs.aws.amazon.com/AmazonS3/latest/API/RESTCommonResponseHeaders.html
+        proxy_hide_header x-amz-delete-marker;
+        proxy_hide_header x-amz-id-2;
+        proxy_hide_header x-amz-request-id;
+        proxy_hide_header x-amz-version-id;
+
+        # S3 will complain if `$1` contains non-encoded special characters.
+        # KoBoCAT must encode twice to make sure `$1` is still encoded after
+        # NGINX's automatic URL decoding.
+        proxy_pass $1;
     }
 }

The above configuration works for me. Some of the lines I removed, though, don't strictly relate to this issue. See the next comment below for details.

jnm commented

Justification for removals:

# I `proxy_pass` to the untouched S3 URL (in `$1`), so this parsing isn't needed
-        set $aws_access_key   'AWSAccessKeyId=$arg_AWSAccessKeyId';
-        set $url_expires      'Expires=$arg_Expires';
-        set $url_signature    'Signature=$arg_Signature';
-        set $bucket_domain_name '$1.s3.amazonaws.com';
-        set $args_full        'https://$bucket_domain_name/$2?$aws_access_key&$url_expires&$url_signature';
# Manually setting the `Host` seems unnecessary; `proxy_pass` seems
# smart enough to do this on its own. Without setting `Host`, the debug log still shows:
#    2019/03/27 08:22:07 [debug] 38#38: *9 http script copy: "Host: "
#    2019/03/27 08:22:07 [debug] 38#38: *9 http script var: "kobo-docker-dev.s3.amazonaws.com"
-        proxy_set_header       Host $bucket_domain_name;
# Not sure why this would be needed?
-        proxy_http_version     1.1;
# Was this to remove any `Authorization` sent by the client? If so, it's covered by
# `proxy_pass_request_headers off`
-        proxy_set_header       Authorization '';
# I agree that hiding the S3 response headers is a good idea. I kept these (but
# changed the whitespace) and added a few others.
-        proxy_hide_header      x-amz-id-2;
-        proxy_hide_header      x-amz-request-id;
# Is it necessary to deal with cookies this way?
-        proxy_hide_header      Set-Cookie;
-        proxy_ignore_headers   "Set-Cookie";
# I kept this but changed the whitespace
-        proxy_buffering        off;
# This is off by default
-        proxy_intercept_errors off;
# I just added a second DNS server to the directive
-        resolver               8.8.8.8 valid=300s;
# I kept this (seems better than the 30s default) but changed the whitespace
-        resolver_timeout       10s;
# Replaced by the simpler `proxy_pass $1;'
-        proxy_pass             $args_full;