NGINX `protected-s3` proxy fails when filenames have spaces
jnm opened this issue · 1 comments
- @jnm and @noliveleger discuss to see if these changes make sense;
- Make the change to KoBoCAT;
- Tag a new release;
- Make the change to here, to
kobo-docker
, and update the Docker configuration to pull the new KoBoCAT release; - Check
protected
; does it need special treatment, too? If so, open a new issue.
Example
Let's consider an attachment whose filename is a space-1_55_1.jpg
. I try to access http://kc.kobo.local:9000/attachment/original?media_file=regular/attachments/d75bb4d648df454ba653e692395294da/f2a84d5d-6c70-4e7e-9bd0-478e9df5340f/a%20space-1_55_1.jpg, but I receive a HTTP 505 error, HttpVersionNotSupported
.
The Internet says that S3 returns this error when it receives an non-encoded URL with special characters. Looking at the NGINX log, though, shows that NGINX receives the URL encoded:
kc.kobo.local:9000 | 192.168.144.1 - - [27/Mar/2019:07:30:40 +0000] "GET /attachment/original?media_file=regular/attachments/d75bb4d648df454ba653e692395294da/f2a84d5d-6c70-4e7e-9bd0-478e9df5340f/a%20space-1_55_1.jpg HTTP/1.1" 505 (316 bytes) "http://kf.kobo.local:9000/" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:66.0) Gecko/20100101 Firefox/66.0"
However, with a little debug logging:
diff --git a/nginx/nginx_site_default.conf.tmpl b/nginx/nginx_site_default.conf.tmpl
index 8767fe9..d8b0902 100644
--- a/nginx/nginx_site_default.conf.tmpl
+++ b/nginx/nginx_site_default.conf.tmpl
@@ -56,6 +56,7 @@ server {
}
location ~ ^/protected-s3/(.*?)/(.*) {
+ error_log /tmp/protected-s3.log debug;
internal;
set $aws_access_key 'AWSAccessKeyId=$arg_AWSAccessKeyId';
set $url_expires 'Expires=$arg_Expires';
…it's possible to see the actual GET
request that NGINX sends to Amazon S3, and it's very non-encoded:
"GET /regular/attachments/d75bb4d648df454ba653e692395294da/f2a84d5d-6c70-4e7e-9bd0-478e9df5340f/a space-1_55_1.jpg?AWSAccessKeyId=AKIAI5FA4MKYP7R35RHQ&Expires=1553675440&Signature=eq5sQGF4vgeGv2Au6RA1eLB5qeM%3D HTTP/1.1
The crux of it seems to be that NGINX decodes the capture groups in location ~ ^/protected-s3/(.*?)/(.*)
; indeed, the debug log includes:
2019/03/27 07:30:40 [debug] 38#38: *54 http script capture: "regular/attachments/d75bb4d648df454ba653e692395294da/f2a84d5d-6c70-4e7e-9bd0-478e9df5340f/a space-1_55_1.jpg"
Others have grappled with this type of thing.
What to do?
Why not take advantage of NGINX's behavior and double-encode the S3 URL in KoBoCAT? We could:
- Have KoBoCAT append a complete, encoded S3 URL to
/protected-s3/
, and use that asX-Accel-Redirect
; - Let the Django machinery or whatever encode the
X-Accel-Redirect
as it already does; - Simplify the NGINX configuration to use
location ~ ^/protected-s3/(.*)$
; - Gleefully let NGINX remove one layer of encoding, leaving the encoded S3 URL in
$1
; proxy_pass $1
.
In KoBoCAT, this might look like:
diff --git a/onadata/apps/viewer/views.py b/onadata/apps/viewer/views.py
index 9c7c8a80..e6b3a727 100644
--- a/onadata/apps/viewer/views.py
+++ b/onadata/apps/viewer/views.py
@@ -21,6 +21,7 @@ from django.http import (
from django.shortcuts import get_object_or_404
from django.shortcuts import redirect
from django.shortcuts import render
+from django.utils.http import urlquote
from django.utils.translation import ugettext as _
from django.views.decorators.http import require_POST
@@ -750,12 +751,9 @@ def attachment_url(request, size='medium'):
response = HttpResponse()
default_storage = get_storage_class()()
if not isinstance(default_storage, FileSystemStorage):
- protocol = "https" if settings.AWS_S3_USE_SSL else "http"
- protected_url = media_url.replace("{}://{}.{}/".format(
- protocol,
- settings.AWS_STORAGE_BUCKET_NAME,
- settings.AWS_S3_HOST,
- ), "/protected-s3/{}/".format(settings.AWS_STORAGE_BUCKET_NAME))
+ # Double-encode the S3 URL to take advantage of NGINX's
+ # otherwise troublesome automatic decoding
+ protected_url = '/protected-s3/{}'.format(urlquote(media_url))
else:
protected_url = media_url.replace(settings.MEDIA_URL, "/protected/")
The changes to this kobo-docker
repository might look like:
diff --git a/nginx/nginx_site_default.conf.tmpl b/nginx/nginx_site_default.conf.tmpl
index 8767fe9..1747251 100644
--- a/nginx/nginx_site_default.conf.tmpl
+++ b/nginx/nginx_site_default.conf.tmpl
@@ -55,25 +55,36 @@ server {
alias /media/;
}
- location ~ ^/protected-s3/(.*?)/(.*) {
+ location ~ ^/protected-s3/(.*)$ {
+ error_log /var/log/nginx/yikes.log debug;
+ # Allow internal requests only, i.e. return a 404 to any client who
+ # tries to access this location directly
internal;
- set $aws_access_key 'AWSAccessKeyId=$arg_AWSAccessKeyId';
- set $url_expires 'Expires=$arg_Expires';
- set $url_signature 'Signature=$arg_Signature';
- set $bucket_domain_name '$1.s3.amazonaws.com';
- set $args_full 'https://$bucket_domain_name/$2?$aws_access_key&$url_expires&$url_signature';
- proxy_set_header Host $bucket_domain_name;
- proxy_http_version 1.1;
- proxy_set_header Authorization '';
- proxy_hide_header x-amz-id-2;
- proxy_hide_header x-amz-request-id;
- proxy_hide_header Set-Cookie;
- proxy_ignore_headers "Set-Cookie";
- proxy_buffering off;
- proxy_intercept_errors off;
- resolver 8.8.8.8 valid=300s;
- resolver_timeout 10s;
- proxy_pass $args_full;
+ # Name resolution won't work at all without specifying a resolver here.
+ # Configuring a validity period is useful for overriding Amazon's very
+ # short (5-second?) TTLs.
+ resolver 8.8.8.8 8.8.4.4 valid=300s;
+ resolver_timeout 10s;
+ # Everything that S3 needs is in the URL; don't pass any headers or
+ # body content that the client may have sent
+ proxy_pass_request_body off;
+ proxy_pass_request_headers off;
+
+ # Stream the response to the client instead of trying to read it all at
+ # once, which would potentially use disk space
+ proxy_buffering off;
+
+ # Don't leak S3 headers to the client. List retrieved from:
+ # https://docs.aws.amazon.com/AmazonS3/latest/API/RESTCommonResponseHeaders.html
+ proxy_hide_header x-amz-delete-marker;
+ proxy_hide_header x-amz-id-2;
+ proxy_hide_header x-amz-request-id;
+ proxy_hide_header x-amz-version-id;
+
+ # S3 will complain if `$1` contains non-encoded special characters.
+ # KoBoCAT must encode twice to make sure `$1` is still encoded after
+ # NGINX's automatic URL decoding.
+ proxy_pass $1;
}
}
The above configuration works for me. Some of the lines I removed, though, don't strictly relate to this issue. See the next comment below for details.
Justification for removals:
# I `proxy_pass` to the untouched S3 URL (in `$1`), so this parsing isn't needed
- set $aws_access_key 'AWSAccessKeyId=$arg_AWSAccessKeyId';
- set $url_expires 'Expires=$arg_Expires';
- set $url_signature 'Signature=$arg_Signature';
- set $bucket_domain_name '$1.s3.amazonaws.com';
- set $args_full 'https://$bucket_domain_name/$2?$aws_access_key&$url_expires&$url_signature';
# Manually setting the `Host` seems unnecessary; `proxy_pass` seems
# smart enough to do this on its own. Without setting `Host`, the debug log still shows:
# 2019/03/27 08:22:07 [debug] 38#38: *9 http script copy: "Host: "
# 2019/03/27 08:22:07 [debug] 38#38: *9 http script var: "kobo-docker-dev.s3.amazonaws.com"
- proxy_set_header Host $bucket_domain_name;
# Not sure why this would be needed?
- proxy_http_version 1.1;
# Was this to remove any `Authorization` sent by the client? If so, it's covered by
# `proxy_pass_request_headers off`
- proxy_set_header Authorization '';
# I agree that hiding the S3 response headers is a good idea. I kept these (but
# changed the whitespace) and added a few others.
- proxy_hide_header x-amz-id-2;
- proxy_hide_header x-amz-request-id;
# Is it necessary to deal with cookies this way?
- proxy_hide_header Set-Cookie;
- proxy_ignore_headers "Set-Cookie";
# I kept this but changed the whitespace
- proxy_buffering off;
# This is off by default
- proxy_intercept_errors off;
# I just added a second DNS server to the directive
- resolver 8.8.8.8 valid=300s;
# I kept this (seems better than the 30s default) but changed the whitespace
- resolver_timeout 10s;
# Replaced by the simpler `proxy_pass $1;'
- proxy_pass $args_full;