darkweak/souin

[feature] allow Souin operator to ignore client attempts to control cache usage by ignoring request cache related headers

dkarlovi opened this issue ยท 27 comments

I have this block in Caddy

    cache {
        allowed_http_verbs GET
        nuts {
            path /tmp
        }
        default_cache_control "public, max-age=86400"
        stale 1h
        ttl 24h
    }

If I do

http https://example.com/thumbnail/image.webp

I get

Age: 103
Cache-Control: public, max-age=86400
Cache-Status: Souin; hit; ttl=86297; key=GET-http-example.com/thumbnail/image.webp

but, if I do

http https://example.com/thumbnail/image.webp pragma:no-cache

I get

Cache-Control: public, max-age=86400
Cache-Status: Souin; fwd=uri-miss; stored; key=GET-http-example.com/thumbnail/image.webp

This is a bug because the client must not be allowed to control if my reverse proxy cache is being used or not. This is the same discussion we've had in #277 (comment)

In your request you're saying that the client doesn't accept a stored response. It has to contact the upstream (given from pragma: no-cache directive).

Yes, but I don't want my proxy server to keep pinging my upstream, that's the purpose of a proxy server. I the service vendor decide when the upstream server will be pinged, not the client at their whim.

Did you read the RFC? You have to respect the user choice. If you are sending cached content to your end users that tell you "I don't want a cached item", that's your choice but it doesn't respect the RFC and standards.

If you are sending cached content to your end users that tell you "I don't want a cached item", that's your choice but it doesn't respect the RFC and standards.

That's fine by me, does Souin allow to do that?

Basically, I don't want to allow clients to DDoS me, just letting them say

No, please blast your upstream with millions of requests even though you have a valid response stored and ready to get served.

I don't want the client to have that lever on my cache system, doing that would be an opt-in for privileged users, maybe.

For example, I'm currently checking Cloudflare in the same situation (requesting with pragma:no-cache) and it still serves from cache, it doesn't care about what I the client say or think:

Age: 82
CF-Cache-Status: HIT
CF-RAY: 7c8465a46e6e3720-FRA

RFCs and standards are one thing, but being pragmatic and robust about this isn't really optional.

What Souin is able to do is the following
Let's define 10 users.
U1: GET /foo.webp
Will proxy to the upstream
Would return Cache-Control: ...; stored; ...

U2: GET /foo.webp Cache-Control: no-cache (or Pragma)
Will proxy to the upstream
returns Cache-Control: ...; stored; ...

U3, U4, U5..., U10 (at the same time): GET /foo.webp Cache-Control: no-cache (or Pragma)
Will proxy only U3 to the upstream
returns Cache-Control: ...; stored; ... to all awaiting users.

If I understood you correctly, it will use the item (from request 3) from cache on requests 4-10, but the Cache-Status will still say stored, like it did when creating the cache item in request 3? Isn't that a bug, the Cache-Status header should be rewritten by Souin in that case. ๐Ÿค”

I'm probably misunderstanding you, maybe I should enable logging on my proxy to see when exactly it hits upstream because currently it seems to be missing each time with pragma set, which is not desired.

Edit: I can confirm it does indeed hit the upstream server each time with Pragma set, it will not reuse the response from U3 in U4-U10.

Basically, I need to be able to totally ignore the client requesting to skip the cache, it should be up to me the Souin operator to configure that.

@dkarlovi You could use request_header -Cache-Control, and then control it however you want.

Ah, that's a nice idea indeed and could probably work. What are all the req headers Souin will take into account to force passing the request here?

I thought so, but now that you're mentionning Pragma, I'll take a look at my setup one more time.

I'm in the same boat as you, I would love to use Caddy+Souin on a 40-50M monthly hits network of websites, but I still have a lot of quirks that I need to fix.

@frederichoule yes, probably makes sense to start with those, but IMO it feels a bit fragile do keep the list in sync manually and will not end up being as flexible / robust as I'd like. A nice addition here might be something like

cache {
    ignore_request_cache_headers
}

(but with better naming ๐Ÿ˜„) which then does strips / ignores headers for us, whichever Souin otherwise checks to force passing here.

I've changed the issue title since @darkweak confirmed the current behaviour is by design so it's not actually a bug.

I've been thinking about that. If we provide a directive for that it could be

cache {
    mode {strict (default), bypass_request, bypass_response, bypass}
}
  • strict follows the RFC
  • bypass ignore the request/response headers
  • bypass_request ignore the request headers only
  • bypass_response ignore the response headers only

What do you think about that @dkarlovi @frederichoule ?

I think that would be perfect for my use case.

I agree. It sounds just like the thing I'd want to use here and probably in the majority of cases. ๐Ÿ‘

The PR should be ready this week-end.

@dkarlovi can you try the linked PR please ?
xcaddy build --with github.com/darkweak/souin@3ab7a6d9eb52d09e857e9d52c9cc0d407009a5c4 --with github.com/darkweak/souin/plugins/caddy@3ab7a6d9eb52d09e857e9d52c9cc0d407009a5c4

@darkweak there seems to be a regression, I had this before

xcaddy build \
    --with github.com/caddyserver/cache-handler@v0.7.0

and now I have

xcaddy build \
    --with github.com/darkweak/souin@3ab7a6d9eb52d09e857e9d52c9cc0d407009a5c4 \
    --with github.com/darkweak/souin/plugins/caddy@3ab7a6d9eb52d09e857e9d52c9cc0d407009a5c4

without any other changes, my cache now does Souin; fwd=uri-miss; stored; each time. If I go back to the previous version, it again works. Caddy in Docker, 2.6.4.

What is the config ?

{
    auto_https disable_redirects
    order cache before rewrite
}

(common) {
    log {
        output stdout
    }
    header /* {
        -Server
    }
}

(cors) {
    @origin{args.0} header Origin {args.0}
    header @origin{args.0} Access-Control-Allow-Origin "{args.0}"
    header @origin{args.0} Access-Control-Allow-Headers "content-type, x-requested-with"
    header @origin{args.0} Vary Origin
}

storage.demo.example.com:80, storage.demo.example.com.local:80 {
    import cors *
    import common
    cache {
        allowed_http_verbs GET
        nuts {
            path /tmp
        }
        default_cache_control "public, max-age=86400"
        stale 1h
        ttl 24h
    }
    reverse_proxy https://stexampledemo.blob.core.windows.net {
        header_up Host {upstream_hostport}
    }
    header /* {
        -X-Ms-Blob-Type
        -X-Ms-Lease-Status
        -X-Ms-Request-Id
        -X-Ms-Version
    }
}

What's the curl request?

BTW this made me realize the cache plugin could provide more details in debug logs why it's not using cache, something like

  1. "no entry found matching the key"
  2. "entry stale"
  3. "bypassing cache due to request header Pragma"

etc.

What's the curl request?

$ http http://storage.demo.example.com.local/thumbnail/_default_upload_bucket/5975/image-thumb__5975__coreshop_category/stefan-stefancik-5p_7M5MP2Iw-unsplash@2x.webp -v
GET /thumbnail/_default_upload_bucket/5975/image-thumb__5975__coreshop_category/stefan-stefancik-5p_7M5MP2Iw-unsplash@2x.webp HTTP/1.1
Accept: */*
Accept-Encoding: gzip, deflate, br
Connection: keep-alive
Host: storage.demo.example.com.local
User-Agent: HTTPie/3.2.1



HTTP/1.1 200 OK
Cache-Control: public, max-age=86400
Cache-Status: Souin; fwd=uri-miss; stored; key=GET-http-storage.demo.example.com.local-/thumbnail/_default_upload_bucket/5975/image-thumb__5975__coreshop_category/stefan-stefancik-5p_7M5MP2Iw-unsplash@2x.webp
Content-Length: 101648
Content-Md5: XlXDfwguIHYzFxwvU/X6ow==
Content-Type: image/webp
Date: Mon, 29 May 2023 10:22:15 GMT
Etag: 0x8DA59F5DF4F6A83
Last-Modified: Wed, 29 Jun 2022 17:36:17 GMT

I will add more detail in the key because I don't see any clues about your case.

Yes, same here, IMO adding more debug info would be valuable overall. It's exactly the same upstream server, exactly the same request and exactly the same request, it works with 0.7.0, but not with the new patch.

Opened #349, #350.

Ive been doing this because the pragma behaviour was a massive issue for me too, i dont understand why any server would allow clients to control cache behaviour. Exactly as you said its just asking for a ddos

 {
                           "handler":"headers",
                           "request":{
                              "delete":[
                                 "cache-control",
                                 "pragma"
                              ]
                           },
                           "response":{
                              "deferred":true,
                              "delete":[
                                 "cache-control",
                                 "pragma",
                                 "server",
                                 "x-powered-by"
                              ]
                           }
                        },
                        {
                           "allowed_http_verbs":[
                              "GET"
                           ],
                           "default_cache_control":"no-store",
                           "handler":"cache",
                           "log_level":"DEBUG"
                        },

The important part is deleting the headers before it goes to the cache plugin.