pulp/pulpcore

Content consumption metrics ignore content when cached

Closed this issue · 6 comments

Version
3.61.0

Describe the bug
From what i've been told the download bytes metrics emitted by the content app ignore content that is fetched after the cache is warmed up and includes that content.

To Reproduce
With caching turned on on the content app, fetch a file 1000 times, you'd expect to see content metrics indicate 1000x size of file, instead it'd just be te first time

Expected behavior
Every fetch of the file is counted in the download metrics regardless of pulp-content caching

Additional context

I don't think I am following what you are saying. Are you saying that once a file is cached, the download bytes metric should no longer increase if that file is requested again? Or are you saying the opposite, that once a file is cached the download bytes metric no longer increases on repeated requests even though it should?

The cache is poorly named as it isn't a store of recently requested files, it's a lookup table of where the recently requested files are stored.

@gerrod3 I will admit that this is based entirely on what @lubosmj told me. It seems that he may not be sure, and it just needs to be tested.

But from i was told, if caching is enabled the metrics only represent the initial fetch of the file and until that cache entry is evicted, further requests are not reflected in the download bytes metrics.

I can officially confirm that once we "cache" a requested file, we no longer report the content consumption for the file.

Tested locally. I synced a file repository containing 3 files, 1MB each (https://fixtures.pulpproject.org/file/PULP_MANIFEST).

pulp file remote create --name test --url https://fixtures.pulpproject.org/file-many/PULP_MANIFEST --policy immediate
pulp file repository create --name test --remote test
pulp file repository sync --name test
pulp file publication create --repository test
pulp file distribution create --name test --base-path test --repository test

Then, I manually issued GET requests against the distributed source:

http http://localhost:5001/pulp/content/default/test/3.iso & http http://localhost:5001/pulp/content/default/test/3.iso & http http://localhost:5001/pulp/content/default/test/3.iso & 

Instead of showing a growing trend, the curve remains steady once all three files are cached.

image

Since we have reopened this topic, I would like to clarify expectations regarding redirect handling. Are we comfortable with a scenario where a user requests a content file but chooses not to follow the redirect? In this case, we would still report the consumption as if the user had followed the redirect and downloaded the content. Is this approach acceptable, @jlsherrill?

self._report_served_artifact_size(content_length)
if domain.storage_class == "pulpcore.app.models.storage.FileSystem":
path = storage.path(artifact_name)
if not os.path.exists(path):
raise Exception(_("Expected path '{}' is not found").format(path))
return FileResponse(path, headers=headers)
elif not domain.redirect_to_object_storage:
return ArtifactResponse(content_artifact.artifact, headers=headers)
elif domain.storage_class == "storages.backends.s3boto3.S3Boto3Storage":
raise HTTPFound(_build_url(http_method=request.method))

Yes I am comfortable with us reporting that a redirect was followed even though we can't be sure if it was or not.

I believe in practice they will follow them because typically it's a software client who actually wants the data. Also it's just the best we can do with S3 serving the data directly.