Include `.conda` packages

Question

Include `.conda` packages

Opened this issue a year ago · 46 comments

It would be helpful to include both .conda & .tar.bz2 packages. Particularly as more of the former and less of the latter are produced. May also help to track these separately to track the transition to the newer format

Answer 1 · 2023-08-07T05:43:44.000Z

cc @beckermr @wolfv

Answer 2 · 2023-09-20T18:13:47.000Z

Looking into this with @cappadona

Answer 3 · 2023-10-02T22:31:03.000Z

@jezdez Did that go anywhere? I was working on collecting some download numbers for my library and right now 2023 shows minimal downloads due to the transition to .conda.

Answer 4 · 2023-10-24T00:03:04.000Z

@jezdez did this issue get solved more broadly?

Saw the python packages were fixed recently: #41

Is there a path for fixing the other packages? Or did this already happen?

Answer 5 · 2023-10-24T14:12:53.000Z

@jakirkham @dopplershift. Apologies for the delay.

We have not yet addressed .conda packages missing from this data set. This work is on our backlog, and we should be able to get this done in November. We will provide updates here, but please don't hesitate to reach out with questions.

Answer 6 · 2023-10-25T17:37:56.000Z

Thanks Nick! 🙏

Answer 7 · 2024-01-04T17:47:07.000Z

Hi @jakirkham @dopplershift. Quick update on the status of this issue.

We're working on finalizing a new pipeline that will source this public data set and include .conda packages moving forward. We expect to have it ready by the end of March 2024 and will post an update here when it is available.

Answer 8 · 2024-01-04T17:54:34.000Z

Hi @cappadona Thanks for the update! Q: Would it be possible to also update the past statistics when the new pipeline is up?

Answer 9 · 2024-01-04T18:00:06.000Z

@leofang At the moment we're not planning to replace any existing files in the bucket and only implement the fix for future data.

Answer 10 · 2024-01-23T20:26:48.000Z

cc @aterrel @chenghlee (as we discussed this earlier)

Answer 11 · 2024-03-01T15:37:10.000Z

Hi @cappadona @jezdez Friendly nudge for updates 🙂 This has impacted several statistics tracking tools and caused confusion. I've heard jabbering about "no one is using conda" as they looked at the download counts from, say, condastats, but it is simply not true.

Answer 12 · 2024-03-01T16:00:04.000Z

Hi @leofang. Thanks for checking in. We are on track to include .conda packages in the dataset by the end of the month.

Answer 13 · 2024-03-19T19:25:34.000Z

Just wanted to check in, @cappadona how are things looking here?

Answer 14 · 2024-03-19T19:48:35.000Z

Still looks reaaaally flat: https://prefix.dev/channels/conda-forge/packages/aesara (picked a random package)

Answer 15 · 2024-03-19T20:19:06.000Z

To be fair, Nick said end of the month originally. So end of next week

Though would be good to learn if that is still true or if this is likely to slip

Answer 16 · 2024-04-01T04:29:49.000Z

@cappadona how are things looking?

Answer 17 · 2024-04-01T13:45:34.000Z

@jakirkham Sorry I missed your earlier message. Thanks for checking in. We're looking good and the March 2024 data published to the s3 bucket later this week will include .conda packages.

I will post an update to this thread once the March data is available.

Answer 18 · 2024-04-01T17:46:19.000Z

Thanks Nick! 🙏

Answer 19 · 2024-04-05T21:40:17.000Z

Hi all. Quick update. We're just about there. Finalizing QA with the rest of the team, including a colleague who returns next week. Here are a couple examples for March 2024.

Answer 20 · 2024-04-05T21:50:31.000Z

Thanks Nick! 🙏

With numpy this includes some older versions like 1.9.2, are these coming from defaults? Asking as conda-forge jumped to numpy version 1.9.3 (in the 1.9 series). Or is this an amalgamation of different channel statistics?

aesara is only in conda-forge AFAIK. So am guessing the top sheet is based on conda-forge data. Is that right?

Answer 21 · 2024-04-08T15:01:32.000Z

Hi @jakirkham. The screenshot is an aggregation of multiple channels, which are usually identified in the final dataset via the data_source column. I did confirm that conda-forge is the only data sources for aesara.

Answer 22 · 2024-04-17T02:10:07.000Z

How are things looking @cappadona ?

Answer 23 · 2024-04-30T18:22:55.000Z

@cappadona are there any updates here?

Also as a side note, users are also asking about March data in this issue: #51

Answer 24 · 2024-05-04T14:48:41.000Z

Hi @jakirkham. Monthly and hourly data for March and April 2024, which includes .conda packages, are now available in the bucket.

Thank you all for your patience.

Answer 25 · 2024-05-06T11:02:57.000Z

@cappadona Do you think we could update the old files as well, since .conda files had been hosted for a while? Should we keep this ticket open until we fix that?

Answer 26 · 2024-05-06T11:13:02.000Z

So just to get it right, the format of the parquet files changed?

Answer 27 · 2024-05-06T11:32:44.000Z

Neither this command:

condastats overall pandas --start_month 2019-01 --end_month 2019-03 --monthly

Nor

condastats overall pandas --start_month 2024-01 --end_month 2024-03 --monthly

seem to work (both fail with FileNotFoundError: anaconda-package-data/conda/monthly/2024/2024-01.parquet). Did something else change? Note that these were taken from the official anaconda blog: https://www.anaconda.com/blog/get-python-package-download-statistics-with-condastats

I also tried to get the parquet file locally:

import pandas as pd

year = 2024
month = 4

s = f's3://anaconda-package-data/conda/monthly/{year}/{month:02}/{year}-{month:02}.parquet'

pd.read_parquet(s)

But it also fails because it can't find the file.

On our server, it seems to have downloaded the file at least at some point, btu the download counts were not updated (maybe because there is a new column that we don't take into account).

Answer 28 · 2024-05-06T18:25:57.000Z

Thanks Nick! 🙏

So I tried condastats overall pandas --start_month 2024-03 --end_month 2024-04 --monthly

Though got an error from condastats: conda-incubator/condastats#20

Maybe this is due to the same issue Wolf pointed out above?

Answer 29 · 2024-05-07T08:39:03.000Z

I'll ask Sophia to move the project into the conda-incubator, so we can fix it

Edit: conda-incubator/condastats#21

Answer 30 · 2024-05-07T08:42:18.000Z

Nick is out currently and will pick the topic back up when he's back.

Answer 31 · 2024-05-07T08:54:16.000Z

And should the following work?

aws s3 cp s3://anaconda-package-data/conda/monthly/2024/04/2024-04.parquet ./

?

Answer 32 · 2024-05-15T17:32:05.000Z

I just tried to run aws s3 ls s3://anaconda-package-data/conda/monthly/ and that also doesn't work. Can you help me @cappadona? Am I doing something wrong? Is this supposed to work? Or not supported anymore?

Answer 33 · 2024-05-15T17:37:11.000Z

I double checked with the conda-forge scripts and none of the historic download counts seem to be publicly available:

Error:

ClientConnectorError: Cannot connect to host anaconda-package-data.s3.weur.amazonaws.com:443 ssl:default [nodename nor servname provided, or not known]

Running this Notebook: https://github.com/conda-forge/by-the-numbers/blob/main/total%20downloads.ipynb

Answer 34 · 2024-05-16T13:57:43.000Z

Hi. I'm back and catching up...I think there are a couple different issues at play here.

I just tried to run aws s3 ls s3://anaconda-package-data/conda/monthly/ and that also doesn't work. Can you help me @cappadona? Am I doing something wrong? Is this supposed to work? Or not supported anymore?

@wolfv Can you give this another try with the --no-sign-request option?

aws s3 ls s3://anaconda-package-data/conda/monthly/ --no-sign-request

Thanks Nick! 🙏

So I tried condastats overall pandas --start_month 2024-03 --end_month 2024-04 --monthly

Though got an error from condastats: sophiamyang/condastats#20

Maybe this is due to the same issue Wolf pointed out above?

@jakirkham I'm able to reproduce this and it looks like the new parquet files for March and April 2024 are missing some pandas specific properties in the file metadata that are expected by condastats.

I will follow up in conda-incubator/condastats#20

Answer 35 · 2024-05-20T14:24:54.000Z

confirmed the data is available:

Answer 36 · 2024-05-21T00:52:20.000Z

@aterrel that appears to be 2023 data. Were you able to load 2024 data from April or March?

Answer 37 · 2024-05-21T14:34:06.000Z

I do see data for 2024-04

Answer 38 · 2024-05-29T14:42:12.000Z

Hi @wolfv. Any luck on your end?

I just tried to run aws s3 ls s3://anaconda-package-data/conda/monthly/ and that also doesn't work. Can you help me @cappadona? Am I doing something wrong? Is this supposed to work? Or not supported anymore?

@wolfv Can you give this another try with the --no-sign-request option?
aws s3 ls s3://anaconda-package-data/conda/monthly/ --no-sign-request

Answer 39 · 2024-05-29T17:02:18.000Z

Yep, the data is back:

Answer 40 · 2024-05-29T18:31:47.000Z

Looking @aterrel 's plot above am curious why 12.3 doesn't show up. Is this an issue in the data or the code for the plot?

Answer 41 · 2024-05-30T04:49:09.000Z

Tried generating my own script to parse through the data. Am seeing the following download counts for cudatoolkit (legacy package for CUDA 11 and earlier) and cuda-version (used in CUDA 12 and later)

#!/usr/bin/env python


import packaging
import sys

from packaging.version import InvalidVersion, Version

import matplotlib.pyplot as plt
import pandas as pd


plt.rcParams["figure.figsize"] = (22, 5)


def main(*argv):
    pkgs = [
        ("cudatoolkit", lambda v: Version("11.2") <= v < Version("12")),
        ("cuda-version", lambda v: Version("12") <= v and str(v) != "12.0.0"),
    ]

    for each_pkg, keep_filter in pkgs:
        year = "2024"
        month = "04"
        df = pd.read_parquet(f"{year}-{month}.parquet")

        df_pkg = df[df["pkg_name"] == each_pkg]

        pkg_vers = []
        for v in df_pkg["pkg_version"].unique():
            try:
                v = Version(v)
            except InvalidVersion:
                # Skip invalid version formats
                continue
            pkg_vers.append(v)
        pkg_vers = sorted(pkg_vers)

        pkg_vers_filt = list(filter(keep_filter, pkg_vers))

        df_pkg_sorted = pd.concat(
            [df_pkg[df_pkg["pkg_version"] == str(v)] for v in pkg_vers_filt]
        )

        df_pkg_plot = df_pkg_sorted[["pkg_version", "counts"]]
        df_pkg_plot["counts"] = df_pkg_plot["counts"] / 1e6

        plt.clf()

        plt.bar(df_pkg_plot["pkg_version"], df_pkg_plot["counts"])
        plt.title(f"{each_pkg} versions vs. Downloads (millions) for {year}-{month}")
        plt.xlabel(f"{each_pkg} versions")
        plt.ylabel("Downloads (millions)")

        plt.savefig(f"{each_pkg}_download_count.svg")

    return 0


if __name__ == "__main__":
    sys.exit(main(*sys.argv))

Here are the results it shows (note values are in millions):

Admittedly this is only one month

Plus some packages built with CUDA support link to the driver (like Arrow); so, may not pull in either of these packages at install time (despite building with CUDA support)

Also it would be better to group the cudatoolkit patch versions together like how cuda-version is handled

Nevertheless this is a good rough test of the data. It does seem to be picking up download counts for these packages that were missed in prior months (which had been off by a couple orders of magnitude in the worst case)

Edit: Fix issue where 12.0 got cutoff

Answer 42 · 2024-06-04T15:12:51.000Z

Thanks @jakirkham. May 2024 data was made available this past Saturday, June 1st.

As of today, the .conda packages are included in the data for the following months:

2024-03
2024-04
2024-05

Based on the community feedback thus far, we're considering replacing data for additional prior months, and updating them to also include .conda packages. Stay tuned.

cc @jezdez

Answer 43 · 2024-06-04T16:58:28.000Z

Based on the community feedback thus far, we're considering replacing data for additional prior months, and updating them to also include .conda packages. Stay tuned.

Thanks Nick! 🙏

This would be incredibly helpful 🙂

Answer 44 · 2024-06-05T02:06:06.000Z

It would be amazing to pull these updates back to the introduction of .conda artefacts, both for having a correct history and an accurate total number of downloads. The conda-forge landing page currently prominently displays the latter, and I think we're still not counting over a year of .conda downloads.

If one goes and executes by-the-numbers notebook linked from the conda-forge landing page (with some minor adaptations to update the loop over which years we're interested in), we get the following for 2021-2023:

While there's undoubtedly some variability in the monthly data, to my understanding that sharp drop-off is related to the introduction of .conda around November 2022.

Answer 45 · 2024-06-05T12:26:40.000Z

I agree with @h-vetinari, let’s make this available for the whole time period, doesn’t make sense otherwise IMO.

Answer 46 · 2024-06-09T14:44:38.000Z

Did something happen with the timestamps? For some reason, we seem to have some new entries at "epoch 0" (ie. somewhere in 1970)

I'll delete/filter them from our data but just wanted to check if anyone knows what's up?