ContinuumIO/anaconda-package-data

Include `.conda` packages

Opened this issue Β· 46 comments

It would be helpful to include both .conda & .tar.bz2 packages. Particularly as more of the former and less of the latter are produced. May also help to track these separately to track the transition to the newer format

Looking into this with @cappadona

@jezdez Did that go anywhere? I was working on collecting some download numbers for my library and right now 2023 shows minimal downloads due to the transition to .conda.

@jezdez did this issue get solved more broadly?

Saw the python packages were fixed recently: #41

Is there a path for fixing the other packages? Or did this already happen?

@jakirkham @dopplershift. Apologies for the delay.

We have not yet addressed .conda packages missing from this data set. This work is on our backlog, and we should be able to get this done in November. We will provide updates here, but please don't hesitate to reach out with questions.

Thanks Nick! πŸ™

Hi @jakirkham @dopplershift. Quick update on the status of this issue.

We're working on finalizing a new pipeline that will source this public data set and include .conda packages moving forward. We expect to have it ready by the end of March 2024 and will post an update here when it is available.

Hi @cappadona Thanks for the update! Q: Would it be possible to also update the past statistics when the new pipeline is up?

@leofang At the moment we're not planning to replace any existing files in the bucket and only implement the fix for future data.

cc @aterrel @chenghlee (as we discussed this earlier)

Hi @cappadona @jezdez Friendly nudge for updates πŸ™‚ This has impacted several statistics tracking tools and caused confusion. I've heard jabbering about "no one is using conda" as they looked at the download counts from, say, condastats, but it is simply not true.

Hi @leofang. Thanks for checking in. We are on track to include .conda packages in the dataset by the end of the month.

Just wanted to check in, @cappadona how are things looking here?

Still looks reaaaally flat: https://prefix.dev/channels/conda-forge/packages/aesara (picked a random package)

To be fair, Nick said end of the month originally. So end of next week

Though would be good to learn if that is still true or if this is likely to slip

@cappadona how are things looking?

@jakirkham Sorry I missed your earlier message. Thanks for checking in. We're looking good and the March 2024 data published to the s3 bucket later this week will include .conda packages.

I will post an update to this thread once the March data is available.

Thanks Nick! πŸ™

Hi all. Quick update. We're just about there. Finalizing QA with the rest of the team, including a colleague who returns next week. Here are a couple examples for March 2024.

Screenshot 2024-04-05 at 5 17 12 PM Screenshot 2024-04-05 at 5 20 53 PM

Thanks Nick! πŸ™

With numpy this includes some older versions like 1.9.2, are these coming from defaults? Asking as conda-forge jumped to numpy version 1.9.3 (in the 1.9 series). Or is this an amalgamation of different channel statistics?

aesara is only in conda-forge AFAIK. So am guessing the top sheet is based on conda-forge data. Is that right?

Hi @jakirkham. The screenshot is an aggregation of multiple channels, which are usually identified in the final dataset via the data_source column. I did confirm that conda-forge is the only data sources for aesara.

How are things looking @cappadona ?

@cappadona are there any updates here?

Also as a side note, users are also asking about March data in this issue: #51

Hi @jakirkham. Monthly and hourly data for March and April 2024, which includes .conda packages, are now available in the bucket.

Thank you all for your patience.

@cappadona Do you think we could update the old files as well, since .conda files had been hosted for a while? Should we keep this ticket open until we fix that?

So just to get it right, the format of the parquet files changed?

Neither this command:

condastats overall pandas --start_month 2019-01 --end_month 2019-03 --monthly

Nor

condastats overall pandas --start_month 2024-01 --end_month 2024-03 --monthly

seem to work (both fail with FileNotFoundError: anaconda-package-data/conda/monthly/2024/2024-01.parquet). Did something else change? Note that these were taken from the official anaconda blog: https://www.anaconda.com/blog/get-python-package-download-statistics-with-condastats

I also tried to get the parquet file locally:

import pandas as pd

year = 2024
month = 4

s = f's3://anaconda-package-data/conda/monthly/{year}/{month:02}/{year}-{month:02}.parquet'

pd.read_parquet(s)

But it also fails because it can't find the file.

On our server, it seems to have downloaded the file at least at some point, btu the download counts were not updated (maybe because there is a new column that we don't take into account).

Thanks Nick! πŸ™

So I tried condastats overall pandas --start_month 2024-03 --end_month 2024-04 --monthly

Though got an error from condastats: conda-incubator/condastats#20

Maybe this is due to the same issue Wolf pointed out above?

I'll ask Sophia to move the project into the conda-incubator, so we can fix it

Edit: conda-incubator/condastats#21

Nick is out currently and will pick the topic back up when he's back.

And should the following work?

aws s3 cp s3://anaconda-package-data/conda/monthly/2024/04/2024-04.parquet ./

?

I just tried to run aws s3 ls s3://anaconda-package-data/conda/monthly/ and that also doesn't work. Can you help me @cappadona? Am I doing something wrong? Is this supposed to work? Or not supported anymore?

I double checked with the conda-forge scripts and none of the historic download counts seem to be publicly available:

Screenshot 2024-05-15 at 19 34 58

Error:

ClientConnectorError: Cannot connect to host anaconda-package-data.s3.weur.amazonaws.com:443 ssl:default [nodename nor servname provided, or not known]

Running this Notebook: https://github.com/conda-forge/by-the-numbers/blob/main/total%20downloads.ipynb

Hi. I'm back and catching up...I think there are a couple different issues at play here.

I just tried to run aws s3 ls s3://anaconda-package-data/conda/monthly/ and that also doesn't work. Can you help me @cappadona? Am I doing something wrong? Is this supposed to work? Or not supported anymore?

@wolfv Can you give this another try with the --no-sign-request option?

aws s3 ls s3://anaconda-package-data/conda/monthly/ --no-sign-request

Thanks Nick! πŸ™

So I tried condastats overall pandas --start_month 2024-03 --end_month 2024-04 --monthly

Though got an error from condastats: sophiamyang/condastats#20

Maybe this is due to the same issue Wolf pointed out above?

@jakirkham I'm able to reproduce this and it looks like the new parquet files for March and April 2024 are missing some pandas specific properties in the file metadata that are expected by condastats.

confirmed the data is available:

Screenshot 2024-05-20 at 10 24 14β€―AM

@aterrel that appears to be 2023 data. Were you able to load 2024 data from April or March?

I do see data for 2024-04

Screenshot 2024-05-21 at 10 33 04β€―AM

Hi @wolfv. Any luck on your end?

I just tried to run aws s3 ls s3://anaconda-package-data/conda/monthly/ and that also doesn't work. Can you help me @cappadona? Am I doing something wrong? Is this supposed to work? Or not supported anymore?

@wolfv Can you give this another try with the --no-sign-request option?

aws s3 ls s3://anaconda-package-data/conda/monthly/ --no-sign-request
wolfv commented

Yep, the data is back:
Screenshot 2024-05-29 at 19 01 41

Looking @aterrel 's plot above am curious why 12.3 doesn't show up. Is this an issue in the data or the code for the plot?

Tried generating my own script to parse through the data. Am seeing the following download counts for cudatoolkit (legacy package for CUDA 11 and earlier) and cuda-version (used in CUDA 12 and later)

#!/usr/bin/env python


import packaging
import sys

from packaging.version import InvalidVersion, Version

import matplotlib.pyplot as plt
import pandas as pd


plt.rcParams["figure.figsize"] = (22, 5)


def main(*argv):
    pkgs = [
        ("cudatoolkit", lambda v: Version("11.2") <= v < Version("12")),
        ("cuda-version", lambda v: Version("12") <= v and str(v) != "12.0.0"),
    ]

    for each_pkg, keep_filter in pkgs:
        year = "2024"
        month = "04"
        df = pd.read_parquet(f"{year}-{month}.parquet")

        df_pkg = df[df["pkg_name"] == each_pkg]

        pkg_vers = []
        for v in df_pkg["pkg_version"].unique():
            try:
                v = Version(v)
            except InvalidVersion:
                # Skip invalid version formats
                continue
            pkg_vers.append(v)
        pkg_vers = sorted(pkg_vers)

        pkg_vers_filt = list(filter(keep_filter, pkg_vers))

        df_pkg_sorted = pd.concat(
            [df_pkg[df_pkg["pkg_version"] == str(v)] for v in pkg_vers_filt]
        )

        df_pkg_plot = df_pkg_sorted[["pkg_version", "counts"]]
        df_pkg_plot["counts"] = df_pkg_plot["counts"] / 1e6

        plt.clf()

        plt.bar(df_pkg_plot["pkg_version"], df_pkg_plot["counts"])
        plt.title(f"{each_pkg} versions vs. Downloads (millions) for {year}-{month}")
        plt.xlabel(f"{each_pkg} versions")
        plt.ylabel("Downloads (millions)")

        plt.savefig(f"{each_pkg}_download_count.svg")

    return 0


if __name__ == "__main__":
    sys.exit(main(*sys.argv))

Here are the results it shows (note values are in millions):

cudatoolkit_download_count

cuda-version_download_count

Admittedly this is only one month

Plus some packages built with CUDA support link to the driver (like Arrow); so, may not pull in either of these packages at install time (despite building with CUDA support)

Also it would be better to group the cudatoolkit patch versions together like how cuda-version is handled

Nevertheless this is a good rough test of the data. It does seem to be picking up download counts for these packages that were missed in prior months (which had been off by a couple orders of magnitude in the worst case)

Edit: Fix issue where 12.0 got cutoff

Thanks @jakirkham. May 2024 data was made available this past Saturday, June 1st.

As of today, the .conda packages are included in the data for the following months:

  • 2024-03
  • 2024-04
  • 2024-05

Based on the community feedback thus far, we're considering replacing data for additional prior months, and updating them to also include .conda packages. Stay tuned.

cc @jezdez

Based on the community feedback thus far, we're considering replacing data for additional prior months, and updating them to also include .conda packages. Stay tuned.

Thanks Nick! πŸ™

This would be incredibly helpful πŸ™‚

It would be amazing to pull these updates back to the introduction of .conda artefacts, both for having a correct history and an accurate total number of downloads. The conda-forge landing page currently prominently displays the latter, and I think we're still not counting over a year of .conda downloads.

If one goes and executes by-the-numbers notebook linked from the conda-forge landing page (with some minor adaptations to update the loop over which years we're interested in), we get the following for 2021-2023:

Untitled

While there's undoubtedly some variability in the monthly data, to my understanding that sharp drop-off is related to the introduction of .conda around November 2022.

I agree with @h-vetinari, let’s make this available for the whole time period, doesn’t make sense otherwise IMO.

wolfv commented

Did something happen with the timestamps? For some reason, we seem to have some new entries at "epoch 0" (ie. somewhere in 1970)

Screenshot 2024-06-09 at 09 33 05

I'll delete/filter them from our data but just wanted to check if anyone knows what's up?