creativecommons/quantifying

[Feature] Automate Data Gathering and Analysis/Rendering

TimidRobot opened this issue ยท 26 comments

Problem

The focus of this project is on handling data in a way that is reproducible and update-able.

Description

  • How often should the data be gathered and analyzed/rendered?
    • What is the strategy for gathering data over multiple days (due to query limits)?
    • What is the strategy for ensuring automated updates do not result in broken/incomplete state if they don't complete successfully?
      • Should scripts wait until completion to write data to file(s)?
    • Can the various tasks be run using GitHub Actions?
  • How should the gathered data be stored and formatted?
    • What naming conventions and purge/overwrite rules should be used to facilitate multi-day data gathering?
    • Should data be stored in a plaintext (immediately readable) or binary format (compressed, SQLite, etc.)?
  • Are there opportunities for code deduplication?

Alternatives

Do everything manually ๐Ÿ˜ฉ

Additional context

(Suggestions welcomed! Please comment if you have a relevant links to share.)

Hello @TimidRobot ! I'm interested in contributing to this project as part of GSoC 2023. After reading the problem and description, I think I have a good understanding of the goals of the project and the criteria for success. However, I would appreciate more information on some of the specific questions raised in the description.

First, regarding the frequency of data gathering and analysis, are there any constraints or limitations that need to be taken into account? For example, are there certain times of day when data should be collected, or are there restrictions on the number of requests that can be made to an API?

Second, regarding the strategy for handling data over multiple days, can you provide more information on what kind of data we'll be working with and what the expected volume is? This will help determine what kind of storage and naming conventions we should use.

Finally, regarding the format and storage of the data, what are the specific requirements or preferences for how the data should be formatted and stored? Should we prioritize readability or efficiency, or is there some other consideration to take into account?

Thank you for your time and guidance! I'm excited to work on this project and look forward to hearing back from you.

Interested in this project let's us know once we had mentor.

@samadpls

First, regarding the frequency of data gathering and analysis, are there any constraints or limitations that need to be taken into account? For example, are there certain times of day when data should be collected, or are there restrictions on the number of requests that can be made to an API?

Different APIs have different limits on queries per day. (Adding this information the README or creating a dedicated sources markdown document would be helpful--see #37).

Second, regarding the strategy for handling data over multiple days, can you provide more information on what kind of data we'll be working with and what the expected volume is? This will help determine what kind of storage and naming conventions we should use.

See the existing CSV files and scripts.

Finally, regarding the format and storage of the data, what are the specific requirements or preferences for how the data should be formatted and stored? Should we prioritize readability or efficiency, or is there some other consideration to take into account?

This is an unanswered question. However, any proposed solutions should be compared against CSVs for readability/interoperability and SQLite for efficiency.

@TimidRobot Greetings, I stumbled upon your GitHub repository, and I'm interested in contributing.

@HoneyTyagii Welcome! Please see Contribution Guidelines โ€” Creative Commons Open Source.

@TimidRobot Thanks for getting back to me! I really appreciate the prompt response and the link to the contribution guidelines. I'll make sure to read through them thoroughly before submitting any contributions. If I have any questions, I'll reach out to you for further assistance. Thanks again!

Hello @TimidRobot ! I am interested in contributing to this project in GSOC 2023. I tried to understand the Project and I request you to help me wherever I am mistaken. Here is the Summary I have written based on my understanding.

  • The Quantifying the Commons project is an initiative by Creative Commons to measure
    the impact of Creative Commons licenses on the sharing and reuse of creative works

  • The main objective of the project is to automate the process of data gathering and
    reporting so that the reports are never more than three months out of date

A general overview of the steps involved in the Code-base (GitHub):

  1. Data collection: The code collects data from various sources, such as the Creative
    Commons search engine, the Flickr API, and Wikimedia Commons.

  2. Data cleaning: The collected data is cleaned and standardized to remove duplicates,
    missing values, and other errors.

  3. Data analysis: The cleaned data is analyzed using statistical methods and machine learning
    algorithms to identify patterns and trends in the data.

  4. Report generation: Based on the analysis, reports are generated using Python libraries such
    as Matplotlib and Pandas. The reports include visualizations and tables that summarize the
    data and provide insights into the impact of Creative Commons licenses.

  5. Automation: To ensure that the reports are never more than three months out of date, the
    code-base uses automation techniques, such as GitHub Actions, to periodically run the data
    collection, cleaning, analysis, and report generation steps.

Any further assistance will be highly appreciated.

Hello @TimidRobot ! I am interested in contributing to this project in GSOC 2023. I tried to understand the Project and I request you to help me wherever I am mistaken. Here is the Summary I have written based on my understanding.

  • The Quantifying the Commons project is an initiative by Creative Commons to measure
    the impact of Creative Commons licenses on the sharing and reuse of creative works
  • The main objective of the project is to automate the process of data gathering and
    reporting so that the reports are never more than three months out of date

A general overview of the steps involved in the Code-base (GitHub):

  1. Data collection: The code collects data from various sources, such as the Creative
    Commons search engine, the Flickr API, and Wikimedia Commons.
  2. Data cleaning: The collected data is cleaned and standardized to remove duplicates,
    missing values, and other errors.
  3. Data analysis: The cleaned data is analyzed using statistical methods and machine learning
    algorithms to identify patterns and trends in the data.
  4. Report generation: Based on the analysis, reports are generated using Python libraries such
    as Matplotlib and Pandas. The reports include visualizations and tables that summarize the
    data and provide insights into the impact of Creative Commons licenses.
  5. Automation: To ensure that the reports are never more than three months out of date, the
    code-base uses automation techniques, such as GitHub Actions, to periodically run the data
    collection, cleaning, analysis, and report generation steps.

Any further assistance will be highly appreciated.

Hi @satyampsoni , From my understanding i think you've correctly summarised the project. Just that the automation hasnโ€™t been implemented yet.
I personally found this article series by @Bransthre quite helpful, it explains the whole development process. @TimidRobot already shared a part of it.

Thanks for sharing the blog @Paulooh007 !
I am checking it out and if I need any help I'll reach out to you.

In sources.md only 8 sources of data gathering is present while the article series covers the 9 sources of data. Deviantart data sources is not present over there..
@TimidRobot @Paulooh007 so you know the reason or by mistake is not placed there?

In sources.md only 8 sources of data gathering is present while the article series covers the 9 sources of data. Deviantart data sources is not present over there.. @TimidRobot @Paulooh007 so you know the reason or by mistake is not placed there?

Both the google custom search and deviantart scripts use the same data source, They both use the Custom Search JSON API, The API performs a Google Search with the specified arguments provided in its API call.

So for deviantart, weโ€™re limiting the scope of the search by setting the relatedSite query parameter to deviantart.com. This explains why we have only 8 sources.
See line 65 of deviantart_scratcher.py

(
    "https://customsearch.googleapis.com/customsearch/v1"
    f"?key={api_key}&cx={PSE_KEY}"
    "&q=_&relatedSite=deviantart.com"
    f'&linkSite=creativecommons.org{license.replace("/", "%2F")}'
)

Oh! I see.

hello, I would like to work on this feature And I think this feature is also included in Gsoc 2024 So when should I start contributing or analyzing Can you please elaborate @TimidRobot

@Saigenix welcome!

Please see Contribution Guidelines โ€” Creative Commons Open Source for how we manage issues and PRs (we generally don't assign issues prior to resolution).

Also, this issue largely duplicates the GSoC 2024 Automating Quantifying the Commons project. You may find #39 more helpful:

Thank you @TimidRobot

Hi @TimidRobot!

As mentioned in the Slack, I'm interested in working on this project.

What is the strategy for gathering data over multiple days (due to query limits)?

A GitHub action can be scheduled to run at scheduled times each day. By storing data about the last successful run, we can run each task only when it is sufficiently outdated, and with exponential backoff, for instance.

Please start with the assumption that each combination of source and stage will require itโ€™s own script to be executed 1+ times by GitHub Actions.

That's certainly possible, and is probably the simplest solution to get a minimal working product. However, it might then be more challenging to implement the scheduling logic. I think it would be difficult to do directly in GitHub actions, so it's probably best to use a helper script, but at that point we may as well convert the scripts into classes with methods and run it as a unified program anyways. All of the scripts seem to simply define a few constants and functions, then run a few functions such as set_up_data_file() and record_all_licenses(), so I don't think it would be complicated to package them into classes. This approach also helps code deduplication; common logic can be implemented in a base class which the others inherit from.

Storing the data in the repository has issues, but it is also simple and free.

One concern I have about this approach is that if the automation scripts were to run regularly (eg. daily), it would result in a lot of commits to the repository which could make the commit history hard to navigate. Though I suppose if you are willing to live with this, then there isn't much of a downside. Another option is to commit the data into another branch, like what GitHub pages does.

What is the strategy for ensuring automated updates do not result in broken/incomplete state if they don't complete successfully?

I think we should start by splitting each task into many small subtasks, each one being able to run and update data independently. For example, vimeo_scratcher.py queries 8 different licenses, with each query being able to run independently. Them each subtask writes data only if it successfully completes. This would work best with a data format that allows each entry to be updated independently and asynchronously, which is why I think something like an SQL database would be ideal.

@Darylgolden

Please start with the assumption that each combination of source and stage will require itโ€™s own script to be executed 1+ times by GitHub Actions.

That's certainly possible, and is probably the simplest solution to get a minimal working product. However, it might then be more challenging to implement the scheduling logic. I think it would be difficult to do directly in GitHub actions, so it's probably best to use a helper script

Remember that the goal is a complete report every quarter--every three months. Handling state will be a primary concern. Each query will need to be scheduled to run multiple times for both large data sets (ex. to work with daily query limits) and for redundancy. I usually prefer shared libraries instead of a single launcher/helper script.

but at that point we may as well convert the scripts into classes with methods and run it as a unified program anyways. All of the scripts seem to simply define a few constants and functions, then run a few functions such as set_up_data_file() and record_all_licenses(), so I don't think it would be complicated to package them into classes. This approach also helps code deduplication; common logic can be implemented in a base class which the others inherit from.

I suspect the data is too available from the various sources is too different to benefit from unification. I prefer to avoid Classes until their complexity and obfuscation is clearly worth it. That may be the case here, but everyone deserves to know my biases.

Storing the data in the repository has issues, but it is also simple and free.

One concern I have about this approach is that if the automation scripts were to run regularly (eg. daily), it would result in a lot of commits to the repository which could make the commit history hard to navigate. Though I suppose if you are willing to live with this, then there isn't much of a downside. Another option is to commit the data into another branch, like what GitHub pages does.

At a quarterly cadence, I don't expect it to be too noisy. I don't like long lived special purpose branches. I think they end up hiding information. If it became an issue, a separate repository is also an option.

What is the strategy for ensuring automated updates do not result in broken/incomplete state if they don't complete successfully?

I think we should start by splitting each task into many small subtasks, each one being able to run and update data independently. For example, vimeo_scratcher.py queries 8 different licenses, with each query being able to run independently. Them each subtask writes data only if it successfully completes. This would work best with a data format that allows each entry to be updated independently and asynchronously, which is why I think something like an SQL database would be ideal.

If each query stores it's data in a separate file (ex. CSV), then they can be updated independently and asynchronously. I lean towards plaintext because it prioritizes visibility, human interaction, and broad compatibility.


In general, I encourage everyone to pursue the simplest and most boring technologies for this project. It isn't a technical demo nor a technology learning project. The easier it is to engage with and to maintain, the longer it will benefit the community. I still find PEP 20 โ€“ The Zen of Python | peps.python.org to be helpful and instructive.

Thank you @TimidRobot for the reply!

In general, I encourage everyone to pursue the simplest and most boring technologies for this project. It isn't a technical demo nor a technology learning project.

I would like to clarify that I did not propose my implementation with the intent of making it a technical demo or technology learning project, but rather because it was what I initially thought was the simplest and most maintainable design for the project. I have worked in projects with convoluted and unmaintainable code, and I have read the Zen of Python, so I definitely understand the importance of simple and boring code. My instinct for clean design clearly differs from yours, and while I'm of course happy to go with whatever design you think suits this project best, I think I would be doing a disservice if I did not at least try to propose alternative designs to compare the merits of each design. That being said, I do see now the benefits of using a shared library design over helper scripts/OOP and am happy to pursue this design instead.

I think we would need to add three fields for each of the data files, time_of_last_successful_update, time_of_last_failed_update and exponential_backoff_factor. The exponential_backoff_factor field would start off at 0, increasing by 1 with each failure and resetting to 0 with each success. The script would try to update a field only if the current time is more than $2^\text{exponential backoff failure}$ days since the last update. This logic can then be implemented in a library that is used in each of the scripts.

What do you think of this design? If you're happy with it, should I start drafting a proposal?

I think we would need to add three fields for each of the data files, time_of_last_successful_update, time_of_last_failed_update and exponential_backoff_factor. The exponential_backoff_factor field would start off at 0, increasing by 1 with each failure and resetting to 0 with each success. The script would try to update a field only if the current time is more than

State management depends on the architecture of the entire process. For example, if there are separate phases for querying data and processing data, then there is no need to update queried data. Instead each query can write to a separate file (all of which would be combined during processing phase).

For example, potential logic of a query script that is run every day:

  1. Exit if the raw data is complete for this interval
  2. Read state (ex. set a of z) if there are raw data files from a previous run during this interval
    • set size might depend on daily query limits
  3. Query source for current chunk (ex. set b of z) with exponential backoff
  4. Write raw data file on success

For example, potential logic of a processing script that is run every day:

  1. Exit if the processed data is complete for this interval
  2. Exit unless the data is complete for this interval
  3. Read data files (ex. a through z) and combine & process data
  4. Write data file on success

This is not how it must be done, merely a way that I can imagine it. There are complexities that are worth it within the context of the total plan.

(Suggestions welcomed! Please comment if you have a relevant links to share.)

Hi @TimidRobot! What's your opinion on integrating something like OpenTelemetry here?

The JSON file we'll obtain would have standardized data, and we can further work on visualization using in-built tools.

(Suggestions welcomed! Please comment if you have a relevant links to share.)

Hi @TimidRobot! What's your opinion on integrating something like OpenTelemetry here?

The JSON file we'll obtain would have standardized data, and we can further work on visualization using in-built tools.

I'm not @TimidRobot, but it seems like OpenTelemetry is mainly used for collecting data from your own applications and not retrieving data from APIs. Do you have an example of it doing the latter?

What is the strategy for gathering data over multiple days (due to query limits)?

Has there been a case where query limits have actually been hit? Because looking at sources.md, the limits seem much more than enough for our purposes. If that's the case, maybe the simplest solution of just running all the scripts on a schedule is the best.

What is the strategy for gathering data over multiple days (due to query limits)?

Has there been a case where query limits have actually been hit? Because looking at sources.md, the limits seem much more than enough for our purposes. If that's the case, maybe the simplest solution of just running all the scripts on a schedule is the best.

Yes, the Google Custom Search JSON API. See:

  • quantifying/sources.md

    Lines 40 to 52 in 8412423

    ## Google Custom Search JSON API
    **Description:** The Custom Search JSON API allows user-defined detailed query
    and access towards related query data using a programmable search engine.
    **API documentation links:**
    - [Custom Search JSON API Reference | Programmable Search Engine | Google
    Developers][google-json]
    - [Method: cse.list | Custom Search JSON API | Google Developers][cse-list]
    **API information:**
    - API key required
    - Query limit: 100 queries per day for free version
  • # 429 is Quota Limit Exceeded, which will be handled alternatively
    )
    session = requests.Session()
    session.mount("https://", HTTPAdapter(max_retries=max_retries))
    with session.get(request_url) as response:
    response.raise_for_status()
    search_data = response.json()
    search_data_dict = {
    "totalResults": search_data["searchInformation"]["totalResults"]
    }
    return search_data_dict
    except Exception as e:
    if isinstance(e, requests.exceptions.HTTPError):
    # If quota limit exceeded, switch to the next API key
    global API_KEYS_IND
    API_KEYS_IND += 1
    LOGGER.error("Changing API KEYS due to depletion of quota")
  • DSD Fall 2022: Quantifying the Commons (9/10) | by Bransthre | Medium

What is the strategy for gathering data over multiple days (due to query limits)?

Has there been a case where query limits have actually been hit? Because looking at sources.md, the limits seem much more than enough for our purposes. If that's the case, maybe the simplest solution of just running all the scripts on a schedule is the best.

Hi, Have you tried running google_scratcher.py, The script queries for licences in legal-tool-paths.txt, and also for all languages in google_lang.txt.

This issue became a GSoC 2024 project.