Summer of Code @ Software Heritage

Report for Google Summer of Code '22 Project @ Software Heritage

Project Details
Initial Proposal Mine Information from Archived Content
Repository swh-indexer
Mentors Stefano Zacchirolli, Valentin Lorentz, and Kumar Shivendu
Contributions swh-indexer
Duration 3 months (13-06-2022 to 12-09-2022)

About the SWH Project

Software Heritage is a far-reaching Open Source-Research project that is working to collect and preserve software source code. As a part of this, Software Heritage’s indexer extracts metadata from source code repositories. Metadata ranges from simple information (eg. project name or hosting place) to more substantial information like the entity behind the project, its license, etc. Metadata is the information it collects and extracts that provides additional information on source code.

Contributions

The search feature of Software Heritage's universal archive of software source code offers searching via URL or through package metadata. As part of GSoC'22, I worked on adding mappings to Packagist (composer.json), NuGet (.nuspec), and dart (pubspec.yaml) packages. Additionally, I am currently working on a mapping for Cocoapods (.podspec) packages. Please find all my contributions here. Here is a summary:

Title Diff. Related Task No. of Packages
Indexer for Packagist (composer.json) D8047 T4357 386k
Metadata Indexer for Pub (pubspec.yaml) D8079 T4376 34.6k
Add NuGet Mapping (*.nuspec) D8144 T4392 397k

In total, these span more than 800k packages.

Future Work

Continuing from here, I am very excited to continue contributing to Software Heritage and Open Source. Software Heritage is on an important mission that I'm privileged to be a part of and deeply excited to continue contributing to. Here are some future aspects to this project:

  • Writing a metadata indexer for Cocoapods packages (*.podspec) (Related Task: T4437)
  • Extend the coverage of supported metadata to all Libraries.io-indexed package managers
  • Possibly use Bibliothecary to extract package metadata

Parallel to my coding journey, I have written up 2 blogs to summarize my learning curve and the state of my project at the time. My mentors were kind to review them before they were published.

Overall, it was a wonderful experience working with knowledgeable mentors and learning from them. Looking forward to continue learning with them.