/tw-tools-update

Use the GitHub-API to update the list of tools and extensions related to TaskWarrior. It is displayed on the web site.

Primary LanguagePython

tw-tools-update

Description

Use the GitHub-API to update the list of tools and extensions related to TaskWarrior. It will be displayed on the web site: http://taskwarrior.org/tools/

This is linked to the project of future Tool page: http://brunovernay.github.io/taskwarrior-site-test/

The idea is to use the GitHub-API to search project related to TaskWarrior and update the list of tools displayed on TaskWarrior site from this list.

The project started in Java, but I created a Python branch, as it is more idiomatic to the TaskWarrior community. It should be compatible with Python v2 & v3 (http://pythonclock.org/).

I use https://github.com/PyGithub/PyGithub , there are many Python projects addressing GitHub, even a book Mining the Social Web .

Usage

  • cp Config.py.example Config.py and edit Config.py with your GitHub token
  • old tool list is in data-tools-old.json
  • python3 Main.py > log-$(date -Iminutes).txt (takes about 5 min)
  • New data is in data-tools.json

Usage

python3 Main.py > log-$(date -Iminutes).txt

Usage

python3 Main.py > log-$(date -Iminutes).txt

Status

  • It works
  • We still have to set the category manually
  • There is no API yet to get the license (GitHub is working on it)
  • You have to enter your GitHub token given the number of required requests. (https://github.com/settings/tokens)
  • It only covers GitHub projects currently (BitBucket maybe one day ...)
  • We might apply a diff after the update, to keep manual changes

Note:

  • the text description is pure text, no HTML.
  • There are duplicated names, I use the url_src as a unique identifier. But some project changed URL, for example xtw changed its login name, so the url is different. I output a warning and create a duplicate

The mapping:

  • category: manual
  • name name
  • description description
  • url homepage
  • url_src html_url
  • license ???
  • language language (will get only the primary language, have to request languages_url to know more)
  • author owner/login (+ collaborators, contributors, teams ...) We have to make multiple request to get the real name instead of the Login.
  • theme best guess from description
  • verified today
  • last_update updated_at (pushed_at would be more conservative, but would miss commits in non-master branches)

Automatic classification

I get all the "Readme" in order to perform some Machine Learning. The first idea would be to classify by category. The Python library seems to be SciKit. There is a more active NLTK library, but since I only need simple text feature extraction and no complex Natural Language processing, I will stick to SciKit. Some ref: