/docs-inventory

Tool for taking term inventory on *.md documentation sets.

Primary LanguagePython

These scripts use Python to search for terms across multiple documentation repositories. (Repositories are assumed to use the metadata formats for docs.microsoft.com.)

Installing Python

  1. Make sure you have Python 3 installed. Download from https://www.python.org/downloads.

  2. Run pip install -r requirements.txt to install needed libraries. (If you want to use a virtual environment instead of your global environment, run python -m venv .env then .env\scripts\activate before running pip install.)

Inventory configuration

Inventories are driven by a JSON configuration file. This repo contains a few example configurations in config.json, config_python.json, and config_js.json. You can create additional files are necessary.

  1. Specify the repos you want to search in the content collection of the config file. For each element:

    • repo is a name for the repo (by convention, we use the GitHub org/repo name).
    • path is the location of the cloned repo on your local computer. Leave path blank to skip the repo.
    • url is the base URL for the published articles of the docset. The url is used to auto-generate full URLs in the output files.
    • exclude_folders is a collection of folder names to omit from the inventory, such as includes folders and other folders that aren't actively maintained (such as vs2015 in the Visual Studio repo.)
  2. In the inventory section, specify distinct inventories, each of which generates a separate set of inventory files.

    • name is a case-insensitive name for the inventory. NOTE: don't use spaces or hyphens in the name, or any other character that's not allowed in a filename. We recommend using letters and numbers.
    • terms is an array of Python regular expressions to use as search terms.

Run the scripts

  1. By default, the script saves results in an InventoryData folder. You can customize this folder by setting the INVENTORY_RESULTS_FOLDER environment variable.

  2. At a command prompt, run python take_inventory.py --config <config-file>. Omitting --config <config-file> defaults to config.json.

  3. When the script is complete, you'll see four files in the results folder for each inventory in the config file:

    • <name>_<date>_<sequential_int>.csv contains one line per search term instance.
    • <name>_<date>_<sequential_int>-metadata.csv, generated by extract_metadata.py (run automatically from take_inventory.py), adds various metadata values extracted from the source files to the results.
    • <name>_<date>_<sequential_int>-consolidated.csv, generated by consolidate.py (also run automatically), collapses the output from extract_metadata.py into one line per file with a count column for each term and count columns for each classification tag (where the term is found)
    • <name>_<date>_<sequential_int>-scored.csv, generated by score.py (also run automatically), applies a scoring algorithm to the output from consolidate.py--see score.py for the details. The scripts adds a single "score" column to the new output file, and automatically omits any file with a score of zero. The result here is a file that has "articles of interest" for the inventory in question.

    The <sequential_int> value starts at 0001 and is incremented each time you run the script on the same day. This is so subsequent runs on the same day produce distinct output.