Agressive cleanup of favourite packages field

Question

Agressive cleanup of favourite packages field

Closed this issue 4 years ago · 2 comments

Hi!

First of all, thank you very much for conducting the survey and processing the data. A survey like this one was long overdue and I think all of us gained some interesting insights on the community. I can only imagine how much work it actually took to collect and process the data.

Some people (myself included) were wondering what's hiding behind the "Other" option in the favourite package table. I decided to try and parse the responses myself and I took a different approach -- instead of simply taking all the data in this column and treating it as a multiple choice question (including the free-form responses) I tried to parse only those values who were properly formatted as lists: comma separated, one-entry-per-line, bullet lists and numbered lists. This approach underestimates the package popularity, but it generates less noise. (The code will be attached as the first comment).

What I found, is that even this overly-cautios approach generates "Other" column that is more prominent than the second place (which is still magit). So, I guess, the lesson we learn from that is "don't bother cleaning the data too strictly".

PS: I understand that this would work better as a pull request, but I was not sure how good is github with diffs of jupyter notebooks.

Answer 1 · 2020-12-13T00:17:35.000Z

import regex


RE_PACKAGE_NAME = "\s*(\w[\d\w+-./*]*)+\s*"
RE_COMMA_SEPARATED_LIST = "(?:" + RE_PACKAGE_NAME + ",)*" + RE_PACKAGE_NAME
RE_BULLET_LIST = "(?:[\W]\s*" + RE_PACKAGE_NAME + "\n)*" + "[\W]\s*" + RE_PACKAGE_NAME
RE_NUMBER_LIST = "(?:[\d]+\.?\s*" + RE_PACKAGE_NAME + "\n)*" + "[\d]+\.?\s*" + RE_PACKAGE_NAME
RE_NEWLINE_SEPARATED_LIST = "(?:" + RE_PACKAGE_NAME + "\n)*" + RE_PACKAGE_NAME


def split_formatted(string):
    """Try to split response according to one of the formats or return a NaN"""
    
    # Lowercase everything
    string = string.lower().strip()
    
    # Drop trainling punctuation
    string = re.sub(r"[.?!]+$", "", string)

    match = regex.fullmatch(RE_PACKAGE_NAME, string)
    if match:
        return match.captures(1)

    match = regex.fullmatch(RE_NUMBER_LIST, string)
    if match:
        return match.captures(1) + match.captures(2)
    
    match = regex.fullmatch(RE_BULLET_LIST, string)
    if match:
        return match.captures(1) + match.captures(2)
    
    match = regex.fullmatch(RE_COMMA_SEPARATED_LIST, string)
    if match:
        return match.captures(1) + match.captures(2)
    
    match = regex.fullmatch(RE_NEWLINE_SEPARATED_LIST, string)
    if match:
        return match.captures(1) + match.captures(2)
    
    return np.nan


def normalize_package(package_name):
    if package_name.endswith(".") or package_name.endswith(","):
        package_name = package_name[:-1]
    
    if package_name.endswith(".el"):
        package_name = package_name[:-3]
    
    if package_name.endswith("-mode"):
        package_name = package_name[:-5]
    
    return package_name


packages = data["Can you list some of your favorite packages?"].dropna()

# Collect the responses corresponding to formats defined above.
# Surprisingly this covers around 80% of responses.
formatted_responses = packages.map(split_formatted)

# Let's compile the high-scores
favourite_packages = pd.Series(formatted_responses.dropna().sum())
favourite_packages = favourite_packages.map(normalize_package)

favourite_packages = (
    favourite_packages
      .groupby(favourite_packages)
      .size()
      .sort_values(ascending=False))

with pd.option_context("display.max_rows", None):
    from IPython.display import display
    display(favourite_packages.to_frame())

Answer 2 · 2020-12-17T18:05:12.000Z

Thank you for sharing this and for the kind words!
I like your results and included your code in the notebook and referred to this issue. I also don't know how diffs of jupyter notebook would work so this is safer.