bblfsh/python-client

`language` argument in `client.parse` changes number of files signigicantly

EgorBu opened this issue ยท 8 comments

Hi,
I found the strange behavior of language argument in client.parse.
When you parse files with and without this argument and after select files with specific language - it returns a different number of files.
Code for reproducibility:

import argparse
import glob
import os

import bblfsh
from bblfsh.client import NonUTF8ContentException


def prepare_files(folder, client, language, use_lang=True):
    files = []

    # collect filenames with full path
    filenames = glob.glob(folder, recursive=True)

    for file in filenames:
        if not os.path.isfile(file):
            continue
        try:
            # TODO (Egor): figure out why `language` argument changes number of files significantly
            if use_lang:
                res = client.parse(file, language)
            else:
                res = client.parse(file)
        except NonUTF8ContentException:
            # skip files that can't be parsed because of UTF-8 decoding errors.
            continue
        if res.status == 0 and res.language.lower() == language.lower():
            files.append("")
    return files


def test_client(args):
    client = bblfsh.BblfshClient(args.bblfsh)
    files = prepare_files(args.input, client, args.language, args.use_lang)
    print("Number of files: %s" % (len(files)))


def create_parser():
    parser = argparse.ArgumentParser()
    parser.add_argument("-i", "--input", required=True, type=str,
                        help="Path to folder with source code - "
                             "should be in a format compatible with glob (ends with**/* "
                             "and surrounded by quotes. Ex: `path/**/*`).")
    parser.add_argument("--bblfsh", default="0.0.0.0:9432",
                        help="Babelfish server's address.")
    # I'm using javascript for experiments
    parser.add_argument("-l", "--language", default="javascript",
                        help="Programming language to use.")
    parser.add_argument("-u", "--use-lang", action="store_true",
                        help="If lang in client.parse should be used.")
    return parser


def main():
    parser = create_parser()
    args = parser.parse_args()
    client = bblfsh.BblfshClient(args.bblfsh)
    files = prepare_files(args.input, client, args.language, args.use_lang)
    print("Number of files %s with args %s" % (len(files), args))


if __name__ == "__main__":
    main()

and results:

egor@egor-sourced:~/workspace/style-analyzer$ python3 lookout/style/format/test_client.py -i '/home/egor/workspace/tmp/freeCodeCamp/**/*'  
Number of files 187 with args Namespace(bblfsh='0.0.0.0:9432', input='/home/egor/workspace/tmp/freeCodeCamp/**/*', language='javascript', use_lang=False)
egor@egor-sourced:~/workspace/style-analyzer$ python3 lookout/style/format/test_client.py -i '/home/egor/workspace/tmp/freeCodeCamp/**/*'  -u
Number of files 258 with args Namespace(bblfsh='0.0.0.0:9432', input='/home/egor/workspace/tmp/freeCodeCamp/**/*', language='javascript', use_lang=True)
bzz commented

@EgorBu thank you for detailed reproducible example!

First of all - this does not seem to be any language-specific client issue, but rather a bblfshd one - if one does not pass language to bblfshd, as it can be seen here enry will be used to detect it.

So, most probably this has to do more with Enry - if you and @juanjux do not mind, I would transfer this issue to enry repo.

Enry does not have Python bindings (yet src-d/enry#154) so there is no easy way to plug in call to GetLanguage() to your script and make sure all the files are actually detected to be JavaScript.

Could you please attach find /home/egor/workspace/tmp/freeCodeCamp -type f | wc -l? So we know how many files are there. As you pass in **/*, this glob matches all the files and not only JavaScript ones.

Your results basically are:

  • out of all N files, autodetect the language: 187 files detected to be in JavaScript AND can be parsed (using JavaScript driver).
  • out of all N files, assume all N are in JavaScript: 258 files can be parsed (using JavaScript dirver).

Simplest way I can think of reproducing exact results is a small Go program that accepts same globs, traverses files recursively, counts number of files and for each file calls enry, saving results in a map "filename" -> "language".

This would allow to determine, why there are 71 files that are not detected by Enry as JavaScript but at the same time seems to be parsable \w Bblfsh.

thank you @bzz, here it's the result:

egor@egor-sourced:~/workspace/hercules$ find /home/egor/workspace/tmp/freeCodeCamp -type f | wc -l
500

I don't mind to move this issue to enry

bzz commented

Ok, so it seems:

  • out of all 500 files, autodetecting the language only 187 files are found to be in JavaScript AND can be parsed using JavaScript driver
  • out of all 500 files, assume all 500 are in JavaScript \wo lang detection: 258 files can be actually parsed using JavaScript driver

I would be happy to put together a small go program to dig deeper in case if:

  • @EgorBu you could upload result of tar -czf freeCodeCamp.tar.gz -C /home/egor/workspace/tmp/freeCodeCamp . somethere (gdrive?) and
  • @smola could give a hand transfering this issue to Enry repo

๐Ÿ˜€

bzz commented

Apparently, on github one can not just transfer an issue between orgs ๐Ÿ˜–

Here is the language detection stats on that repo

enry --json freeCodeCamp | jq "to_entries|map(\"\(.key)=\(.value|length)\")" | less
[
  "JavaScript=184",
  "Pug=55",
  "JSX=55",
  "Less=15",
  "CSS=6",
  "EJS=5",
  "HTML=1",
  "Markdown=1",
  "SVG=1",
  "Text=1"
]

Right now it's a known issue that although JSX can be parsed by bblfsh JavaScript driver, it will be not as Enry does not detect it as JavaScript bblfsh/javascript-driver#46 (comment)

So out of 71 files, these is 55, so only 16 left.
Most probably same holds for EJS with is also a embedded JS templates, that leaves 11.

bzz commented

@EgorBu overall, I would say it behaves as expected and Enry performs well.

The only way to "change" it that I can see is on bblfsh side, by introducing a notion of "dialects" for the drivers: so a single "language driver" e.g JavaScript could parse multiple "language dialects" (like JSX, JavaScript, or Bash + Zsh, etc)

This has been proposed/discussed a little bit under bblfsh/bash-driver#39 (comment) but needs more consideration to include things like src-d/enry#182

Does the explanation above answer your question?

Thank you, @bzz , yes, it makes a lot of sense. And this was my expectation that js-driver can't distinguish between different dialect of JS. Probably documentation should be added too - so other users will know about this caveat.

bzz commented

Great. And it's easy to verify that CSS can also be parsed by JavaScript driver:

$ docker run --rm -d -p 9432:9432 bblfsh/javascript-driver:v2.5.0
$ bblfsh-cli -o json public/css/ubuntu.css | jq . | wc -l
46

I'm going to go ahead and close this issue then.

What I will try to do is to put together a proposal in a new issue under bblfshd for handling dialects so missmatch of enry<=>bblfsh language names would be covered.