nexB/deltacode

Migrate to using json-to-csv script in lieu of separate formatting option

steven-esser opened this issue · 25 comments

Instead of maintaining two different 'branches' of output, we should default to json only.

We will still want a way to view the results in csv form, but this can be moved to a separate script that only takes a deltacode json output and converts it to csv.

UX should not change: -c output should still output a csv file. It is in the internals that we would just make a call to our json2csv script instead of write_csv

For now, we can keep this script in the etc/scripts/ directory, like scancode does.

@MaJuRG Strange, still not able to invoke help:

(deltacode) Thu Mar 29, 2018 05:52 PM  /c/code/nexb/dev/deltacode JMH (75-use-json-to-csv-script)
$ etc/scripts/json2csv -h
bash: etc/scripts/json2csv: No such file or directory

(deltacode) Thu Mar 29, 2018 05:52 PM  /c/code/nexb/dev/deltacode JMH (75-use-json-to-csv-script)
$ etc/scripts/json2csv.py -h
etc/scripts/json2csv.py: line 25: from: command not found
etc/scripts/json2csv.py: line 26: from: command not found
etc/scripts/json2csv.py: line 27: from: command not found
etc/scripts/json2csv.py: line 29: import: command not found
etc/scripts/json2csv.py: line 30: from: command not found
etc/scripts/json2csv.py: line 31: import: command not found
etc/scripts/json2csv.py: line 32: import: command not found
etc/scripts/json2csv.py: line 34: import: command not found
etc/scripts/json2csv.py: line 37: import: command not found
etc/scripts/json2csv.py: line 40: syntax error near unexpected token `('
etc/scripts/json2csv.py: line 40: `def load_deltas(json_input):'

(deltacode) Thu Mar 29, 2018 05:52 PM  /c/code/nexb/dev/deltacode JMH (75-use-json-to-csv-script)
$

Hmmm. I'll figure it out . . . .

BTW, the syntax error seems erroneous. It purports to cover the first method, which looks OK to me:

from __future__ import print_function
from __future__ import absolute_import
from __future__ import unicode_literals

import codecs
from collections import OrderedDict
import json
import os

import click
click.disable_unicode_literals_warning = True

import unicodecsv


def load_deltas(json_input):
    """
    Return a list of scan results loaded from a json_input, either in
    DeltaCode standard JSON format.
    """
    with codecs.open(json_input, 'rb', encoding='utf-8') as jsonf:
        delta = jsonf.read()

    delta_results = json.loads(delta, object_pairs_hook=OrderedDict)

    delta_results = delta_results['deltas']

    return delta_results

We see the syntax error only when running etc/scripts/json2csv.py -h -- the ScanCode source for our work says the command should be run without the .py extension: etc/scripts/json2csv -h

Is the script able to be run correctly from the cli at all? If it is we shouldn’t worry about getting -h to work right this second if that is the case

No -- I get a similar result without the -h:

(deltacode) Fri Mar 30, 2018 08:48 AM  /c/code/nexb/dev/deltacode JMH (75-use-json-to-csv-script)
$ etc/scripts/json2csv
bash: etc/scripts/json2csv: No such file or directory

(deltacode) Fri Mar 30, 2018 08:49 AM  /c/code/nexb/dev/deltacode JMH (75-use-json-to-csv-script)
$

Using a full path works:

(deltacode) Fri Mar 30, 2018 10:11 AM  /c/code/nexb/dev/deltacode JMH (75-use-json-to-csv-script)
$ python etc/scripts/json2csv.py
Usage: json2csv.py [OPTIONS] JSON_INPUT CSV_OUTPUT

Error: Missing argument "json_input".

(deltacode) Fri Mar 30, 2018 10:11 AM  /c/code/nexb/dev/deltacode JMH (75-use-json-to-csv-script)
$ python etc/scripts/json2csv.py -h
Usage: json2csv.py [OPTIONS] JSON_INPUT CSV_OUTPUT

  Convert a DeltaCode JSON file to a CSV.

Options:
  -h, --help  Show this message and exit.

(deltacode) Fri Mar 30, 2018 10:11 AM  /c/code/nexb/dev/deltacode JMH (75-use-json-to-csv-script)
$ python etc/scripts/json2csv.py  /c/code/nexb/dev/deltacode_tests/scan_1_file_moved_20180328.json /c/code/nexb/dev/deltacode_tests/scan_1_file_moved_20180330.csv

(deltacode) Fri Mar 30, 2018 10:12 AM  /c/code/nexb/dev/deltacode JMH (75-use-json-to-csv-script)
$

Sample output (though with unwanted unicode prefix and quotation marks):

score,factors,new,old
0,[u'moved'],b/a4.py,a/a4.py

@johnmhoran We need to handle factors better. something like ' '.format(factors)

meaning all we need to do is construct a string from the list of factors, where each factor is separated by a space.

@MaJuRG Tried a number of approaches. This works on my initial one-factor test: ('factors', ' '.join(factors)).

        yield OrderedDict([
            ('score', score),
            # ('factors', factors),  # unicode and quotes
            # ('factors', ''.format(factors)),  # empty
            # ('factors', ' '.format(factors)),  # empty
            # ('factors', '{}'.format(factors)),  # unicode and quotes
            # ('factors', u'{}'.format(factors)),  # unicode and quotes
            # ('factors', u'{:s}'.format(factors)),  # unicode and quotes
            # ('factors', u"{}".format(factors)),  # unicode and quotes
            # ('factors', u"{0}".format(factors)),  # unicode and quotes
            ('factors', ' '.join(factors)),
            ('new', new),
            ('old', old),
        ])

that looks good

i always mix up format and join

👍

@MaJuRG Our current CSV output headers are Factors,Score,Path,Name,Type,Size,Old Path while our new json2csv headers are score,factors,new,old. Do we want to use the structure and content of the current CSV headers, switch to the new headers, or adopt some other approach?

Let’s copy them for now; forgot about those additions

Thanks @MaJuRG , will do.

@MaJuRG I've been trying to call our json2csv.py methods from the cli() method inside cli.py. To do so I'm trying to import json2csv (or alternatively one of its methods) from the sibling directory, etc/scripts/ (cli.py is inside src/deltacode/).

deltacode
    etc
        scripts
            json2csv.py
    src
        deltacode
            cli.py

Have tried a number of approaches, but the typical error has been a variation on this message: ImportError: No module named etc.scripts.json2csv. A few: ValueError: Attempted relative import beyond toplevel package. I get the feeling that using from . . . for importing is limited to the src directory.

I thought ScanCode might have had a similar need and could provide an example, but I've not yet found one. What's the proper way to import and call a json2csv method from inside cli.py?

@johnmhoran the json2csv script is not in a proper python module available in the path.
json2csv as it is cannot be imported in the deltacode modules as is. You need to move json2csv in a proper python module or move its core code in such module: for now, json2csv can import code from deltacode, but not the other way around so this needs a tiny bit of module design to decide which feature lives where

Thanks @pombredanne . I'll coordinate with @MaJuRG on the design work. Perhaps we need to move json2csv.py into src/deltacode/, where cli.py and its siblings currently live (and move the companion test, test_json2csv.py, into our tests/ directory).

@johnmhoran I think we should proceed with removing the csv code on the deltacode side (the -c option and relating functions) and focus on just supporting a json output.

For displaying info in a csv (for audit reasons or otherwise), we can just use the json2csv script. We will have more control over what the csv output can be without the overhead of supporting it in deltacode proper.

@MaJuRG Just to be clear, we want to retain the json2csv structure we started with inside etc/scripts/, right?

FYI, after saving that structure in a commit, I moved json2csv.py into src/deltacode/ (and moved the related test file and data). The CSV conversion was available with this command: python src/deltacode/json2csv.py /c/code/nexb/dev/deltacode_tests/scan_sorted01_20180328.json /c/code/nexb/dev/deltacode_tests/scan_sorted01_20180401.csv

I then worked on calling json2csv.json_delta_to_csv() from cli(), trying various approaches using a temporary file to first create JSON output and then pass that output to our new conversion function. The closest I got used the code excerpted below, which throws the error ValueError: Unterminated string starting at: line 142 column 17 (char 4233) when a deltacode -n [new_JSON_path] -o [old_JSON_path] -c [CSV_output_path] command is run.

. . .
from deltacode import json2csv
from tempfile import NamedTemporaryFile

. . .

def cli(new, old, csv_file, json_file, all_delta_types):

. . .

    options = OrderedDict([
        ('new_scan_path', new),
        ('old_scan_path', old),
        ('--all-delta-types', all_delta_types)
    ])

    # do the delta
    deltacode = DeltaCode(new, old, options)

    # output to csv
    if csv_file:
        # write_csv(deltacode, csv_file, all_delta_types)
        temp_file = NamedTemporaryFile(mode='w+t', delete=False)
        write_json(deltacode, temp_file, all_delta_types)
        json2csv.json_delta_to_csv(temp_file.name, csv_file)
    # generate JSON output
    else:
        write_json(deltacode, json_file, all_delta_types)

Have not been able to figure out the source of the error (or, more importantly, the fix). The error seems to be triggered by this line inside json2csv.load_deltas(): delta_results = json.loads(delta, object_pairs_hook=OrderedDict)

@johnmhoran Yes, we want to keep that old structure in etc/scripts/. There should no longer be errors for you once you revert back.

I see what you were trying to do, and you are going about it the right way. It is hard to tell the source of the error without seeing the actual stack trace in full. It could be as simple as a ./configure.bat.

In any case, we no longer want this csv stuff in src/deltacode/cli once the script is moved back and working.

Thanks @MaJuRG . Understood and will do, and will then work on adding tests. (BTW, had already run ./configure.bat, to no avail. ;-[ )

@MaJuRG Do we want the main DeltaCode help to say anything about the new json2csv script? Currently that info is available only by running python etc/scripts/json2csv.py -h.

No, but we will mention it somewhere in the Docs outside the codebase.

👍

comments left on the latest commit. you can open a PR once you make the changes.

Merged #94, closing.