nlpyang/PreSumm

Path handling is incorrect [FileNotFoundError, empty bert.pt files, etc]

xkortex opened this issue · 1 comments

Hi,

I'm currently working with a library which relies on this repo for a lot of BERT file preprocessing. I'm noticing there are a lot of improper/brittle cases of path handling. Since this library is getting quite popular (which I bet took you by surprise!), I thought I'd point some things out.

It seems like a lot of issues are potentially related to the way this library, in particular data_builder.py, deals with paths, e.g. #180 #157 #144 #92 are all likely related. I think the original author has probably moved on, but maybe a future maintainer can fix this, or at least this issue might be enlightening to some.

E.g.:

for json_f in glob.glob(pjoin(args.raw_path, '*' + corpus_type + '.*.json')):
real_name = json_f.split('/')[-1]
a_lst.append((corpus_type, json_f, args, pjoin(args.save_path, real_name.replace('json', 'bert.pt'))))
print(a_lst)

and

real_name = f.split('/')[-1].split('.')[0]

You've got the right idea with pjoin but the '/' is not Windows portable and not guaranteed to handle paths correctly, and [-1] is not guaranteed to give you the raw file name. You probably want os.path.basename here.

>>> r"C:\Users\mike\foo.bar.txt".split('/')[-1].split('.')[0] 
'C:\\Users\\mike\\foo'

I'd also recommend looking at boltons.pathutils,
in particular boltons.pathutils.augpath. E.g. this line incorrectly handles filename splitting:

>>> from boltons import pathutils
>>> pathutils.augpath('/foo/bar/somefile.x.y.z.rsd.txt', ext='.jpg', multidot=True) # replace full extension
'/foo/bar/somefile.jpg'
>>> pathutils.augpath('/foo/bar/somefile.x.y.z.rsd.txt', dpath='', ext='', multidot=True) # get the file "ID" aka strip path and ext
'somefile'

Cheers,

Thanks for your sharing. I am sorry, I am confused that which part I should changed the code. Please give me a hand. Looking forward to your reply, really appreciate it.
image