Path handling is incorrect [FileNotFoundError, empty bert.pt files, etc]
xkortex opened this issue · 1 comments
Hi,
I'm currently working with a library which relies on this repo for a lot of BERT file preprocessing. I'm noticing there are a lot of improper/brittle cases of path handling. Since this library is getting quite popular (which I bet took you by surprise!), I thought I'd point some things out.
It seems like a lot of issues are potentially related to the way this library, in particular data_builder.py, deals with paths, e.g. #180 #157 #144 #92 are all likely related. I think the original author has probably moved on, but maybe a future maintainer can fix this, or at least this issue might be enlightening to some.
E.g.:
PreSumm/src/prepro/data_builder.py
Lines 282 to 285 in 70b810e
and
PreSumm/src/prepro/data_builder.py
Line 340 in 70b810e
You've got the right idea with pjoin
but the '/' is not Windows portable and not guaranteed to handle paths correctly, and [-1]
is not guaranteed to give you the raw file name. You probably want os.path.basename
here.
>>> r"C:\Users\mike\foo.bar.txt".split('/')[-1].split('.')[0]
'C:\\Users\\mike\\foo'
I'd also recommend looking at boltons.pathutils,
in particular boltons.pathutils.augpath. E.g. this line incorrectly handles filename splitting:
>>> from boltons import pathutils
>>> pathutils.augpath('/foo/bar/somefile.x.y.z.rsd.txt', ext='.jpg', multidot=True) # replace full extension
'/foo/bar/somefile.jpg'
>>> pathutils.augpath('/foo/bar/somefile.x.y.z.rsd.txt', dpath='', ext='', multidot=True) # get the file "ID" aka strip path and ext
'somefile'
Cheers,