Bioconductor/BiocFileCache

Problem with remote files with whitespaces in the file name

Closed this issue · 9 comments

Hi Lori,

I would like to cache files from a public repository of mzML (raw mass spec data files) using BiocFileCache but it doesn't work because many of these files contain white spaces in their file names. Example:

library(curl)
url <- "ftp://massive.ucsd.edu/MSV000087155/ccms_peak/New_mzMLFinal/20160603151123624-1576262 Batch5_SHP77_2a.mzML"

Unfortunately, there is a white space in the file name. So, adding the file right away does not work:

library(BiocFileCache)
bfc <- BiocFileCache(tempdir())
path <- bfcrpath(bfc, url)
adding rname 'ftp://massive.ucsd.edu/MSV000087155/ccms_peak/New_mzMLFinal/20160603151123624-1576262 Batch5_SHP77_2a.mzML'
Error in bfcrpath(bfc, url) : not all 'rnames' found or unique.
In addition: Warning messages:
1: download failed
  web resource path:ftp://massive.ucsd.edu/MSV000087155/ccms_peak/New_mzMLFinal/20160603151123624-1576262 Batch5_SHP77_2a.mzMLlocal file path:/tmp/Rtmp3Pr4NW/74d1e62e9_20160603151123624-1576262%20Batch5_SHP77_2a.mzMLreason: URL using bad/illegal format or missing URL 
2: bfcadd() failed; resource removed
  rid: BFC1
  fpath:ftp://massive.ucsd.edu/MSV000087155/ccms_peak/New_mzMLFinal/20160603151123624-1576262 Batch5_SHP77_2a.mzMLreason: download failed 
3: In value[[3L]](cond) : 
trying to add rname 'ftp://massive.ucsd.edu/MSV000087155/ccms_peak/New_mzMLFinal/20160603151123624-1576262 Batch5_SHP77_2a.mzML' produced error:
  bfcadd() failed; see warnings()

Replacing the white space with a %20 as required for URLs allows me to add the file to the cache - but this is not ideal because I need to change the original file name (which is usually used to link samples to the data files).

> url <- sub(" ", "%20", url, fixed = TRUE)
> bfc <- BiocFileCache(tempdir())
> path <- bfcrpath(bfc, url)
adding rname 'ftp://massive.ucsd.edu/MSV000087155/ccms_peak/New_mzMLFinal/20160603151123624-1576262%20Batch5_SHP77_2a.mzML'
  |======================================================================| 100%

What also puzzled me is that BiocFileCache further modified the file name by replacing the %20 with %2520 (???).

> path
                                                                          BFC2 
"/tmp/Rtmp3Pr4NW/76a624503_20160603151123624-1576262%2520Batch5_SHP77_2a.mzML" 

What would however be ideal is if I could provide the original file names (eventually also containing white spaces) for remote sources to BiocFileCache and that the package internally fixes the URLs (e.g. replacing white spaces with %20) but then uses again the original file name for the local copy. In other words, it would be great if I could provide e.g. like above the original path and file name (20160603151123624-1576262 Batch5_SHP77_2a.mzML), BiocFileCache downloads that file (needs to fix the file name in the URL to 20160603151123624-1576262%20Batch5_SHP77_2a.mzML) and stores the data to the local copy with the original file name 20160603151123624-1576262 Batch5_SHP77_2a.mzML. Would that be possible?

lshep commented

bfcrpath is a short cut to bfcadd -- could you use bfcadd, using the valid url but then set the rname to the white space version?

Hm, seems not to work:

url <- "ftp://massive.ucsd.edu/MSV000087155/ccms_peak/New_mzMLFinal/20160603151123624-1576262 Batch5_SHP77_2a.mzML"
bfc <- BiocFileCache(tempdir())
path <- bfcadd(bfc, rname = url, fpath = sub(" ", "%20", url, fixed = TRUE))

still gives me:

> path
                                                                          BFC1 
"/tmp/RtmpzMQhnE/860cef5ab_20160603151123624-1576262%2520Batch5_SHP77_2a.mzML" 

i.e. there is a % in the file name.

Somehow the rname seems not to be considered:

> bfc <- BiocFileCache(tempdir())
> path <- bfcadd(bfc, rname = "AAAAA", fpath = sub(" ", "%20", url, fixed = TRUE),
+ fname = "exact")
  |======================================================================| 100%
> path
                                                                BFC4 
"/tmp/RtmpzMQhnE/20160603151123624-1576262%2520Batch5_SHP77_2a.mzML" 
lshep commented

I meant you could match / query on the rname then

lshep commented

We specifically do a curl_escape to make sure the url can be downloaded; I believe we did this purposefully because different systems would fail when spaces and special characters were present. 47c4b23

Yes, that makes total sense. And your solution would fix the sample mapping issue, indeed.

Unfortunately I have a second issue ;) - mzR (or more specifically the proteowizard C++ libraries that are used by mzR) seems to have problems with % in the file names:

> library(mzR)
Loading required package: Rcpp
> openMSfile(path)
Error: Can not open file /tmp/RtmpzMQhnE/20160603151123624-1576262%2520Batch5_SHP77_2a.mzML! Original error was: Error: [References::resolve()] Failed to resolve reference.
  object type: N4pwiz6msdata23InstrumentConfigurationE
  reference id: IC1
  referent list: 0

here I'm really unsure if and how that could be fixed ... but that's obviously not your business - I will see if I can fix that over in mzR...

lshep commented

I think you can manipulate file names locally but it might loose the ability to auto check for redownload -- but I'd have to look back into how to do this

no worries, all good. your solution seems good to me, thanks!

lshep commented

FWIW -- There is a curl::curl_unescape that you might be able to use on the given filepath before using it anywhere else?

> url = "ftp://massive.ucsd.edu/MSV000087155/ccms_peak/New_mzMLFinal/20160603151123624-1576262 Batch5_SHP77_2a.mzML"
> temp = curl::curl_escape(url)
> temp
[1] "ftp%3A%2F%2Fmassive.ucsd.edu%2FMSV000087155%2Fccms_peak%2FNew_mzMLFinal%2F20160603151123624-1576262%20Batch5_SHP77_2a.mzML"
> curl::curl_unescape(temp)
[1] "ftp://massive.ucsd.edu/MSV000087155/ccms_peak/New_mzMLFinal/20160603151123624-1576262 Batch5_SHP77_2a.mzML"