Problem with remote files with whitespaces in the file name
Closed this issue · 9 comments
Hi Lori,
I would like to cache files from a public repository of mzML (raw mass spec data files) using BiocFileCache
but it doesn't work because many of these files contain white spaces in their file names. Example:
library(curl)
url <- "ftp://massive.ucsd.edu/MSV000087155/ccms_peak/New_mzMLFinal/20160603151123624-1576262 Batch5_SHP77_2a.mzML"
Unfortunately, there is a white space in the file name. So, adding the file right away does not work:
library(BiocFileCache)
bfc <- BiocFileCache(tempdir())
path <- bfcrpath(bfc, url)
adding rname 'ftp://massive.ucsd.edu/MSV000087155/ccms_peak/New_mzMLFinal/20160603151123624-1576262 Batch5_SHP77_2a.mzML'
Error in bfcrpath(bfc, url) : not all 'rnames' found or unique.
In addition: Warning messages:
1: download failed
web resource path: ‘ftp://massive.ucsd.edu/MSV000087155/ccms_peak/New_mzMLFinal/20160603151123624-1576262 Batch5_SHP77_2a.mzML’
local file path: ‘/tmp/Rtmp3Pr4NW/74d1e62e9_20160603151123624-1576262%20Batch5_SHP77_2a.mzML’
reason: URL using bad/illegal format or missing URL
2: bfcadd() failed; resource removed
rid: BFC1
fpath: ‘ftp://massive.ucsd.edu/MSV000087155/ccms_peak/New_mzMLFinal/20160603151123624-1576262 Batch5_SHP77_2a.mzML’
reason: download failed
3: In value[[3L]](cond) :
trying to add rname 'ftp://massive.ucsd.edu/MSV000087155/ccms_peak/New_mzMLFinal/20160603151123624-1576262 Batch5_SHP77_2a.mzML' produced error:
bfcadd() failed; see warnings()
Replacing the white space with a %20
as required for URLs allows me to add the file to the cache - but this is not ideal because I need to change the original file name (which is usually used to link samples to the data files).
> url <- sub(" ", "%20", url, fixed = TRUE)
> bfc <- BiocFileCache(tempdir())
> path <- bfcrpath(bfc, url)
adding rname 'ftp://massive.ucsd.edu/MSV000087155/ccms_peak/New_mzMLFinal/20160603151123624-1576262%20Batch5_SHP77_2a.mzML'
|======================================================================| 100%
What also puzzled me is that BiocFileCache
further modified the file name by replacing the %20
with %2520
(???).
> path
BFC2
"/tmp/Rtmp3Pr4NW/76a624503_20160603151123624-1576262%2520Batch5_SHP77_2a.mzML"
What would however be ideal is if I could provide the original file names (eventually also containing white spaces) for remote sources to BiocFileCache
and that the package internally fixes the URLs (e.g. replacing white spaces with %20
) but then uses again the original file name for the local copy. In other words, it would be great if I could provide e.g. like above the original path and file name (20160603151123624-1576262 Batch5_SHP77_2a.mzML), BiocFileCache
downloads that file (needs to fix the file name in the URL to 20160603151123624-1576262%20Batch5_SHP77_2a.mzML) and stores the data to the local copy with the original file name 20160603151123624-1576262 Batch5_SHP77_2a.mzML. Would that be possible?
bfcrpath is a short cut to bfcadd -- could you use bfcadd, using the valid url but then set the rname to the white space version?
Hm, seems not to work:
url <- "ftp://massive.ucsd.edu/MSV000087155/ccms_peak/New_mzMLFinal/20160603151123624-1576262 Batch5_SHP77_2a.mzML"
bfc <- BiocFileCache(tempdir())
path <- bfcadd(bfc, rname = url, fpath = sub(" ", "%20", url, fixed = TRUE))
still gives me:
> path
BFC1
"/tmp/RtmpzMQhnE/860cef5ab_20160603151123624-1576262%2520Batch5_SHP77_2a.mzML"
i.e. there is a %
in the file name.
Somehow the rname
seems not to be considered:
> bfc <- BiocFileCache(tempdir())
> path <- bfcadd(bfc, rname = "AAAAA", fpath = sub(" ", "%20", url, fixed = TRUE),
+ fname = "exact")
|======================================================================| 100%
> path
BFC4
"/tmp/RtmpzMQhnE/20160603151123624-1576262%2520Batch5_SHP77_2a.mzML"
I meant you could match / query on the rname then
We specifically do a curl_escape
to make sure the url can be downloaded; I believe we did this purposefully because different systems would fail when spaces and special characters were present. 47c4b23
Yes, that makes total sense. And your solution would fix the sample mapping issue, indeed.
Unfortunately I have a second issue ;) - mzR
(or more specifically the proteowizard C++ libraries that are used by mzR
) seems to have problems with %
in the file names:
> library(mzR)
Loading required package: Rcpp
> openMSfile(path)
Error: Can not open file /tmp/RtmpzMQhnE/20160603151123624-1576262%2520Batch5_SHP77_2a.mzML! Original error was: Error: [References::resolve()] Failed to resolve reference.
object type: N4pwiz6msdata23InstrumentConfigurationE
reference id: IC1
referent list: 0
here I'm really unsure if and how that could be fixed ... but that's obviously not your business - I will see if I can fix that over in mzR
...
I think you can manipulate file names locally but it might loose the ability to auto check for redownload -- but I'd have to look back into how to do this
no worries, all good. your solution seems good to me, thanks!
FWIW -- There is a curl::curl_unescape that you might be able to use on the given filepath before using it anywhere else?
> url = "ftp://massive.ucsd.edu/MSV000087155/ccms_peak/New_mzMLFinal/20160603151123624-1576262 Batch5_SHP77_2a.mzML"
> temp = curl::curl_escape(url)
> temp
[1] "ftp%3A%2F%2Fmassive.ucsd.edu%2FMSV000087155%2Fccms_peak%2FNew_mzMLFinal%2F20160603151123624-1576262%20Batch5_SHP77_2a.mzML"
> curl::curl_unescape(temp)
[1] "ftp://massive.ucsd.edu/MSV000087155/ccms_peak/New_mzMLFinal/20160603151123624-1576262 Batch5_SHP77_2a.mzML"