open-cogsci/QOpenScienceFramework

Download broken due to redirect

smathot opened this issue · 25 comments

You can no longer download files through the OSF explorer. If you do, the downloaded file is a redirect page instead of the actual file (see below). This has broken the OSF integration in OpenSesame.

My best guess is that this is due to a change in behavior of the OSF API. It seems to have happened a few days ago.

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<title>Redirecting...</title>
<h1>Redirecting...</h1>
<p>You should be redirected automatically to target URL: <a href="https://accounts.osf.io/login?service=https://osf.io/download/trjwg/">https://accounts.osf.io/login?service=https://osf.io/download/trjwg/</a>.  If not click the link.



<!DOCTYPE html>












<html lang="en">
    <head>
        <meta charset="UTF-8" />

        <title>Open Science Framework | Sign In </title>

        
        <link rel="stylesheet" href="/css/cas.css" />
        <link rel="icon" href="/favicon.ico" type="image/x-icon" />

        <!--[if lt IE 9]>
            <script src="//cdnjs.cloudflare.com/ajax/libs/html5shiv/3.6.1/html5shiv.js" type="text/javascript"></script>
        <![endif]-->

        <link href='https://fonts.googleapis.com/css?family=Open+Sans:400,600,300,700' rel='stylesheet' type='text/css'>
    </head>

    <body id="cas" onload="selectFocus()">
        <div id="container">
            <br>
            <header>
                <div class="center">
                    
                    <a id="logo" class="center" href="https://osf.io/" title="Open Science Framework Sign In">Open Science Framework | Sign In</a>
                </div>
                <br>
                <div class="center">
                    <span id="title">
                        
                            
                            
                                <span class="title-full">Open Science Framework</span>
                                <span class="title-abbr">OSF</span>
                            
                            
                        
                    </span>
                </div>
                <div class="responsive">
                    <div id="description">
                        <br><br>
                        
                            
                            
                                    Sign in with your OSF Account to continue
                            
                            
                        
                    </div>
                </div>
            </header>
            <br>
            <div id="content">






    


<div class="box" id="login">
    <form id="fm1" action="/login?service=https://osf.io/download/trjwg/" method="post">

        

        <section class="row">
            <label for="username"><span class="accesskey">E</span>mail:</label>
            
                
                
                    
                    <input id="username" name="username" class="required" tabindex="1" accesskey="e" type="text" value="" size="25" autocomplete="off"/>
                
            
        </section>

        <section class="row">
            <label for="password"><span class="accesskey">P</span>assword:</label>
                
            
            <input id="password" name="password" class="required" tabindex="2" accesskey="p" type="password" value="" size="25" autocomplete="off"/>
            <span id="capslock-on" style="display:none;"><p><img src="images/warning.png" valign="top"> CAPSLOCK key is turned on!</p></span>
        </section>

        

        

        <section class="row btn-row">
            <input type="hidden" name="lt" value="LT-151559-rM7yDt4Ah9MpTXmRjRbDf1Sbwf2FJF-accounts.osf.io" />
            <input type="hidden" name="execution" value="f3cba3f8-65f1-4523-8c05-489eb94c019f_AAAAIgAAABDrTCBrwYO+4L6+fq4QJKd3AAAABmFlczEyOIAu5jwuKt0mJZ2SQtbpUS34t4YVTOwTrc5EYkH2khtYdrjZ/eiRb4ppXX/ln6H9rzxhFfruC6+5c44neIFLPAAIekIKe3fxYsMTT/jSpQi0q98aTI6zMHJopR457+eayOK7XhOsAyJ8MYT9Kpifz2hyz/Xclf7AoG1jPdiR0UibEQoKZmXejPUXPRhyHHng/RMZtzO3QffzFipWGA+GXm+jh2Jb1ByHeLNaI0lSWuZEaprC9Eb3sO7cW5v5KHMehCgYyVRkicbxEEWxgYS/HV+cfJn4PDMzJKMoRcmEWFEDWEdhhoVGq1R+h6YIbOHxzJiu74VDnk2Yn8uM0tdj0Z3jiTe5/NZ+GOBQbWsKtjSoontsWLnGdpQhLJ2nXeAvwmnaJtTJSQPJmN96qhjFtx1XcbX9ML8+rXSaTfrZdmr5CWuqQhvaV0UlADjEgQYMFYnQ/LultrQQkmPLF/ShQr/edfDUDzZZPRPV4Lvr+UWKSpz5m69ABBHt7cfRHuawG5FqYgzynMoobqSHhUISQRL2dd02hdpDBBsT4+otVSftlGAnygq5sx4MjqVd1moqAHPe/vh49NNeiq+CvcDid9m3+J/GBox7o5s8+IoG7Nv6mGzG0cEf3J9N0CqTVlWw4rOWUpXKq8dkER2tB4ZKyaJc0Whm5EKD4g7HVu1AQMRN+8tKwmbNezmxeeHzFZ6UVysnNrXLy+Gc8OII6MwjncVG+dVhHuPgkdbWSq8ECz05svkz0Wd0BFbtj02A4WXlPDtCLgPGI3t8KvcInG/TueL+iaqwi4ZvLJOIK2DfX688zjYEIBQH2Qai+eqP0JEwwGJCxE+Ia3rvaSCZrxSeGzE4xyssx8dhjHWjjR9YsljX3noLsoXu0z7gr1qSu3bxRyFtpjTOvFb+F+Wobc3DxiuPFCnRAeEgH1DwMnMSUMrKYx78U+dYDlcZhVJxTCohe2HLDyPBPYpjF/ce/Q4vGzr4/Z7CZbEhvs+llRrZAsGMLdLs+DVuIs7iLGWb9zRHFFSlRn3KP5y74u8Dc+bVVpuY1WmvQ5IGpnsrJlQr5H1/CTtT+A5DNy8YA9EbaXVVISmMtZlHS1XltfOpuIkoyVF+8n12Cih1qRYkfIt1b76PGDAbr1DFtwhxBqfIDa8JclhfL5CxQ2AQGHIjNrxzgN/7p0cRFVqQMp2qutAJk1DGldlU9oZvwvHSR7Q6hVI3ek/4v6Rxq+p8MfgiimgvkWgwYdRhfEfdhh6YUvizs8wLmqhWXLxF8FXraXg+3yggOc5NQm5T6m8b8KWb+xmefxu/6mmpKxrzrFikCUXH/viBDqEFupRffglvhTRw8CMuz47eoJWHPxtLFivqEoEibEGqO6TF6B0qzGXXwd74n/ttG79nLAmFqybxtt5AwYzO4hMG+cxMSWYx/9XxAT3ox5Jncx15zCgogAbJHVYY1s3AyoAe9Orn3ZNAPA57qUdsgESxvDffSxZVvh+tikW/x7xLq3wKVWnHGY5b2wWkURkcnlBtUYkxq4nozmvSbe2hKcBqJenOIn6ZCMCpmNYENPxaDhVBp72AI8FZdV6GmI1r/+PRl4KHOCyfktENxYHZK6S+NEPkoGxxoLOPhVXFd0NtCxH/6QxQ95DGznU++feIE12JVxBqUR/EG4x1u8oShgT6TGVjTNarSg27ZYDNiFCqbCbt7l+88UMQsY/gFL6Q+DSsrE0mNQtfxkL8SR6E3lR6k2sF0Bpss229CtodIWjTjc7Gzm3gIP7F3aQ2dEX84sKOH23WRbjPhZddd8qFKy4Ypu3VnxM/LWXmMbl+5B3kIo7u52/78kMFHJc9xEDe2NKtvr+D+ITo3MsIE2Kt3y69ll6urDovJxGd5kpC7Y7X4PiI2pXatOFi8SUjhJpBVjjaRgVLKEehVnNpUMgTWnCBms2iJM5Y+omnS2q9fGADy17YeMwOf+yQBkyC69dbbYJdFa2CU8ls4wp4x7YMMJBE41yctvnJ/B4avclOOjihXo/Y/K1SjwS52kg9a11HGaoVgc8USbvX7Q1O9IxLmvOvt4EEK1cTDpIdw/JRTW7HYcvV367Gk4KZiu0+HkM6PD+/ZEHjqm6mMlYvtTInXVPkJRMo/NPxFgbWnMqCYaXoGrBEE+Fp1jviMqxYnb4u0fmfY2F9LrkoRQrenGOKiLEREcn3lnFRQNf/wZifW4W5QZ8ZDaNAU+vrVenF9KU/P8EpUUBxy6rf6c0Mb84JLX3DtUUaCKouK5px56m0uoiZyuoh+NfLYLSxxYPWrDeJpl0kj+X6l1AB+Kv0B+A3Yii2fn6QWqdILuI/QwI5HzEwuSu3OhJSsU3P9StkTSF8d/tK/ckF6eE0yLNir0o+SR1Gi7M/bLhK2LcSG9IVXkua/1RmTVrPXRVcoCFjuUCKDt9ZEMUTFQR815fbFlg7uhEOXOll/G4k6epFRs1fy89eVWg72yLwtgsBwahJzolofQPffnbqkUHEMxWNbvo4oJ7aI+0iKq8gRMl9jTurEt30jw7TvblW7h0hl0zRERhxbGrOgxv1xHd2z7XRPfvBjf43MtK3vHlj6r7TNT00BMrhHwdudU4dQ55nOxT8P8b9ErUkHdlnJYNwoeVINayqtxdbfEugODrAsBJXzdSJcCCjarxRNd2hbJkojLgjeEJCfbUEaFQNAbGvJWDAh0+SeDj3neBdoYear9+ssgFdt674wDanHh8fWCNtCRt5ju0qpF7WTDbyAWwrbXDr5uAtrip4EPMUpnRnEwsYVGhUqChLCj2dHWv88G9ZeXmfD5Eqy86ON3lfLJ1EgfShB9Y3bujJlcG/1JYlYspEn8gdyZZ0vHBNXFGDXeUU9OX6w9ZxLAlXt0F0fdQ7uDEAvSBzPq5CHEFn3MnWH/XHL32+sWsh9mELjXnNjtVy35aZlbea/5I8+HFST696fOqzFWLx/qx/x5cKhmwlVikUMA9FaFax7Bs9o7RmQQzXEPSB1LO2R/rKBSezLO1B5EMsaI32dmfUH7mK1Q8pkWMRl4e9hvugKObg+Yc/zDJkM3lDnwi1" />
            <input type="hidden" name="_eventId" value="submit" />

            <input type="submit" class="btn-submit" name="submit" accesskey="l" value="SIGN IN" tabindex="4"  />
            
        </section>
        <section class="row check">
            
            <input type="checkbox" name="rememberMe" id="rememberMe" value="true" checked tabindex="5" />
            <label for="rememberMe">Stay Signed In</label>
            
            <a id="forgot-password" class='need-help' href="https://osf.io/forgotpassword/" title="Open Science Framework Sign In">Forgot Your Password?</a>
        </section>

        
        
            
                <hr/>
                <section class="row">
                    <a class="btn-oauth" href="https://www.orcid.org/oauth/authorize?client_id=APP-VGLDMYSOHONDWCV8&scope=%2Fauthenticate&response_type=code&redirect_uri=https%3A%2F%2Faccounts.osf.io%2Flogin%3Fclient_name%3DOrcidClient#show_login"><img class="orcid-logo" src="../images/orcid-logo.png">Sign in with ORCID</a>
                </section>
            
        

    <div>
</div></form>
</div>







    
        <div class="row" style="text-align: center;">
            <hr><br>
            
                
                
                
                    
                        
                        
                            
                            
                                
                                    <a id="alternative-institution" href="/login?campaign=institution&service=https%3A%2F%2Fosf.io%2Fdownload%2Ftrjwg%2F">Login through Your Institution</a>&nbsp;&nbsp;&nbsp;&nbsp;
                                
                                
                            
                        
                    
                
            
            
            <a id="back-to-osf" href="https://osf.io/">Back&nbsp;to&nbsp;OSF</a><br>
        </div>
    

</div> <!-- END #content -->


    
        <div class="row" style="text-align: center;">
            <br>
            
            <a id="create-account" href="https://osf.io/register/">Create Account</a>
        </div>
    


<footer>
    
    <div class="copyright">
        <div class="row">
            <p>Copyright &copy; 2011-2018 <a href="https://cos.io">Center for Open Science</a> |
                <a href="https://github.com/CenterForOpenScience/centerforopenscience.org/blob/master/TERMS_OF_USE.md">Terms of Use</a> |
                <a href="https://github.com/CenterForOpenScience/centerforopenscience.org/blob/master/PRIVACY_POLICY.md">Privacy Policy</a>
            </p>
        </div>
    </div>
</footer>

</div> <!-- END #container -->

<script>
    function selectFocus() {
        var username = document.getElementById("username")
        if (username) {
            username.focus();
        }
        var institutionSelect = document.getElementById("institution-form-select")
        if (institutionSelect) {
            institutionSelect.focus();
        }

    }
</script>

<script src="https://cdnjs.cloudflare.com/ajax/libs/headjs/1.0.3/head.min.js"></script>

<script type="text/javascript" src="/js/cas.js"></script>


    <script>
        (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
        (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
        m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
        })(window,document,'script','//www.google-analytics.com/analytics.js','ga');

        ga('create', "${GOOGLE_ANALYTICS_ID:}", 'auto', {'allowLinker': true});
        ga('require', 'linker');
        
        
            ga('linker:autoLink', ['${GOOGLE_ANALYTICS_AUTOLINK:}'] );
        
        ga('send', 'pageview');
    </script>

</body>
</html>

Yes, the system indeed returns a redirect URL now instead of the actual file. I hope this can be fixed by simply following the URL. I'll look into this now and return to you when I have something.

This problem is more difficult than I thought. The download/get function should already handle redirects correctly (https://github.com/dschreij/QOpenScienceFramework/blob/master/QOpenScienceFramework/manager.py#L828-L850), so I think OSF currently redirects in a non-conventional way. Normally the client receives a response with a HTTP status code of 301 or 302 from the server, which also includes the URL to redirect to. QOSF has handled such response correctly in the past.
Something else is happening with a file download. I think the OSF wants to explicitly show a page with We are redirecting you to your file and to do so, it has to send a HTTP 200 (OK) response, which is the code for a succesful response. They then redirect to the file with some javascript code on the page (I think).
There is no easy or elegant way to handle this situation in the QOSF code, and I think might be a bit sloppy approach to redirects from Waterbutler. I am going to post an issue in their repo and see if they can offer a way to solve this.

Hey all,

This bug has been reported against the OSF, and there's an in-progress PR here: CenterForOpenScience/osf.io#8239 I closed the issue in the WaterButler repo, because the fix doesn't involve WB.

Cheers,
@felliott

@smathot @dschreij Are you using bearer tokens to authenticate with the API? We had a bug where the download routes weren't respecting the auth tokens, which would result in a redirect to the log in page (as seen in the OP).

This bug is now fixed. Can you verify that this is the case?

@sloria Thanks for the update. We are indeed using bearer tokens (or at least the usual Oauth2 procedures) for authentication and send these with the request header. I just did another test and am still experiencing the issue. I printed some output of what's happening and get:

GET PyQt5.QtCore.QUrl(u'https://osf.io/download/572214b59ad5a100464ee572/')
STATUS 302
Redirecting to PyQt5.QtCore.QUrl(u'https://accounts.osf.io/login?service=https://osf.io/download/572214b59ad5a100464ee572/')
GET PyQt5.QtCore.QUrl(u'https://accounts.osf.io/login?service=https://osf.io/download/572214b59ad5a100464ee572/')
STATUS 200

So the download request gets a 302 redirect response to the login page.

All the other communications with the API work without problems, so the used token is accepted everywhere except with downloads as it seems.

This bug is now fixed. Can you verify that this is the case?

As @dschreij says, I'm afraid the problem still persists.

Thanks for checking that. Looks like our fix only works for Personal access tokens but not for tokens issued for a an OAuth2 app. We'll get this fixed up.

Apologies for the hiccup. Thanks for your patience.

@dschreij @smathot We've deployed another fix for this. Would you mind checking this again?

@sloria The problem is partly fixed, but not entirely. Most files are now downloaded correctly, including OpenSesame experiments (.osexp) in plain-text format. However, OpenSesame experiments in .tar.gz format (also called .osexp, OpenSesame detects the file type automatically) are corrupted, seemingly having been zipped (i.e. an extra layer of compression).

Hey @smathot,

Are you downloading the .tar.gz files directly or using the download-as-zip functionality? Only in the latter case should they be getting zipped. We added some code to not re-zip .zip files, but we haven't extended that to .tar.gz yet.

Cheers,
@felliott

This goes through the API as a direct download. (Right @dschreij?)

Yes, it is using the url that is provided at the download key of the links object. I haven't heard about the download-as-zip functionality, nor do I know how to use it (must be a newer feature of the OSF API)?

Hi @felliott
You can try https://osf.io/download/mpnza, which is an image in one of my public projects.

It does something differently now, giving the impression it worked (or at least, I don't get redirected to the login page anymore), but the downloaded jpeg is corrupted:

PyQt5.QtCore.QUrl(u'https://osf.io/download/mpnza/')
302
Redirecting to PyQt5.QtCore.QUrl(u'https://files.osf.io/v1/resources/3mp7e/providers/osfstorage/570f74966c613b01dd75b8de')
PyQt5.QtCore.QUrl(u'https://files.osf.io/v1/resources/3mp7e/providers/osfstorage/570f74966c613b01dd75b8de')
302
Redirecting to PyQt5.QtCore.QUrl(u'https://storage101.iad3.clouddrive.com/v1/MossoCloudFS_926294/osf_storage_prod/972ee849624949ed8d1f9a48a482e62cad144c32b7491a47fb1e62005755ab83?temp_url_sig=0a5f2fee6533cc167847434e2ffb1eff77163e3f&temp_url_expires=1521813501&filename=210_900_-7881.jpg')
PyQt5.QtCore.QUrl(u'https://storage101.iad3.clouddrive.com/v1/MossoCloudFS_926294/osf_storage_prod/972ee849624949ed8d1f9a48a482e62cad144c32b7491a47fb1e62005755ab83?temp_url_sig=0a5f2fee6533cc167847434e2ffb1eff77163e3f&temp_url_expires=1521813501&filename=210_900_-7881.jpg')
200

Hey @dschreij,

When I download that file via the browser, I get an uncorrupted image (both with the OSF url and the WB url). It's md5 is 09995d7468a3fa459f0ae9b94df66b8b. Do you get a corrupted image when you download directly or only through the QT widgets? I wonder if there's some header we should or shouldn't be setting that's causing QT to corrupt it. Can you upload the corrupted version here? I'd like to take a look at it with a hex editor.

Cheers,
@felliott

210_900_-7881

I can download without problems through the browser, and these errors only occur with the Qt widgets (which have always worked without problems before this week).
I uploaded the file; I hope it helps!
210_900_-7881

It's a superstrange file. The start is an HTML redirect page (plaintext), and then it turns into binary.

The file you uploaded is identical to mine, except that the content of the second redirect has been preprended to it:

screen shot 2018-03-23 at 1 48 56 pm

The highlighted area is the beginning of the jpg. Everything before that is the HTML representation of the redirect response from the OSF to WB. Something in our initial redirect response must have changed to make QT think it's part of the data. Where in your codebase is the download being done? Hopefully we can figure out what the OSF needs to send to not think that's part of the data.

Forgive me if the explanation below is wordy, or more than you actually ask for, but I think it's better to walk you completely through the process.

All the communication with the OSF API is done by the Manager module. When a file needs to be downloaded the __download() function is called which opens a file handle to save the file to. The file is then retrieved by the __get() function (which is responsible for all GET requests to the OSF API).

Once a reply to the GET request is completely received, it is handled by __reply_finished(). This function checks the status code, and redirects to a new URL if this is a 301 or 302. If the status is a 200 (OK) response, the __download_finished() function is called, which copies the downloaded file from its temporary location to the destination location that was specified by the user.

While walking through this process myself, I think I have a hunch where things go wrong, and it is in my code:

The get() function provides a possibility to supply a readyRead function, which can process the incoming data of the reply as it comes in. With a download, I use this to stream the incoming data directly to the temporary file. The status code of the reply is only checked after the reply is finished (in __reply_finished()), so any data that is transmitted before that is already saved to the file, including any content that is provided in the body of the 301/302 response. This never caused any problems before, but probably the OSF API only recently started to add contents to the body of their redirect responses.

I'm not sure how I need to handle or fix this at the moment, because to my knowledge the above process is the only way to provide a download progress dialog, which is absolutely necessary in my opinion, especially for large files.

In short, the problem is that you only know the status of a response after the reply is finished, but to show the progress of a download, you already need to work with the incoming data while the reply is still in progress. I now create the file handle before the first GET request and stream everything that comes in until the 200 OK response to this file, including contents of 301/302 responses. A solution would be to create a new file handle for every GET request that is performed, stream the data to it, and decide what to do with the results afterwards: throw it away if it was a 301 or 302 REDIRECT response, and keep it if the status was a 200 OK. This requires me to do a lot of replumbing to the QOSF module, so it will take some time to fix it like this, if this is the direction we choose, but what needs to be done, needs to be done ;)

@dschreij Thanks for this. Would it be an option to first implement a hacky solution in which we simply check downloaded files whether they start with this HTML redirect page, and if so, strip that from the file?

That seems like an easy way to solve the problem, so that at least the integration in OpenSesame is back up. And then we can work on a more thorough solution later on. What do you think?

That could work, but I don't know if it will always "uncorrupt" binary files, and is quite a fragile solution. If the OSF changes only a bit of the HTML contents of the redirect response, or adds an extra redirect with different contents, then it will break already. The best way is to solve this in the file download process itself I think, as I described above.
A simple and quick solution I was thinking of, is that maybe I can just preventively always delete the contents of the file where the handle is pointing to after a redirect reply is detected. Then every GET request after a redirect will start with an empty file and this problem should no longer occur.

That could work, but I don't know if it will always "uncorrupt" binary files, and is quite a fragile solution. If the OSF changes only a bit of the HTML contents of the redirect response, or adds an extra redirect with different contents, then it will break already.

I totally agree. I just felt it might be a quick patch for now (because it's quite bothersome to many people that the OSF integration is broken). But if we can fix it in a better way, then that's of course preferred.

A simple and quick solution I was thinking of, is that maybe I can just preventively always delete the contents of the file where the handle is pointing to after a redirect reply is detected. Then every GET request after a redirect will start with an empty file and this problem should no longer occur.

I see what you mean. Go for it!

@smathot Alright. I'll see if I can get to it this afternoon, and otherwise tomorrow.

I added a line that truncates the target file after a redirect response has been received. For me this fixes the issue. @smathot can you please give this a quick test and see if it works for you too?