The .csv.gz files for unpaired and paired sequences on the web portal are not opening

Question

The .csv.gz files for unpaired and paired sequences on the web portal are not opening

Closed this issue a year ago · 1 comments

Hi! Thanks very much for the great resource!

Unfortunately it seems like the files under the links for paired and unpaired sequences from the web portal are corrupted in some manner. I am using a chrome browser on a mac machine, and when I download the files and try to open them I get the following error message:

When downloading the whole database "plabdab_data", I found the same files in the folder and those can be unarchived and opened without any issues.

Answer 1 · 2023-07-29T07:56:58.000Z

Hi Aleks,

Thank you for bringing this to our attention. We appreciate your interest in our resource and we apologize for any inconvenience caused by the issue.

To clarify, the original files on our server aren't corrupt, but it appears there could be an issue with the files getting corrupted during download. This is often due to some browser settings which automatically try to extract compressed files upon download, potentially leading to issues such as the one you're experiencing.

We are currently investigating the issue and seeking a comprehensive solution. In the meantime, there are a couple of workarounds you could try:

Disable the automatic extraction feature on your browser. The steps to do this may vary depending on your browser.
Download the data directly via Python using the pandas library. Here's a brief code snippet on how to achieve that:

import pandas as pd

url = "https://opig.stats.ox.ac.uk/webapps/plabdab/static/downloads/paired_sequences.csv.gz"
df = pd.read_csv(url, compression='gzip')

This should bypass the need for manual extraction and allow you to directly access the data.

Thank you for your patience and understanding as we work to resolve this issue. If you encounter any other problems, please don't hesitate to let us know.

All the best,

Brennan