PayneLab/cptac

Problem getting data from Hnscc dataset

Closed this issue · 8 comments

I am trying to get the Hnscc and Ccrcc datasets, however, I get this error (see the attached screenshot too) that looks like something is wrong with the Excel file reading:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\smdb2\.conda\envs\cptac_env\lib\site-packages\cptac\hnscc.py", line 250, in __init__
    df = pd.read_excel(file_path)
  File "C:\Users\smdb2\.conda\envs\cptac_env\lib\site-packages\pandas\util\_decorators.py", line 296, in wrapper
    return func(*args, **kwargs)
  File "C:\Users\smdb2\.conda\envs\cptac_env\lib\site-packages\pandas\io\excel\_base.py", line 304, in read_excel
    io = ExcelFile(io, engine=engine)
  File "C:\Users\smdb2\.conda\envs\cptac_env\lib\site-packages\pandas\io\excel\_base.py", line 867, in __init__
    self._reader = self._engines[engine](self._io)
  File "C:\Users\smdb2\.conda\envs\cptac_env\lib\site-packages\pandas\io\excel\_xlrd.py", line 22, in __init__
    super().__init__(filepath_or_buffer)
  File "C:\Users\smdb2\.conda\envs\cptac_env\lib\site-packages\pandas\io\excel\_base.py", line 353, in __init__
    self.book = self.load_workbook(filepath_or_buffer)
  File "C:\Users\smdb2\.conda\envs\cptac_env\lib\site-packages\pandas\io\excel\_xlrd.py", line 37, in load_workbook
    return open_workbook(filepath_or_buffer)
  File "C:\Users\smdb2\.conda\envs\cptac_env\lib\site-packages\xlrd\__init__.py", line 170, in open_workbook
    raise XLRDError(FILE_FORMAT_DESCRIPTIONS[file_format]+'; not supported')
xlrd.biffh.XLRDError: Excel xlsx file; not supported

image

Just in case is something related to Pandas, I have version 1.1.5 installed:
image

I am using python 3.6.12 as requested in documentation (3.6.x) and looks like I cannot upgrade pandas with this python version. Should I upgrade my python version?

I think I got it, playing with the library version. I finally got it running with:
Python 3.6.12
xlrd 1.2.0
scipy 1.5.4
pandas 1.1.5
openpyxl 2.6.0

Yeah sorry about that, glad you got it working. The issue is that xlrd stopped supporting .xlsx files as of version 2.0.0, and so pandas>=1.2.0 now uses openpyxl. But if you have pandas<1.2.0, it will try to use xlrd for .xlsx files, so you need to have xlrd<2.0.0 in that case. Long term we're just going to require pandas>=1.2.0 and openpyxl, but we're waiting to update that dependency until pandas 1.2.0 is available on Google Colabs, since there's a good number of people who use that environment. For now we may update the dependencies to require xlrd=1.2.0, in case anyone has pandas<1.2.0.

Thank you for your response!
I have pandas <1.2.0 because I see that python 3.6.x only supports pandas 1.1.5, but maybe I am wrong.
Just to make sure: I need to have python 3.6.x right? then, is it possible to use pandas 1.2.0?

Oh good catch. I checked the installation instructions, and it looks like pandas stopped officially supporting Python 3.6 as of pandas 1.2.0, and now only officially supports Python 3.7.1 and greater. So you may have to use 1.1.5 unless you upgrade to Python 3.7.1 or greater.

But is it Python 3.7.1 or greater compatible with cptac? I read only 3.6.x

We support Python 3.6 or greater, so 3.7 will be no problem. 3.6 is just the minimum version we require. Sorry if the documentation used to imply that it had to be 3.6 specifically. I just checked to make sure that the installation instructions now state "Python 3.6 or greater". Sorry for the confusion!

No problem. Thank you very much for your quick responses!

Alright, we just released a patch update with new dependency requirements that fixes this issue. We required xlrd==1.2.0 instead of >=1.2.0, so now it should be fine whether you have pandas 1.1.5 or 1.2.0.