Chapter 6 - Data Prep Add Customer_Levels - reads 1000 rows instead of 50 (extra sales)

Question

Chapter 6 - Data Prep Add Customer_Levels - reads 1000 rows instead of 50 (extra sales)

Closed this issue 4 years ago · 5 comments

I created a notebook for chapter 6 to follow along with the lesson. When I read in the example file, the total number of rows read was 1000 instead of 50. The read is not stopping at 50 and is reading up to 100o rows even though there is no data in cells 51 - 1000.

I viewed the data file (sample_sales_50_extra.xlsx - copied from clone into my project) and it definitely only has 50 rows.

Reseting the kernal and rerunning the notebook has no effect.

extra_transactions = Path.cwd() / 'data' / 'raw' / 'sample_sales_50_extra.xlsx'
...
extra_sales = pd.read_excel(extra_transactions)
...
extra_sales.shape
(1000, 7)

Why is that? I cannot see why it happening. Is there a way to prevent it?

Thanks
Michael

Answer 1 · 2021-01-01T16:23:13.000Z

Michael -

That's odd.

In trying to figure out the issue, I was able to reproduce it and i think it has to do with pandas' excel engine changes as described here.

If i explicitly tell pandas to use xlrd, it works:

import pandas as pd
file_url = 'https://github.com/talkpython/excel-to-python-course/blob/master/code/ch6-aggregating-grouping-merging/data/raw/sample_sales_50_extra.xlsx?raw=True'
df = pd.read_excel(file_url, engine='xlrd')
df.shape

However if I use openpyxl, I get the behavior you describe:

df = pd.read_excel(file_url, engine='openpyxl')

One potential workaround is to pass in the number of rows to read:

df = pd.read_excel(file_url, engine='openpyxl', nrows=50)

Let me know if this works for you. If it does, I will see that I can do to clean it up.

Answer 2 · 2021-01-02T06:03:18.000Z

Hi Chris,

THE SOLUTION (for me at least):

extra_sales = pd.read_excel(extra_transactions, engine='openpyxl', nrows=50)

Some History

The xlrd package was giving me some grief.

I installed the required packages but when I tried to run the notebook for the chapter, at the following point:

extra_sales = pd.read_excel(extra_transactions)

I got the following error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-3-a79ef5fb00f4> in <module>
----> 1 sales = pd.read_excel(input_file)
      2 levels = pd.read_excel(level_file)
      3 extra_sales = pd.read_excel(extra_transactions)

~/.local/share/virtualenvs/work-SEXxSIg4-python/lib/python3.9/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    297                 )
    298                 warnings.warn(msg, FutureWarning, stacklevel=stacklevel)
--> 299             return func(*args, **kwargs)
    300 
    301         return wrapper

~/.local/share/virtualenvs/work-SEXxSIg4-python/lib/python3.9/site-packages/pandas/io/excel/_base.py in read_excel(io, sheet_name, header, names, index_col, usecols, squeeze, dtype, engine, converters, true_values, false_values, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, parse_dates, date_parser, thousands, comment, skipfooter, convert_float, mangle_dupe_cols, storage_options)
    334     if not isinstance(io, ExcelFile):
    335         should_close = True
--> 336         io = ExcelFile(io, storage_options=storage_options, engine=engine)
    337     elif engine and engine != io.engine:
    338         raise ValueError(

~/.local/share/virtualenvs/work-SEXxSIg4-python/lib/python3.9/site-packages/pandas/io/excel/_base.py in __init__(self, path_or_buffer, engine, storage_options)
   1083         if engine == "xlrd" and ext != "xls" and xlrd_version is not None:
   1084             if xlrd_version >= "2":
-> 1085                 raise ValueError(
   1086                     f"Your version of xlrd is {xlrd_version}. In xlrd >= 2.0, "
   1087                     f"only the xls format is supported. Install openpyxl instead."

ValueError: Your version of xlrd is 2.0.1. In xlrd >= 2.0, only the xls format is supported. Install openpyxl instead.

So I installed openpyxl and downgraded xlrd to 2.0.

But I just realised that I should have downgraded xlrd to a version less than 2.0 (didn't read the error message properly).

So I downgraded to xlrd to version 1.2.0 and uninstalled openpyxl (see Pipfile below):

But I then got a different error:

<ipython-input-3-a79ef5fb00f4>:1: FutureWarning: Your version of xlrd is 1.2.0. In xlrd >= 2.0, only the xls format is supported. As a result, the openpyxl engine will be used if it is installed and the engine argument is not specified. Install openpyxl instead.
  sales = pd.read_excel(input_file)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-3-a79ef5fb00f4> in <module>
----> 1 sales = pd.read_excel(input_file)
      2 levels = pd.read_excel(level_file)
      3 extra_sales = pd.read_excel(extra_transactions)

~/.local/share/virtualenvs/work-SEXxSIg4-python/lib/python3.9/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    297                 )
    298                 warnings.warn(msg, FutureWarning, stacklevel=stacklevel)
--> 299             return func(*args, **kwargs)
    300 
    301         return wrapper

~/.local/share/virtualenvs/work-SEXxSIg4-python/lib/python3.9/site-packages/pandas/io/excel/_base.py in read_excel(io, sheet_name, header, names, index_col, usecols, squeeze, dtype, engine, converters, true_values, false_values, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, parse_dates, date_parser, thousands, comment, skipfooter, convert_float, mangle_dupe_cols, storage_options)
    334     if not isinstance(io, ExcelFile):
    335         should_close = True
--> 336         io = ExcelFile(io, storage_options=storage_options, engine=engine)
    337     elif engine and engine != io.engine:
    338         raise ValueError(

~/.local/share/virtualenvs/work-SEXxSIg4-python/lib/python3.9/site-packages/pandas/io/excel/_base.py in __init__(self, path_or_buffer, engine, storage_options)
   1112         self.storage_options = storage_options
   1113 
-> 1114         self._reader = self._engines[engine](self._io, storage_options=storage_options)
   1115 
   1116     def __fspath__(self):

~/.local/share/virtualenvs/work-SEXxSIg4-python/lib/python3.9/site-packages/pandas/io/excel/_xlrd.py in __init__(self, filepath_or_buffer, storage_options)
     23         err_msg = "Install xlrd >= 1.0.0 for Excel support"
     24         import_optional_dependency("xlrd", extra=err_msg)
---> 25         super().__init__(filepath_or_buffer, storage_options=storage_options)
     26 
     27     @property

~/.local/share/virtualenvs/work-SEXxSIg4-python/lib/python3.9/site-packages/pandas/io/excel/_base.py in __init__(self, filepath_or_buffer, storage_options)
    389             # N.B. xlrd.Book has a read attribute too
    390             self.handles.handle.seek(0)
--> 391             self.book = self.load_workbook(self.handles.handle)
    392         elif isinstance(self.handles.handle, bytes):
    393             self.book = self.load_workbook(BytesIO(self.handles.handle))

~/.local/share/virtualenvs/work-SEXxSIg4-python/lib/python3.9/site-packages/pandas/io/excel/_xlrd.py in load_workbook(self, filepath_or_buffer)
     36         if hasattr(filepath_or_buffer, "read"):
     37             data = filepath_or_buffer.read()
---> 38             return open_workbook(file_contents=data)
     39         else:
     40             return open_workbook(filepath_or_buffer)

~/.local/share/virtualenvs/work-SEXxSIg4-python/lib/python3.9/site-packages/xlrd/__init__.py in open_workbook(filename, logfile, verbosity, use_mmap, file_contents, encoding_override, formatting_info, on_demand, ragged_rows)
    128         if 'xl/workbook.xml' in component_names:
    129             from . import xlsx
--> 130             bk = xlsx.open_workbook_2007_xml(
    131                 zf,
    132                 component_names,

~/.local/share/virtualenvs/work-SEXxSIg4-python/lib/python3.9/site-packages/xlrd/xlsx.py in open_workbook_2007_xml(zf, component_names, logfile, verbosity, use_mmap, formatting_info, on_demand, ragged_rows)
    810     del zflo
    811     zflo = zf.open(component_names['xl/workbook.xml'])
--> 812     x12book.process_stream(zflo, 'Workbook')
    813     del zflo
    814     props_name = 'docprops/core.xml'

~/.local/share/virtualenvs/work-SEXxSIg4-python/lib/python3.9/site-packages/xlrd/xlsx.py in process_stream(self, stream, heading)
    264         self.tree = ET.parse(stream)
    265         getmethod = self.tag2meth.get
--> 266         for elem in self.tree.iter() if Element_has_iter else self.tree.getiterator():
    267             if self.verbosity >= 3:
    268                 self.dump_elem(elem)

AttributeError: 'ElementTree' object has no attribute 'getiterator'

So I reverted to openpyxl==3.0.5 and xlrd==2.0 and I ceased getting the runtime errors but the read still read in 1000 rows rather than 50 rows. Hence, the solutions for me was to then specifically set which engine to use and the total number of rows to read, as you stated i.e. extra_sales = pd.read_excel(extra_transactions, engine='openpyxl', nrows=50).

Thanks Chris!
Michael

Answer 3 · 2021-01-02T17:07:58.000Z

As an FYI, I uploaded a new version of the source excel file that should not read in the 1000 rows. Your solution is still valid but wanted to let you know the file has been updated.

Answer 4 · 2021-01-03T17:50:43.000Z

Hi all. Thanks for pointing out these issues and Chris for touching up the Excel file. Looks like Pandas 1.2 is out but conda claims there are no updates for pandas when 1.1 is installed. I'm guessing this is just a matter of time for it to make it down stream. So until then, I added a big message on screen which appears for only 5 seconds in the first encounter of pd.read_excel().

Answer 5 · 2021-01-03T19:33:33.000Z

As an FYI, I uploaded a new version of the source excel file that should not read in the 1000 rows. Your solution is still valid but wanted to let you know the file has been updated.

@chris1610 Thanks Chris! I can confirm that the new version only reads 50 rows!

@mikeckennedy Thanks Michael!