Chapter 6 - Data Prep Add Customer_Levels - reads 1000 rows instead of 50 (extra sales)
Closed this issue · 5 comments
I created a notebook for chapter 6 to follow along with the lesson. When I read in the example file, the total number of rows read was 1000 instead of 50. The read is not stopping at 50 and is reading up to 100o rows even though there is no data in cells 51 - 1000.
I viewed the data file (sample_sales_50_extra.xlsx - copied from clone into my project) and it definitely only has 50 rows.
Reseting the kernal and rerunning the notebook has no effect.
extra_transactions = Path.cwd() / 'data' / 'raw' / 'sample_sales_50_extra.xlsx'
...
extra_sales = pd.read_excel(extra_transactions)
...
extra_sales.shape
(1000, 7)
Why is that? I cannot see why it happening. Is there a way to prevent it?
Thanks
Michael
Michael -
That's odd.
In trying to figure out the issue, I was able to reproduce it and i think it has to do with pandas' excel engine changes as described here.
If i explicitly tell pandas to use xlrd, it works:
import pandas as pd
file_url = 'https://github.com/talkpython/excel-to-python-course/blob/master/code/ch6-aggregating-grouping-merging/data/raw/sample_sales_50_extra.xlsx?raw=True'
df = pd.read_excel(file_url, engine='xlrd')
df.shape
However if I use openpyxl, I get the behavior you describe:
df = pd.read_excel(file_url, engine='openpyxl')
One potential workaround is to pass in the number of rows to read:
df = pd.read_excel(file_url, engine='openpyxl', nrows=50)
Let me know if this works for you. If it does, I will see that I can do to clean it up.
Hi Chris,
THE SOLUTION (for me at least):
extra_sales = pd.read_excel(extra_transactions, engine='openpyxl', nrows=50)
Some History
The xlrd
package was giving me some grief.
I installed the required packages but when I tried to run the notebook for the chapter, at the following point:
extra_sales = pd.read_excel(extra_transactions)
I got the following error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-3-a79ef5fb00f4> in <module>
----> 1 sales = pd.read_excel(input_file)
2 levels = pd.read_excel(level_file)
3 extra_sales = pd.read_excel(extra_transactions)
~/.local/share/virtualenvs/work-SEXxSIg4-python/lib/python3.9/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
297 )
298 warnings.warn(msg, FutureWarning, stacklevel=stacklevel)
--> 299 return func(*args, **kwargs)
300
301 return wrapper
~/.local/share/virtualenvs/work-SEXxSIg4-python/lib/python3.9/site-packages/pandas/io/excel/_base.py in read_excel(io, sheet_name, header, names, index_col, usecols, squeeze, dtype, engine, converters, true_values, false_values, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, parse_dates, date_parser, thousands, comment, skipfooter, convert_float, mangle_dupe_cols, storage_options)
334 if not isinstance(io, ExcelFile):
335 should_close = True
--> 336 io = ExcelFile(io, storage_options=storage_options, engine=engine)
337 elif engine and engine != io.engine:
338 raise ValueError(
~/.local/share/virtualenvs/work-SEXxSIg4-python/lib/python3.9/site-packages/pandas/io/excel/_base.py in __init__(self, path_or_buffer, engine, storage_options)
1083 if engine == "xlrd" and ext != "xls" and xlrd_version is not None:
1084 if xlrd_version >= "2":
-> 1085 raise ValueError(
1086 f"Your version of xlrd is {xlrd_version}. In xlrd >= 2.0, "
1087 f"only the xls format is supported. Install openpyxl instead."
ValueError: Your version of xlrd is 2.0.1. In xlrd >= 2.0, only the xls format is supported. Install openpyxl instead.
So I installed openpyxl
and downgraded xlrd
to 2.0
.
But I just realised that I should have downgraded xlrd
to a version less than 2.0
(didn't read the error message properly).
So I downgraded to xlrd
to version 1.2.0
and uninstalled openpyxl
(see Pipfile below):
But I then got a different error:
<ipython-input-3-a79ef5fb00f4>:1: FutureWarning: Your version of xlrd is 1.2.0. In xlrd >= 2.0, only the xls format is supported. As a result, the openpyxl engine will be used if it is installed and the engine argument is not specified. Install openpyxl instead.
sales = pd.read_excel(input_file)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-3-a79ef5fb00f4> in <module>
----> 1 sales = pd.read_excel(input_file)
2 levels = pd.read_excel(level_file)
3 extra_sales = pd.read_excel(extra_transactions)
~/.local/share/virtualenvs/work-SEXxSIg4-python/lib/python3.9/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
297 )
298 warnings.warn(msg, FutureWarning, stacklevel=stacklevel)
--> 299 return func(*args, **kwargs)
300
301 return wrapper
~/.local/share/virtualenvs/work-SEXxSIg4-python/lib/python3.9/site-packages/pandas/io/excel/_base.py in read_excel(io, sheet_name, header, names, index_col, usecols, squeeze, dtype, engine, converters, true_values, false_values, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, parse_dates, date_parser, thousands, comment, skipfooter, convert_float, mangle_dupe_cols, storage_options)
334 if not isinstance(io, ExcelFile):
335 should_close = True
--> 336 io = ExcelFile(io, storage_options=storage_options, engine=engine)
337 elif engine and engine != io.engine:
338 raise ValueError(
~/.local/share/virtualenvs/work-SEXxSIg4-python/lib/python3.9/site-packages/pandas/io/excel/_base.py in __init__(self, path_or_buffer, engine, storage_options)
1112 self.storage_options = storage_options
1113
-> 1114 self._reader = self._engines[engine](self._io, storage_options=storage_options)
1115
1116 def __fspath__(self):
~/.local/share/virtualenvs/work-SEXxSIg4-python/lib/python3.9/site-packages/pandas/io/excel/_xlrd.py in __init__(self, filepath_or_buffer, storage_options)
23 err_msg = "Install xlrd >= 1.0.0 for Excel support"
24 import_optional_dependency("xlrd", extra=err_msg)
---> 25 super().__init__(filepath_or_buffer, storage_options=storage_options)
26
27 @property
~/.local/share/virtualenvs/work-SEXxSIg4-python/lib/python3.9/site-packages/pandas/io/excel/_base.py in __init__(self, filepath_or_buffer, storage_options)
389 # N.B. xlrd.Book has a read attribute too
390 self.handles.handle.seek(0)
--> 391 self.book = self.load_workbook(self.handles.handle)
392 elif isinstance(self.handles.handle, bytes):
393 self.book = self.load_workbook(BytesIO(self.handles.handle))
~/.local/share/virtualenvs/work-SEXxSIg4-python/lib/python3.9/site-packages/pandas/io/excel/_xlrd.py in load_workbook(self, filepath_or_buffer)
36 if hasattr(filepath_or_buffer, "read"):
37 data = filepath_or_buffer.read()
---> 38 return open_workbook(file_contents=data)
39 else:
40 return open_workbook(filepath_or_buffer)
~/.local/share/virtualenvs/work-SEXxSIg4-python/lib/python3.9/site-packages/xlrd/__init__.py in open_workbook(filename, logfile, verbosity, use_mmap, file_contents, encoding_override, formatting_info, on_demand, ragged_rows)
128 if 'xl/workbook.xml' in component_names:
129 from . import xlsx
--> 130 bk = xlsx.open_workbook_2007_xml(
131 zf,
132 component_names,
~/.local/share/virtualenvs/work-SEXxSIg4-python/lib/python3.9/site-packages/xlrd/xlsx.py in open_workbook_2007_xml(zf, component_names, logfile, verbosity, use_mmap, formatting_info, on_demand, ragged_rows)
810 del zflo
811 zflo = zf.open(component_names['xl/workbook.xml'])
--> 812 x12book.process_stream(zflo, 'Workbook')
813 del zflo
814 props_name = 'docprops/core.xml'
~/.local/share/virtualenvs/work-SEXxSIg4-python/lib/python3.9/site-packages/xlrd/xlsx.py in process_stream(self, stream, heading)
264 self.tree = ET.parse(stream)
265 getmethod = self.tag2meth.get
--> 266 for elem in self.tree.iter() if Element_has_iter else self.tree.getiterator():
267 if self.verbosity >= 3:
268 self.dump_elem(elem)
AttributeError: 'ElementTree' object has no attribute 'getiterator'
So I reverted to openpyxl==3.0.5
and xlrd==2.0
and I ceased getting the runtime errors but the read still read in 1000 rows rather than 50 rows. Hence, the solutions for me was to then specifically set which engine to use and the total number of rows to read, as you stated i.e. extra_sales = pd.read_excel(extra_transactions, engine='openpyxl', nrows=50)
.
Thanks Chris!
Michael
As an FYI, I uploaded a new version of the source excel file that should not read in the 1000 rows. Your solution is still valid but wanted to let you know the file has been updated.
Hi all. Thanks for pointing out these issues and Chris for touching up the Excel file. Looks like Pandas 1.2 is out but conda claims there are no updates for pandas when 1.1 is installed. I'm guessing this is just a matter of time for it to make it down stream. So until then, I added a big message on screen which appears for only 5 seconds in the first encounter of pd.read_excel()
.
As an FYI, I uploaded a new version of the source excel file that should not read in the 1000 rows. Your solution is still valid but wanted to let you know the file has been updated.
@chris1610 Thanks Chris! I can confirm that the new version only reads 50 rows!
@mikeckennedy Thanks Michael!