pd.read_excel() on the output of infer_schema.xlsx returns AssertionError: #4

sugiantolauw · 2018-02-28T03:29:16Z

Hi There,

This looks like a great package and I was testing your package for my own automation.

So, I have created the schema of my dataframe with the infer_schema() function and it returns the .xlsx file.

When I tried to read the .xlsx schema file using pd.read_excel() function, it returns "AssertionError: "

I am not sure what is happening here.

Here is the complete error:

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-29-e35830cdc292> in <module>()
----> 1 test_schema = pd.read_excel('output/data_schema_test_schema.xlsx')

~/anaconda3/envs/py36tensor/lib/python3.6/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    116                 else:
    117                     kwargs[new_arg_name] = new_arg_value
--> 118             return func(*args, **kwargs)
    119         return wrapper
    120     return _deprecate_kwarg

~/anaconda3/envs/py36tensor/lib/python3.6/site-packages/pandas/io/excel.py in read_excel(io, sheet_name, header, skiprows, skip_footer, index_col, names, usecols, parse_dates, date_parser, na_values, thousands, convert_float, converters, dtype, true_values, false_values, engine, squeeze, **kwds)
    228 
    229     if not isinstance(io, ExcelFile):
--> 230         io = ExcelFile(io, engine=engine)
    231 
    232     return io._parse_excel(

~/anaconda3/envs/py36tensor/lib/python3.6/site-packages/pandas/io/excel.py in __init__(self, io, **kwds)
    292             self.book = xlrd.open_workbook(file_contents=data)
    293         elif isinstance(self._io, compat.string_types):
--> 294             self.book = xlrd.open_workbook(self._io)
    295         else:
    296             raise ValueError('Must explicitly set engine if not passing in'

~/anaconda3/envs/py36tensor/lib/python3.6/site-packages/xlrd/__init__.py in open_workbook(filename, logfile, verbosity, use_mmap, file_contents, encoding_override, formatting_info, on_demand, ragged_rows)
    141                 formatting_info=formatting_info,
    142                 on_demand=on_demand,
--> 143                 ragged_rows=ragged_rows,
    144                 )
    145             return bk

~/anaconda3/envs/py36tensor/lib/python3.6/site-packages/xlrd/xlsx.py in open_workbook_2007_xml(zf, component_names, logfile, verbosity, use_mmap, formatting_info, on_demand, ragged_rows)
    835         x12sheet = X12Sheet(sheet, logfile, verbosity)
    836         heading = "Sheet %r (sheetx=%d) from %r" % (sheet.name, sheetx, fname)
--> 837         x12sheet.process_stream(zflo, heading)
    838         del zflo
    839 

~/anaconda3/envs/py36tensor/lib/python3.6/site-packages/xlrd/xlsx.py in own_process_stream(self, stream, heading)
    546         for event, elem in ET.iterparse(stream):
    547             if elem.tag == row_tag:
--> 548                 self_do_row(elem)
    549                 elem.clear() # destroy all child elements (cells)
    550             elif elem.tag == U_SSML12 + "dimension":

~/anaconda3/envs/py36tensor/lib/python3.6/site-packages/xlrd/xlsx.py in do_row(self, row_elem)
    633             self.rowx = int(row_number) - 1
    634             explicit_row_number = 1
--> 635         assert 0 <= self.rowx < X12_MAX_ROWS
    636         rowx = self.rowx
    637         colx = -1

AssertionError:

Here is what I have installed on my python:
airflow==1.8.0
alembic==0.8.10
annoy==1.11.1
asn1crypto==0.24.0
attrs==17.4.0
Automat==0.6.0
awscli==1.14.32
basemap==1.0.7
bcolz==1.1.2
beautifulsoup4==4.6.0
bitarray==0.8.1
bleach==1.5.0
bokeh==0.12.14
boto==2.48.0
boto3==1.5.22
botocore==1.8.36
branca==0.2.0
bs4==0.0.1
bz2file==0.98
cachetools==2.0.1
Cartopy==0.15.1
catboost==0.6
certifi==2018.1.18
cffi==1.11.4
chardet==3.0.4
click==6.7
click-plugins==1.0.3
cliff==2.8.0
cligj==0.4.0
cloudpickle==0.5.2
cmd2==0.8.0
colorama==0.3.7
colorcet==1.0.0
configparser==3.5.0
constantly==15.1.0
croniter==0.3.20
cryptography==2.1.4
cssselect==1.0.3
cycler==0.10.0
cymem==1.31.2
Cython==0.27.3
cytoolz==0.8.2
dask==0.17.0
datashader==0.6.5
datashape==0.5.4
deap==1.2.2
decorator==4.2.1
descartes==1.1.0
dill==0.2.7.1
distributed==1.21.0
Django==1.11.8
docutils==0.14
en-core-web-sm==2.0.0
entrypoints==0.2.3
et-xmlfile==1.0.1
fastai==0.6
feather-format==0.4.0
Fiona==1.7.11.post1
Flask==0.11.1
Flask-Admin==1.4.1
Flask-Cache==0.13.1
Flask-Login==0.2.11
flask-swagger==0.2.13
Flask-WTF==0.12
folium==0.5.0
ftfy==4.4.3
funcsigs==1.0.0
future==0.15.2
gensim==3.2.0
geopandas==0.3.0
geoplot==0.1.2
gitdb2==2.0.3
GitPython==2.1.8
graphviz==0.8.2
gunicorn==19.3.0
h5py==2.7.1
heapdict==1.0.0
html5lib==0.9999999
hyperlink==17.3.1
idna==2.6
ijson==2.3
impyla==0.14.0
incremental==17.5.0
ipykernel==4.8.0
ipython==6.2.1
ipython-genutils==0.2.0
ipywidgets==7.1.1
isoweek==1.3.3
itsdangerous==0.24
JayDeBeApi==1.1.1
jdcal==1.3
jedi==0.11.1
Jinja2==2.8.1
jmespath==0.9.3
JPype1==0.6.2
jsonschema==2.6.0
jupyter==1.0.0
jupyter-client==5.2.2
jupyter-console==5.2.0
jupyter-contrib-core==0.3.3
jupyter-contrib-nbextensions==0.4.0
jupyter-core==4.4.0
jupyter-highlight-selected-word==0.1.0
jupyter-latex-envs==1.4.1
jupyter-nbextensions-configurator==0.4.0
kaggle-cli==0.12.13
Keras==2.1.3
ktext==0.31
lightgbm==2.1.0
llvmlite==0.21.0
locket==0.2.0
lockfile==0.12.2
lxml==3.8.0
Mako==1.0.7
Markdown==2.6.9
MarkupSafe==1.0
matplotlib==2.1.2
MechanicalSoup==0.8.0
missingno==0.4.0
mistune==0.8.3
mizani==0.4.4
more-itertools==4.1.0
msgpack==0.5.1
msgpack-numpy==0.4.1
msgpack-python==0.5.1
multipledispatch==0.4.9
multiprocess==0.70.5
munch==2.2.0
murmurhash==0.28.0
nbconvert==5.3.1
nbformat==4.4.0
networkx==2.1
nltk==3.2.5
nose==1.3.7
notebook==5.3.1
numba==0.36.2
numpy==1.14.0
olefile==0.45.1
opencv-python==3.4.0.12
openpyxl==2.5.0
ordereddict==1.1
packaging==16.8
palettable==3.1.0
pandas==0.22.0
pandas-summary==0.0.41
pandocfilters==1.4.2
param==1.5.1
parsel==1.3.1
parso==0.1.1
partd==0.3.8
pathlib==1.0.1
pathos==0.2.1
patsy==0.5.0
pbr==3.1.1
pexpect==4.3.1
pickleshare==0.7.4
Pillow==5.0.0
plac==0.9.6
plotly==2.3.0
plotnine==0.3.0
ply==3.10
pox==0.2.3
ppft==1.6.4.7.1
preshed==1.0.0
prettytable==0.7.2
progressbar2==3.34.3
prompt-toolkit==1.0.15
protobuf==3.4.1
psutil==5.4.3
ptyprocess==0.5.2
pyarrow==0.8.0
pyasn1==0.4.2
pyasn1-modules==0.2.1
pycparser==2.18
PyDispatcher==2.0.5
pydot==1.2.4
pydqc==0.1
pyemd==0.5.1
Pygments==2.2.0
pyodbc==4.0.22
pyOpenSSL==17.5.0
pyparsing==2.2.0
pyperclip==1.6.0
Pyphen==0.9.4
pyproj==1.9.5.1
pyshp==1.2.12
python-daemon==2.1.2
python-dateutil==2.6.1
python-editor==1.0.3
python-Levenshtein==0.12.0
python-nvd3==0.14.2
python-slugify==1.1.4
python-utils==2.2.0
pytz==2017.3
PyYAML==3.12
pyzmq==16.0.3
qtconsole==4.3.1
queuelib==1.4.2
regex==2017.4.5
requests==2.18.4
rsa==3.4.2
s3transfer==0.1.12
scikit-learn==0.19.1
scipy==1.0.0
Scrapy==1.5.0
seaborn==0.8.1
selenium==3.8.1
Send2Trash==1.4.2
service-identity==17.0.0
setproctitle==1.1.10
Shapely==1.6.4.post1
simplegeneric==0.8.1
six==1.11.0
sklearn-pandas==1.6.0
smart-open==1.5.6
smmap2==2.0.3
sortedcontainers==1.5.9
spacy==2.0.6
SQLAlchemy==1.2.2
statsmodels==0.8.0
stevedore==1.28.0
stopit==1.1.1
tabulate==0.7.7
tblib==1.3.2
tensorflow==1.4.1
tensorflow-tensorboard==0.1.5
teradata==15.10.0.21
termcolor==1.1.0
terminado==0.8.1
testpath==0.3.1
textacy==0.5.0
Theano==1.0.1
thinc==6.10.2
thrift==0.9.3
thriftpy==0.3.9
toolz==0.9.0
torchtext==0.2.1
tornado==4.5.3
TPOT==0.9.2
tqdm==4.19.5
traitlets==4.3.2
Twisted==17.9.0
ujson==1.35
Unidecode==1.0.22
update-checker==0.16
urllib3==1.22
w3lib==1.19.0
wcwidth==0.1.7
webencodings==0.5.1
Werkzeug==0.14.1
widgetsnbextension==3.1.0
wrapt==1.10.11
WTForms==2.1
xarray==0.10.0
xgboost==0.7
xlrd==1.1.0
XlsxWriter==1.0.2
zict==0.1.3
zope.deprecation==4.3.0
zope.interface==4.4.3

Let me know your thoughts and if you need more info.

Sugi

The text was updated successfully, but these errors were encountered:

SauceCat · 2018-02-28T09:33:51Z

Hi @sugiantolauw

Seems the error is from xlrd ..
Please Google "xlrd assert 0 <= self.rowx < X12_MAX_ROWS".

rohanneps · 2018-04-16T06:14:31Z

Hi @sugiantolauw,

The errors roots from the fact that the infer_schema function generates data schema as an .xlsx file containing lots of unnecessary columns. Manually selecting and deleting all the remaining unused columns solved the issue for me.

@SauceCat,
Could you look into the issue, as I am not much familiar with openpyxl.
Thanks,

vishaalkk · 2018-05-29T17:17:49Z

@sugiantolauw @rohanneps
I was running into same issue, and for now openpyxl library helped. Since I need just first three columns, I am slicing those only.

from openpyxl import load_workbook

def iter_rows(ws):
    for row in ws.iter_rows():
        yield [cell.value for cell in row[0:3]]

def get_schema(fname):
    wb = load_workbook(fname) #from openpyxl import load_workbook
    ws = wb.active
    schema = pd.DataFrame(iter_rows(ws))
    schema.columns = schema.iloc[0]
    schema = schema.reindex(schema.index.drop(0))
    return schema

chrisgschon · 2018-06-22T12:12:27Z

You can also solve this by saving the edited output workbook as a csv and then use pd.read_csv instead.

bballamudi · 2020-03-21T21:57:17Z

@SauceCat Can you please let me know if you have a fix for this?

JingyuZHANG · 2020-08-08T23:33:22Z

you can change your code from

test_schema = pd.read_excel('output/data_schema_test_schema.xlsx')

to

test_schema = pd.read_excel('output/data_schema_test_schema.xlsx', engine='openpyxl')

SauceCat added the bug label Jun 25, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pd.read_excel() on the output of infer_schema.xlsx returns AssertionError: #4

pd.read_excel() on the output of infer_schema.xlsx returns AssertionError: #4

sugiantolauw commented Feb 28, 2018 •

edited

Loading

SauceCat commented Feb 28, 2018

rohanneps commented Apr 16, 2018 •

edited

Loading

vishaalkk commented May 29, 2018 •

edited

Loading

chrisgschon commented Jun 22, 2018

bballamudi commented Mar 21, 2020

JingyuZHANG commented Aug 8, 2020

pd.read_excel() on the output of infer_schema.xlsx returns AssertionError: #4

pd.read_excel() on the output of infer_schema.xlsx returns AssertionError: #4

Comments

sugiantolauw commented Feb 28, 2018 • edited Loading

SauceCat commented Feb 28, 2018

rohanneps commented Apr 16, 2018 • edited Loading

vishaalkk commented May 29, 2018 • edited Loading

chrisgschon commented Jun 22, 2018

bballamudi commented Mar 21, 2020

JingyuZHANG commented Aug 8, 2020

sugiantolauw commented Feb 28, 2018 •

edited

Loading

rohanneps commented Apr 16, 2018 •

edited

Loading

vishaalkk commented May 29, 2018 •

edited

Loading