Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pd.read_excel() on the output of infer_schema.xlsx returns AssertionError: #4

Open
sugiantolauw opened this issue Feb 28, 2018 · 6 comments
Labels

Comments

@sugiantolauw
Copy link

sugiantolauw commented Feb 28, 2018

Hi There,

This looks like a great package and I was testing your package for my own automation.

So, I have created the schema of my dataframe with the infer_schema() function and it returns the .xlsx file.

When I tried to read the .xlsx schema file using pd.read_excel() function, it returns "AssertionError: "

I am not sure what is happening here.

Here is the complete error:

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-29-e35830cdc292> in <module>()
----> 1 test_schema = pd.read_excel('output/data_schema_test_schema.xlsx')

~/anaconda3/envs/py36tensor/lib/python3.6/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    116                 else:
    117                     kwargs[new_arg_name] = new_arg_value
--> 118             return func(*args, **kwargs)
    119         return wrapper
    120     return _deprecate_kwarg

~/anaconda3/envs/py36tensor/lib/python3.6/site-packages/pandas/io/excel.py in read_excel(io, sheet_name, header, skiprows, skip_footer, index_col, names, usecols, parse_dates, date_parser, na_values, thousands, convert_float, converters, dtype, true_values, false_values, engine, squeeze, **kwds)
    228 
    229     if not isinstance(io, ExcelFile):
--> 230         io = ExcelFile(io, engine=engine)
    231 
    232     return io._parse_excel(

~/anaconda3/envs/py36tensor/lib/python3.6/site-packages/pandas/io/excel.py in __init__(self, io, **kwds)
    292             self.book = xlrd.open_workbook(file_contents=data)
    293         elif isinstance(self._io, compat.string_types):
--> 294             self.book = xlrd.open_workbook(self._io)
    295         else:
    296             raise ValueError('Must explicitly set engine if not passing in'

~/anaconda3/envs/py36tensor/lib/python3.6/site-packages/xlrd/__init__.py in open_workbook(filename, logfile, verbosity, use_mmap, file_contents, encoding_override, formatting_info, on_demand, ragged_rows)
    141                 formatting_info=formatting_info,
    142                 on_demand=on_demand,
--> 143                 ragged_rows=ragged_rows,
    144                 )
    145             return bk

~/anaconda3/envs/py36tensor/lib/python3.6/site-packages/xlrd/xlsx.py in open_workbook_2007_xml(zf, component_names, logfile, verbosity, use_mmap, formatting_info, on_demand, ragged_rows)
    835         x12sheet = X12Sheet(sheet, logfile, verbosity)
    836         heading = "Sheet %r (sheetx=%d) from %r" % (sheet.name, sheetx, fname)
--> 837         x12sheet.process_stream(zflo, heading)
    838         del zflo
    839 

~/anaconda3/envs/py36tensor/lib/python3.6/site-packages/xlrd/xlsx.py in own_process_stream(self, stream, heading)
    546         for event, elem in ET.iterparse(stream):
    547             if elem.tag == row_tag:
--> 548                 self_do_row(elem)
    549                 elem.clear() # destroy all child elements (cells)
    550             elif elem.tag == U_SSML12 + "dimension":

~/anaconda3/envs/py36tensor/lib/python3.6/site-packages/xlrd/xlsx.py in do_row(self, row_elem)
    633             self.rowx = int(row_number) - 1
    634             explicit_row_number = 1
--> 635         assert 0 <= self.rowx < X12_MAX_ROWS
    636         rowx = self.rowx
    637         colx = -1

AssertionError: 

Here is what I have installed on my python:
airflow==1.8.0
alembic==0.8.10
annoy==1.11.1
asn1crypto==0.24.0
attrs==17.4.0
Automat==0.6.0
awscli==1.14.32
basemap==1.0.7
bcolz==1.1.2
beautifulsoup4==4.6.0
bitarray==0.8.1
bleach==1.5.0
bokeh==0.12.14
boto==2.48.0
boto3==1.5.22
botocore==1.8.36
branca==0.2.0
bs4==0.0.1
bz2file==0.98
cachetools==2.0.1
Cartopy==0.15.1
catboost==0.6
certifi==2018.1.18
cffi==1.11.4
chardet==3.0.4
click==6.7
click-plugins==1.0.3
cliff==2.8.0
cligj==0.4.0
cloudpickle==0.5.2
cmd2==0.8.0
colorama==0.3.7
colorcet==1.0.0
configparser==3.5.0
constantly==15.1.0
croniter==0.3.20
cryptography==2.1.4
cssselect==1.0.3
cycler==0.10.0
cymem==1.31.2
Cython==0.27.3
cytoolz==0.8.2
dask==0.17.0
datashader==0.6.5
datashape==0.5.4
deap==1.2.2
decorator==4.2.1
descartes==1.1.0
dill==0.2.7.1
distributed==1.21.0
Django==1.11.8
docutils==0.14
en-core-web-sm==2.0.0
entrypoints==0.2.3
et-xmlfile==1.0.1
fastai==0.6
feather-format==0.4.0
Fiona==1.7.11.post1
Flask==0.11.1
Flask-Admin==1.4.1
Flask-Cache==0.13.1
Flask-Login==0.2.11
flask-swagger==0.2.13
Flask-WTF==0.12
folium==0.5.0
ftfy==4.4.3
funcsigs==1.0.0
future==0.15.2
gensim==3.2.0
geopandas==0.3.0
geoplot==0.1.2
gitdb2==2.0.3
GitPython==2.1.8
graphviz==0.8.2
gunicorn==19.3.0
h5py==2.7.1
heapdict==1.0.0
html5lib==0.9999999
hyperlink==17.3.1
idna==2.6
ijson==2.3
impyla==0.14.0
incremental==17.5.0
ipykernel==4.8.0
ipython==6.2.1
ipython-genutils==0.2.0
ipywidgets==7.1.1
isoweek==1.3.3
itsdangerous==0.24
JayDeBeApi==1.1.1
jdcal==1.3
jedi==0.11.1
Jinja2==2.8.1
jmespath==0.9.3
JPype1==0.6.2
jsonschema==2.6.0
jupyter==1.0.0
jupyter-client==5.2.2
jupyter-console==5.2.0
jupyter-contrib-core==0.3.3
jupyter-contrib-nbextensions==0.4.0
jupyter-core==4.4.0
jupyter-highlight-selected-word==0.1.0
jupyter-latex-envs==1.4.1
jupyter-nbextensions-configurator==0.4.0
kaggle-cli==0.12.13
Keras==2.1.3
ktext==0.31
lightgbm==2.1.0
llvmlite==0.21.0
locket==0.2.0
lockfile==0.12.2
lxml==3.8.0
Mako==1.0.7
Markdown==2.6.9
MarkupSafe==1.0
matplotlib==2.1.2
MechanicalSoup==0.8.0
missingno==0.4.0
mistune==0.8.3
mizani==0.4.4
more-itertools==4.1.0
msgpack==0.5.1
msgpack-numpy==0.4.1
msgpack-python==0.5.1
multipledispatch==0.4.9
multiprocess==0.70.5
munch==2.2.0
murmurhash==0.28.0
nbconvert==5.3.1
nbformat==4.4.0
networkx==2.1
nltk==3.2.5
nose==1.3.7
notebook==5.3.1
numba==0.36.2
numpy==1.14.0
olefile==0.45.1
opencv-python==3.4.0.12
openpyxl==2.5.0
ordereddict==1.1
packaging==16.8
palettable==3.1.0
pandas==0.22.0
pandas-summary==0.0.41
pandocfilters==1.4.2
param==1.5.1
parsel==1.3.1
parso==0.1.1
partd==0.3.8
pathlib==1.0.1
pathos==0.2.1
patsy==0.5.0
pbr==3.1.1
pexpect==4.3.1
pickleshare==0.7.4
Pillow==5.0.0
plac==0.9.6
plotly==2.3.0
plotnine==0.3.0
ply==3.10
pox==0.2.3
ppft==1.6.4.7.1
preshed==1.0.0
prettytable==0.7.2
progressbar2==3.34.3
prompt-toolkit==1.0.15
protobuf==3.4.1
psutil==5.4.3
ptyprocess==0.5.2
pyarrow==0.8.0
pyasn1==0.4.2
pyasn1-modules==0.2.1
pycparser==2.18
PyDispatcher==2.0.5
pydot==1.2.4
pydqc==0.1
pyemd==0.5.1
Pygments==2.2.0
pyodbc==4.0.22
pyOpenSSL==17.5.0
pyparsing==2.2.0
pyperclip==1.6.0
Pyphen==0.9.4
pyproj==1.9.5.1
pyshp==1.2.12
python-daemon==2.1.2
python-dateutil==2.6.1
python-editor==1.0.3
python-Levenshtein==0.12.0
python-nvd3==0.14.2
python-slugify==1.1.4
python-utils==2.2.0
pytz==2017.3
PyYAML==3.12
pyzmq==16.0.3
qtconsole==4.3.1
queuelib==1.4.2
regex==2017.4.5
requests==2.18.4
rsa==3.4.2
s3transfer==0.1.12
scikit-learn==0.19.1
scipy==1.0.0
Scrapy==1.5.0
seaborn==0.8.1
selenium==3.8.1
Send2Trash==1.4.2
service-identity==17.0.0
setproctitle==1.1.10
Shapely==1.6.4.post1
simplegeneric==0.8.1
six==1.11.0
sklearn-pandas==1.6.0
smart-open==1.5.6
smmap2==2.0.3
sortedcontainers==1.5.9
spacy==2.0.6
SQLAlchemy==1.2.2
statsmodels==0.8.0
stevedore==1.28.0
stopit==1.1.1
tabulate==0.7.7
tblib==1.3.2
tensorflow==1.4.1
tensorflow-tensorboard==0.1.5
teradata==15.10.0.21
termcolor==1.1.0
terminado==0.8.1
testpath==0.3.1
textacy==0.5.0
Theano==1.0.1
thinc==6.10.2
thrift==0.9.3
thriftpy==0.3.9
toolz==0.9.0
torchtext==0.2.1
tornado==4.5.3
TPOT==0.9.2
tqdm==4.19.5
traitlets==4.3.2
Twisted==17.9.0
ujson==1.35
Unidecode==1.0.22
update-checker==0.16
urllib3==1.22
w3lib==1.19.0
wcwidth==0.1.7
webencodings==0.5.1
Werkzeug==0.14.1
widgetsnbextension==3.1.0
wrapt==1.10.11
WTForms==2.1
xarray==0.10.0
xgboost==0.7
xlrd==1.1.0
XlsxWriter==1.0.2
zict==0.1.3
zope.deprecation==4.3.0
zope.interface==4.4.3

Let me know your thoughts and if you need more info.

Sugi

@SauceCat
Copy link
Owner

Hi @sugiantolauw

Seems the error is from xlrd ..
Please Google "xlrd assert 0 <= self.rowx < X12_MAX_ROWS".

@rohanneps
Copy link

rohanneps commented Apr 16, 2018

Hi @sugiantolauw,

The errors roots from the fact that the infer_schema function generates data schema as an .xlsx file containing lots of unnecessary columns. Manually selecting and deleting all the remaining unused columns solved the issue for me.

@SauceCat,
Could you look into the issue, as I am not much familiar with openpyxl.
Thanks,

@vishaalkk
Copy link

vishaalkk commented May 29, 2018

@sugiantolauw @rohanneps
I was running into same issue, and for now openpyxl library helped. Since I need just first three columns, I am slicing those only.

from openpyxl import load_workbook

def iter_rows(ws):
    for row in ws.iter_rows():
        yield [cell.value for cell in row[0:3]]

def get_schema(fname):
    wb = load_workbook(fname) #from openpyxl import load_workbook
    ws = wb.active
    schema = pd.DataFrame(iter_rows(ws))
    schema.columns = schema.iloc[0]
    schema = schema.reindex(schema.index.drop(0))
    return schema

@chrisgschon
Copy link

You can also solve this by saving the edited output workbook as a csv and then use pd.read_csv instead.

@SauceCat SauceCat added the bug label Jun 25, 2018
@bballamudi
Copy link

@SauceCat Can you please let me know if you have a fix for this?

@JingyuZHANG
Copy link

you can change your code from

test_schema = pd.read_excel('output/data_schema_test_schema.xlsx')

to

test_schema = pd.read_excel('output/data_schema_test_schema.xlsx', engine='openpyxl')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants