Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"read_file" and "get_df_from_csv" functions load boolean values as string ones #89

Open
lorenz-gorini opened this issue Oct 15, 2020 · 0 comments

Comments

@lorenz-gorini
Copy link
Member

This issue is related and similar to issue #85 .
When trousse.dataset.read_file and trousse.dataset.get_df_from_csv functions are used to read a CSV file, they use pandas.read_csv function to parse the CSV file.

By choice, Pandas tries to avoid columns with mixed typed values (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html and https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.errors.DtypeWarning.html), so when a column written in the CSV file contains boolean values (i.e. True/False) along with typos (e.g. True% instead of True or 0 instead of False), the column will be loaded in a DataFrame (inside the ._data attribute of Dataset) with a dtype='object'.
The issue derives from the pandas behavior that, whenever a column is loaded from CSV file and its assigned dtype is object, all its values are casted to string. This means that if a CSV is similar to:

,col0,col1
0,1,True
1,1,False
2,0,True%
3,0,True
4,0,True

(where in a boolean column there is a typo like True%), the corresponding DataFrame has:

>>> import pandas as pd
>>> df = pd.read_csv(CSV_PATH)
>>> df['col1'].dtype
'object'

And if we select the first element of column col1, its value will be:

>>> df['col0'][0]
'True'

and its type will be:

>>> type(df['col1'][0])
<class 'str'>

So even pandas function infer_dtype do not recognize that column as a mixed column:

>>> pd.api.types.infer_dtype(df['col1'])
'string'

In conclusion if a column of a CSV file contains at least one value that cannot be interpreted consistently with all the other types (e.g.: a boolean value containing a typo), every value of that column will be interpreted as a string.

Similarly to issue #85 , my proposal is to add a function inside Dataset.__init__ method that analyzes columns with dtype='object'.
For each found column, this function replaces 'True' with True and 'False' with False values. This would change the type of the single value from string to boolean, while leaving the others untouched.
This would mean that when the Dataset method _columns_type calls the pd.api.types.infer_dtype function, the inferred type will not be 'string', but 'mixed' instead (so that proper and expected inference will be performed).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant