You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This issue is related and similar to issue #85 .
When trousse.dataset.read_file and trousse.dataset.get_df_from_csv functions are used to read a CSV file, they use pandas.read_csv function to parse the CSV file.
By choice, Pandas tries to avoid columns with mixed typed values (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html and https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.errors.DtypeWarning.html), so when a column written in the CSV file contains boolean values (i.e. True/False) along with typos (e.g. True% instead of True or 0 instead of False), the column will be loaded in a DataFrame (inside the ._data attribute of Dataset) with a dtype='object'.
The issue derives from the pandas behavior that, whenever a column is loaded from CSV file and its assigned dtype is object, all its values are casted to string. This means that if a CSV is similar to:
And if we select the first element of column col1, its value will be:
>>>df['col0'][0]
'True'
and its type will be:
>>>type(df['col1'][0])
<class 'str'>
So even pandas function infer_dtype do not recognize that column as a mixed column:
>>>pd.api.types.infer_dtype(df['col1'])
'string'
In conclusion if a column of a CSV file contains at least one value that cannot be interpreted consistently with all the other types (e.g.: a boolean value containing a typo), every value of that column will be interpreted as a string.
Similarly to issue #85 , my proposal is to add a function inside Dataset.__init__ method that analyzes columns with dtype='object'.
For each found column, this function replaces 'True' with True and 'False' with False values. This would change the type of the single value from string to boolean, while leaving the others untouched.
This would mean that when the Dataset method _columns_type calls the pd.api.types.infer_dtype function, the inferred type will not be 'string', but 'mixed' instead (so that proper and expected inference will be performed).
The text was updated successfully, but these errors were encountered:
This issue is related and similar to issue #85 .
When
trousse.dataset.read_file
andtrousse.dataset.get_df_from_csv
functions are used to read a CSV file, they usepandas.read_csv
function to parse the CSV file.By choice, Pandas tries to avoid columns with mixed typed values (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html and https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.errors.DtypeWarning.html), so when a column written in the CSV file contains boolean values (i.e. True/False) along with typos (e.g. True% instead of True or 0 instead of False), the column will be loaded in a DataFrame (inside the
._data
attribute of Dataset) with a dtype='object'.The issue derives from the pandas behavior that, whenever a column is loaded from CSV file and its assigned
dtype
isobject
, all its values are casted tostring
. This means that if a CSV is similar to:(where in a boolean column there is a typo like
True%
), the corresponding DataFrame has:'object'
And if we select the first element of column
col1
, its value will be:'True'
and its type will be:
So even pandas function
infer_dtype
do not recognize that column as amixed
column:'string'
In conclusion if a column of a CSV file contains at least one value that cannot be interpreted consistently with all the other types (e.g.: a boolean value containing a typo), every value of that column will be interpreted as a string.
Similarly to issue #85 , my proposal is to add a function inside
Dataset.__init__
method that analyzes columns withdtype='object'
.For each found column, this function replaces 'True' with True and 'False' with False values. This would change the type of the single value from string to boolean, while leaving the others untouched.
This would mean that when the Dataset method
_columns_type
calls thepd.api.types.infer_dtype
function, the inferred type will not be 'string', but 'mixed' instead (so that proper and expected inference will be performed).The text was updated successfully, but these errors were encountered: