-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set UTF-8 as default encoding when reading and writing #124
base: develop
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
We do not have proper documentation - yet -, only the README, so could you add at the end of the README a sentence about the encoding parameter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is cool!
As it stands though there are a bunch of places where the encoding
parameter is only used to read the YAML header, not the CSV data underneath. I guess in an ideal world we'd have a functional test which actually tries to load a CSV file containing some unicode (emojis or whatever) to check it works -- but don't bother doing this unless you're particularly keen!
The other thing is that I think people sometimes specify the encoding
as a parameter in the metadata (i.e. in the YAML). I think that's out of scope for this PR, but maybe open an issue to handle this case? (I guess the way to implement this is to assume that the header is pure ASCII, then use the encoding in the header to load the rest of the file.)
header, nlines, comment = read_header(filename, marker=marker, **yaml_options) | ||
header, nlines, comment = read_header( | ||
filename, marker=marker, encoding=encoding, **yaml_options | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same with pl.scan_csv
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK so, annoyingly, pl.scan_csv only accepts "utf8" or "utf8-lossy" as encodings. "utf-8" is invalid. Which kinda messes this whole thing up 🫠
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, yuck. Well I guess then for the polars
wrapper we just have to insist that the encoding is UTF-8? We should probably document this in the docstring.
csvy/readers.py
Outdated
|
||
options = csv_options.copy() if csv_options is not None else {} | ||
|
||
data = [] | ||
with open(filename, newline="") as csvfile: | ||
with open(filename, encoding="utf-8", newline="") as csvfile: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with open(filename, encoding="utf-8", newline="") as csvfile: | |
with open(filename, encoding=encoding, newline="") as csvfile: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good spot, thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aside from a minor comment on the README and Alex comments, all good.
I think encoding is the sort of thing that you need to know upfront to read the file properly. Reading it from the header might be an option, but requires assuming things about the header, already and I think the premise of PyCSVY
is not to make any assumption. As Alex suggest, if we want to consider it, it will be a separate PR for sure.
reader or writer functions. For example, on Windows, Python will assume Windows-1252 encoding, | ||
which can be specified via `encoding='cp1252'`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line here might be a bit confusing for the reader. If Python assumes Windows 1252 encoding, then what's the default, UTF-8 or cp1252? I know what the code does, but it won't be clear for a reader if they need to use the cp1252 encoding or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, sorry, I should have been more clear.
I guess the case in which this is important is if a Windows user has written a csv file using Python but outside of pycsvy, which by default would use cp1252-encoded. If they don't pay/haven't paid attention to encodings then it could cause confusion for them.
I do agree with you, but just to add that people do put the encoding in actual documents sometimes (e.g. you see it in HTML etc.). It's kinda weird, but it works fine if a) the encoding is ASCII-compatible (like |
This PR sets UTF-8 as the default encoding when reading from or writing to file, whilst giving users the option to specify encoding when calling the top-level functions, e.g.
data, metadata = csvy.read_to_array("important_data.csv", encoding="cp1252")
Closes #86