Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set UTF-8 as default encoding when reading and writing #124

Open
wants to merge 8 commits into
base: develop
Choose a base branch
from

Conversation

dc2917
Copy link
Contributor

@dc2917 dc2917 commented Oct 23, 2024

This PR sets UTF-8 as the default encoding when reading from or writing to file, whilst giving users the option to specify encoding when calling the top-level functions, e.g.
data, metadata = csvy.read_to_array("important_data.csv", encoding="cp1252")

Closes #86

Copy link
Collaborator

@dalonsoa dalonsoa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

We do not have proper documentation - yet -, only the README, so could you add at the end of the README a sentence about the encoding parameter?

Copy link
Contributor

@alexdewar alexdewar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is cool!

As it stands though there are a bunch of places where the encoding parameter is only used to read the YAML header, not the CSV data underneath. I guess in an ideal world we'd have a functional test which actually tries to load a CSV file containing some unicode (emojis or whatever) to check it works -- but don't bother doing this unless you're particularly keen!

The other thing is that I think people sometimes specify the encoding as a parameter in the metadata (i.e. in the YAML). I think that's out of scope for this PR, but maybe open an issue to handle this case? (I guess the way to implement this is to assume that the header is pure ASCII, then use the encoding in the header to load the rest of the file.)

csvy/readers.py Show resolved Hide resolved
csvy/readers.py Show resolved Hide resolved
header, nlines, comment = read_header(filename, marker=marker, **yaml_options)
header, nlines, comment = read_header(
filename, marker=marker, encoding=encoding, **yaml_options
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same with pl.scan_csv

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK so, annoyingly, pl.scan_csv only accepts "utf8" or "utf8-lossy" as encodings. "utf-8" is invalid. Which kinda messes this whole thing up 🫠

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, yuck. Well I guess then for the polars wrapper we just have to insist that the encoding is UTF-8? We should probably document this in the docstring.

csvy/readers.py Outdated

options = csv_options.copy() if csv_options is not None else {}

data = []
with open(filename, newline="") as csvfile:
with open(filename, encoding="utf-8", newline="") as csvfile:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
with open(filename, encoding="utf-8", newline="") as csvfile:
with open(filename, encoding=encoding, newline="") as csvfile:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good spot, thanks

Copy link
Collaborator

@dalonsoa dalonsoa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside from a minor comment on the README and Alex comments, all good.

I think encoding is the sort of thing that you need to know upfront to read the file properly. Reading it from the header might be an option, but requires assuming things about the header, already and I think the premise of PyCSVY is not to make any assumption. As Alex suggest, if we want to consider it, it will be a separate PR for sure.

Comment on lines +118 to +119
reader or writer functions. For example, on Windows, Python will assume Windows-1252 encoding,
which can be specified via `encoding='cp1252'`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line here might be a bit confusing for the reader. If Python assumes Windows 1252 encoding, then what's the default, UTF-8 or cp1252? I know what the code does, but it won't be clear for a reader if they need to use the cp1252 encoding or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, sorry, I should have been more clear.

I guess the case in which this is important is if a Windows user has written a csv file using Python but outside of pycsvy, which by default would use cp1252-encoded. If they don't pay/haven't paid attention to encodings then it could cause confusion for them.

@alexdewar
Copy link
Contributor

I think encoding is the sort of thing that you need to know upfront to read the file properly. Reading it from the header might be an option, but requires assuming things about the header, already and I think the premise of PyCSVY is not to make any assumption. As Alex suggest, if we want to consider it, it will be a separate PR for sure.

I do agree with you, but just to add that people do put the encoding in actual documents sometimes (e.g. you see it in HTML etc.). It's kinda weird, but it works fine if a) the encoding is ASCII-compatible (like cp1252 or UTF-8) and b) you don't use any non-ASCII chars before the bit where you say what encoding the document uses. Definitely a job for another day though, if indeed we bother at all. (Who wouldn't want to use UTF-8 these days anyway?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pycsvy won't work with Unicode data on Windows
3 participants