Set UTF-8 as default encoding when reading and writing #124

dc2917 · 2024-10-23T14:08:14Z

This PR sets UTF-8 as the default encoding when reading from or writing to file, whilst giving users the option to specify encoding when calling the top-level functions, e.g.
data, metadata = csvy.read_to_array("important_data.csv", encoding="cp1252")

Closes #86

dalonsoa

Looks good to me!

We do not have proper documentation - yet -, only the README, so could you add at the end of the README a sentence about the encoding parameter?

alexdewar

This is cool!

As it stands though there are a bunch of places where the encoding parameter is only used to read the YAML header, not the CSV data underneath. I guess in an ideal world we'd have a functional test which actually tries to load a CSV file containing some unicode (emojis or whatever) to check it works -- but don't bother doing this unless you're particularly keen!

The other thing is that I think people sometimes specify the encoding as a parameter in the metadata (i.e. in the YAML). I think that's out of scope for this PR, but maybe open an issue to handle this case? (I guess the way to implement this is to assume that the header is pure ASCII, then use the encoding in the header to load the rest of the file.)

csvy/readers.py

alexdewar · 2024-10-23T17:34:48Z

csvy/readers.py

-    header, nlines, comment = read_header(filename, marker=marker, **yaml_options)
+    header, nlines, comment = read_header(
+        filename, marker=marker, encoding=encoding, **yaml_options
+    )


Same with pl.scan_csv

OK so, annoyingly, pl.scan_csv only accepts "utf8" or "utf8-lossy" as encodings. "utf-8" is invalid. Which kinda messes this whole thing up 🫠

Oh, yuck. Well I guess then for the polars wrapper we just have to insist that the encoding is UTF-8? We should probably document this in the docstring.

alexdewar · 2024-10-23T17:35:41Z

csvy/readers.py


    options = csv_options.copy() if csv_options is not None else {}

    data = []
-    with open(filename, newline="") as csvfile:
+    with open(filename, encoding="utf-8", newline="") as csvfile:


Suggested change

with open(filename, encoding="utf-8", newline="") as csvfile:

with open(filename, encoding=encoding, newline="") as csvfile:

Good spot, thanks

dalonsoa

Aside from a minor comment on the README and Alex comments, all good.

I think encoding is the sort of thing that you need to know upfront to read the file properly. Reading it from the header might be an option, but requires assuming things about the header, already and I think the premise of PyCSVY is not to make any assumption. As Alex suggest, if we want to consider it, it will be a separate PR for sure.

dalonsoa · 2024-10-24T08:12:08Z

README.md

+reader or writer functions. For example, on Windows, Python will assume Windows-1252 encoding,
+which can be specified via `encoding='cp1252'`.


This line here might be a bit confusing for the reader. If Python assumes Windows 1252 encoding, then what's the default, UTF-8 or cp1252? I know what the code does, but it won't be clear for a reader if they need to use the cp1252 encoding or not.

Yeah, sorry, I should have been more clear.

I guess the case in which this is important is if a Windows user has written a csv file using Python but outside of pycsvy, which by default would use cp1252-encoded. If they don't pay/haven't paid attention to encodings then it could cause confusion for them.

alexdewar · 2024-10-24T08:28:41Z

I think encoding is the sort of thing that you need to know upfront to read the file properly. Reading it from the header might be an option, but requires assuming things about the header, already and I think the premise of PyCSVY is not to make any assumption. As Alex suggest, if we want to consider it, it will be a separate PR for sure.

I do agree with you, but just to add that people do put the encoding in actual documents sometimes (e.g. you see it in HTML etc.). It's kinda weird, but it works fine if a) the encoding is ASCII-compatible (like cp1252 or UTF-8) and b) you don't use any non-ASCII chars before the bit where you say what encoding the document uses. Definitely a job for another day though, if indeed we bother at all. (Who wouldn't want to use UTF-8 these days anyway?)

dc2917 added 3 commits October 23, 2024 14:32

Pass utf-8 encoding in all file open calls

94fe91a

Pass encoding via top-level functions

ebd7da4

Assert encoding passed to readers/writers

ff18e6b

dc2917 requested review from alexdewar and dalonsoa October 23, 2024 14:08

dalonsoa approved these changes Oct 23, 2024

View reviewed changes

dalonsoa added the hacktoberfest-accepted label Oct 23, 2024

Added note about character encoding

9c83fbd

alexdewar requested changes Oct 23, 2024

View reviewed changes

dalonsoa reviewed Oct 24, 2024

View reviewed changes

dc2917 added 4 commits October 25, 2024 09:08

Pass encoding to numpy, pandas and polars readers

a83f31e

Use passed encoding

d35e716

Fix read_to_polars encoding utf-8 -> utf8

6088d77

Enforce utf-8 encoding for read_to_polars

32f9139

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set UTF-8 as default encoding when reading and writing #124

Set UTF-8 as default encoding when reading and writing #124

dc2917 commented Oct 23, 2024

dalonsoa left a comment

alexdewar left a comment

alexdewar Oct 23, 2024

dc2917 Oct 25, 2024

alexdewar Oct 25, 2024

alexdewar Oct 23, 2024

dc2917 Oct 25, 2024

dalonsoa left a comment

dalonsoa Oct 24, 2024

dc2917 Oct 24, 2024

alexdewar commented Oct 24, 2024

	with open(filename, encoding="utf-8", newline="") as csvfile:
	with open(filename, encoding=encoding, newline="") as csvfile:

		reader or writer functions. For example, on Windows, Python will assume Windows-1252 encoding,
		which can be specified via `encoding='cp1252'`.

Set UTF-8 as default encoding when reading and writing #124

Are you sure you want to change the base?

Set UTF-8 as default encoding when reading and writing #124

Conversation

dc2917 commented Oct 23, 2024

dalonsoa left a comment

Choose a reason for hiding this comment

alexdewar left a comment

Choose a reason for hiding this comment

alexdewar Oct 23, 2024

Choose a reason for hiding this comment

dc2917 Oct 25, 2024

Choose a reason for hiding this comment

alexdewar Oct 25, 2024

Choose a reason for hiding this comment

alexdewar Oct 23, 2024

Choose a reason for hiding this comment

dc2917 Oct 25, 2024

Choose a reason for hiding this comment

dalonsoa left a comment

Choose a reason for hiding this comment

dalonsoa Oct 24, 2024

Choose a reason for hiding this comment

dc2917 Oct 24, 2024

Choose a reason for hiding this comment

alexdewar commented Oct 24, 2024