Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: support opengptx data encoding #24

Closed
wants to merge 6 commits into from

Conversation

lhahn-iis
Copy link
Contributor

When playing around with the Dataset implementations of PackedMemMapDataset using OpenGPT-X data, decoding errors happened. This was also discovered in the past, that this data cannot get properly decoded using "utf8".
Although this very likely to be problem with the conversion of the respective data and the resulting encoding there, a potential solution is provided within this PR.

Note: This branch is based on #9 in order to use the respective test infrastructure and should get therefore merged afterwards, but not beforehand.

the OpenGPT-X data seems to come with problematic chars, which cannot get edecoded via utf8.
The former fix to use iso-8859-1 fixes this. However the issue probably lays actually with dataset conversions
@lhahn-iis lhahn-iis force-pushed the fix/support-opengptx-data-encoding branch from 991ad9d to c9e4e08 Compare January 23, 2024 08:19
@lhahn-iis
Copy link
Contributor Author

It is worth noting that this problem can also get solved by simplifying the general way of accessing files here. Right now we are basing the functionalities on np.memmap. We could however also use IOStreams with a certain offset and a specific read()-length. Some small attempts to play around with that indicate also much better performance (which makes it also interesting for #23. I decided however not to add it for now, since these changes might collide with the future use of more than text data.
If we want to add this anyhow, some more time needs to get invested to verify the stability of the code, since there seem to be still some problems with this implementation.

We can however also leave this to #23 and keep the changes here as a quick solution to support OpenGPT-X Data.

@lhahn-iis lhahn-iis requested a review from le1nux January 23, 2024 08:31
@lhahn-iis
Copy link
Contributor Author

Closing this, see #40 for more details

@lhahn-iis lhahn-iis closed this Jan 30, 2024
@fromm-m fromm-m deleted the fix/support-opengptx-data-encoding branch June 17, 2024 12:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant