-
Notifications
You must be signed in to change notification settings - Fork 288
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Anyone able to load tfrecords into TFRS generated with the spark-to-tf-records connector? #197
Comments
That said, there's the CSV consuming api's... https://www.tensorflow.org/guide/data#consuming_csv_data, wonder if that's just a better, more direct approach anyway. I have Parquet data to consume; in theory should be able to load it as described here tensorflow/io#1121 |
It might help folks answer this if you post the entire stack trace of your error. It's hard to say what's going on based on your summary of the error message. |
Hi @maciejkula, there are two slightly different stack traces in the linked tickets I've included. However here's one of them:
|
It looks like the error is in the map function:
Whatever the elements of |
That's true, unlike with The question is, how does one decode these bytes? b'\ns\n\x11\n\x08movie_id\x12\x05\x1a\x03\n\x01\x01\n#\n\x0bmovie_title'\x12\x14\n\x12\n\x10Toy Story (1995)\n9\n\x06genres\x12/\n-\n+Adventure|Animation|Children|Comedy|Fantasy' Clearly this has movie_id, movie_title, and genres. Similarly, for ratings Wonder if there's a utility method in tf somewhere.. |
I'm pretty sure the problem is on the writing side. What things do you expect Spark to be writing to these files? Is it writing what you think it's writing? |
It's writing the right keys (column names) and the right values, but I don't know enough about the internal tf record format. So this binary blob arrangement I can't pass a judgement on as far as correctness and the encoding it's using. It seems that there's some kind of a disconnect between what's being written and how it's being read into the dataset. Maybe it's a TF 1.x vs. TF 2.x issue. Or maybe when reading into a TF dataset, tfrecords need to pass through another conversion layer. All in all, not a critical issue for me because now I want to load Parquet into TF datasets directly; experimenting with that now. I had thought I'd have to write Spark dataframes into intermediary tfrecord files first then load them into tf datasets. But there seems to be a more direct way to just go Parquet -> DS. That said, I think the load of tfrecord files into a TF DS "should work". |
Have you looked at the docs for reading TFRecord files containing It looks like you're skipping the deserialization step (converting the serialized |
Oic, you mean this? -
It seems odd that one has to load the tfrecords into a dataset, then convert it while instructing it what's in the files :) I would think that the dataset should be able to do the parsing itself, encapsulating the knowledge of how to parse these files... I'll try this out a little later |
We can probably close the ticket; it may be worth mentioning in the docs that the 'parse' type of transformation is required; it's not obvious especially to a noob :) |
@dgoldenberg-audiomack I gtet the error "InvalidArgumentError: Feature: cold (data type: float) is required but could not be found. [Op:ParseSingleExample]" even though "cold" is clearly in the message byte string.
|
@Data-Jack In my particular case, the solution was to make sure I use the
However generally I've moved away from using tf-records. At least, my current thinking is why bother :) TF has ways of loading CSV and Parquet data into its datasets. What I do is wrangle all the input data to CSV/Parquet using Spark, then load it into TF datasets. I have filed this issue in TF for better interoperability with Spark. However, there's also the Big DL project which presumably allows one to distribute TF training using Spark. |
@dgoldenberg-audiomack Unfortunately my data contains arrays of array fields and
Thanks for the tip, I will look into to a pipeline that reads from parquet. I was really hoping to just be able to write all my nested array fields to tf-record and read them in. I am going to settle for flattening all the arrays into 1 array for now whilst I research other options. |
@Data-Jack Have you considered filing an issue with tensorflow-io, or core TF? they might suggest something. |
@dgoldenberg-audiomack Yeah, I will do. My first guess was it was how it was being written. |
Anyone see this kind of error when trying to load TF records generated from Spark by the spark to tf records connector or linkedin's spark tf record library?
Error: Error when deserializing tfrecord's in TF 2.x: Only integers, slices (
:
), ellipsis (...
), tf.newaxis (None
) and scalar tf.int32/tf.int64 tensors are valid indicesFiled tickets there with details
:
), ellipsis (...
), tf.newaxis (None
) and scalar tf.int32/tf.int64 tensors are valid indices linkedin/spark-tfrecord#19:
), ellipsis (...
), tf.newaxis (None
) and scalar tf.int32/tf.int64 tensors are valid indices ecosystem#178Really just doing a simple thing, using the small movielens dataset:
The text was updated successfully, but these errors were encountered: