Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tensorflow.contrib.data in observations #22

Open
Arvinds-ds opened this issue Sep 23, 2017 · 9 comments
Open

tensorflow.contrib.data in observations #22

Arvinds-ds opened this issue Sep 23, 2017 · 9 comments

Comments

@Arvinds-ds
Copy link
Contributor

I am heavily using tf.contrib.data datasets api for image based tasks. With observations for images (LSUN/celebA) etc being no more than an downloader for these datasets, would it be worthwhile to return a tensorflow dataset something along the lines of

lsun_bedroom_x_train  = lsun('~/data',category='bedroom', set='training', 
                                                   batch_size=32, shuffle=True)
training_data = lsun_bedroom_x_train.make_one_shot_iterator()
.....
for i in range(inference.n_iter):
   x_batch = training_data.get_next()
   inference.update(....{x_ph: x_batch)
@dustinvtran
Copy link
Member

dustinvtran commented Sep 26, 2017

Observations is agnostic to the user's choice of workflow, which is a deliberate design choice.

That said, it could be useful to see just how much we can push generic data loading functions that adopt a specific framework. For example, the generator functions in the README assume your workflow can store data in memory and feed numpy arrays during training. Maybe we can do the same with a generic tf.contrib.data datasets function, which can help load in some of these large data sets while still leaving some of the data-specific preprocessing to the user?

@Arvinds-ds
Copy link
Contributor Author

Thanks.I understand the need for being agnostic. But is the expectation to be be independent of tensorflow (and consequently Edward)?. If so, we should not introduce tf specific code.

If a tf dependency is fine then:-
Regarding users choice of workflow, the beauty of the dataset api is exactly that. The dataset object can be further customized, transformed, shuffled later by the user. Edward can handle the particular dataset specific task of reading images from a corresponding text file, loading images and labels as batches etc

@dustinvtran
Copy link
Member

dustinvtran commented Sep 26, 2017

Agnostic to choice of workflow as in including the framework too. I think of Observations as having a longer life span than Edward or TensorFlow in that it's more likely to still be developed 5-10 years from now; it's more uncertain for computational graph and PP frameworks.

But maybe it's not possible to implement a generic utility for tf.contrib.data to help loading across all large data sets (it is likely to be not possible—I haven't thought carefully about it). If so, I can see an argument for having a tf dependency for large data sets. In the same way we use stuff like networkx for loading network data, we can rely on tf to load large data sets and, in the future, possibly change it.

@Arvinds-ds
Copy link
Contributor Author

@dustinvtran Let me know if crystallize your thoughts on tf dependency for large image datasets, I can contribute code for that. I have written loading code for CelebA, Camvid and they seem to be the basic variation of similar functionality and can be abstracted to return a contrib.data.dataset. I am closing this issue .

@dustinvtran
Copy link
Member

I thought about it and agree with you. I think it makes sense to have celeba/lsun/etc. functions load and return objects for contrib.data.dataset.

@dustinvtran dustinvtran reopened this Sep 27, 2017
@Arvinds-ds
Copy link
Contributor Author

There are common patterns for loading data e.g load_image from a text file containing image names, load_image from a folder, load labels from text file etc which I am abstracting to a tf_dataset_utils.py. Where do you suggest the file should reside?.

@dustinvtran
Copy link
Member

Either as part of util.py or a new util_tf.py if you think it's substantial and dependency-ridden enough to have separately.

@dustinvtran
Copy link
Member

A currently private but soon-to-be-open probabilistic programming library built on PyTorch also uses this library. We should make sure to enable other data loaders and not just TF's.

@Arvinds-ds
Copy link
Contributor Author

Cool. I do love the pytorch Dataset and DataLoader Interface too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants