-
Notifications
You must be signed in to change notification settings - Fork 74k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset.shuffle leads to worse training performance due to chunked processing #36626
Comments
I have tried in colab with TF 2.1.0 and Nightly versions. Please, find the gist here. Thanks! |
I am working on a "TFIndexedDataset" RFC for externally shuffling dataset whose limit is beyond memory size. Let me know your preferred use case and I will consider adding it to the RFC. Thanks! |
@byronyi Basically a way to use either TFDS or TFRecord files where shuffling considers the whole dataset. To be more specific: I expect a dataset to know its size and provide random access to its elements where this is possible (and I assume for most datasets it is possible as they use images, videos or whatever files or lines which can be counted and ordered first, I even thought TFRecord files were made for that). With this 2 properties a random shuffle operation should exist, which produces every element from the dataset exactly once in a completely random order. So basically a |
Feature request: @Flamefire The recommended approach is to write your dataset out into multiple TFRecord files and then when loading the data doing it like so: files = tf.data.Dataset.from_tensor_slices(filenames)
files = files.shuffle(len(filenames))
ds = files.interleave(lambda x: tf.data.TFRecordDataset(x, compression_type='GZIP').prefetch(1),
num_parallel_calls=tf.data.experimental.AUTOTUNE,
deterministic=False, cycle_length=10) That way you are reading from 10 files at once (still serially in each file though), feeding it to the shuffle buffer at the appropriate point in your pipeline. And each epoch you shuffle the files you are reading from and if you have saved the dataset in more than 10 files that will change the order that you get the files (i.e. if you have 20 files then the expectation is that 5 files will be the same in the first half of epochs 1 and 2). @jsimsa I really think adding a shuffle capability to the cache of TF Dataset would be a slam-dunk. It would definitely let me remove a lot of code from my projects that only exists to deal with Tensorflows limited capability in shuffling ❤️ |
@grofte am I correct to assume that you would want the functionality of I don't think the
This is a more general API, which handles the aspect of shuffling mentioned above and could be used with any datasets that supports indexing. The main disadvantage is that it requires the number of elements of the caches dataset to be known. |
You are 100% correct that an As I understand it caching currently happens during the first training epoch for efficiency reasons. However, if you expect to train your model for many epochs then performing the caching before is only a small percentage increase in time and a large potential increase in model quality (and faster convergence which reduces training time). So you wouldn't have to know the number of elements in advance. |
The proposed IndexedDataset is stored with the very same on-disk format as that of cached datasets, but it could be stored in the remote storage without reading it through first. The |
@grofte I am a little bit confused by your example. I would assume that if the first three indices were 100, 3, and 2000, then you would need to read 100 elements to get the first results, and then either 0 (if the read elements are cached) or 3, and then either 1900 (if the read elements are cached) or 2000. So you would either need to read 2000 elements if the read elements are cached or 2103 elements if there is no caching -- neither of which matches your description. Note that I was asking you about what would you expect the behavior of @byronyi it makes sense for us to chat over VC about your proposal -- we have internal WIP proposal for indexed datasets as well and it would be great if we could align the two and I would be more than happy for you to lead the effort. I will reach out to you via email. |
System information
Describe the current behavior
Dataset.shuffle is (essentially) described as buffering N elements, then choosing 1 out of those N to return. Hence the input data is processed in chunks. In the extreme consider a shuffle buffer of size 2: In the first epoch only the first 2 elements can be returned. And the following code (using the 3rd and 4th element) produces only values of 0-4:
For good training performance (as in accuracy reaches high values fast) a complete shuffling of the dataset is required. This becomes obvious if one considers (accidentally or purposely) order sets of training data (in the MNIST example: First all zeros, then all ones, etc). There will be many batches mostly or completely consisting of only 1 label value. This does not work well with SGD approaches.
Some statistics on MNIST (validation after 10 epochs by shuffle buffer size):
As you can see the accuracy increases with the buffer size with everything else constant.
Describe the expected behavior
The whole dataset should be shuffled. This requires the concept of random access datasets. I believe the TFRecord format supports random access(?) So the shuffle operation can take random data from the whole dataset.
Code to reproduce the issue
The text was updated successfully, but these errors were encountered: