You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi team, thank you for your Kafka support in tensorflow_io! I was doing this tutorial that said:
Once all the messages are read from kafka and the latest offsets are committed using the streaming.KafkaGroupIODataset, the consumer doesn't restart reading the messages from the beginning. Thus, while training, it is possible only to train for a single epoch with the data continuously flowing in.
This doesn't seem right. Once the data is in the dataset, we should be able to run through multiple epochs. The data set produced by the from_kafka method is able to train multiple epochs.
If the topic has way more data than the tensorflow_io app has memory, then perhaps the tfio library could help out in that situation by resetting offsets with the kafka admin API and re-reading data if necessary? Or I don't know what clustered map-reduce capabilities tensorflow has. In Spark, you would read from Kafka to populate an RDD (resilient distributed dataframe), which is then shared amongst spark workers in a map-reduce job. In any case, offset management would be good so that you can reset a consumer group multiple times as you tweak hyperparameters.
Also perhaps the KafkaGroupIODataset should have a "max_records" config that specifies that you only want a certain number of records and you don't want to wait indefinitely as records continuously flow into the topic (you could set stream_timeout to something small to avoid keeping the stream open indefinitely, but that's a bit hacky and imprecise). Or maybe start and end timestamp if you only want to train on data in a certain time range.
Some feedback on from_kafka -- I find it strange that it only reads a single partition. A topic may have multiple partitions for various reasons, so it's not clear what a user should do if there are many partitions to build a dataset.
I'm more of a Kafka expert than an ML expert, so let me know if what I'm saying is completely daft. Thank you again for your work on this library!
The text was updated successfully, but these errors were encountered:
chuck-confluent
changed the title
Support Multiple Epochs with KafkaGroupIODataset
Feedback on KafkaGroupIODataset and from_kafka
Jul 15, 2022
Hi team, thank you for your Kafka support in tensorflow_io! I was doing this tutorial that said:
This doesn't seem right. Once the data is in the dataset, we should be able to run through multiple epochs. The data set produced by the
from_kafka
method is able to train multiple epochs.If the topic has way more data than the tensorflow_io app has memory, then perhaps the tfio library could help out in that situation by resetting offsets with the kafka admin API and re-reading data if necessary? Or I don't know what clustered map-reduce capabilities tensorflow has. In Spark, you would read from Kafka to populate an RDD (resilient distributed dataframe), which is then shared amongst spark workers in a map-reduce job. In any case, offset management would be good so that you can reset a consumer group multiple times as you tweak hyperparameters.
Also perhaps the
KafkaGroupIODataset
should have a "max_records" config that specifies that you only want a certain number of records and you don't want to wait indefinitely as records continuously flow into the topic (you could setstream_timeout
to something small to avoid keeping the stream open indefinitely, but that's a bit hacky and imprecise). Or maybe start and end timestamp if you only want to train on data in a certain time range.Some feedback on
from_kafka
-- I find it strange that it only reads a single partition. A topic may have multiple partitions for various reasons, so it's not clear what a user should do if there are many partitions to build a dataset.I'm more of a Kafka expert than an ML expert, so let me know if what I'm saying is completely daft. Thank you again for your work on this library!
The text was updated successfully, but these errors were encountered: