Feedback on KafkaGroupIODataset and from_kafka #1695

chuck-confluent · 2022-07-14T22:34:39Z

Hi team, thank you for your Kafka support in tensorflow_io! I was doing this tutorial that said:

Once all the messages are read from kafka and the latest offsets are committed using the streaming.KafkaGroupIODataset, the consumer doesn't restart reading the messages from the beginning. Thus, while training, it is possible only to train for a single epoch with the data continuously flowing in.

This doesn't seem right. Once the data is in the dataset, we should be able to run through multiple epochs. The data set produced by the from_kafka method is able to train multiple epochs.

If the topic has way more data than the tensorflow_io app has memory, then perhaps the tfio library could help out in that situation by resetting offsets with the kafka admin API and re-reading data if necessary? Or I don't know what clustered map-reduce capabilities tensorflow has. In Spark, you would read from Kafka to populate an RDD (resilient distributed dataframe), which is then shared amongst spark workers in a map-reduce job. In any case, offset management would be good so that you can reset a consumer group multiple times as you tweak hyperparameters.

Also perhaps the KafkaGroupIODataset should have a "max_records" config that specifies that you only want a certain number of records and you don't want to wait indefinitely as records continuously flow into the topic (you could set stream_timeout to something small to avoid keeping the stream open indefinitely, but that's a bit hacky and imprecise). Or maybe start and end timestamp if you only want to train on data in a certain time range.

Some feedback on from_kafka -- I find it strange that it only reads a single partition. A topic may have multiple partitions for various reasons, so it's not clear what a user should do if there are many partitions to build a dataset.

I'm more of a Kafka expert than an ML expert, so let me know if what I'm saying is completely daft. Thank you again for your work on this library!

The text was updated successfully, but these errors were encountered:

chuck-confluent changed the title ~~Support Multiple Epochs with KafkaGroupIODataset~~ Feedback on KafkaGroupIODataset and from_kafka Jul 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feedback on KafkaGroupIODataset and from_kafka #1695

Feedback on KafkaGroupIODataset and from_kafka #1695

Feedback on KafkaGroupIODataset and from_kafka #1695

Feedback on KafkaGroupIODataset and from_kafka #1695

Comments