[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feedback on KafkaGroupIODataset and from_kafka #1695

Open
chuck-confluent opened this issue Jul 14, 2022 · 0 comments
Open

Feedback on KafkaGroupIODataset and from_kafka #1695

chuck-confluent opened this issue Jul 14, 2022 · 0 comments

Comments

@chuck-confluent
Copy link
chuck-confluent commented Jul 14, 2022

Hi team, thank you for your Kafka support in tensorflow_io! I was doing this tutorial that said:

Once all the messages are read from kafka and the latest offsets are committed using the streaming.KafkaGroupIODataset, the consumer doesn't restart reading the messages from the beginning. Thus, while training, it is possible only to train for a single epoch with the data continuously flowing in.

This doesn't seem right. Once the data is in the dataset, we should be able to run through multiple epochs. The data set produced by the from_kafka method is able to train multiple epochs.

If the topic has way more data than the tensorflow_io app has memory, then perhaps the tfio library could help out in that situation by resetting offsets with the kafka admin API and re-reading data if necessary? Or I don't know what clustered map-reduce capabilities tensorflow has. In Spark, you would read from Kafka to populate an RDD (resilient distributed dataframe), which is then shared amongst spark workers in a map-reduce job. In any case, offset management would be good so that you can reset a consumer group multiple times as you tweak hyperparameters.

Also perhaps the KafkaGroupIODataset should have a "max_records" config that specifies that you only want a certain number of records and you don't want to wait indefinitely as records continuously flow into the topic (you could set stream_timeout to something small to avoid keeping the stream open indefinitely, but that's a bit hacky and imprecise). Or maybe start and end timestamp if you only want to train on data in a certain time range.

Some feedback on from_kafka -- I find it strange that it only reads a single partition. A topic may have multiple partitions for various reasons, so it's not clear what a user should do if there are many partitions to build a dataset.

I'm more of a Kafka expert than an ML expert, so let me know if what I'm saying is completely daft. Thank you again for your work on this library!

@chuck-confluent chuck-confluent changed the title Support Multiple Epochs with KafkaGroupIODataset Feedback on KafkaGroupIODataset and from_kafka Jul 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant