[Data] Part of Datasources moved to _internal #46774

to3i · 2024-07-24T07:02:25Z

What happened

In version 2.32, many datasources have been moved to some _internal folder. Some of these datasources we are using to customize the data loading in our pipeline.

What you expected to happen

Since some of these datasources (e.g., ImageDatasource) were marked with DeveloperAPI, we were expecting such changes to be announced more broadly (major release). Additionally, it is not clear what the reasoning behind this change is (the associated PR links to a google doc that is not publicly accessible) and what this implies for the future.

Will were be more breaking changes?
How is a ray user supposed to use the datasources in the future?

Versions / Dependencies

ray[data] 2.32+

Reproduction script

from ray.data.datasource import ImageDatasource

Issue Severity

Low: It annoys or frustrates me.

The text was updated successfully, but these errors were encountered:

scottjlee · 2024-08-21T17:41:34Z

Our official policy:

Developer APIs are lower-level methods explicitly exposed to advanced Ray users and library developers. Their interfaces may change across minor Ray releases.

For this particular case with Datasources, we recommend using the high-level Read APIs (e.g. read_images()) when possible. Most Read APIs are PublicAPIs with higher stability. Out of curiosity, what's your use case for using the Datasource directly here?

In terms of the motivation, I don't have full context on this (couldn't find the google docs link). I will leave it to @bveeramani 's comment when he returns from vacation.

bveeramani · 2024-08-26T17:45:00Z

Hey @to3i, sorry about that. Breaking changes are always frustrating.

we were expecting such changes to be announced more broadly (major release).

Maybe this isn't discoverable enough, but in our documentation, we say we make breaking changes to developer APIs across minor releases: "Developer APIs are lower-level methods explicitly exposed to advanced Ray users and library developers. Their interfaces may change across minor Ray releases."

it is not clear what the reasoning behind this change is

We made datasource implementations internal for a few reasons:

For consistency. Previously, datasource implementations could be internal, developer, or public. Now, they're all internal.
To simplify the API surface and make the library easier to understand.
To enable us to improve the library faster.

Will were be more breaking changes?

We don't have any interface changes planned for Datasource implementations as of now, although I can't guarantee anything.

How is a ray user supposed to use the datasources in the future?

If your goal is to customize data loading and you can't achieve your customization with existing APIs (e.g. read_images or read_binary_files + map), then I'd recommend copying the ImageDatasource source code.

OOC what were doing with ImageDatasource that you couldn't with read_images?

to3i · 2024-08-28T11:14:51Z

Thank you for the clarification @scottjlee, @bveeramani !

We found read_images/ImageDatasource to work more efficiently than using read_binary_files + map. Thus we extended the ImageDatasource with what we needed:

store the original image size in a column
store the storage account in a column (to identify items even if you stream from multiple storage accounts because path is insufficient)

I would prefer to use the more flexible read_binary_files + map approach. I wonder if I am missing something that would improve performance? Should one increase the _NUM_THREADS_PER_TASK similar to what ImageDatasource is doing?

BTW is there a plan to allow FileBasedDatasource to handle multiple storage accounts? Right now I need to use one ImageDatasource per storage account and later merge them with union. While this works well for single items, loading tuples (like image and label) from multiple storage accounts becomes tricky with ray data. It seems that a zip(union([read_images] * num_storage), union([read_binary_files] *num_storage)) cannot be executed efficiently. Not sure if one of your issues is related to such a topic: #46049

bveeramani · 2024-08-28T16:56:54Z

I would prefer to use the more flexible read_binary_files + map approach. I wonder if I am missing something that would improve performance? Should one increase the _NUM_THREADS_PER_TASK similar to what ImageDatasource is doing?

Hmm...not sure why read_binary_files is performing worse than read_images. I'd expect them to perform similarily.

Increasing the number of threads might be worth a shot.

BTW is there a plan to allow FileBasedDatasource to handle multiple storage accounts?

No plans at the moment. By different storage accounts, do you mean different clouds or different accounts within the same cloud?

to3i · 2024-08-29T13:06:06Z

BTW is there a plan to allow FileBasedDatasource to handle multiple storage accounts?

No plans at the moment. By different storage accounts, do you mean different clouds or different accounts within the same cloud?

In the same cloud streaming data from multiple storage accounts in Azure or multiple projects in GCS. I guess for AWS multiple buckets is no issue, since no dedicated filesystem is passed. E.g., https://docs.ray.io/en/latest/data/loading-data.html#reading-files-from-cloud-storage

When loading from multiple storage accounts the problem, we can use union to merge the different streams, but if you then want to zip label files to form a tuple, ray first wants to materialize the union which causes problems.

Fan-in with zip/union only seems to work only for depth=1 without materialization. Do you have any suggestion?

bveeramani · 2024-08-29T16:32:27Z

Ah, gotcha.

I don't have any recommendations off the top of my head, but I'd be happy to discuss this more if you DM me on the Ray Slack.

For the time being, I'm going to close this issue because I think the question about moving the Datasource implementations has been resolved.

to3i added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jul 24, 2024

anyscalesam added the data Ray Data-related issues label Jul 29, 2024

scottjlee added question Just a question :) enhancement Request for new feature and/or capability and removed bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Aug 21, 2024

bveeramani closed this as completed Aug 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Part of Datasources moved to _internal #46774

[Data] Part of Datasources moved to _internal #46774

[Data] Part of Datasources moved to _internal #46774

[Data] Part of Datasources moved to _internal #46774

Comments

What happened

What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity