feat: no longer load full table into ram in write #2265

aersam · 2024-03-08T20:24:10Z

Description

Well, I suffered quite a bit and am still not finished. But here's what I've learned so far:

If you pass around the Iterator from Python to Rust, it's Send, but not Sync which complicates the usage in a future. To work around limitations of this, I had to split the WriterBuilder into two Structs, one for the data and one for the config
DataFusion is a pretty cool thing! You can really pick what you want from it and it uses good abstractions. I need to have a deeper look at it :)

About the implementation:

Instead of doing something proper, I wanted to first create this PR that basically just takes the iterable and breaks in into chunks to process. This is not ideal, it does not parallelize as good as using things like channels, but that would be a bigger thing. Still a big win for large tables!

Related Issue(s)

Fixes #2255

Documentation

github-actions · 2024-03-08T20:24:33Z

ACTION NEEDED

delta-rs follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

…iter2

ion-elgreco · 2024-03-08T20:43:49Z

@aersam haven't checked yet, but are you streaming data to open file handles?

Or do you close the files after writing chunks?

aersam · 2024-03-08T20:45:00Z

I just create many files 🙂

aersam · 2024-03-08T20:45:29Z

Streaming would be better, but way more complicated to do, mostly since DataFusion's MemoryExec does not take a Stream but a Vec

ion-elgreco · 2024-03-08T20:55:49Z

@aersam in that case I don't think it's the right way, with small sized recordbatches you could potentially get many many small parquets which then suddenly would require to constantly run optimize afterwards to fix that

aersam · 2024-03-08T20:59:24Z

Ok, true. What if I'd count the bytes in a chunk and let it grow to a certain threshold?

…iter2

aersam · 2024-03-11T15:37:39Z

@aersam in that case I don't think it's the right way, with small sized recordbatches you could potentially get many many small parquets which then suddenly would require to constantly run optimize afterwards to fix that

This is resolved now, it will now produce few big files using streams

aersam · 2024-03-11T15:43:05Z

I also removed lot's of duplicate code now, the writer was dividing stuff into partitions twice. Column Mapping in Write would also be pretty straightforward now

aersam · 2024-03-11T16:15:20Z

I could implement From for the WriteData Enum to make usage a bit simplier, if you want

ion-elgreco · 2024-03-11T16:27:59Z

@aersam I'll try to take a look tonight!

ion-elgreco · 2024-03-12T15:17:04Z

@aersam in that case I don't think it's the right way, with small sized recordbatches you could potentially get many many small parquets which then suddenly would require to constantly run optimize afterwards to fix that

This is resolved now, it will now produce few big files using streams

So instead of passing plans over, you are now passing RecordBatchStreams instead?

aersam · 2024-03-12T15:18:26Z

Yep. I tried using a StreamingTable, but this one has to be Sync which is an issue. The stream worked fine

ion-elgreco · 2024-03-12T15:44:31Z

I could implement From for the WriteData Enum to make usage a bit simplier, if you want

Maybe do that and have the DeltaOps().write use an into

aersam · 2024-03-12T20:00:10Z

I could implement From for the WriteData Enum to make usage a bit simplier, if you want

Maybe do that and have the DeltaOps().write use an into

this is done now

ion-elgreco

Thanks for the work! @aersam

This should definitely be helpful in situations where you can send a recordbatchreader directly.

Before merging let's have @rtyler also take a short look if he has time :)

aersam · 2024-03-13T05:57:27Z

One thing came to my mind, I'm not a 100% sure, but is it a good idea to to py.allow_threads in lib.rs on Python Side? Since we're iterating accross language borders, I guess, we need to hold the GIL, no? I pushed a commit that should resolve it, but feel free to undo if you disagree

ion-elgreco · 2024-03-13T07:14:31Z

On this I'm not entirely sure, it is a python object and consuming a reader can only be done once but @emcake 's description of his PR suggests it's safe? #2091

@wjones127 @emcake any insights or comments on this?

aersam · 2024-03-13T08:28:51Z

On this I'm not entirely sure, it is a python object and consuming a reader can only be done once but @emcake 's description of his PR suggests it's safe? #2091

@wjones127 @emcake any insights or comments on this?

It think it can only be safe if pyarrow itself would take care of getting the GIL. I don't know if they do this

ion-elgreco · 2024-03-13T08:59:08Z

If that's the case, then we need to apply GilIterator change to merge_execute as well, since that one also wouldn't be safe anymore with your changes

aersam · 2024-03-13T12:56:34Z

If that's the case, then we need to apply GilIterator change to merge_execute as well, since that one also wouldn't be safe anymore with your changes

I tried getting myself through the pyarrow code and see if it does acquire the GIL and it looks like it does: https://github.com/apache/arrow/blob/93816475f75d751067d4ff427fb9ae64e85acebe/python/pyarrow/src/arrow/python/ipc.cc#L39

So I'll revert the last two commits

aersam · 2024-03-13T14:59:24Z

The failure is not related to my changes, I guess?

…iter2

… write-iter2

…iter2

aersam · 2024-03-21T11:53:26Z

Closing this in favor of #2289 which I'll keep up to date with the main branch

aersam added 5 commits March 7, 2024 21:15

close to compiling

565f43d

still learning :)

3a52bb7

some compile errors

30a5463

another bug fix

cde4207

clippy feedback

6743373

aersam changed the title ~~feat; no longer load full table into ram in write~~ feat: no longer load full table into ram in write Mar 8, 2024

github-actions bot added binding/python Issues for the Python package binding/rust Issues for the Rust crate labels Mar 8, 2024

aersam added 3 commits March 8, 2024 21:28

test compilation

577442b

wip on tests

4b276a7

Merge branch 'main' of https://github.com/aersam/delta-rs into write-…

9d022cb

…iter2

aersam added 7 commits March 11, 2024 08:50

Merge branch 'main' of https://github.com/aersam/delta-rs into write-…

d1352fa

…iter2

cleanup

d4d82ce

wip on fixes

385c935

more fixes

023df09

more fixes

0397a0c

fmt

c83f947

adjust test

f131eb1

aersam marked this pull request as ready for review March 11, 2024 15:41

aersam requested review from MrPowers, wjones127, fvaleye and roeap as code owners March 11, 2024 15:41

use into()

a3d5585

ion-elgreco previously approved these changes Mar 12, 2024

View reviewed changes

we need GIL, no?

965968c

aersam dismissed ion-elgreco’s stale review via 965968c March 13, 2024 05:57

clippy, your so right

83d398f

aersam and others added 2 commits March 13, 2024 13:57

revert 965968c and 965968c

98bf7ec

Merge branch 'main' into write-iter2

44cd5b9

ion-elgreco previously approved these changes Mar 13, 2024

View reviewed changes

aersam added 3 commits March 14, 2024 10:48

Merge branch 'main' of https://github.com/aersam/delta-rs into write-…

5ae3599

…iter2

fmt

28eba65

Merge branch 'write-iter2' of https://github.com/aersam/delta-rs into…

c66762a

… write-iter2

aersam dismissed ion-elgreco’s stale review via c66762a March 14, 2024 12:12

Merge branch 'main' of https://github.com/aersam/delta-rs into write-…

6e742a9

…iter2

aersam mentioned this pull request Mar 15, 2024

feat: no longer load full table into ram in write by using concurrent write #2289

Open

ion-elgreco approved these changes Mar 16, 2024

View reviewed changes

aersam closed this Mar 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: no longer load full table into ram in write #2265

feat: no longer load full table into ram in write #2265

feat: no longer load full table into ram in write #2265

feat: no longer load full table into ram in write #2265

Conversation

Description

Related Issue(s)

Documentation

Choose a reason for hiding this comment