[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve query performance by caching Parquet footers and Bloom filters #1597

Open
gaffer01 opened this issue Dec 4, 2023 · 0 comments
Open
Labels
enhancement New feature or request epic

Comments

@gaffer01
Copy link
Member
gaffer01 commented Dec 4, 2023

Background

We should investigate improvements to the query performance.

It is normal in LSM stores to store a Bloom filter with each file. When querying for a key, the Bloom filter is queried and if that says the key is not in the file, the expense of opening the file and reading pages looking for data can be skipped.

Reading Parquet files requires reading the footer first. If the footer could be copied from the file to a higher performing storage layer then a file could potentially be opened by reading the footer from one storage location and the data from S3.

Description

Storing Bloom filters of the keys will be simple. However, for this to provide performance benefits for queries, the Bloom filters will need to be stored somewhere that can be read from without this adding significantly to the overall query time in the case that the key is in the file, and without it taking almost as long as just opening the Parquet file in the case that the key is not in the file.

We can investigate whether it is possible to store Parquet file footers in a higher performance storage system than S3 and reduce the query time by reading the footers from there and the pages from the Parquet file in S3. The higher performing storage system might be a lower latency layer of S3 (e.g. https://aws.amazon.com/blogs/aws/new-amazon-s3-express-one-zone-high-performance-storage-class/), or a local NVM drive in the case where queries are run from an EC2 server.

@gaffer01 gaffer01 added enhancement New feature or request epic labels Dec 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request epic
Projects
None yet
Development

No branches or pull requests

1 participant