[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Symbol like/ilike/regexp filters may be slow when the number of distinct values is high #4825

Closed
1 task done
puzpuzpuz opened this issue Jul 27, 2024 · 3 comments · Fixed by #4871
Closed
1 task done
Assignees
Labels
Performance Performance improvements regression SQL Issues or changes relating to SQL execution

Comments

@puzpuzpuz
Copy link
Contributor

To reproduce

In v8.1 we introduced optimized functions for like/ilike/~ operators on symbol columns. Those operators first filter the symbol table, store a list of int codes, and then use it to filter the rows. It turns out that this may be slower than the old string function in case of a large number of distinct symbols.

We should make this optimization optional, depending on the known size of symbol table.

QuestDB version:

8.1

OS, in case of Docker specify Docker and the Host OS:

Linux

File System, in case of Docker specify Host File System:

ext4

Full Name:

Andrei Pechkurov

Affiliation:

QuestDB

Have you followed Linux, MacOs kernel configuration steps to increase Maximum open files and Maximum virtual memory areas limit?

  • Yes, I have

Additional context

No response

@puzpuzpuz puzpuzpuz added SQL Issues or changes relating to SQL execution Performance Performance improvements regression labels Jul 27, 2024
@bluestreak01
Copy link
Member

What is the threshold at which this regresses? How to we repro?

@puzpuzpuz
Copy link
Contributor Author

Daniel will share additional details on Monday.

@nixer89
Copy link
nixer89 commented Jul 29, 2024

hello, below you will find some details:

The database schemas is:

CREATE TABLE 'xrpl_offer_exchanges' (
  pair SYMBOL capacity 131072 CACHE index capacity 131072,
  rate DOUBLE,
  volume_a DOUBLE,
  volume_b DOUBLE,
  buyer VARCHAR,
  seller VARCHAR,
  taker VARCHAR,
  provider VARCHAR,
  isAMM BOOLEAN,
  autobridged VARCHAR,
  tx_hash VARCHAR,
  tx_type VARCHAR,
  ledger_index INT,
  tx_index INT,
  offer_sequence INT,
  ts TIMESTAMP
) timestamp (ts) PARTITION BY MONTH WAL;

The query which is causing trouble is:

WITH

first_selection as (SELECT pair, first(rate) AS open, last(rate) AS close, min(rate) AS low, max(rate) AS high, sum(volume_a) AS base_volume, sum(CASE WHEN buyer = taker then volume_a else 0 END) as base_volume_buy, sum(CASE WHEN seller = taker then volume_a else 0 END) as base_volume_sell, sum(volume_b) AS counter_volume, sum(CASE WHEN seller = taker then volume_b else 0 END) as counter_volume_buy, sum(CASE WHEN buyer = taker then volume_b else 0 END) as counter_volume_sell, count(*) AS exchanges, count_distinct(buyer) as unique_buyers, count_distinct(seller) as unique_sellers, last(ts) as last_trade FROM xrpl_offer_exchanges WHERE  ts >= '2024-03-25T17:53:02.932Z' AND ts <= '2024-03-26T17:53:02.932Z'  AND ( (pair not like 'XRP|%' OR (pair like 'XRP|%' AND volume_a >= 0.00001)) AND (pair not like '%|XRP' OR (pair like '%|XRP' AND volume_b >= 0.00001)))  ),

second_selection as (SELECT pair, rate as prev_rate, ts as prev_ts FROM xrpl_offer_exchanges WHERE  ts < '2024-03-25T17:53:02.932Z' AND pair in (SELECT pair FROM first_selection) LATEST ON ts PARTITION BY pair)

SELECT first_selection.pair, first_selection.open, first_selection.close, first_selection.low, first_selection.high, first_selection.base_volume, first_selection.base_volume_buy, first_selection.base_volume_sell, first_selection.counter_volume, first_selection.counter_volume_buy, first_selection.counter_volume_sell, first_selection.exchanges, first_selection.unique_buyers, first_selection.unique_sellers, first_selection.last_trade, second_selection.prev_rate, second_selection.prev_ts from first_selection LEFT JOIN second_selection on (pair) WHERE first_selection.pair LIKE '%|%' ;

The table currently holds a total of 46143 distinct values for the pair column.
The time frame we are querying contains 507 distinct pairs and 14283 rows.

When applying the filters, it results in 493 distinct pairs (and therefore 493 rows)

The execution times for this query are:

8.0.1: 60 ms
8.1.0: 6000 ms

So performance degraded by a factor of 100.

Please let me know if you need any more data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Performance improvements regression SQL Issues or changes relating to SQL execution
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants