Symbol like/ilike/regexp filters may be slow when the number of distinct values is high #4825

puzpuzpuz · 2024-07-27T09:56:25Z

To reproduce

In v8.1 we introduced optimized functions for like/ilike/~ operators on symbol columns. Those operators first filter the symbol table, store a list of int codes, and then use it to filter the rows. It turns out that this may be slower than the old string function in case of a large number of distinct symbols.

We should make this optimization optional, depending on the known size of symbol table.

QuestDB version:

8.1

OS, in case of Docker specify Docker and the Host OS:

Linux

File System, in case of Docker specify Host File System:

ext4

Full Name:

Andrei Pechkurov

Affiliation:

QuestDB

Have you followed Linux, MacOs kernel configuration steps to increase Maximum open files and Maximum virtual memory areas limit?

Yes, I have

Additional context

No response

The text was updated successfully, but these errors were encountered:

bluestreak01 · 2024-07-27T10:03:33Z

What is the threshold at which this regresses? How to we repro?

puzpuzpuz · 2024-07-27T10:07:56Z

Daniel will share additional details on Monday.

nixer89 · 2024-07-29T14:20:33Z

hello, below you will find some details:

The database schemas is:

CREATE TABLE 'xrpl_offer_exchanges' (
  pair SYMBOL capacity 131072 CACHE index capacity 131072,
  rate DOUBLE,
  volume_a DOUBLE,
  volume_b DOUBLE,
  buyer VARCHAR,
  seller VARCHAR,
  taker VARCHAR,
  provider VARCHAR,
  isAMM BOOLEAN,
  autobridged VARCHAR,
  tx_hash VARCHAR,
  tx_type VARCHAR,
  ledger_index INT,
  tx_index INT,
  offer_sequence INT,
  ts TIMESTAMP
) timestamp (ts) PARTITION BY MONTH WAL;

The query which is causing trouble is:

WITH

first_selection as (SELECT pair, first(rate) AS open, last(rate) AS close, min(rate) AS low, max(rate) AS high, sum(volume_a) AS base_volume, sum(CASE WHEN buyer = taker then volume_a else 0 END) as base_volume_buy, sum(CASE WHEN seller = taker then volume_a else 0 END) as base_volume_sell, sum(volume_b) AS counter_volume, sum(CASE WHEN seller = taker then volume_b else 0 END) as counter_volume_buy, sum(CASE WHEN buyer = taker then volume_b else 0 END) as counter_volume_sell, count(*) AS exchanges, count_distinct(buyer) as unique_buyers, count_distinct(seller) as unique_sellers, last(ts) as last_trade FROM xrpl_offer_exchanges WHERE  ts >= '2024-03-25T17:53:02.932Z' AND ts <= '2024-03-26T17:53:02.932Z'  AND ( (pair not like 'XRP|%' OR (pair like 'XRP|%' AND volume_a >= 0.00001)) AND (pair not like '%|XRP' OR (pair like '%|XRP' AND volume_b >= 0.00001)))  ),

second_selection as (SELECT pair, rate as prev_rate, ts as prev_ts FROM xrpl_offer_exchanges WHERE  ts < '2024-03-25T17:53:02.932Z' AND pair in (SELECT pair FROM first_selection) LATEST ON ts PARTITION BY pair)

SELECT first_selection.pair, first_selection.open, first_selection.close, first_selection.low, first_selection.high, first_selection.base_volume, first_selection.base_volume_buy, first_selection.base_volume_sell, first_selection.counter_volume, first_selection.counter_volume_buy, first_selection.counter_volume_sell, first_selection.exchanges, first_selection.unique_buyers, first_selection.unique_sellers, first_selection.last_trade, second_selection.prev_rate, second_selection.prev_ts from first_selection LEFT JOIN second_selection on (pair) WHERE first_selection.pair LIKE '%|%' ;

The table currently holds a total of 46143 distinct values for the pair column.
The time frame we are querying contains 507 distinct pairs and 14283 rows.

When applying the filters, it results in 493 distinct pairs (and therefore 493 rows)

The execution times for this query are:

8.0.1: 60 ms
8.1.0: 6000 ms

So performance degraded by a factor of 100.

Please let me know if you need any more data.

puzpuzpuz added SQL Issues or changes relating to SQL execution Performance Performance improvements regression labels Jul 27, 2024

puzpuzpuz self-assigned this Aug 15, 2024

puzpuzpuz mentioned this issue Aug 15, 2024

perf(sql): optimize like/ilike/regexp functions on symbol column #4871

Merged

ideoma closed this as completed in #4871 Aug 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Symbol like/ilike/regexp filters may be slow when the number of distinct values is high #4825

Symbol like/ilike/regexp filters may be slow when the number of distinct values is high #4825

Symbol like/ilike/regexp filters may be slow when the number of distinct values is high #4825

Symbol like/ilike/regexp filters may be slow when the number of distinct values is high #4825

Comments

To reproduce

QuestDB version:

OS, in case of Docker specify Docker and the Host OS:

File System, in case of Docker specify Host File System:

Full Name:

Affiliation:

Have you followed Linux, MacOs kernel configuration steps to increase Maximum open files and Maximum virtual memory areas limit?

Additional context