Skip to main content

· 3 min read

Happy Pi Day! We thought it would be fun to calculate pi using SQL queries in ClickHouse. Here is what we came up with so far...

  1. This one uses the ClickHouse numbers_mt table function to return 1B rows and only takes 40ms to compute the calculation:
SELECT 4 * sum(if(number % 2, -1, 1) / ((number * 2) + 1)) AS pi
FROM numbers_mt(1000000000.)

┌────────────────pi─┐
3.141592652589797
└───────────────────┘

1 row in set. Elapsed: 0.432 sec. Processed 1.00 billion rows, 8.00 GB (2.32 billion rows/s., 18.53 GB/s.)
  1. The following example also processes 1B numbers, just not as quickly:
SELECT 3 + (4 * sum(if((number % 2) = 0, if((number % 4) = 0, -1 / ((number * (number + 1)) * (number + 2)), 1 / ((number * (number + 1)) * (number + 2))), 0))) AS pi
FROM numbers_mt(2, 10000000000)

┌─────────────────pi─┐
3.1415926525808087
└────────────────────┘

1 row in set. Elapsed: 9.825 sec. Processed 10.00 billion rows, 80.00 GB (1.02 billion rows/s., 8.14 GB/s.)
  1. This one is obviously our favorite in ClickHouse (and the most accurate!):
SELECT pi()

┌──────────────pi()─┐
3.141592653589793
└───────────────────┘

1 row in set. Elapsed: 0.008 sec.
  1. Someone knew their trigonometry with this one:
SELECT 2 * asin(1) AS pi

┌────────────────pi─┐
3.141592653589793
└───────────────────┘

1 row in set. Elapsed: 0.005 sec.
  1. Here is a handy API that lets you specify the number of digits you want:
SELECT *
FROM url('https://api.pi.delivery/v1/pi?start=0&numberOfDigits=100', 'JSONEachRow')

┌───────────────content─┐
3.1415926535897933e99 │
└───────────────────────┘

1 row in set. Elapsed: 0.556 sec.
  1. This one is clever - it uses ClickHouse distance functions:
WITH random_points AS
(
SELECT (rand64(1) / pow(2, 64), rand64(2) / pow(2, 64)) AS point
FROM numbers(1000000000)
)
SELECT (4 * countIf(L2Norm(point) < 1)) / count() AS pi
FROM random_points


┌──────────pi─┐
3.141627208
└─────────────┘

1 row in set. Elapsed: 4.742 sec. Processed 1.00 billion rows, 8.00 GB (210.88 million rows/s., 1.69 GB/s.)
  1. If you're a physicist, you will be content with this one:
SELECT 22 / 7

┌─────divide(22, 7)─┐
3.142857142857143
└───────────────────┘
  1. Another indirect mehthod (this one came from Alexey Milovidov) that is accurate to 7 decimal places - and it's quick:
WITH
10 AS length,
(number / 1000000000.) * length AS x
SELECT pow((2 * length) * avg(exp(-(x * x))), 2) AS pi
FROM numbers_mt(1000000000.)


┌─────────────────pi─┐
3.1415926890388595
└────────────────────┘

1 row in set. Elapsed: 1.245 sec. Processed 1.00 billion rows, 8.00 GB (803.25 million rows/s., 6.43 GB/s.)
note

If you have any more, we'd love for you to contribute. Thanks!

· 2 min read

Question: When a source table has new rows inserted into it, those new rows are also sent to all of the materialized views of that source table. Are inserts into Materialized Views performed synchronously, meaning that once the insert is acknowledged successfully from the server to the client, it means that all Materialized Views have been fully updated and available for queries?

Answer:

  1. When an INSERT succeeds, the data is inserted both to the table and all materialized views.
  2. The insert is not atomic with respect to materialized views. At the moment of time when the INSERT is in progress, concurrent clients may see the intermediate state, when the data is inserted to the main table, but not to materialized views, or vice versa.
  3. If you are using async inserts, they collect the data and perform a regular insert under the hood, returning the same type of answer to the client as for regular inserts. If the client received success from an async insert with the option wait_for_async_insert (as by default), the data is inserted into both the table and all of its materialized views.

Question: How about chained/cascaded materialized views?

Answer: The same rules apply - an INSERT with a successful response means that the data was inserted into every materialized view in the chain. The insert is non-atomic.

· 4 min read

Normally the max_threads setting controls the number of parallel reading threads and parallel query processing threads:

Untitled scene

The data is read 'in order', column after column, from disk.

Asynchronous data reading

The new setting allow_asynchronous_read_from_io_pool_for_merge_tree allows the number of reading threads (streams) to be higher than the number of threads in the rest of the query execution pipeline to speed up cold queries on low-CPU ClickHouse Cloud services, and to increase performance for I/O bound queries. When the setting is enabled, then the amount of reading threads is controlled by the max_streams_for_merge_tree_reading setting:

Untitled scene

The data is read asynchronously, in parallel from different columns.

Note that there is also the max_streams_to_max_threads_ratio setting for configuring the ratio between the number of reading threads (streams) and the number of threads in the rest of the query execution pipeline. However, in benchmarks it did not help as much as the max_streams_for_merge_tree_reading setting

What about optimize_read_in_order?

With the optimize_read_in_order optimization, ClickHouse can skip resorting data in memory if the queries sort order reflects the physical order of data on disk, but that requires reading the data in order (in contrast to asynchronous reading):

Untitled scene

optimize_read_in_order has precedence over asynchronous reading

When ClickHouse sees that optimize_read_in_order optimization can be applied, then the allow_asynchronous_read_from_io_pool_for_merge_tree setting will be ignored / disabled.

Example demonstrating all of the above

  • Create and load the UK Property Price Paid table

  • Check set value of max_threads (by default the amount of CPU cores that ClickHouse sees on the node executing the query

SELECT getSetting('max_threads');


┌─getSetting('max_threads')─┐
│ 10 │
└───────────────────────────┘
  • Check query pipeline with default amount of threads for both reading and processing the data
EXPLAIN PIPELINE
SELECT *
FROM uk_price_paid;

┌─explain──────────────────────┐
│ (Expression) │
│ ExpressionTransform × 10 │
│ (ReadFromMergeTree) │
│ MergeTreeThread × 10 0 → 1 │
└──────────────────────────────┘
  • Check query pipeline with 60 async reading threads and default amount of threads for the rest of the query execution pipeline
EXPLAIN PIPELINE
SELECT *
FROM uk_price_paid
SETTINGS
allow_asynchronous_read_from_io_pool_for_merge_tree = 1,
max_streams_for_merge_tree_reading = 60;


┌─explain────────────────────────┐
│ (Expression) │
│ ExpressionTransform × 10 │
│ (ReadFromMergeTree) │
│ Resize 60 → 10 │
│ MergeTreeThread × 60 0 → 1 │
└────────────────────────────────┘
  • Check query pipeline with 20 threads for both reading and processing the data
EXPLAIN PIPELINE
SELECT *
FROM uk_price_paid
SETTINGS
max_threads = 20;


┌─explain──────────────────────┐
│ (Expression) │
│ ExpressionTransform × 20 │
│ (ReadFromMergeTree) │
│ MergeTreeThread × 20 0 → 1 │
└──────────────────────────────┘
  • Check query pipeline with 60 async reading threads and 20 threads for the rest of the query execution pipeline
EXPLAIN PIPELINE
SELECT *
FROM uk_price_paid
SETTINGS
max_threads = 20,
allow_asynchronous_read_from_io_pool_for_merge_tree = 1,
max_streams_for_merge_tree_reading = 60;


┌─explain────────────────────────┐
│ (Expression) │
│ ExpressionTransform × 20 │
│ (ReadFromMergeTree) │
│ Resize 60 → 20 │
│ MergeTreeThread × 60 0 → 1 │
└────────────────────────────────┘
  • Check query pipeline with 60 async reading threads and 20 threads for the rest of the query execution pipeline when optimize_read_in_order optimization can be applied
EXPLAIN PIPELINE
SELECT *
FROM uk_price_paid
ORDER BY postcode1, postcode2
SETTINGS
max_threads = 20,
allow_asynchronous_read_from_io_pool_for_merge_tree= 1,
max_streams_for_merge_tree_reading= 60;


┌─explain───────────────────────────┐
│ (Expression) │
│ ExpressionTransform │
│ (Sorting) │
│ MergingSortedTransform 20 → 1 │
│ (Expression) │
│ ExpressionTransform × 20 │
│ (ReadFromMergeTree) │
│ MergeTreeInOrder × 20 0 → 1 │
└───────────────────────────────────┘


-- note that this is equivalent to disabling allow_asynchronous_read_from_io_pool_for_merge_tree

EXPLAIN PIPELINE
SELECT *
FROM uk_price_paid
ORDER BY postcode1, postcode2
SETTINGS
max_threads = 20,
allow_asynchronous_read_from_io_pool_for_merge_tree = 0,
max_streams_for_merge_tree_reading = 0;


┌─explain───────────────────────────┐
│ (Expression) │
│ ExpressionTransform │
│ (Sorting) │
│ MergingSortedTransform 20 → 1 │
│ (Expression) │
│ ExpressionTransform × 20 │
│ (ReadFromMergeTree) │
│ MergeTreeInOrder × 20 0 → 1 │
└───────────────────────────────────┘

-- note that you can enforce allow_asynchronous_read_from_io_pool_for_merge_tree by disabling optimize_read_in_order

EXPLAIN PIPELINE
SELECT *
FROM uk_price_paid
ORDER BY
postcode1 ASC,
postcode2 ASC
SETTINGS
max_threads = 20,
allow_asynchronous_read_from_io_pool_for_merge_tree = 1,
max_streams_for_merge_tree_reading = 60,
optimize_read_in_order = 0;


┌─explain──────────────────────────────┐
│ (Expression) │
│ ExpressionTransform │
│ (Sorting) │
│ MergingSortedTransform 20 → 1 │
│ MergeSortingTransform × 20 │
│ (Expression) │
│ ExpressionTransform × 20 │
│ (ReadFromMergeTree) │
│ Resize 60 → 20 │
│ MergeTreeThread × 60 0 → 1 │
└──────────────────────────────────────┘


· 2 min read

There are several ways to define a setting for a user in ClickHouse, depending on the use case and how long you want the setting to be configured. Let's look at a few scenarios...

Configure a setting for a single query

A SELECT query can contain a SETTINGS clause where you can define any number of settings. The settings are only applied for that particular query. For example:

SELECT *
FROM my_table
SETTINGS max_threads = 8;

The maximum number of threads will be 8 for this particular query.

Configure a setting for a session

You can define a setting for the lifetime of a client session using a SET clause. This is handy for ad-hoc testing or for when you want a setting to live for the lifetime of a few queries - but not longer.

SET max_threads = 8;

SELECT *
FROM my_table;

Configure a setting for a particular user

Use ALTER USER to define a setting just for one user. For example:

ALTER USER my_user_name SETTINGS max_threads = 8;

You can verify it worked by logging out of your client, logging back in, then use the getSetting function:

SELECT getSetting('max_threads');

· One min read

In order to execute the same query on all nodes of a ClickHouse Cloud service, we can use clusterAllReplicas.

For example, in order to get entries from a (node-local) system table from all nodes, you can use:

SELECT ... FROM clusterAllReplicas(default, system.TABLE) ...;

Similarly, you can execute the same SYSTEM statement on all nodes with a single statement, by using the ON CLUSTER clause:

SYSTEM ... ON CLUSTER default;

For example for dropping the filesystem cache from all nodes, you can use:

SYSTEM DROP FILESYSTEM CACHE ON CLUSTER default;

· One min read

ClickHouse provides a simple and intuitive way to write filtered aggregates. For example, compare the standard SQL way to write filtered aggregates (which work fine in ClickHouse) with the shorthand syntax using the -If aggregate function combinator, which can be appended to any aggregate function:

--standard SQL
SELECT
avg(number)
FILTER (WHERE number > 50)
FROM numbers(100)

--ClickHouse using an aggregate combinator
SELECT
avgIf(number, number > 50)
FROM numbers(100)

Similarly, there is a -Distinct aggregate combinator:

--standard SQL
SELECT avg(DISTINCT number)

--ClickHouse using an aggregate combinator
SELECT avgDistinct(number)

Why are filtered aggregates are important? Because they allow you to implement the "segment comparison" feature in web analytics services. For example:

WITH
Region = 'us' AS segment1,
Browser = 'Chrome' AS segment2
SELECT
uniqIf(UserID, segment1),
uniqIf(UserID, segment2)
WHERE segment1 OR segment2

Check out the aggregate function combinator page in the docs for more details.

· One min read

ClickHouse uses threads from the Global Thread pool to process queries and also perform background operations like merges and mutations. If there is no idle thread to process a query, then a new thread is created in the pool.

The maximum size of the global thread pool is determined by the max_thread_pool_size setting, which defaults to 10,000. You can modify this value in your config - here we set it to 20,000:

<max_thread_pool_size>20000</max_thread_pool_size>

If you modify max_thread_pool_size, we recommend changing thread_pool_queue_size to be the same value. The thread_pool_queue_size setting is the maximum number of jobs that can be scheduled on the Global Thread pool:

<thread_pool_queue_size>20000</thread_pool_queue_size>

You can also free up resources if your server has a lot of idle threads - using the max_thread_pool_free_size setting. The default is 1,000, which means your Global Thread pool will never have more than 1,000 idle threads. The following example increases the value to 2,000:

<max_thread_pool_free_size>2000</max_thread_pool_free_size>

Check out the docs for more details on the settings above and other settings that affect the Global Thread pool.

· One min read

When a user-level setting is specified in the wrong place, the server won't start and an exception message is sent to the log. However, you can tell ClickHouse to ignore the incorrect setting using the skip_check_for_incorrect_settings setting:

Add the following to config.xml:

<skip_check_for_incorrect_settings>1</skip_check_for_incorrect_settings>
note

User-level settings should be specified in users.xml inside a <profile> section for the specific user profile, (or in <default> for default settings.

· One min read

A client can view the server logs - even at a different level than what the server log level is configured to - by setting the send_logs_level client setting.

For example, suppose the client runs:

SET send_logs_level = 'trace';

The client will receive trace logs even if the server has log level set to info.

One useful scenario is to use send_logs_level to monitor the insertion of rows into a Distributed table:

  • Enable logs in clickhouse-client using SET send_logs_level = 'trace';
  • Run your INSERT query
  • Inserts into a distributed table are asynchronous by default. The data is written into a local buffer on disk, then sent to remote servers in background.
  • Logs will be sent from all nodes participating in the query processing (distributed tracing)

To check the status of distributed inserts, check the system.distribution_queue table. This table contains information about local files that are in the queue to be sent to the shards. These local files contain new parts that are created by inserting new data into the Distributed table in asynchronous mode.

· One min read

Pricing

For pricing information see the ClickHouse Cloud Pricing page. To understand what can affect your bill, and ways that you can manage your spend, keep reading.

Amazon Web Services(AWS) Examples

note

Prices reflect AWS us-east-1 pricing.

Development: From $51 per month

Best for: Starter projects & staging

  • Development service
  • 16 GiB RAM, 2 vCPU
  • 1 TB Data

Pricing breakdown for this example:

10% active50% activeAlways on
Compute$16$79$158
Storage$35$35$35
Total$51$114$193
note

Consumption can be even lower if less than 1TB disk is used

Production (Idling, Auto-scaling): From $172 per month

Best for: Cost-sensitive ad-hoc analytics applications

  • Production Service
  • Active workload ~25% time
  • Idling on with default settings
  • Auto-scaling maximum set to prevent runaway bills

Pricing breakdown for this example:

Example 1Example 2Example 3
Compute24 GiB RAM, 6 vCPU
$125
192 GiB RAM, 48 vCPU
$1000
720 GiB RAM, 180 vCPU
$3750
Storage1 TB Data
$47
5 TB Data
$235
10 TB Data
$470
Total$172$1,235$4,220

Production (Always-on, Reserved capacity): From $550 per month​

Best for: Latency-sensitive applications

  • Production Service
  • Active workload ~100% time
  • Auto-scaling minimum set to reserve capacity

Pricing breakdown for this example:

Example 1Example 2Example 3
Compute24 GiB RAM, 6 vCPU
$503
96 GiB RAM, 24 vCPU
$2,012
360 GiB RAM, 90 vCPU
$7,545
Storage1 TB Data
$47
4 TB Data
$188
8 TB Data
$376
Total$550$2,200$7,921

For help with further estimation, please contact support if you are already a ClickHouse Cloud user, or sales@clickhouse.com otherwise.

Google Cloud Platform(GCP) Examples

note

Prices reflect GCP us-central-1 pricing.

Development: From $46 per month

Best for: Starter projects & staging

  • Development service
  • 16 GiB RAM, 2 vCPU
  • 1 TB Data

Pricing breakdown for this example:

10% active50% activeAlways on
Compute$15$74$147
Storage$31$31$31
Total$46$105$178
note

Consumption can be even lower if less than 1TB disk is used

Production (Idling, Auto-scaling): From $146 per month

Best for: Cost-sensitive ad-hoc analytics applications

  • Production Service
  • Active workload ~25% time
  • Idling on with default settings
  • Auto-scaling maximum set to prevent runaway bills

Pricing breakdown for this example:

Example 1Example 2Example 3
Compute24 GiB RAM, 6 vCPU
$105
192 GiB RAM, 48 vCPU
$843
720 GiB RAM, 180 vCPU
$3162
Storage1 TB Data
$41
5 TB Data
$205
10 TB Data
$410
Total$146$1,048$3,572

Production (Always-on, Reserved capacity): From $463 per month​

Best for: Latency-sensitive applications

  • Production Service
  • Active workload ~100% time
  • Auto-scaling minimum set to reserve capacity

Pricing breakdown for this example:

Example 1Example 2Example 3
Compute24 GiB RAM, 6 vCPU
$422
96 GiB RAM, 24 vCPU
$1,686
360 GiB RAM, 90 vCPU
$6,342
Storage1 TB Data
$41
4 TB Data
$164
8 TB Data
$328
Total$463$1,850$6,652

For help with further estimation, please contact support if you are already a ClickHouse Cloud user, or sales@clickhouse.com otherwise.

FAQs

How is compute metered?

ClickHouse Cloud meters compute on a per-minute basis, in 8G RAM increments.

How is storage on disk calculated?

ClickHouse Cloud uses cloud object storage and is metered on the compressed size of data stored in ClickHouse tables.

Do backups count toward total storage?

ClickHouse Cloud offers two free backups for production services, and one free backup for development services. Backups do not count toward storage.

How do I estimate compression?

Compression can vary quite a bit by dataset. It is dependent on how compressible the data is in the first place (number of high vs. low cardinality fields), and how the user sets up the schema (using optional codecs or not, for instance). It can be on the order of 10x for common types of analytical data, but it can be significantly lower or higher as well. See the optimizing documentation for guidance and this Uber blog for a detailed logging use case example. The only practical way to know exactly is to ingest your dataset into ClickHouse and compare the size of the dataset with the size stored in ClickHouse.

You can use the query SELECT formatReadableSize(total_bytes) FROM system.tables WHERE name = <your table name>.

What tools does ClickHouse offer to estimate the cost of running a service in the cloud if I have a self-managed deployment?

The ClickHouse query log captures key metrics that can be used to estimate the cost of running a workload in ClickHouse Cloud. For details on migrating from self-managed to ClickHouse Cloud please refer to the migration documentation, and contact ClickHouse Cloud support if you have further questions.

What billing options are available for ClickHouse Cloud?

ClickHouse Cloud supports the following billing options:

  • Self-service monthly (in USD, via credit card)
  • Direct-sales annual / multi-year (through pre-paid "ClickHouse Credits", in USD, with additional payment options)

How long is the billing cycle?

Billing follows a monthly billing cycle and the start date is tracked as the date when the ClickHouse Cloud organization was created.

What controls does ClickHouse Cloud offer to manage costs for Production services?

  • Trial and Annual Commit customers will be notified with automated emails when the consumption hits certain thresholds-50%, 75%, and 90%, so that users can take action.
  • ClickHouse Cloud allows users to set a maximum auto-scaling limit on their compute via Advanced scaling control, a significant cost factor for analytical workloads.
  • The Advanced scaling control lets you set memory limits with an option to control the behavior of pausing/idling during inactivity.

What controls does ClickHouse Cloud offer to manage costs for Developer services?

  • The Advanced scaling control lets you control the behavior of pausing/idling during inactivity. Adjusting memory allocation is not supported for Developer services
  • Note that the default setting pauses the service after a period of inactivity

If I have multiple services, do I get an invoice per service or a consolidated invoice?

A consolidated invoice is generated for all services in a given organization for a billing period.

If I add my credit card and upgrade before my trial period and credits expire, will I be charged?

When a user converts from trial to paid before the 30-day trial period ends, but with credits remaining from the trial credit allowance, we continue to draw down from the trial credits during the initial 30-day trial period, and then charge the credit card.

How can I keep track of my spending?

ClickHouse Cloud console includes a Usage display that gives detailed information about usage per service on compute and storage. This can be used to understand the cost breakdown by metered units.

How do I access my invoice for my AWS marketplace subscription to the ClickHouse Cloud service?

All marketplace subscriptions will be billed and invoiced by AWS. You can download the invoice from the AWS Billing Dashboard.

Why do the dates on the Usage statements not match my AWS Marketplace Invoice?

AWS Marketplace billing follows the calendar month cycle e.g., for usage between dates 01-Dec-2022 and 01-Jan-2023, an invoice will be generated between 3-Jan and 5-Jan-2023

ClickHouse Cloud usage statements follow a different billing cycle where usage is metered and reported over 30 days starting from the day of sign up

The usage and invoice dates will differ if these dates are not the same. Since usage statements track usage by day for a given service, users can rely on statements to see the breakdown of costs.

What are the best practices?

There are several areas of optimization, some of these are:

  • Batching inserts in place of frequent small-size inserts
  • Having fewer columns in tables
  • Choosing a partition key so that inserts go into a fewer number of partitions
  • Avoiding write-heavy operations in ClickHouse, such as mutations, OPTIMIZE FINAL, and Nullable columns