Skip to main content

· One min read

How can I start using ClickHouse Cloud connected to my AWS/GCP/Azure account billing?

AWS

  • Log into the AWS console using your AWS account
  • Navigate to the ClickHouse Cloud at aws marketplace
  • Click "View purchase options"
  • In the "Contract Options" section of the page, enter any number in the Units field. This will not affect the price your pay as the price for these units for the public offering is $0. These units are usually used when accepting a private offer from ClickHouse Cloud.
  • Click "Create contract"
  • Click "Set up your account".
  • You will be redirected to the special AWS marketplace ClickHouse Cloud login page. Please, complete your sign-in / sign-up at this page so we can bind your ClickHouse Cloud organization to AWS billing:
    • If you are a new CH Cloud user, click "Register" at the bottom of the page. You will be prompted to create a new user and verify the email. After verifying your email, you can leave the ClickHouse Cloud login page and login using the new username at the https://clickhouse.cloud.
    • If you are an existing CH Cloud user, simply log in using your credentials.
  • After successful log in, a new ClickHouse Cloud organization will be created. This organization will be connected to your AWS billing account.

GCP

  • Log into the GCP console using your GCP account and your preferred project
  • Navigate to the ClickHouse Cloud at GCP marketplace
  • Click “Subscribe”
  • Select a billing account that you want to use for billing
  • Accept the additional terms and check the checkboxes
  • Click “Subscribe”
  • Click “Sign in with ClickHouse”
  • You will be redirected to the special GCP marketplace ClickHouse Cloud login page. Please, complete your sign-in / sign-up at this page so we can bind your ClickHouse Cloud organization to GCP billing:
    • If you are a new CH Cloud user, click "Register" at the bottom of the page. You will be prompted to create a new user and verify the email. After verifying your email, you can leave the ClickHouse Cloud login page and login using the new username at the https://clickhouse.cloud.
    • If you are an existing CH Cloud user, simply log in using your credentials.
  • After successful log in, a new ClickHouse Cloud organization will be created. This organization will be connected to your GCP billing account.

Azure is coming soon

How can I understand that my organization is connected to the marketplace billing?

In ClickHouse Cloud console, navigate to Admin -> Billing. You should see the name of the marketplace and the link in the Payment details section

I am an existing ClickHouse Cloud user. What will happen if I subscribe to the CH Cloud via aws marketplace?

A separate organization connected to the marketplace will be created. Your existing services and organizations will remain and they will not be connected to the marketplace billing.

You can switch between organizations in the top right corner of the ClickHouse Cloud console.

I am an existing ClickHouse Cloud user and I want my existing services to be billed via marketplace.

Please contact ClickHouse Cloud support in this case.

I subscribed as a marketplace user and now I want to unsubscribe from the ClickHouse Cloud.

Note that you can simply stop using ClickHouse Cloud and delete all existing ClickHouse Cloud services. Even though the subscription will still be active, you will not be paying anything as ClickHouse Cloud doesn't have any recurring fees.

If you want to unsubscribe, please navigate to the Cloud Provider console and cancel the subscription renewal there. Once the subscription ends, all existing services will be stopped and you will be prompted to add a credit card. If no card was added, after two weeks all existing services will be deleted.

Previously I subscribed to ClickHouse Cloud as a marketplace user, then I unsubscribed, but now I want to subscribe back.

In that case please subscribe to the ClickHouse Cloud as usual (see "How can I start using ClickHouse Cloud connected to my AWS/GCP/Azure account billing?"). Note that:

  • For AWS marketplace a new ClickHouse Cloud organization will be created and connected to the marketplace
  • For GCP marketplace your old organization will be re-activated If you have any troubles with re-activating you marketpalce org, please contact ClickHouse Cloud Support.

How do I access my invoice for my marketplace subscription to the ClickHouse Cloud service?

Why do the dates on the Usage statements not match my Marketplace Invoice?

Marketplace billing follows the calendar month cycle. For example, for usage between December 1st and January 1st, an invoice will be generated between January 3rd and January 5th

ClickHouse Cloud usage statements follow a different billing cycle where usage is metered and reported over 30 days starting from the day of sign up

The usage and invoice dates will differ if these dates are not the same. Since usage statements track usage by day for a given service, users can rely on statements to see the breakdown of costs.

Where can I find general billing information

Please see the billing documentation.

· One min read

The root cause of these permissions errors has been identified and is due to more strict checking of grants in the new release for async_inserts.

To fix this issue, grants must be updated for your service to work. Check the error message indicating the missing grants and add those needed grants manually. The required additional grants for the tables using async_inserts will either be the SELECT or dictGet grant.

Add necessary GRANTs to affected tables and dictionaries

-- Add SELECT grant permissions
GRANT SELECT ON mydb.insertTable TO insert_role WITH GRANT OPTION

-- Add dictGet grant permissions
GRANT dictGet ON mydb.insertDictionary TO insert_role

To learn more about the GRANT command, please see this page.

If you are unable to perform this change, then please contact ClickHouse Support for assistance.

· 3 min read

Problem

Map lookup such as a['key'] works with linear complexity (mentioned here) and can be inefficient. This is because selecting a value with a specific key from a table would require iterating through all keys (~M) across all rows (N) in the Map column, resulting in ~MxN lookups.

A lookup using Map can be 10x slower than a String column. The experiment below also shows ~10x slowdown for cold query, and difference in multiple magnitudes of data processed (7.21 MB vs 5.65 GB).

-- create table with SpanNAme as String and ResourceAttributes as Map
DROP TABLE IF EXISTS tbl;
CREATE TABLE tbl (
`Timestamp` DateTime64(9) CODEC (Delta(8), ZSTD(1)),
`TraceId` String CODEC (ZSTD(1)),
`ServiceName` LowCardinality(String) CODEC (ZSTD(1)),
`Duration` UInt8 CODEC (ZSTD(1)), -- Int64
`SpanName` LowCardinality(String) CODEC (ZSTD(1)),
`ResourceAttributes` Map(LowCardinality(String), String) CODEC (ZSTD(1))
)
ENGINE = MergeTree
PARTITION BY toDate(Timestamp)
ORDER BY (ServiceName, SpanName, toUnixTimestamp(Timestamp), TraceId);

-- create UDF to generate random Map data for ResourceAttributes
DROP FUNCTION IF EXISTS genmap;
CREATE FUNCTION genmap AS (n) -> arrayMap (x-> (x::String, (x*rand32())::String), range(1, n));

-- check that genmap is working as intended
SELECT genmap(10)::Map(String, String);

-- insert 1M rows
INSERT INTO tbl
SELECT
now() - randUniform(1, 1000000.) as Timestamp,
randomPrintableASCII(2) as TraceId,
randomPrintableASCII(2) as ServiceName,
rand32() as Duration,
randomPrintableASCII(2) as SpanName,
genmap(rand64()%500)::Map(String, String) as ResourceAttributes
FROM numbers(1_000_000);

-- querying for SpanName is faster
-- [cold] 0 rows in set. Elapsed: 0.642 sec. Processed 1.00 million rows, 7.21 MB (1.56 million rows/s., 11.22 MB/s.)
-- [warm] 0 rows in set. Elapsed: 0.164 sec. Processed 1.00 million rows, 7.21 MB (6.10 million rows/s., 43.99 MB/s.)
SELECT
COUNT(*),
avg(Duration/1E6) as average,
quantile(0.95)(Duration/1E6) as p95,
quantile(0.99)(Duration/1E6) as p99,
SpanName
FROM tbl
GROUP BY SpanName ORDER BY 1 DESC LIMIT 50 FORMAT Null;

-- query for ResourceAttributes is slower
-- [cold] 0 rows in set. Elapsed: 6.432 sec. Processed 1.00 million rows, 5.65 GB (155.46 thousand rows/s., 879.07 MB/s.)
-- [warm] 0 rows in set. Elapsed: 5.935 sec. Processed 1.00 million rows, 5.65 GB (168.50 thousand rows/s., 952.81 MB/s.)
SELECT
COUNT(*),
avg(Duration/1E6) as average,
quantile(0.95)(Duration/1E6) as p95,
quantile(0.99)(Duration/1E6) as p99,
ResourceAttributes['1'] as hostname
FROM tbl
GROUP BY hostname ORDER BY 1 DESC LIMIT 50 FORMAT Null;

Solution To improve the query, we can add another column with the value defaulting to a particular key in the Map column, and then materializing it to populate value for existing rows. This way, we extract and store the necessary value at insertion time, thereby speeding up the lookup at query time.

-- solution is to add a column with value defaulting to a particular key in Map
ALTER TABLE tbl ADD COLUMN hostname LowCardinality(String) DEFAULT ResourceAttributes['1'];
ALTER TABLE tbl MATERIALIZE COLUMN hostname;

-- query for hostname (new column) is now faster
-- [cold] 0 rows in set. Elapsed: 2.215 sec. Processed 1.00 million rows, 21.67 MB (451.52 thousand rows/s., 9.78 MB/s.)
-- [warm] 0 rows in set. Elapsed: 0.541 sec. Processed 1.00 million rows, 21.67 MB (1.85 million rows/s., 40.04 MB/s.)
SELECT
COUNT(*),
avg(Duration/1E6) as average,
quantile(0.95)(Duration/1E6) as p95,
quantile(0.99)(Duration/1E6) as p99,
hostname
FROM tbl
GROUP BY hostname ORDER BY 1 DESC LIMIT 50 FORMAT Null;

-- drop cache to run query cold
SYSTEM DROP FILESYSTEM CACHE;

· 2 min read

The short answer is “yes”. ClickHouse has multiple mechanisms that allow freeing up disk space by removing old data. Each mechanism is aimed for different scenarios.

TTL

ClickHouse allows to automatically drop values when some condition happens. This condition is configured as an expression based on any columns, usually just static offset for any timestamp column.

The key advantage of this approach is that it does not need any external system to trigger, once TTL is configured, data removal happens automatically in background.

note

TTL can also be used to move data not only to /dev/null, but also between different storage systems, like from SSD to HDD.

More details on configuring TTL.

DELETE FROM

DELETE FROM allows standard DELETE queries to be run in ClickHouse. The rows targeted in the filter clause are marked as deleted, and removed from future result sets. Cleanup of the rows happens asynchronously.

note

DELETE FROM is generally available from version 23.3 and newer. On older versions, it is experimental and must be enabled with:

SET allow_experimental_lightweight_delete = true;

ALTER DELETE

ALTER DELETE removes rows using asynchronous batch operations. Unlike DELETE FROM, queries run after the ALTER DELETE and before the batch operations complete will include the rows targeted for deletion. For more details see the ALTER DELETE docs.

ALTER DELETE can be issued to flexibly remove old data. If you need to do it regularly, the main downside will be the need to have an external system to submit the query. There are also some performance considerations since mutations rewrite complete parts even there is only a single row to be deleted.

This is the most common approach to make your system based on ClickHouse GDPR-compliant.

More details on mutations.

DROP PARTITION

ALTER TABLE ... DROP PARTITION provides a cost-efficient way to drop a whole partition. It’s not that flexible and needs proper partitioning scheme configured on table creation, but still covers most common cases. Like mutations need to be executed from an external system for regular use.

More details on manipulating partitions.

TRUNCATE

It’s rather radical to drop all data from a table, but in some cases it might be exactly what you need.

More details on table truncation.

· 2 min read

Question

How can I tell if a projection is used?

Answer

  1. Create a sample database
CREATE database db1;
  1. Create a sample table that will use column1 as the primary key
CREATE table db1.table1_projections
(
column1 Int32,
column2 Int32
)
engine = MergeTree()
order by column1;
  1. Add a projection for_column2 to use column2 as the primary key
ALTER table db1.table1_projections add projection for_column2
(
select *
order by column2
);
  1. Insert test data

*this inserts 100000 rows with random numbers in column1 and column2

INSERT INTO db1.table1_projections
select
floor(randNormal(50, 5)) as column1,
floor(randUniform(1, 100)) as column2
from numbers(100000);
  1. Check sample set of data
clickhouse-cloud :) SELECT * from db1.table1_projections limit 5;

SELECT *
FROM db1.table1_projections
LIMIT 5

Query id: d6940799-b507-4a5e-9843-df55ebe818ab

┌─column1─┬─column2─┐
│ 28 │ 41 │
│ 29 │ 12 │
│ 30 │ 73 │
│ 30 │ 75 │
│ 30 │ 70 │
└─────────┴─────────┘
  1. Test that it is using the original table with column1:
clickhouse-cloud :) explain indexes = 1 
SELECT count() from db1.table1_projections where column1 > 50;

EXPLAIN indexes = 1
SELECT count()
FROM db1.table1_projections
WHERE column1 > 50

Query id: e04d5236-1a05-4f1f-9502-7e41986beb44

┌─explain────────────────────────────────────────────┐
│ Expression ((Projection + Before ORDER BY)) │
│ Aggregating │
│ Expression (Before GROUP BY) │
│ Filter (WHERE) │
│ ReadFromMergeTree (db1.table1_projections) │
│ Indexes: │
│ PrimaryKey │
│ Condition: true │
│ Parts: 1/1 │
│ Granules: 12/12 │
└────────────────────────────────────────────────────┘

*notice that it is reading from db1.table1_projections

  1. Test reading from the projection by using column2 in the where clause
clickhouse-cloud :) explain indexes = 1 
SELECT * from db1.table1_projections where column2 > 50;

EXPLAIN indexes = 1
SELECT *
FROM db1.table1_projections
WHERE column2 > 50

Query id: d2b20e01-93bf-4b60-a370-4aac7b454267

┌─explain─────────────────────────────────────┐
│ Expression ((Projection + Before ORDER BY)) │
│ Filter │
│ ReadFromMergeTree (for_column2) │
│ Indexes: │
│ PrimaryKey │
│ Keys: │
│ column2 │
│ Condition: (column2 in [51, +Inf)) │
│ Parts: 1/1 │
│ Granules: 6/12 │
└─────────────────────────────────────────────┘

*notice that now the for_column2 projection is used.

For more info

Projections: https://clickhouse.com/docs/en/sql-reference/statements/alter/projection

numbers table function: https://clickhouse.com/docs/en/sql-reference/table-functions/numbers

Blog for generating random data: https://clickhouse.com/blog/generating-random-test-distribution-data-for-clickhouse

· 2 min read

Can I just run HTTP requests to ClickHouse server using requests module?

Answer

Yes, here is the sample code.

import requests
import datetime

create_replace_stmt = 'CREATE OR REPLACE TABLE test_table (name String, age UInt8) Engine=MergeTree ORDER BY tuple();'
select_query = 'SELECT count() FROM test_table'
insert_query = 'INSERT INTO test_table SELECT * FROM generateRandom(\'name String, age UInt8\',1,1) LIMIT 300000000'

CH_URL = 'https://your_clickhouse_service_fqdn:8443'
CH_USER = 'default'
CH_PASSWORD = 'secret_pwd'

headers = {}
headers["X-ClickHouse-User"] = CH_USER
headers["X-ClickHouse-Key"] = CH_PASSWORD

now = (datetime.datetime.now())
print("{} - starting...".format(now))


# create/replace table
now = (datetime.datetime.now())
print("{} - creating/replacing table...".format(now))
response = requests.post(url=CH_URL,
params={"database": "default",
"query": create_replace_stmt,
"session_id": "my-session-id-string"
},
headers=headers)

# select count()
response = requests.post(url=CH_URL,
params={"database": "default",
"query": select_query,
"session_id": "my-session-id-string"
},
headers=headers)

now = (datetime.datetime.now())
print("{} - elements in test_table before insert: {}".format(
now, response.content.decode('utf-8')))


# insert
now = (datetime.datetime.now())
print("{} - Inserting data...".format(now))
response = requests.post(url=CH_URL,
params={"database": "default",
"query": insert_query,
"session_id": "my-session-id-string",
"wait_end_of_query": 1
},
headers=headers)

now = (datetime.datetime.now())
print("{} - Done inserting data...".format(now))

response = requests.post(url=CH_URL,
params={"database": "default",
"query": select_query,
"session_id": "my-session-id-string",
},
headers=headers)

now = (datetime.datetime.now())
print("{} - elements in test_table after insert: {}".format(
now, response.content.decode('utf-8')))

Sample expected output:

(venv) ➜  venv/bin/python main.py
2023-07-07 14:54:27.336450 - starting...
2023-07-07 14:54:27.336476 - creating/replacing table...
2023-07-07 14:54:28.125270 - elements in test_table before insert: 0

2023-07-07 14:54:28.125352 - Inserting data...
2023-07-07 14:55:23.788466 - Done inserting data...
2023-07-07 14:55:23.962134 - elements in test_table after insert: 299115357

requirements.txt

requests==2.31.0

See this other KB for steps on how setup your python venv.

· One min read

How do I enforce a time limit on my queries?

Answer

You can use max_execution_time setting:

clickhouse-cloud :) SELECT 1 SETTINGS max_execution_time=0.0001

SELECT 1
SETTINGS max_execution_time = 0.0001

Query id: 3db752a7-b94f-4456-b3b9-ccbf290d1394


0 rows in set. Elapsed: 0.113 sec.

Received exception from server (version 23.5.1):
Code: 159. DB::Exception: Received from service.aws.clickhouse.cloud:9440. DB::Exception: Timeout exceeded: elapsed 0.000557862 seconds, maximum: 0.0001. (TIMEOUT_EXCEEDED)

· One min read

The short answer is "yes". However, we recommend keeping latency between all regions/datacenters in two-digit range, otherwise write performance will suffer as it goes through distributed consensus protocol. For example, replication between US coasts will likely work fine, but between the US and Europe won't.

Configuration-wise there's no difference compared to single-region replication, simply use hosts that are located in different locations for replicas.

For more information, see full article on data replication.

· 2 min read

A columnar database stores the data of each column independently. This allows reading data from disk only for those columns that are used in any given query. The cost is that operations that affect whole rows become proportionally more expensive. The synonym for a columnar database is a column-oriented database management system. ClickHouse is a typical example of such a system.

Key columnar database advantages are:

  • Queries that use only a few columns out of many.
  • Aggregating queries against large volumes of data.
  • Column-wise data compression.

Here is the illustration of the difference between traditional row-oriented systems and columnar databases when building reports:

Traditional row-oriented Traditional row-oriented

Columnar Columnar

A columnar database is the preferred choice for analytical applications because it allows having many columns in a table just in case, but to not pay the cost for unused columns on read query execution time (a traditional OLTP database reads all of the data during queries as the data is stored in rows and not columns). Column-oriented databases are designed for big data processing and data warehousing, they often natively scale using distributed clusters of low-cost hardware to increase throughput. ClickHouse does it with combination of distributed and replicated tables.