Streaming real-time data into Snowflake using Kinesis Streams

November 25, 2024

Gennadiy Gashev

Solutions architect

High-level architecture Snowflake setup AWS setup Additional Improvements

Significant growth of a product’s user base always leads to challenges for data engineering teams. The volume of events produced by millions (or billions) of users makes it almost impossible to use standard solutions for ingestion as is. It’s always nuanced and adjusted for particular situation.

The real-life example (with code) in this article shares our experience of building efficient real-time data streaming pipeline for the World project . The architecture is based on AWS Kinesis , and Snowflake — both are well acknowledge leaders in data processing services in the industry.

High-level architecture

We will create an AWS CDK stack that will stream data into a Snowflake table using Kinesis Streams and Kinesis Data Firehose. This solution will allow to ingest data from different accounts and regions into a Snowflake table in real-time.

Amazon Kinesis Data Streams is a scalable and durable real-time data streaming service that can continuously capture gigabytes of data per second from hundreds of sources. Snowflake is a cloud-based data warehousing platform that allows you to store and analyze large volumes of data.

There are two ways to load data into the tables in Snowflake: Snowpipe and Snowpipe Streaming . Snowpipe performs the loading from files in extremely micro batches. Essentially, streams are aggregated, written to interim storage, and then loaded into Snowflake. It is highly inefficient and leads to several minutes of latency, increasing costs.

We have now introduced Firehose to the approach in order to cost- and latency-optimize the process further, and make it less complex. Firehose with Snowpipe Streaming allows writing of individual rows of data directly into tables, making the records deliverable as soon as they become available; in turn, the data becomes queryable in Snowflake within seconds.

Snowflake setup

We will need a Snowflake user with permissions to insert data into the table. We will also need to configure key pair authentication for the user. Follow this guide , it should be relatively simple.

Next, create a secret in AWS Secrets Manager containing this private key. This secret will be used by the Kinesis Data Firehose delivery stream to authenticate with Snowflake.

Terminal

aws secretsmanager create-secret \
  --name Snowflake/PrivateKey \
  --secret-string file://path/to/private-key-file

AWS setup

Let’s build a functional pipeline for streaming real-time data from Kinesis into Snowflake. We will need to setup a few resources, as well as IAM policies.

KMS Key

We will start by creating a new CMK (customer-managed key) KMS key for the Kinesis stream. This key will be used to encrypt the stream and share access with the data producers. After creation, we will add a resource policy to the key that allows access to the key metadata for our account and through Amazon Kinesis for all principals in the account authorized to use Amazon Kinesis.

In alternative way, you can use the default AWS-managed KMS key for encryption of the stream, but usage of the AWS managed key restricts access to the stream to resources in that account only.