Ingesting Amazon Kinesis Data Streams into Data Lake Tables

Ingest Kinesis Data into a Data Lake in 3 Steps

Upsolver SQLake makes it easy to ingest data from Amazon Kinesis Data Streams (KDS) into your data lake. By using KDS to ingest real-time events into the data lake you give data analysts access to fresh data and enable data scientists to train their models on most recent user behavior data.  You’ll also be able to quickly detect reliability and data quality issues.

Using SQLake to ingest from KDS involves only 3 quick steps.

  1. Create a connection to your Kinesis Data Stream
  2. Create a staging table where events will be stored
  3. Create an ingestion job to copy events into your staging table

Let’s walk through the steps.

1. Create a connection to KDS

Connections in SQLake contain necessary authentication details to access your Kinesis Data Stream.  SQLake enables you to authenticate via either AWS IAM roles or AWS IAM access and secret keys.  Examples for both authentication methods are shown below. 

Note: If you have deployed SQLake into your own VPC, you can leave the authentication parameters undefined, and the connection will default to the role with which you’ve integrated SQLake.

Typically you would use an IAM role to connect to your KDS instance. Here is the syntax:

CREATE KINESIS CONNECTION kinesis_stream
   AWS_ROLE = '<role arn>'
   EXTERNAL_ID = '<external id>'
   REGION = '<region>';

2. Create a staging table

Staging tables are used to store a raw and immutable copy of the events read from the source system.  You don’t need to statically define the schema of your staging tables, because you want to allow the flexibility of ingesting any new changes that the application team or source system may implement, such as adding new fields. SQLake automatically detects, infers, and updates the AWS Glue Data Catalog with the table’s schema and partition information.  For most staging tables, a best practice is to partition it by $event_date, a system column that SQLake automatically generates and which is the date of the incoming event.

Create a staging table as follows:

CREATE TABLE catalog.stage_db.orders_raw_data ()
PARTITIONED BY $event_date;

The full syntax of the CREATE TABLE command can be found in the documentation

3. Create an ingestion job to copy data from KDS to the staging table

Finally, create a COPY FROM job that reads events from your Kinesis stream and copies them into your staging table.  The data is copied continuously, minute by minute, with SQLake automatically handling all table management (partitioning, compaction, cataloging, and so on).

Create the ingestion job that reads from the “webstore_orders” stream in KDS:

CREATE SYNC JOB ingest_orders_from_kinesis
   START_FROM = BEGINNING
   CONTENT_TYPE = JSON
AS COPY FROM KINESIS kinesis_stream 
   STREAM = 'webstore_orders'
INTO default_glue_catalog.database_11f174.orders_raw_data;

The following are important parameters to consider when building your COPY FROM job:

START_FROM = BEGINNING | NOW | TIMESTAMP ‘<timestamp>’

Defines how far back to begin reading events from the stream.  

BEGINNING reads from the earliest event that is stored on the stream. 

NOW starts reading from the time the job was created.

TIMESTAMP ‘<timestamp>’ reads from the timestamp referenced.

CONTENT_TYPE = JSON | CSV | PARQUET | AVRO | …

The type of content to read from the stream.  Certain content types may require
additional settings; for more details on the full syntax on COPY FROM jobs referencing

Kinesis Streams, go to the documentation: Copy From Kinesis options

Summary

That’s it!

It takes only a few minutes to create a data pipeline that copies data from Kinesis Data Streams to your data lake table. SQLake automatically scales up and down to match the velocity of events, manages the output using an optimal file format (Apache Parquet), partitions the data, and compacts (merges) small files into larger more efficient ones – a common challenge with streaming data in the data lake. 

For a deeper dive into this topic see An easier way to build a data ingestion pipeline for batch and streaming data.

Get started today for free with sample data or bring your own.

ctaForm

Start for free - No credit card required

Batch and streaming pipelines.

Accelerate data lake queries

Real-time ETL for cloud data warehouse

Build real-time data products

Get Started Now

Templates

All Templates

Explore our expert-made templates & start with the right one for you.