What Makes Snowflake Different from other DBMS as “Cloud-Based”

Recent years, Snowflake DB has become more and more popular because of its creative features. A lot of companies who are leaders in their business domain have started their journey with Snowflake, such as Sony, Logitech and Electronic Arts. So, what are these “features” that make Snowflake such attractive and competitive? In this article, I will pick up some points which make me love this database management system. Rather than measuring its performance and abilities, I will focus on the unique features that Snowflake has because it is a true “cloud-based” database.

What I will skip

I will skip some features that have been talked too much everywhere on the Internet, such as the creative architecturehigh performance and pricing model. Of course, these are what Snowflake proud with. However, as a consultant, I have seen too many tools and software that were competed with each other in the “existing” dimensions, and no one of them can be guaranteed to be the top one forever.

Specifically, most customers will not likely be interested in its creative architecture, if they are not technique enthusiasts. For the performance, in fact, none of the popular database management systems will be very bad. Finally, the price is not always the key factor that makes people buy.

Therefore, what I am doing here, is trying to let you know what are the features that Snowflake may let you WOW. Also, if applicable, I will compare Snowflake with 2 other popular cloud-based DBMS, Azure Synapse Analytics (former Azure Data Warehouse) and AWS Redshift, to see how Snowflake makes use its “cloud-based” icon.

Zero efforts for development environment

Photo by RawFilm on Unsplash

Snowflake is not the only cloud-based DBMS, but it is probably the only one providing such a comprehensive UI that supports all popular web browsers.

Azure Synapse Analytics:
Need to download SQL Server Management Studio

AWS Redshift:
Need to download a third-party SQL client such as SQL Workbench

Snowflake:
In your web browser, go to the URL and press enter

It’s not only about downloading management software. Think about we you need to work on a new machine…

Retain your worksheet

Photo by Annie Spratt on Unsplash

I believe I am not the only one who use to open several worksheets with many SQL statements for different purposes. Most of the time I will NEVER shut down the SQL client and my laptop for many days until reaching a milestone of the project, because I just don’t want to lose my SQL statements. I knew I can save them into script files and import next time, but just lazy, and re-import those cause some headache, too.

With Snowflake, be calm down enough to close your browser tab any time. Whenever you open the URL, your worksheets will always be there, even you change to another machine because everything is stored in the cloud.

Change your role

Photo by Christina @ wocintechchat.com on Unsplash

If you are an admin or a guy who owns master permissions, you probably will be frequently asked to grant permissions to other users. When using other DBMS, you will need to do your job and then confirm with the user that you granted permission to, whether the resource is already available.

With Snowflake, simply switch your role to see what permission this role have. This will also let you easily testing whether you have over-granted the permission.

SQL statement execution history

Photo by Djim Loic on Unsplash

It is not uncommon sometimes we want to see what we have done before during the development. At least I used to write a lot of SQL statements for testing purposes, and then I got the query working, so I deleted the testing code. But later on, I may realise that I want to see what I have actually tested.

With Snowflake, all your query history will be retained for 1 YEAR. Whenever you want to view what you have done before, simply go to the history to find it.

Configure notifications to email and SMS

As a cloud-based DBMS, Snowflake makes it extremely easy to configure your email or SMS notifications.

Azure Synapse Analytics:
Email: ask your IT department to provide you with the SMTP server and configure the white list for you. Then, configure your SMTP in Mail Profile
SMS: configure your Azure Action Group

AWS Redshift:
Need to create a Kinesis Data Firehose Delivery Stream and then configure it

Snowflake:
Fill in your email address and/or phone number. Then, enable it.

You can even enable MFA (Multi-Factor Authentication) in 30 seconds. Just provide your phone number, receive a verification code and fill it back.

Time Travel

Photo by Jake Blucker on Unsplash

This is the most WOW feature when I started to explore Snowflake DB. Specifically, with the Standard Edition, you can enable a table with “time travel” and have 1-day data retained no matter how you deleted or updated it. With the Enterprise Edition, the time travel is up to 90-day.

How can we use “time-travel” feature? Easy!

SELECT <columns>
FROM <your_table>
AT (timestamp => '<timestamp>');

Also, I have never seen such expression in any other DBMS:

UNDROP TABLE <table_name>;

Think about any other DBMS, you will need a well-rounded plan for how your data will be backup regularly, just in case for the disaster recovery, and then automate everything. You may need at least 1–3 developers/DBAs to do this job for at least several weeks and test it. But with Snowflake, not necessary at all.

Quick Clone

Photo by henry perks on Unsplash

With Snowflake, you can clone a table, a schema or even a database in no time and occupying no space. This is because the cloned table actually create pointers (point to the stored data) but not the actual data. In other words, the cloned table only has the data that is different from its original table.

Why do we need to clone a table?

  • When you want to perform some experiments with data but don’t want to hurt it by mistakes.
  • When you want to capture data in a snapshot at a certain time.
  • When you want to get a separate copy of a table’s history, then use clone together with time travel!
  • When you want to share production data with others, but don’t want to directly give access to the production database.
  • etc.

Quick Integration with Cloud Storage

Photo by Norbert Kundrak on Unsplash

If you use one of the biggest 3 cloud providers: Azure, AWS or GCP, you may use their cheap storage services Blob Storage, S3 or GCS. You can use 1 minute to configure your Snowflake to be able to import data from these storages.

Automatically Data Ingestion (SnowPipe)

Photo by Samuel Sianipar on Unsplash

Snowflake, as a DBMS, is not designed as a comprehensive ETL tool, but if your data ingestion is that complicated, it definitely will satisfy your requirements, with very quick implementation.

For example, if you have another process to put data from whatever data sources to cloud storage (e.g. Azure Blob Storage), and then what you need to do for automating the data ingestion into Snowflake is:

  • Create an Azure Event Grid to emit events when a blob is created in Blob Storage
  • In Snowflake, create an “integration” with the Event Grid (can be done in a single SQL command CREATE EVENT)
  • In Snowflake, create a SnowPipe to monitor the “integration” and set the destination table
CREATE PIPE <snowpipe_name>
AUTO_INGEST = TRUE
INTEGRATION = <integration_name>
AS
COPY INTO <destination_table>
FROM <blob_storage_path>
FILE_FORMAT = (type = '<e.g. JSON>')

Then, whenever there is a new file been created in the Azure Blob Storage, the event will be emitted, and SnowPipe will aware and run COPY INTO command to ingest the data into the destination table.

If you want to know more about this, please check out one of my previous post here: https://medium.com/@qiuyujx/building-snowpipe-on-azure-blob-storage-using-azure-portal-web-ui-for-snowflake-data-warehouse-f0cdd7997250

Easiest Scheduled SQL Tasks Ever

Photo by Estée Janssens on Unsplash

If you want to have your SQL statement running regularly, I found it is never such as easy before Snowflake was invented. What you simply to do is:

CREATE TASK <task_name>
WAREHOUSE = <warehouse_name>
SCHEDULE = <frequency (e.g.'10 MINTUE')>
AS
<sql_statement>

Do you still remember how you need to do this in SQL Server? Enable SQL Agent Windows Service, create a new task, and then fill in several pages of forms…

If you want, you can also use SnowPiple and Tasks together to do some complicated ETL, it is not “drag-n-drop” though, but you also don’t need to involve a separated tool, it’s all in your DBMS.

Resource

Snowflake Architecture:
https://www.snowflake.com/product/architecture/

Written by Christopher Tao, Principal Consultant@AtoBI Australia
Follow Christopher on LinkedIn Here

Article originally published on Medium.

ELT Data Pipelining in Snowflake Data Warehouse — using Streams and Tasks

Snowpipe was introduced to be utilised for automatically discovering new data files in cloud storage (e.g. Azure Blob Storage) and then load the data into a certain table.

Snowpipe is a very convenient tool for the above purposes. However, Snowpipe itself is only considered as the “E” (Extract) of ELT, because only COPY INTO command is allowed in a Snowpipe creating statement. In other words, we can easily achieve the following with it:

  • Loading data files in different formats such as CSV, JSON, XML, Parquet and ORC
  • Adopting and tweaking the data source for better compatibility such as stripping outer array for JSON and stripping outer element for XML
  • Changing column names
  • Changing column orders
  • Omitting columns
  • Parsing data/time string into data/time object

Therefore, for the following reasons, Streams and Tasks are needed for the rest of the data pipelining:

  1. Some data transformation might be necessary, such as numbers calculation and strings concatenation.
  2. The data source is not in typical third-norm format, so it needs to be loaded into multiple tables based on certain relations.
  3. The ELT jobs may not be limited to appending tables, but also include more complex requirements such as SCD (Slowly Changing Dimension) to be implemented.

Therefore, we have to involve other objects in Snowflake to complete the data pipeline.

Snowflake Streams

A Snowflake Stream object is to tracking any changes to a table including inserts, updates and deletes, and then can be consumed by other DML statement. One of the typical usage of steam object is the CDC (Change Data Capture)

Standard v.s. Append-only Stream

There are 2 types of streams that we can define in Snowflake, which are standard and append-only. Standard streams will capture any types of changes to the table, whereas append-only streams only capture inserted rows. The former can be used for general purpose, and the latter can be utilised in a typical ELT model when we only interest in the new rows ingested.

For example, we can have a Snowpipe to automatically ingest the data from CSV files in a cloud storage and copy into a staging table. Then, a stream will capture this bulk-inserting action and record the offset of these new rows.

Stream Offset

Snowflake Streams do not physically store, contain or copy any data. It just takes an snapshot of the tracking table at the current time (e.g. upon stream creation), then every changes made to this table will be recorded by the stream as additional metadata. The difference between the previous version of the table and current version is called the offset.

When other DML statements utilise the offset of the stream, the offset will be reset, and the stream will consider the consumed data as “previous” version of the table. A typical DML statement that consumes stream offset is INSERT INTO ... SELECT ... FROM STREAM.

Here is an example of the stream consumption flow:

  1. Suppose there are already 1 million rows in a table. We create a stream on this table, the stream will have no offset because the current version of the table is snapshotted.
  2. Suppose there are 1 thousand rows inserted into the table from Snowpipe COPY INTO statement.
  3. The stream will have these 1 thousand rows recorded as its offset.
  4. Suppose now we read these 1 thousand rows from the stream and insert them into another table.
  5. The stream now will have no offset so that it will be empty, because we have consumed the previous offset.

It is obvious that Snowflake Streams are designed for the ELT processes. For example, we have a high frequency data that is being ingested into the database, and we are consuming the data every 5 minutes. The stream will guarantee that every time we consume the new data has no missing and no overlaps. This indeed can be a logic that is complicated to be implemented in other database management systems, or we may need to use extra programming language achieve this.

Stream Columns

Because stream is actually a snapshot of the original table, all the columns in the original table is also accessible in the stream. For example, if the original table has 3 columns names col1col2 and col3, then we can simple run

SELECT col1, col2, col3
FROM the_stream;

to retrieve the offset in this stream.

Additionally, there are 3 extra columns that are metadata particularly for stream objects:

  • METADATA$ACTION indicates the type of the changes for this row. It can be either INSERT or DELETE. Please note that there is no “UPDATE” action. An update of a row will be indicated by 2 rows, one is DELETE of the previous row, the other one is INSERT of the new row. You’ll find this is very convenient because you can get information in the stream regarding what are the specific fields were updated by comparing these two rows.
  • METADATA$ISUPDATE indicates whether this row is generated because of an update. This column is an additional indicator that allows you to get all the rows with INSERT and DELETE but actually generated because of an update.
  • METADATA$ROW_ID is an unique ID of a particular row. It will not be changed even if the row is updated, and will also remain unchanged for any further updates in the future offsets.

Here is an image from official documentation that presents the data flow very clear.

Image courtesy: https://docs.snowflake.net/manuals/user-guide/streams.html

Snowflake Tasks

A Snowflake Task is such an object that can schedule an SQL statement to be automatically executed as a recurring event.

Yes, I use “an” SQL statement for each task. Currently, this is an limitation of Snowflake Tasks. Hopefully in the future updates it will support a transaction will proper BEGIN and COMMIT keywords so that we can execute a series of SQL statements in a transaction, and utilising a task to schedule this transaction.

Despite there is such an limitation, Snowflake does provide approaches for multiple SQL statement scheduling, which is called Task Tree. As its name, we can define multiple tasks in a tree structure. Several tips:

  • A task can have multiple tasks as its offsprings, so the offsprings tasks will be executed when the parent task is finished.
  • A task can have only one parent task.
  • Therefore, a task tree is really just a tree, but not DAG. So, you can consider that it can do “fork”, but there is no “join”.

Here is a graph to indicate a simple task tree.

Image courtesy: https://docs.snowflake.net/manuals/user-guide/tasks-intro.html

Trigger a Task

There are two ways for triggering a task, one is by defining a schedule and the other one is triggering by another task.

Snowflake supports two types of task schedule definition, “CRON” expression and time interval. CRON expression can define the exact date/day/time that the task will be triggered, it is very powerful that can achieve any simple to complex requirements, such as

  • Run task at certain time of every day
  • Run task at certain time of a certain day of week
  • Support timezones
  • Support daylight saving

For specific syntax of CRON expression, please visit this free website for detailed explanation and experiments: Free Online Cron Expression Generator and Describer – FreeFormatter.comGenerate a quartz cron expression with an easy to use online interface. Convert a cron expression into a readable text…www.freeformatter.com

The other approach of scheduling defining is much more straightforward, which is just simple define the time interval. For example, if I want the task to be triggered every 5 minutes, then just define it like this:

SCHEDULE = '5 MINUTE'

Another way of triggering a task is to define a parent task, which is considered to build a task tree. For example, if we want the TASK2 to be triggered when TASK1 is finished, just simple define:

CREATE TASK TASK2
...
AFTER "TASK1"
AS
<SQL>

Conditional Tasks

A task can be defined with a condition by a “WHEN” clause that is used to decide whether this task should be executed.

This is an extremely convenient feature. For example, if you set a condition with your “Root Task”, when the condition is not satisfied, the whole task tree will not run at all.

Also, if this “condition test” is happened in the cloud service layer of Snowflake (using metadata rather than SQL query on a table), it will have ZERO cost for this condition test and consequently having NO cost if the condition is not satisfied.

Using Tasks with Streams

It is very common that the Snowflake Tasks and Streams are utilised together to build a data pipeline. A very typical usage pattern will be:

  1. Snowpipe loads raw data into a staging table.
  2. Snowflake Stream is created on the staging table, so the ingested new rows will be recored as the offsets.
  3. Snowflake Task then consume the Stream offsets by some DML statement to further load the data into production tables, some more complex transformations might included.
  4. In practice, there will be a task tree to execute multiple SQL statements in order to perform the complex transformation, and sometime populate the transformed entries into multiple production tables.
  5. Some intermediate tables might need to be created by the tasks in task tree.
  6. There might be some tasks at the last of the task tree to clean some temporary tables.

We have already introduced the typical scenario that we need to integrate Snowpipe and Stream. Now, for Stream and Task, there is also a very typical usage pattern, that is the system variable SYSTEM$STREAM_HAS_DATA.

Specifically, the variable SYSTEM$STREAM_HAS_DATA will return a boolean value that whether there is an offset in a stream that is consumable. Therefore, we can simple put this condition in the “WHEN” clause of our Task definition. So, the task will not be executed if the there is nothing new in the stream. For example:

CREATE TASK mytask1
WAREHOUSE = mywh
SCHEDULE = '5 minute'
WHEN
SYSTEM$STREAM_HAS_DATA('MYSTREAM')
AS
INSERT INTO ... ;

In this example, only if there is consumable offset in the Stream “MYSTREAM”, the Task “mytask1” will be executed. Otherwise, the task will be skipped and check the condition again after 5 minutes.

One more tip

It is important to remember that a Task that has just been created will be suspended by default. It is necessary to manually enable this task by “altering” the task as follows:

ALTER TASK mytask1 RESUME;

Future works

In this article, the Snowflake Stream and Task are introduced theoretically. The general picture of using Snowpipe -> Stream -> Task to build an ELT data pipeline is drawn. For the next part, I’ll use a real example data source to build the ELT data pipeline using these Snowflake components with necessary SQL scripts demonstration, which will be more hands on.

Resources

Official documentation of Snowflake Streams: https://docs.snowflake.net/manuals/user-guide/streams.html

Official documentation of Snowflake Tasks:
https://docs.snowflake.net/manuals/user-guide/tasks-intro.html

Free online CRON expression generator:
https://www.freeformatter.com/cron-expression-generator-quartz.html

Written by Christopher Tao, Principal Consultant@AtoBI Australia
Follow Christopher on LinkedIn Here

Article originally published on Towards Data Science.