What is the SnowFlake Data Platform?

Image

Snowflake, a cloud-based data warehousing platform, has been in buzz because of all the features it is providing for data storage and computing. Are you also confused if you should go with Snowflake or not?

In this blog, we shall be looking at a few points for you to decide if you should go with Snowflake or not.

Why do we need Snowflake?

Snowflake follows a ‘Software as a Service’ model, to provide warehousing to both structured and semistructured data in an efficient way.

Not long ago, companies had to spend a lot on setting up an infrastructure to store and compute their data in their data centers. But this, compared to Snowflake, which offers all its services totally on the cloud, is an ideal platform for companies who do not want to store their data on-premises and just want to store everything on the cloud.

Compared to traditional data warehouses, Snowflake is incredibly fast, flexible, and user-friendly.

What is Snowflake?

Snowflake is a data warehouse, to store and compute huge amounts of structured and unstructured data on the cloud. It follows a Software-as-a-Service (SaaS) model.

It offers a user-friendly UI for ingestion, staging of data, assigning roles, etc. Snowflake provides many formats like CSV, JSON, Parquet files, etc. to save a data set in. If an organization is looking for a platform that is easy to use and on the go, then Snowflake is the one.

Architecture

For source - click here

The architecture of Snowflake consist of 3 layers:

  • Database storage : This is the layer where all the data is stored. This layer manages all the organization, structure, file size, metadata, compression, and statistics of the data. The storage layer is independent of the compute resources.
  • Query Processing : The query is processed and executed in this layer using virtual warehouses. Warehouses consist of clusters of nodes. You can have multiple warehouses, where the performance of one warehouse does not affect the performance of another.
  • Cloud Services : This layer includes a collection of services that coordinate activities on Snowflake. These are the activities that give the user an end-to-end experience, from query to dispatch.

Services managed in this layer include:

  • Authentication
  • Infrastructure management
  • Metadata management
  • Query parsing and optimization
  • Access control

How is it better than others?

This architecture brings us to how this platform offers good performance and efficiency.

Snowflake offers a hybrid of ‘Shared disk’ and ‘Shared nothing’ architecture. In Snowflake, a central data repository is accessible from all the nodes, like in a ‘Shared disk architecture’, making data highly available at any point in time. Also, Snowflake uses massively parallel processing clusters, where every node stores some part of the data locally similar to ‘Shared nothing architecture’ enhancing its performance and scalability. This hybrid model allows Snowflake to have the advantages of both architectures.

These are the 3 major points that make Snowflake a good data warehousing platform.

  • Performance and speed : If you think that the load on a query is more, you can easily scale up your virtual warehouse to increase the performance and speed of your query and then scale down afterward. This will allow you to pay for more compute power only when you use it.
  • Storage and support for many types of data : In Snowflake, you can easily query many types of data including CSV, JSON, Parquet files, etc. And this is an advantage, as querying JSON and Parquet files are way faster than other formats of data.
  • Concurrency and accessibility : Because of the hybrid multi-cluster architecture, we can run as many queries on a warehouse as we want. And no query will affect each other’s performance. We can even scale up and down separately in different warehouses.

Snowflake or Amazon Redshift?

Currently, Snowflake has been compared with Amazon Redshift. Let’s see the key differences between both of them.

  Amazon Redshift Snowflake
Integration Redshift can be easily integrated with AWS services like Athena, Database Migration Service (DMS), DynamoDB, CloudWatch, and Kinesis Data Firehose. Snowflake seamlessly integrates with IBM Cognos, Informatica, Power BI, Tableau, Apache Spark, and Qlik, to name a few.
Database features Redshift does not offer sharing data between different accounts. We can easily share data between different accounts. This is very helpful when we are working on any 3rd party database.
Data types Redshift doesn’t support semi-structured data types like Array, Object, and Variant. In Snowflake, you can easily use different warehouses and use the same data without copying it.
Maintenance In Redshift, users can only use one cluster for all the computing. Managing it also involves a lot of complex steps. Snowflake supports all the semi-structured data types.
Security Redshift provides features and tools to manage it like Access management, Amazon Virtual Private Cloud, Cluster encryption, Cluster security groups, Data in transit, Load data encryption, Sign-in credentials, SSL connections, and Sign-in credentials. Snowflake also offers similar tools and features to ensure security and compliance with regulatory bodies. You can even define roles to keep a check on who can access what.
Support to JSON When JSON is loaded into Redshift, it’s split into strings, which makes it harder to work with and query. Snowflake’s support is decidedly more robust than Redshift. This means that with Snowflake you can store and query JSON with native, built-in functions.
Scalability Redshift Resize operations can also quickly become extremely expensive and lead to significant downtime. As compute and storage are separate in Snowflake, you don’t have to copy data to scale up or down. You can just switch data compute capacity at will.
Cost Redshift works on a pay-as-you-go basis. In other words, you pay for the particular services and the particular amount of those services that you use
Snowflake has a t-shirt size bucket model and charges credits accordingly. One interesting thing is that Snowflake charges for storage and computation separately.
 

Pricing

  • Snowflake charges per warehouse and usage pattern and according to time.
  • Snowflake’s pricing model is that customers are billed separately for their data storage and warehousing, this makes it easier to create new data warehouses of varying sizes.
  • Snowflake charges per second, For simplicity we will be seeing prices on hourly rates.
  • As Snowflake allows user to choose different sizes of warehouses, let’s discuss, how much a customer will spend on an average, while choosing any size:

1. Storage cost :

On Demand Storage - $40 per month per Terabyte, around $0.055 per hour (When you pay monthly)

Capacity Storage- $23 per month per terabyte (When you pay it upfront)

2. Compute Cost :

It will depend on how much duration have you used it for, and that time will be added to see the total number of hours, and accordingly you will be charged some credits, given in the following table :

1 credit = $3 per hour

Size Credits per hour Charge per hour
XS 1 $3
S 2 $6
M 4 $12
L 4 $12
XL 16 $48

If you are confused on how to choose the size of the warehouse, then, here is the list which will tell you the specifications of each size of warehouse.

1. Small

  • 10-20 analytics users
  • 10-20 ELT pipelines
  • Under 5 TB of data

2. Medium

  • 30-50 analytics users
  • 30-50 ELT pipelines
  • Under 50 TB of data

3. Large

  • 100+ analytics users
  • 100s – 1000s of ELT pipelines
  • Under 100+ TBs of data
  • A customer will only be charged for those seconds when their warehouse was getting used and not when it was kept idle.
  • Snowflake, though being expensive, than its competitors, offers seven levels of their computational warehouse services, otherwise known as “clusters”.
  • The clusters are based on a dynamic pricing model which allows for flexibility and resizing, which helps customers save money.

Real-time Use Case

There are many companies using Snowflake like Cisco, Square, Salesforce, Microsoft etc. This was the statement given by Cisco while moving from Hadoop to Snowflake:

"Our early POCs indicated that Snowflake was 2-4 times faster than Hadoop for complex workloads. The fact that this was ANSI SQL-based yielded several advantages, including a larger qualified talent pool, shorter development cycles, and improved time to capability. The platform also offered a higher concurrency and lower latency compared to Hadoop. Snowflake was a clear winner!"

For source - click here.

Let’s walk you through one case study, which can give you an idea of how you can use Snowflake efficiently and how much it will cost you.

Cluster Size Number of Clusters Credits per Hour Credit price Cost per hour Hours Used Total Cost
Small 3 4 $3 $18 1 $18
Medium 2 2 $3 $24 4 $96
Large 1 8 $3 $24 1 $24

For say you do the same task for 1 month = $138 * 30 = $4140

Storage of say 5 TB of data for 1 month= 5 * 1 * $40 = $200

So, the total you would be costed for this case study would be= $4140 + $200= $4340 for 1 month

This was an example, to explain the dynamic pricing of Snowflake. But, you must be thinking why to take multiple small clusters rather than taking one medium cluster. Here’s the answer:

With a standard, single-cluster warehouse, if your user/query load increases to the point where you need more compute resources:

  • You must either increase the size of the warehouse or start additional warehouses and explicitly redirect the additional users/queries to these warehouses.
  • Then, when the resources are no longer needed, to conserve credits, you must manually downsize the larger warehouse or suspend the additional warehouses.
  • In contrast, a multi-cluster warehouse enables larger numbers of users to connect to the same size warehouse. In addition:
  • In Auto-scale mode, a multi-cluster warehouse eliminates the need for resizing the warehouse or starting and stopping additional warehouses to handle fluctuating workloads. Snowflake automatically starts and stops additional warehouses as needed.
  • In Maximized mode, you can control the capacity of the multi-cluster warehouse by increasing or decreasing the number of warehouses as needed.

Implementation

Let’s see how to import a table lying in your S3 bucket to Snowflake.

Step 1: Create Warehouse.

Step 2: Select the size of warehouse.

Step 3: Go to Databases tab and click on ‘CREATE.

Step 4: Give the name to your new database.

Step 5: Click on the database and select ‘Tables’ to create a new table in which we will be uploading our data.

Step 6: Fill in the required details.

Step 7: Go to stages and create a new stage from Amazon S3 bucket

Step 8: Connect your S3 bucket by entering the AWS Key ID, AWS Security key, URL to your S3 bucket data.

Step 9: Your bucket is now staged in Snowflake. You just need to copy the data in that bucket to your table.

Step 10: Just go to the worksheets tab and type the following command syntax and copy your data to the test table.

  copy into test from @s3_bucket;   

Step 11: Your data is available for analytics on Snowflake!

Top N / Bottom N Simultaneously

The demand of extracting information from a huge amount of data has been rising exponentially, Snowflake provides us the platform to overcome all these challenges in an efficient yet easy way. And that’s the reason why Snowflake has now become popular.

Enabling access to data, when you want it, where you want it without delay is a core principle to their success.

With a unique architecture that delivers exceptional performance, concurrency and simplicity, Snowflake is a data warehouse built to bring together diverse data into one system for data analytics efforts.

Written by:

Yash Khare

Data Scientist

LinkedIn

Related Post

Leave a Reply