Data Warehousing... but in the Cloud

 

As data needs only increase, it’s important to make sure you’re finding storage methods that best suit you and your business’ needs. Today we’re going to provide a brief overview of cloud data storage – what it is, and who’s doing it.

Traditional Data Warehousing

To begin, we’re going to look at what traditional data warehouses are. Many of the data warehouses that are used today were developed during the 1980s and were built for the on-site data centers needed at the time. Within this type of data warehouse, there are three models:

-        A virtual data warehouse which are separate databases queried together so users can effectively access all of the data as if it was stored within a singular warehouse

-        A data mart which is used for a specific business area, such as sales or auditing, and data is compiled from a variety of source systems

-        And an enterprise data warehouse, where the data warehouse is a central component as the heart of the enterprise information system and integrates data from all the business units.

Additionally, traditional data warehouse architecture consists of a three-tier structure. The bottom tier contains the database server used to extract data from different sources, like transactional databases. The middle tier holds the OLAP server, which transforms the data to be better analyzed or queried. This can work in several ways. The top tier is the client layer and has the tools that are used for high-level analysis, reporting, and data mining. Traditional data warehouse options usually require a lot of attention on low-level infrastructure, and upfront costs to buy and maintain.

Cloud Data Warehousing

With a cloud-built data warehouse, you can gather more data from different sources instantly, and scale the warehouses to support multiple users, workloads, and be accessed anywhere. The appeal of a cloud data warehouse is that it’s accessible over the internet, meaning companies can avoid the initial setup costs of a traditional data warehouse, and adaptable which allows companies to scale up or down as the needs of the business change.

Cloud data warehouses are designed to optimize vast quantities of data quickly. One method to achieve this is by storing tables values in columns instead of rows, called columnar storage, that provides much faster aggregate queries. Additionally, by using massively parallel processing (MPP) cloud data warehouses are able to improve query speeds by combining processing from large datasets using many machines, as well as a combination of batch processing Extract-Transform-Load jobs, streaming data options, and other methods of integrating data into the warehouse.

Additionally, cloud data warehouses are fully managed by the service provider, meaning they assume all the responsibility and take on tasks such as updating the system.

Comparing Tradition & Cloud Data Warehouses

There are several ways that cloud data warehouses differ from traditional data warehouses:

-        No physical hardware is needed

-        Cloud data warehouses typically run faster

-        Scalability is easier and cheaper on the cloud, while time-consuming and costly in traditional data warehouses

-        There are more security concerns with a cloud data warehouse, and more opportunities for human error

It’s important to note that it’s normally difficult and expensive to switch from a traditional data warehouse to a cloud data warehouse, so typically organizations that use them are originally based in the cloud.

Vendors in the Space

Amazon Redshift

Redshift was one of the pioneers in the cloud data warehouse industry. With Redshift, database administrators can allocate resources, from choosing the nodes to the network connection, just like if they were in an on-premise warehouse. This is a huge draw for teams that want to maximum performance and high customizability. Additionally, it integrates easily to other AWS apparatuses, so another draw for teams already using any of them.

Loading Data: Redshift can only support structured data and can load it from many AWS sources (S3, DynamoDB, etc.) as well as an SSH-enabled host on premises, as well as ingest streaming data via Firehose.

Query Processing: In many tests, Redshift outperformed other cloud data warehouses in terms of speed and optimization. However, due to how customizable it is, those speeds depend on how you design your data warehouse. Important – Redshift queries are written in a specific dialect of PostgreSQL that doesn’t support a sizable amount of data types and functions.

Scalability: Clusters must be manually shut down and started via AWS.

Support and Maintenance: Redshift monitors all system components for failures and recovers them automatically, everything else is up to the user.

Google BigQuery

BigQuery is based on Dremel, which translates SQL queries into low-level instructions via a file management system, which allows it to sift through billions of data in seconds. The ease of use makes BigQuery ideal for small businesses who want something efficient and easy to use.

Loading Data: Data can be loaded from many different sources, and all the formatting and processing is handled by Google. Additionally, data can be queried externally from outside sources, like Google Drive, without loading them.

Query Processing: BigQuery supports standard SQL, as well as some semi-structured data types. BigQuery does limit the number of queries you can run at once, but allows you to prioritize queries over others.

Scalability: Pretty much everything is handled by Google in BigQuery.

Support and Maintenance: See above. There’s no need for periodic maintenance, but it makes editing or deleting old data is very cumbersome and involves truncating and recreating new tables.

Snowflake

Snowflake is a custom query engine built onto AWS software. Snowflake processes queries across virtual warehouses, which are clusters of MPP nodes. Each query has a warehouse, meaning running queries at the same time won’t affect each other.

Loading Data: Snowflake supports both structured and semi-structured data, which it’s able to preserve the flexibility of without sacrificing any performance. It does not support streaming data, but allows you to connect with other services that do.

Query Processing: As mentioned earlier, Snowflake supports structured and semi-structured data. There’s talk that Snowflake works better with smaller datasets, which may help smaller businesses.

Scalability: With Snowflake, computing and storage are separate, which allows these resources to be distributed efficiently.

Support and Maintenance: Snowflake administrators handle all maintenance, updates, and security.

 

As the field expands, we'll continue to update you on new developments, trends, and technologies. Hopefully this provided a quick overview as you assess your data needs for 2019!

 

Sources

https://panoply.io/data-warehouse-guide/data-warehouse-architecture-traditional-vs-cloud/

https://searchdatamanagement.techtarget.com/essentialguide/Cloud-data-warehouse-guide-Using-Redshift-rival-platforms

https://searchdatamanagement.techtarget.com/feature/Evaluating-the-key-features-of-data-warehouse-platforms

https://www.snowflake.com/blog/why-you-need-a-cloud-data-warehouse/

https://www.upwork.com/hiring/data/cloud-based-data-warehouses/

 

https://www.dataversity.net/difference-traditional-data-warehouse-cloud-data-warehouse/