A data warehouse is a technology that aggregates data from operational systems and external data sources from anywhere within an organization, formatted for reporting and analysis. The data warehouse serves as a central repository of information that can be used to provide an organization with both historical and current data points to support decision-making processes. This type of data is often difficult to access or present from a traditional operational data store (or database).
A data warehouse is often used as a way to break data silos, providing a point of correlation between different systems within an organization. By providing a central repository, and setting a standard data format within said repository, the data warehouse can connect, for example, the product inventory data stored in one system to the POs issued for a specific client or market region, stored in a different system. Business Intelligence relies on access to these disparate points, in a format that is accessible, presentable, and of reliable lineage. The data warehouse provides the required hub.
The key difference between a data warehouse and a data lake is structured data. Both aggregate and gather data into a large storage repository. A data warehouse deals with aggregated transactional data, formatted for business analysis. A data lake stores raw, structured and unstructured data in the format it is received until it is needed. Data Lakes leverage object storage, such as AWS S3, which can also mean cheaper storage.
Now that we’ve gone over several data warehouse options, we should discuss what binds them. All operate on key data warehouse concepts.
A data warehouse, no matter what sort of data it aggregates, formats its data based on facts, dimensions, and measures.
The data warehouse solution often makes use of ETL tools (Extract, Transform, and Load) to obtain and format the data based on these formatting principles. The ETL tool pulls the data from the various sources, and facilitates the organization of the data into the format defined by the data warehouse. Some solutions claim an ELT model, arguing that it helps to retrieve data quicker, and allows the data warehouse to do the work in transforming the data.
There are numerous types of data warehouses in the modern IT environment ranging from traditional to cloud options, with several variations in between.
The traditional data warehouse is an on-premise construct, typically obtaining data via batch processing, sometimes enhanced by ETL tools to aid in gathering and transforming the data. Data warehouse batch processing refers to the collection and storing of data in batches during a batch window. In other words, data is not processed in real-time, but at scheduled intervals. The drawback to this, of course, is that in the window between receipt and being processed, bad decisions can occur based on faulty or obsolete information. They are (or were) built to serve as the single and primary organizational data repository, as opposed to modern architectures which are more distributed. Such data warehouses are rarely deployed new anymore but are still present in large on-premise deployments that could be costly to migrate to the cloud.
Cloud Data Warehouses are, of course, based on cloud technologies, and hosted on cloud platforms. Some key examples of Cloud Data Warehouses include Amazon Redshift and Google BigQuery. Each leverage cloud compute and cloud storage in different ways, but have some common features.
Amazon Redshift operates in a cluster with several nodes, and uses what is termed as columnar storage (as opposed to row storage) to improve analytic query performance. This means that instead of storing records in rows, it stores values from a single column from multiple rows. This allows data to be read faster, as well as making it easier to compress, as each column can be compressed to the same format because it uses the same type of data.
Google BigQuery uses what is termed as “serverless architecture” This means that clients do not need to manage physical (or virtual) server units to run their data warehouse. Instead, BigQuery dynamically assigns computing resources, and the client pays based on storage used, and queries per terabyte. BigQuery also leverages columnar storage and compression.
BigQuery supports batch processing and has a Streaming API for streaming data.
The main advantages of the cloud data warehouse are flexibility and scalability, as well as potential cost savings stemming from not having to purchase expensive infrastructure. Or at the least, changing the capital expenditure model to a more operational expenditure model.
A modern data warehouse is almost always a real-time data warehouse (RTDW). Many companies rely on up-to-the-minute information and cannot wait for batch processing to compare datasets. It will also often pair this with advanced analytics and machine learning. It will also leverage the scalability of cloud platform technologies. But perhaps the most telling difference may be that a modern data warehouse often integrates the capabilities of object storage and data lakes, in order to access unstructured data as well as structured. This is because IoT data and other edge sources of data that are necessary for today’s analytics are often unstructured, and cannot be processed by traditional data warehouses. Data lakes gather this data and allow it to be processed separately prior to loading it into the data warehouse. Google BigQuery can be considered a cloud data warehouse, and a modern data warehouse, as it can leverage data lakes and S3, along with streaming data warehouse capabilities. Another example would be Snowflake, for the same reasons.