<img height="1" width="1" style="display:none" src="https://www.facebook.com/tr?id=315693165909440&amp;ev=PageView&amp;noscript=1">


The Big Data Blog

Solving the Upserts Challenge in Data Lakes

Apr 1, 2020 5:55:49 PM / by Roy Hegdish posted in Data Lake, Data Architecture, ETL, SQL, AWS S3, Amazon S3


Updating or deleting data (upserts) is a basic functionality in databases, but is surprisingly difficult to do in data lake storage. In this article, we will explain the challenge of data lake upserts, and how we built a solution to enable an efficient and quick update and delete operations on object storage using Upsolver’s SQL-based data transformation engine.


Read More

4 Guiding Principles for Modern Data Lake Architecture

Mar 18, 2020 5:44:06 PM / by Roy Hegdish posted in Data Lake, Data Architecture, ETL, SQL, AWS S3, Amazon S3, Event sourcing


Data lakes are the cornerstones of modern big data architecture, but getting them right can be tricky. How do you design a data lake that will serve the business, rather than weigh down your IT department with technical debt and constant data pipeline rejiggering? In this document we cover the four essential principles for effectively architecting your data lake.

Read More

Data Lake as a Service: Is There a GUI-based Data Lake?

Mar 1, 2020 2:11:00 PM / by Eran Levy posted in Big Data, Data Lake, ETL, SQL, AWS S3, Amazon S3, Apache Spark


Recent surveys have shown that the data lake market is expected to grow to $20.1 billion by 2024, with a growing number of organizations looking to deploy a data lake in coming years. However, despite growing interest in big data initiatives, a roadblock many organizations run into is the complex, manual nature of building a data lake - which requires hiring skilled personnel that are in dire shortage.

Read More

How (and Why) to Analyze CloudWatch Logs In AWS Athena

Feb 27, 2020 2:15:06 PM / by Roy Hegdish posted in Data Lake, ETL, SQL, AWS S3, Amazon S3, CloudWatch

Amazon CloudWatch is a monitoring service for AWS cloud resources and the applications you run on AWS. While CloudWatch enables you to view logs and understand some basic metrics, it’s often necessary to perform additional operations on the data such as aggregations, cleansing and SQL querying, which are not supported by CloudWatch out of the box.

Read More

Protecting PII & Sensitive Data on S3 with Tokenization

Feb 24, 2020 3:48:46 PM / by Roy Hegdish posted in Data Lake, Amazon S3, Data security, PII


Read More

Custom Partitioning for Embedded Analytics with Athena

Feb 13, 2020 7:18:50 PM / by Eran Levy posted in Amazon Athena, Partitioning, Amazon S3, Glue Data Catalog


Read More

Data Architecture for AWS Athena: 6 Examples to Learn From

Feb 5, 2020 4:43:18 PM / by Eran Levy posted in Amazon Athena, Data Engineering, Data Lake ETL, Amazon Redshift, Amazon Kinesis, Amazon S3, Apache Parquet


Read More

Apache Spark Limitations & the Self-service Alternative

Jan 23, 2020 3:41:15 PM / by Eran Levy posted in Data Engineering, Apache Spark, DevOps


Read More

14 Best Data Engineering Podcasts, Blogs and Websites

Jan 16, 2020 11:38:51 AM / by Eran Levy posted in Apache Kafka, Data Engineering, Blogs, Podcasts, Netflix, DevOps, NoSQL, Uber


Read More

Upsolver Lookup Tables: A Decoupled Alternative to Cassandra

Jan 9, 2020 3:16:54 PM / by Eran Levy posted in Streaming Data, Lookup Tables, Apache Cassandra, Schema Discovery


Read More