Explore our expert-made templates & start with the right one for you.
Data ingestion is a key component of big data architectures. While storing all your data on unstructured object storage such as Amazon S3 might seem simple enough, there are many potential pitfalls you’ll want to avoid if you want to prevent your data lake from becoming a data swamp. The way your data is ingested can dramatically impact the performance and usefulness of your data lake.
Having trouble keeping up with data lake best practices? We can automate that for you. Check out our data lake ingestion solution today, or schedule a demo to see Upsolver in action. To learn more about data pipelines and see interactive examples, check out our data pipeline examples.
In this article we’ll look at 7 best practices for ingesting data into your lake – from strategic principles down to the more tactical (and technical) issues that you should be aware of when building your ingest pipelines.
What data lake ingestion is, and how it’s different from writing to a data warehouse
Data lake ingestion is simply the process of collecting or absorbing data into object storage such as Hadoop, Amazon S3, or Google Cloud Storage. For a streaming source, ingestion would usually be continuous, with each event or log stored soon after it is received in the stream processor. For batch data, ingestion might be periodical – i.e., a script that runs once a day to import new records in the CRM and write them to the lake as CSV or JSON files.
In a data lake architecture, ingestion is simpler than in a data warehouse. This is because data lakes allows you to store semi-structured data in its native format (read more about data lakes and data lake platforms). This removes the need to manually wrangle the data and fit it into existing database schemas, which can become a complex and costly ELT process when dealing with large volumes of semi-structured data. Additionally, data lake ingestion can include types of data that would not be able to be stored in an EDW – including image, video, and audio data.
However, this doesn’t mean that data ingestion doesn’t matter – just that there are different areas to focus on.
Why data lake ingestion is important (and you should think about it early)
While data lakes are typically predicated on the notion of “store now, analyze later,” taking this approach too literally will not end well. Dumping all of your data into the lake without having at least a general idea of what you’d like to do with the data later on (and what that will entail) could result in your data lake becoming a “data swamp” – a murky bog of unfamiliar, uncategorized datasets which no one really knows how to make useful.
Data that is not ingested and stored according to best practices will be very difficult to access further down the line, and mistakes made early on could come back to haunt you for months or years because once data is in the lake, it becomes very difficult to reorganize.
In addition, proper data ingestion should address functional challenges such as:
- Optimizing storage for analytic performance, which is often best done upon ingest (data partitioning, columnar file format, etc.)
- Ensuring exactly-once processing of streaming event data without data loss or duplication
- Visibility into the data you’re bringing in, especially in the case of frequent schema changes
7 Best Practices for Big Data Ingestion Pipelines
Now that we’ve made the case for why effective ingestion matters, let’s look at the best practices you want to enforce in order to avoid the above pitfalls and build a performant, accessible data lake.
Note that enforcing some of these best practices requires a high level of technical expertise – if you feel your data engineering team could be put to better use, you might want to look into an ETL tool for Amazon S3 that can automatically ingest data to cloud storage and store it according to best practices.
For the purposes of this article we’ll assume you’re building your data lake on Amazon S3, but most of the advice applies to other types of cloud or on-premises object storage including HDFS, Azure Blob, or Google Cloud Storage; they also apply regardless of whichever framework or service you’re using to build your lake – Apache Kafka, Apache Flume, Amazon Kinesis Firehose, etc.
1. Have a plan
This is more of a general guideline but important to keep in mind throughout your big data initiative: blindly dumping your data into S3 is a bad strategy. You want to maintain flexibility and enable experimental use cases, but you also need to be aware of the general gist of what you’re going to be doing with the data, what types of tools you might want to use, and how those tools will require your data to be stored in order to work properly.
For example, if you’re going to be running ad-hoc analytic queries on terabytes of data, you’re likely going to be using Amazon Athena. Optimizing lake storage can dramatically impact the performance and costs of your queries in Athena (check out our webinar on ETL for Amazon Athena to learn more).
2. Create visibility upon ingest
You shouldn’t wait for data to actually be in your lake to know what’s in the data you’re bringing in. You should have visibility into the schema and a general idea of what your data contains even as it is being streamed into the lake, which will remove the need for “blind ETLing” or reliance on partial samples for schema discovery later on.
Upsolver’s schema discovery is built to do exactly this; an alternative would be to write ETL code that pulls a sample of the data from your message broker and infer schema and statistics based on that sample.
Watch this short video to learn how Upsolver deals with schema evolution, or check out the related pipeline example:
3. Lexicographic ordering on S3
You should use a lexicographic date format (yyyy/mm/dd) when storing your data on S3. Since files are listed by S3 in lexicographic order, failing to store them in the correct format will cause problems down the line when retrieving the data.
Note that while Amazon previously recommended randomizing prefix naming with hashed characters, this is no longer the case according to their most up-to-date documentation.
4. Compressing your data
While storage on S3 is cheap, everything in life is relative, and when you’re processing terabytes or petabytes of data on a daily basis, that’s going to have an impact on your cloud bill – which is why you want to compress your data.
When choosing the compression format, note that very strong compression might actually increase your costs when querying the data since data will need to be uncompressed before querying – which will incur compute costs that are greater than what you’re saving on storage. Using super-strong compression such as BZ2 will make your data lake completely unusable!
Instead, opt for “weaker” compression that reads fast and lowers CPU costs – we like Snappy for this – to reduce your overall cost of ownership.
5. Reduce the number of files
We’ve written previously about dealing with small files on S3 when it comes to query performance – but even before you run your compaction process, you should try to reduce the number of events you’re ingesting in a single write.
Kafka producers might be sending thousands of messages per second – storing each of these as separate files will increase disk reads and degrade performance. At Upsolver we write minutes instead of seconds for this reason.
6. Self-describing File Formats
You want every file you store to contain the metadata needed in order to understand the data contained in that file, as that will make analytic querying far simpler. For optimal performance during query execution you want to use a columnar format such as Parquet or ORC. For the ETL staging layer you can use row-based Avro or JSON.
7. Ensuring exactly-once processing
Duplicate events or missed events can significantly hurt the reliability of the data stored in your lake, but exactly-once processing is notoriously difficult to implement since your storage layer is only eventually (not instantly) consistent. Upsolver provides exactly-once processing from Kafka or Kinesis via idempotent operations – if coding your own solution you would need to do the same. This is a complex topic which we will try to cover elsewhere.
Learn more about data lake best practices:
- Read our guide to comparing data lake ETL tools
- Watch our webinar with AWS and ironSource to learn how ironSource operationalize a petabyte-scale data lake in the cloud
- Learn about Upsolver’s declarative data pipelines
- Try SQLake for free (early access). SQLake is Upsolver’s newest offering. It lets you build and run reliable data pipelines on streaming and batch data via an all-SQL experience. Try it for free. No credit card required.
Schedule a friendly chat with our solution architects to learn how to improve your data architecture