If you’re working with streaming data in 2019, odds are you’re using Kafka - either in its open-source distribution or as a managed service via Confluent or AWS. The stream processing platform, originally developed at LinkedIn and available under the Apache license, has become pretty much standard issue for event-based data, spanning diverse use cases from sensors to application logs to clicks on online advertisements.
Every software development team makes build-vs-buy decisions on a regular basis. For most coding problems, someone is offering a packaged or white-label solution. The decision whether to purchase a tool or develop an alternative in-house - to ‘build or buy’ - is typically made ad-hoc based on cost, existing engineering skillsets and organizational culture.
Stream processing is a critical part of the big data stack in data-intensive organizations. Tools like Apache Storm and Samza have been around for years, and are joined by newcomers like Apache Flink and managed services like Amazon Kinesis Streams.
Is it time to move your data lake to the cloud? As with any infrastructural choice, there are advantages and trade-offs to deploying in the cloud vs on-premises, and the decision needs to be made on ad-hoc basis based on considerations such as scale, cost, and available technical resources.
This article covers best practices for reducing the price tag of Elasticsearch using a data lake approach. Want to learn how to optimize your entire streaming data infrastructure? Check out our technical whitepaper to learn how leading organizations generate value from cloud data lakes. Get the paper now!
Elasticsearch is a fantastic log analysis and search tool, used by everyone from tiny startups to the largest enterprises. It’s a robust solution for many operational use cases as well as for BI and reporting, and performs well at virtually any scale - which is why many developers get used to ‘dumping’ all of their log data into Elasticsearch and storing it there indefinitely.