Build a Lean, Mean Log Analytics Machine: An Architecture that Scales Performance and Cuts Costs

Table of contents

    Events logs are the archetypical source of streaming data: a continuously-generated stream of semi-structured data that reflects the state of a software or hardware system at any given moment. Some of the most common sources are IoT sensors, server and network security logs, app activity, and advertising data.  The rise of software development and clickstream and app analytics has prompted increased demand for log data analysis by data analytics teams, business units, and other non-IT stakeholders.

    Database Approach: Slow, Expensive, and Inefficient

    Traditionally, the task of collecting and analyzing logs was associated mainly with IT and networking professionals and had been served by search databases such as Splunk and ElasticSearch.  But as log data has grown in volume and velocity, and with increasing demand for using log data for new types of analysis – dashboards, operational analytics, ML model training, and more – the drawbacks of a database approach become more apparent:

    1. High Cost. Databases lack elasticity, making them expensive and cumbersome to scale because you must constantly spin up, configure, and resize clusters based on changes in data volume and retention.
    2. Long Latency. Data isn’t available in real-time due to the latencies built in to batch ETL processes.
    3. Machine learning roadblocks. The database, which was built for OLAP analytics, isn’t a good fit for data scientists, who build models and train them against custom datasets. ML datasets require data from various points in time, making them easier to extract from an append-only log.  This contrasts with a datastore that allows updates and deletes. Further, data scientists want to script in Python/Scala to build their datasets and apply functionality that would be very cumbersome to do in SQL.

    Managed Platform Approach

    There are third-party log analytics data platforms.  These are efficient but costly and lock-in can limit your data strategy.

    Log analytics platform solutions such as Splunk and ElasticSearch are great for needle-in-the-haystack searches for IT and security teams. But some users index everything into their log analytics platform solutions only to find the vast majority of data is accessed infrequently and can therefore could have been stored on cheaper alternatives such as AWS S3. The cost of indexing data that’s unnecessary for log analytics platform solutions searches can really add up, making for a very expensive haystack.  Also, some of the end users prefer a SQL-based approach, which can be challenging as their data structure is not designed for SQL processing. 

    Also with managed platforms, technology lock-in can be the Achilles heel of your data strategy.  Log analytics platform solutions rely on a proprietary data format optimized for their own query engine. This limitation creates vendor lock-in and forces organizations to choose between two bad options: force a use case on an existing database or replicate the data to another store, which introduces consistency and reliability issues.

    Decoupled, Open Source Architecture Approach

    So the database fell out of favor as the core datastore for log data as organizations began searching for more scalable, cost-effective, and agile solutions. Recent years have witnessed the rise of the decoupled architecture.  In a decoupled architecture, raw data is ingested and stored in the inexpensive object storage of a data lake, with compute resources provided on an ad-hoc basis to different analytic services per use case.

    This data lake approach bypasses the roadblocks of traditional databases above.  It stores data as an append-only log on cloud storage while enabling consumers to analyze the data with best-of-breed engines per use case. With this architecture, scaling is elastic, cost is low, machine learning is natively supported, and data is stored in its raw format without ETL delays. This architecture heavily relies on open file formats (such as Apache Parquet) for storage and metadata stores for data discovery and queries.  Logging databases would often still be used, but at much lower cost because so much of the data is offloaded onto inexpensive object storage. 

    Here’s an example: building an enterprise data platform for IoT analytics via a data lake architecture using open-source stream processing frameworks and time-series databases such as Apache Spark/Hadoop, Apache Flink, InfluxDB, and many additional building blocks. This open-source toolset does get the job done.  But implementing it correctly can be an overbearing task for all but the most data-savvy organizations. Deploying and orchestrating this type of data platform requires specialized big data engineers and a strong focus on data infrastructure, and there’s a good chance of delayed delivery, exorbitant costs, and thousands of engineering hours going to waste.

    Upsolver SQLake approach

    Upsolver SQLake is a declarative data pipeline platform for streaming and batch data. With SQLake you can easily develop, test and deploy pipelines that extract, transform, and load data in the data lake and data warehouse in minutes instead of weeks. 

    With SQLake, you can build reliable data pipelines using familiar SQL at a fraction of the code compared to any other system.  SQLake simplifies pipeline operations by automating tasks such as job orchestration and scheduling, file system optimization, data retention, and the scaling of compute resources. 

    SQLake provides the flexibility of open-source log analytics stream processing without the complexity. Essentially it sits between the AWS data storage layer (S3+Glue Data Catalog) and analytical services such as Athena, Redshift, and Snowflake.  You can try SQLake for free, using either sample data or your own data.

    SQLake was built for turning event streams into analytics-ready datasets. The architecture shown enables a full breadth of functionality and use cases – operational reporting, ad-hoc analytics, and data preparation for machine learning.

    Note:  If you use Splunk as part of your log processing platform, you can continue using Splunk and just do the complex transformations in SQLake.  This will improve your overall performance.

    Read how Cox Automotive used log analytics to standardize cloud security analytics across 16 subsidiaries, saving $700k:

    Try SQLake for Free

    SQLake is Upsolver’s newest offering. It lets you build and run reliable data pipelines on streaming and batch data via an all-SQL experience. Try it for free. No credit card required. 

     

    ctaForm

    Start for free - No credit card required

    Batch and streaming pipelines.

    Accelerate data lake queries

    Real-time ETL for cloud data warehouse

    Build real-time data products

    Get Started Now

    Templates

    All Templates

    Explore our expert-made templates & start with the right one for you.