The world generates an unfathomable amount of data every minute of each 24-hour period, and these data volumes continue to increase exponentially. This data includes a wide variety of structured and unstructured data from non-traditional data sources, including system-generated log files from IoT networks and cybersecurity systems, transactional files generated by point-of-sales or supply chain systems, and customer-generated log files like clickstream data.
As a result, organizations are shifting from batch processing to real-time stream processing. Therefore, the need to process streaming data in near real-time has increased in prominence and is now an integral component of the enterprise data architecture.
Streaming technologies are not new, but they have matured in recent years. The earliest iteration of data streaming tech was first developed in the 1990s. Since then, data streaming technologies have come a long way with the industry moving and having moved, from the open-source Hadoop/Spark architectures to full-stack data streaming solutions with end-to-end streaming data architecture built on the scalability of cloud data lakes.
4 data streaming architecture use cases
1. Peer39: Contextualizing billions of web pages for targeting and analysis
Peer39 is a leader in the digital marketing industry, providing page-level intelligence for targeting and analysis. The company analyzes over 450 million unique web pages every day to contextualize the true meaning of the page topics and text, providing their clients with the information to optimize their advertising spend by placing advertisements in the right place, at the right time, and with the right audience. Equally importantly, Peer39 ensures that advertisers comply with the rigorous privacy regulations such as GDPR and CCPA is an integral part of their digital marketing campaigns.
Because of their legacy architecture and processes, Peer39 faced many challenges, including limited data availability, the production of inaccurate statistics, and reduced business agility.
Upsolver solved these challenges by implementing a cloud-native, data stream processing platform, including an easy-to-use UI, deploying a modernized data architecture in the cloud under budget and on time. The primary benefits of this data stack include the Upsolver compute engine and an automated and scalable workflow using AWS S3, Athena, Kinesis, and Parquet.
2. ironSource: Deploying a petabyte-scale data lake with Upsolver, Amazon Athena, and Amazon S3
IronSource is an in-app monetization and video advertising platform with over 80,000 apps using ironSource technologies to grow their business. They operate at a massive scale across their monetization platforms, namely apps, video, and mediation, resulting in millions of end-user devices generating gargantuan amounts of streaming data. Their data generation statistics include 500K events per second and over 20 billion events daily.
The company’s use case requirements included the data collection, storage, and preparation, supporting multiple use cases while minimizing infrastructure and engineering overheads. Additionally, ironSource had three requirements: The ability to scale up quickly, efficiently and cost-effectively, flexibility, and resilience.
Because of the voluminous data ironSource generates, the ability to store almost infinite amounts of data in an S3 data lake without preprocessing the data is critical. Secondly, ironSource uses the same data to support multiple business processes; this data must feed into many services, resulting in the need to implement a data storage construct that is not constrained by the rigidity and schema limitations of a relational database. Lastly, the ability to recover from failure quickly, plus the need to ensure that errors down the data pipelines do not affect the production environments, requires the deployment of a robust, resilient data architecture because all historical data is stored on S3.
Upsolver’s solution for these requirements started with a data pipeline using Apache Kafka, an open-source distributed event streaming platform, to the Upsolver platform and into the S3 data lake. The input stream is deployed through Upsolver. In other words, Upsolver pulls the data from Kafka and stores it in the AWS S3 data lake. This data is then sent to Athena, once again through the Upsolver interface, and processed or transformed into meaningful information exposed to a Tableau interface.
3. Browsi: Managing ETL pipelines for 4 billion events
Browsi is an AI-powered advertising technology solution that helps publishers monetize content by analyzing web pages, utilizing AI and machine learning to recommend ad placement creation. Additionally, Browsi connects with the publisher’s ad service, embedding the impressions, requesting publisher demand, per viewability, and tracking placement viewability. Browsi also tracks user engagement metrics and uses AI to improve its advertisement layout and viewability prediction.
Browsi employs a single data engineer tasked with maintaining its data architecture. The original data stack included a data lake infrastructure with the data ingested by Amazon Kinesis, controlled by a Lambda function that ensured exactly-once processing. The ETL process was handled by a batch process, coded in Spark/Hadoop, and running on an Amazon EMR cluster once a day. Amazon Athena was then used to query this data. However, because of the batch latency, or the fact that the data was 24-hours old, the resulting information was out of data. Lastly, this solution was time-consuming and challenging to maintain.
As a result, Upsolver was tasked with replacing its Spark/Hadoop ingestion pipeline and ETL functions with an automated ETL pipeline. Because Browsi already had an AWS account, Upsolver was seamlessly integrated into their platform. As a result, Upsolver quickly became Browsi’s central data lake ETL platform. Upsolver was also used to replace the Lambda architecture and the Spark/EMR structure, moving from batch to stream processing and implementing end-to-end latency with Kinesis to Athena. The company then built its output ETL flows to Athena for external reporting functions. For internal reporting, Upsolver creates daily data aggregations, processed by an internal reporting solution.
Lastly, the ETL pipelines are managed through Upsolver, freeing up time and reducing the overhead required to maintain custom code.
4. Clinch: Improves customer service with Upsolver
Clinch is an advertising technology company that delivers a hyper-personalized product, optimizing marketing campaigns in real-time. Consequently, Clinch must process high volumes of data extremely fast. The company tracks hundreds of millions of anonymous web users, generating 1 billion events per day.
From its inception in 2012, the company’s data infrastructure runs on AWS, streaming all of its events into a NoSQL datastore where teams could query or aggregate the raw data to be used by its products. However, as the company grew, this data store became a bottleneck.
In summary, Upsolver was tasked with removing this bottleneck, implementing a data pipeline that transfers the data from source to store, and removing the manual work involved with the data transformation. Because Clinch already had an AWS account, the Upsolver platform was seamlessly integrated with Clinch’s AWS account, creating end-to-end data pipelines from source to an S3 data lake and querying the real-time streaming data, resulting in an improved customer experience for Clinch’s customers.
Conclusion: Ready to deploy your own data streaming architecture?
Real-time data streaming is the way of the future, providing companies with near real-time information based on the billions of data events streamed per day. As described above, implementing a data streaming architecture is a vital part of the modern enterprise, providing almost instantaneous information to clients and management, assisting both with meaningful information, driving strategic decision-making.
Are you ready to build a real-time data streaming architecture? Schedule a demo to learn how you can deploy an end-to-end automated data pipeline and streaming architecture in minutes with Upsolver.