Table of contents
Upsolver’s Execution Advantage
Besides making the data preparation process easy, Upsolver also delivers high performance when processing intensive transformations at scale. It does this via the first indexing engine for data lakes, combined with breakthrough compression technology that allows you to store index 10X the data in the same RAM footprint without requiring a separate NoSQL database cluster. In order to keep your costs down, processing is implemented using extremely affordable spot instances, which can be 90% cheaper than reserved compute.
Indexing engine and innovative compression reduces RAM footprint 90%
Upsolver created an indexing system called Lookup Tables to manage state. Lookup tables are the world’s first fully decoupled key-value store. They store the results of a Group-By query which is running continuously on incoming data streams. The relevant results can be fetched using a lookup key and a snapshot timestamp.
Lookup Tables are persisted to an object storage and cached in RAM for fast retrieval (< 1 millisecond). They require a relatively small RAM footprint (roughly 10% compared to a traditional key/value store such as Apache Cassandra) reducing the needed size and cost of the underlying cloud compute instance. By utilizing object storage and avoiding local disk storage, Lookup Tables dramatically reduce the cost relative to using an external key/value store, especially when it comes to scaling, replication and recovery. Since Lookup Tables are used for random access “get by key” requests, storing data in a pure columnar format introduces quite a lot of overhead - the value from each column would need to be found individually, creating a lot of random memory access that slows things down considerably. To address this, Lookup Tables are based on a new file format and compression algorithm which combine efficient columnar compression with millisecond key-based queries. Lookup Tables are computed using streaming SQL on multiple sources of streaming and batch data.
Using Spot Instances to reduce compute costs by 90%
Upsolver’s storage and compute layers are fully decoupled. To minimize your compute costs, Upsolver uses EC2 spot instances for all processing and uses only the object store for storage, which lower compute costs by up to 90%. Since AWS can interrupt spot instances without notice, Upsolver utilizes a pool of spot instances that mutually back one another up in case to ensure no compromise between availability and cost.
Automatic elastic scaling
Upsolver never stores data on local server storage, so processing can elastically and automatically scale up or down according to the workload. An Upsolver compute cluster can be set to scale out based on the average CPU utilization reaching a customer-defined threshold so that you can optimize your scaling strategy for low cost or low latency. Scale down rules also allow selection of cost or latency, and are configured independently of scale up rules.
Learn more about scaling strategies.
Millisecond latency queries on data in cloud object storage
While Lookup Tables are an architectural feature of Upsolver that power our data transformation engine, we also make them available for lightning fast analytics queries. They are particularly useful for situations where low latency meets high cardinality, such as:
- Joins between streaming sources – joining between ad impressions and clicks based on a user ID.
- Data-driven applications – a mobile app that needs to retrieve purchase history upon user request.
- Device or user-level aggregations – creating a single view of user activity for advertising or analytics purposes.
- Real-time dashboards – create deeper real-time or near real-time analytics, rather than being constricted by batch processing latencies when reports require data to be joined or aggregated.
- Real-time machine learning – enables machine learning algorithms to take both real-time and historical data into account, resulting in more accurate modeling with less data engineering.
As a query tool, Lookup Tables provide the following features
- Millisecond query response on high cardinality data
- Built-in window aggregations to capture real-time behavior using window aggregations, nested aggregations and time-series aggregations.
- Time travel (Replay) on historical data, without needing to define the aggregation in advance.
- Dramatically lower infrastructure costs with 10x-15x more data indexed in the same memory footprint vs. Cassandra-based alternatives.
- Decoupled storage and compute to avoid the constant ETL effort associated with traditional key/value stores.
Millisecond query response on high cardinality data
Built-in window aggregations to capture real-time behavior using window aggregations, nested aggregations and time-series aggregations.
Time travel (Replay) on historical data, without needing to define the aggregation in advance.
Dramatically lower infrastructure costs with 10x-15x more data indexed in the same memory footprint vs. Cassandra-based alternatives.
Decoupled storage and compute to avoid the constant ETL effort associated with traditional key/value stores.