Amazon Athena is an interactive query service based on the open-source Apache Presto, that enables you to directly analyze data stored in Amazon S3 using ANSI SQL. Athena is serverless so there is no infrastructure to manage and maintain, and you only pay for the queries you run.
In order to start using Athena, you need to define the schema of your data stored in Amazon S3 and then you are ready to start querying it using SQL. The schema is defined using Amazon Glue Data Catalog which enables you to create a unified metadata repository across multiple services.
Athena can be used alongside or instead of traditional databases, depending on the specific business and technical scenario. However, it’s important to understand the differences between the two, and why you would choose one over the other.
How is Athena different from a database or data warehouse?
Athena is not a database but rather a query engine. This means that:
- Compute and storage are separate: databases both store data in rest, and provision the resources needed in order to perform queries and calculations. Each of these comes with direct and indirect overheads. Athena doesn’t store data – instead, storage is managed entirely on Amazon S3. Athena’s query service is fully managed, so that resources are allocated automatically by AWS as needed in order to perform a query.
- No DML interface – with Athena there is no need to model the data. I/O is a bottleneck in virtually every database, but with Athena this is a non-issue. And since you don’t need to waste I/O bandwidth on modeling the data, you can focus all compute resources on query processing.
Why use Athena? The Benefits
- Serverless design reduces IT overhead: Amazon Athena is serverless meaning there is no infrastructure to manage or configure on the user’s side. Using Athena is as simple as defining your query, and you only pay for the queries you run. There are no additional IT costs and no clusters to manage.
- SQL based: You can use Athena to run SQL queries on the required table which is configured in the Glue Data Catalog, or data sources you can connect to using the Athena Query Federation SDK. For users who are already fluent in SQL, there is no learning curve for getting started.
- Open architecture (no vendor lock-in): Athena facilitates an open approach to data, rather than lock-in to a specific tool or technology. This is manifested in different ways;
- Ubiquitous access – since your data is stored in an S3 bucket and schema is defined in the Glue Data Catalog, you can switch between query engines that can read from these sources without redefining the schema or creating a separate copy of the data.
- Separate storage and compute – in Athena there is a complete separation between compute and storage resources. The data is stored on your Amazon S3 account while Athena’s compute is provisioned by Amazon Web Services as a pooled resource among all Athena users.
- Open file formats – unlike many high-performance databases, Athena does not use a proprietary file format but rather supports standard open-source formats such as Apache Parquet, ORC, CSV, and JSON.
- Low cost: Athena’s pricing model is based on terabytes of data scanned. You can control costs and keep them low by scanning only the required data to answer a specific query (this can be done via data partitioning – see below).
- Access to all your data: Most organizations ingest only between 30 to 35 percent of their data to a traditional data warehouse due to the high operational and infrastructure costs of constantly resizing database clusters. Using Athena allows you to query large volumes of data stored in a data lake on Amazon S3; and since this storage costs a fraction of what you would pay to store the same data in a data warehouse, you can work with higher volumes of data without concern.
- Security: With the introduction of AWS Lake Formation, it’s possible to set fine-grained access control in Amazon Athena. You can choose which users can perform which operations and on which data.
- Custom connectors: Amazon Athena allows you to run SQL queries over multiple data sources, which can power multiple business intelligence and analytics processes. You can use JDBC to connect Athena to BI and machine learning tools.
What are the limitations of Athena?
- No built-in insert / update / delete operations: Because Athene is a query engine and doesn’t have a DML interface, upserts can be challenging.
- Optimization is limited to queries: You can optimize your queries, not your data. Your data is already stored in Amazon S3; performing transformations for the sake of using Athena may affect others using the same data for other purposes. This needs to be addressed as part of your ETL process.
- Multi-tenancy means pooled resources: All users of Athena receive a similar SLA for queries at any given time. In other words, the entire global user base is “competing” for the same resources – and while AWS provisions more of these as needed, it could mean that query performance fluctuates based on other people’s usage.
- No indexing: Indices are built-in to traditional databases, but do not exist in Athena. This makes joins between large tables a heavy operation that increases the load on Athena negatively impacts performance, while running a query by key requires scanning all of the data and then searching for the required key in the list of results. This is solved using Upsolver lookup tables.
- Partitioning: Efficient queries in Athena require you to partition your data. It is important to keep the number of partitions in a ballpark that suits your performance needs. As a rule of thumb, every 500 partitions you scan adds 1 second to your query.
Additional required products
Athena is never a stand-alone product but rather always part of a stack that includes:
- Amazon S3: Athena queries run directly on top of Amazon S3, so this is where your data will be stored.
- Glue Data Catalog: A centralized managed schema that enables you to replace or supplement Athena with additional services as needed (for example with Amazon Redshift Spectrum).
- ETL tools: While Athena can run almost any query out-of-the-box, reducing costs and improving performance requires adherence to a set of performance tuning best practices. The traditional way is to use Spark, which can process large volumes of unstructured data; however, this option requires significant coding knowledge. There are some solutions that offer managed Spark as a service, which simplifies the infrastructure aspects but do not remove the coding overhead. An alternative is to use self-service data lake ETL tools such as Upsolver.
Typical Use Cases for Amazon Athena
While Athena is a versatile analytical tool, there are certain instances where it might be your natural go-to (assuming your data is on AWS of course). To learn more about use cases, check out these Athena architecture examples, as well as the resources linked below.
Many organizations store their system logs on Amazon S3. These can be logs generated by software applications or hardware such as servers and remote devices. Athena can then be used to query and analyze the data.
- Costs – can define longer retention periods due to low storage cost vs traditional logging databases such as Elasticsearch.
- SQL – most log storages don’t offer an SQL engine for analytics, which requires analysts to use unfamiliar tools
Examples of log analysis use case using Athena:
Business Intelligence and Online Analytical Processing (OLAP)
Athena is often used as the querying layer, with query results being used to build BI dashboards or support analytical processing.
- Cost – avoid the ongoing compute and storage costs of a traditional database.
- Access to all data – no need to prune your data to reduce infrastructure overhead (see above).
Examples of BI and analytics using Athena:
Athena allows analysts and data scientists to quickly run ad-hoc queries against large volumes of data in order to quickly answer a specific business question.
- 100% of the data is accessible (see above)
- No dependency on others for data modeling
- Minimal ETL
Examples of using Athena for research:
Looking for more Athena comparisons? Check out the following resources:
Want to get even more in-depth into the pros and cons of Athena compared to other tools/ Check out Athena vs Google BigQuery performance and costs, or our comparison of Athena and Amazon Redshift. If you want to multiply the value your organization gets out of Athena and to easily build, manage and optimize your entire cloud data lake architecture, get a free demo of Upsolver today.
Learn everything you need to build performant cloud architecture on Amazon S3 with our ultimate Amazon Athena pack, including:
– Ebook: Partitioning data on S3 to improve Athena performance
– Recorded Webinar: Improving Athena + Looker Performance by 380%
– Recorded Webinar: 6 Must-know ETL tips for Amazon Athena
– Athena compared to Google BigQuery + performance benchmarks
And much more. Get the full bundle for FREE right here.