#AWS - Athena query result management.
#aws_tips_and_trick

#AWS - Athena query result management. #aws_tips_and_trick

·

3 min read

What is Athena?

  • Amazon Athena is a serverless interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL
  • It is a "pay as you go service", you pay only for the queries that you run

How does it work?

  • Athena uses Presto under the hood to run queries. Presto is a distributed query engine for big data using the SQL query language.
  • Athena also uses Apache Hive. Apache Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop

What are Athena's query results?

  • Athena stores query results and metadata information for each query that runs automatically.
  • These results are stored in an S3 bucket that we can specify.
  • We can also encrypt the files if required.
  • Athena supports DML, and DDL queries.
  • Files are saved based on the name of the query, and the query ID. QueryID is a unique identifier that Athena assigns to each query when it runs. I.e: 0000e975-9f92-4adb-b853-71fc97dac20e.csv
  • For each query, the result will contain two files
    1. Actual data
    2. Metadata
  • DML query results are saved (CSV) format.
  • DDL query results are saved as plain text files.
  • Metadata files are saved in binary format.

What are DML and DDL?

  • DDL (Data Definition Language) queries include CREATE TABLE and ALTER TABLE ADD PARTITION. ​​
  • There are no charges for DDL queries.

  • DML (Data Manipulation Language) queries include SELECT, CREATE TABLE AS (CTAS), and INSERT INTO queries.

  • DML queries are charged for the number of bytes scanned rounded up to the nearest megabyte

What is a query quota?

  • A number of queries can be executed at any given time.
  • Query quota includes both running and queued queries.
  • Exhausting the query will result in a TooManyRequestsException error.

How to access the query results?

  • We can directly access/download the files from Athena Console/S3.
  • We can use AWS CLI / API to get the query results from Athena based on the unique QueryIds.
  • To access and view query output files, IAM principals (users and roles) need permission for the Amazon S3 GetObject action for the query result location, as well as permission for the Athena GetQueryResults action.
  • If the location is encrypted, users must have the appropriate key permissions to encrypt and decrypt the query result location.

Why should we bother?

  • Athena keeps a query history for 45 days.
  • CloudTrail keeps the log for 90 days.
  • But the Athena query results are kept forever.
  • Over a period of time, the query results get accumulated and incur a cost when kept in S3.

What can we do?

  • Setup a data rotation policy to maintain the Athena query results.
  • This can be achieved using S3 lifecycle policies.

Is there a risk involved in deleting the query results?

  • You can safely delete the old files as long as you don't want to go back and review the results of that particular query.
  • But keep in mind once the data is deleted it can’t be restored.
  • Also while trying to access the result will throw a "Could not find results" error.
  • We can always run the query to get the results again.
  • For any reason such as unexpected Athena usage, if we want the query results we will not be able to get them.

What are the best practices to be followed?

  • Use Athena Saved queries wherever possible.
  • Create and maintain a separate bucket for Athena query results.
  • Setup an S3 lifecycle to maintain the query results.
  • Set Up a monitoring system for collecting and storing the Athena logs periodically into the S3 bucket from Cloud Trail.

Reference: