Flattening json arrays, with Athena

If you have an S3 bucket, chock full of newline separated json blobs (e.g. from a firehose ingest pipeline); you may want to query said files, using a SQL-like prompt. This is the promise of Athena.

If your json blobs are very similar, you can specify the schema when creating the table; and parsing is taken care of, for you. The rest of the time, you are in the land of “semi-structured data”.

CREATE EXTERNAL TABLE `foo_202510`(
  `id` bigint COMMENT 'from deserializer',
  `j` string COMMENT 'from deserializer')
ROW FORMAT SERDE 
  'org.openx.data.jsonserde.JsonSerDe' 
WITH SERDEPROPERTIES ( 
  'case.insensitive'='TRUE', 
  'dots.in.keys'='FALSE', 
  'ignore.malformed.json'='FALSE', 
  'mapping'='TRUE') 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://some-bucket/2025/10/'
TBLPROPERTIES (
  'classification'='json', 
  'transient_lastDdlTime'=...)

The json files in the S3 bucket are organised by YYYY/MM/DD, which allows me to partition the data, and only scan what I am interested in (as it is many GBs). Assuming each json blob looks like this:

{ "id": 124232, "j": { ... } }

I am loading the id into one col, and the nested object (as a string) into another. I can then run a query, against that table:

SELECT
    id,
    json_extract(j, '$.abc')
FROM "foo_202510"
LIMIT 1

A more complicated scenario, is that the nested json contains an array – that I wish to flatten, so I have a row in the output per item in the array:

WITH data AS (
    SELECT
        id,
        cast(json_extract(j, '$.a') as array(json)) arr
    FROM "foo_202510"
    WHERE ...
)
SELECT
    id,
    json_extract(a, '$.b') b
FROM data
CROSS JOIN UNNEST(arr) AS t(a)

We start with a CTE, extracting the array – and casting it to the correct type. We can then perform a cross join on this temp table, to get a row per array item.

Leave a comment