<img height="1" width="1" style="display:none" src="https://www.facebook.com/tr?id=315693165909440&amp;ev=PageView&amp;noscript=1">

Upstream

The Big Data Blog

Problems with Small Files on HDFS / S3? Make Them Bigger

Oct 23, 2018 2:08:54 PM / by Eran Levy posted in Amazon Athena, HDFS, S3

More often than not, big data is made up of a lot of small files. Event-based streams from IoT devices, servers or applications will typically arrive in kb-scale JSON files, easily adding up to hundreds of thousands of new files being ingested into your data lake on a daily basis.

Writing small files to an object storage (Amazon S3, Azure Blob, HDFS, etc.) is easy enough; however, trying to query the data in this state using an SQL engine such as Athena or Presto will absolutely kill both your performance and your budget.

Read More