For a job that needs to access 100's of thousands of small files, the ability to read the meta data quickly is very important.
This is the wider issue with small files. On HDFS each file uses up some namenode memory, but if there are jobs that need to touch 100k+ files (which I have seen plenty of), that puts a real strain on the Namenode too.
I have no experience with S3 to know how it would behave in terms of metadata queries for lots of small objects.
This is the wider issue with small files. On HDFS each file uses up some namenode memory, but if there are jobs that need to touch 100k+ files (which I have seen plenty of), that puts a real strain on the Namenode too.
I have no experience with S3 to know how it would behave in terms of metadata queries for lots of small objects.