This kind of approach can probably scale out pretty far before actually needing to resort to true distributed processing. Compression, simple transforms, R, etc... You can probably get away with even more by just using a networked filesystem and inotify.