I have been following this discussion [1] & challenges on using datahose came up in few places. That leads me to earnestly ask:
We are a brand new research group with just a few hands. This Twitter data is enormous (45-50GB/day for E Asia in JSON).
We have limited experience & hence saving it out as daily logs in flat JSON files
For people using decahose, what kind of system architecture have you put in place for storing & searching such data. We explored AWS DynamoDB & MongoDB datalake but the cost seemed just too high. Feedback & suggestions needed.
[1] Twitter plans to comply with Musk’s demands for data : https://news.ycombinator.com/item?id=31686055