Thanks for expressing interest...I'll respond in order.
> - At the risk of birthing a page-long subthread on ZFS-vs-everything-else... what storage solution are you using, and why?
Right now, XFS and HDFS. This is primarily due to my prior experience with it; I have considered ZFS, and might move to that later on. I may also dispense with HDFS for shared storage and to reduce the redundant JVM memory footprint. I'm attracted to ZFS for compression and error correction features, I just don't have any actual experience using it.
> - What sort of hardware are you using? (This is a non-catalyzing question, and is just out of curiosity)
Four bare metal servers, three of which have dual Xeon E5-2630 V4s (10 core, 20 thread 2.2 Ghz), one of which has an i7-6900K (8 core, 16 thread 3.2 Ghz). The latter server has 10TB SATA SSD capacity and four GTX 1080 GPUs. The other three have 2TB of NVMe SSD capacity. Each of the four has 128GB DDR4 2400 Mhz RAM. Currently all storage is local to the hardware, but there is a SuperMicro SC847 for future expansion. Every server has dual 10Gb/s SFP+ connections in bonded 802.3ad LACP, and are connected via a 16 port 10G network switch in the same rack. I think that's everything at a high level, off the top of my head.
> - How did this get started, and how are you managing this?
I joined John Baez and a few other scientists/researchers working on the Azimuth Climate Project, which sought to preserve critical snapshots of climate data in the event of mass defunding. Later on I decided to take this a step further since those snapshots were becoming very out of date and the multifarious data repositories were never very well documented, organized or normalized. Not sure what you mean by "managing this" though.
> - Will you ever be interested in accepting donations or funding? (Including on a voluntary basis; and including with clear stipulations/structure about management)
I initiated the process of starting a formal nonprofit, but not for the purpose of soliciting donations; rather, just so that it would be clear it's a noncommercial activity. I might be interested in that kind of thing - in either way you mentioned - once I have exhausted my own reasonable resources for the task and have to significantly expand.
> - What sorts of decisions/motivated this initiative?
It began with reading worrydream's blog post, "What can a technologist do about climate change?": http://worrydream.com/ClimateChange/. Not much more to it than that.
> - Do you have any interest in creating this as a "community hub", with a central code repo that data scientists can push updates to that then get run on the cluster? If the visualization data (or prerendered bits of it) are openly cached/accessible, having the code that generated the data equivalently open/available could be interesting.
Yes, that's an interesting idea.
> - What kind of availability/openness are you looking at with data? (TL;DR translation: rate limiting)
Well everything is intended to be extremely transparent, so I'm going to open source all software (infrastructure, development, research, etc). I'm also going to keep all data open. In practice there will probably be a limit of 1000 or so queries per day per IP address, but I'll burn that bridge when I get to it. It really depends on how "real time" queries end up being, and how much abuse the system actually receives.
Thanks for the reply! Not sure if you'll see this; had some unexpected delays finishing this comment.
Everything is duly noted; I'll just expand on some bits.
> I joined John Baez and a few other scientists/researchers working on the Azimuth Climate Project, which sought to preserve critical snapshots of climate data in the event of mass defunding. Later on I decided to take this a step further since those snapshots were becoming very out of date and the multifarious data repositories were never very well documented, organized or normalized.
Archive.org is very probably interested in this sort of thing then, FWIW.
> Not sure what you mean by "managing this" though.
Heh :) I was curious what sort of work you do in order for this initiative, which appears(?) to be somewhat of a side project, to be viable in terms of budget and adequate spare time. I'm also interested in storing/working with somewhat large amounts of data too (one project I want to try at some point is implementing an infinite browser cache so I can "Google" the content of every version of every webpage I'd ever visited), so I definitely want to optimize for something that doesn't require much time :) (don't we all)
> It began with reading worrydream's blog post, "What can a technologist do about climate change?" ... Not much more to it than that.
After having read that page I now understand the sentiment of that statement. Major TIL :)
And if only all webpages were that well designed... (The interactivity was a bit of an information overload, but I really liked the layout.)
> - At the risk of birthing a page-long subthread on ZFS-vs-everything-else... what storage solution are you using, and why?
Right now, XFS and HDFS. This is primarily due to my prior experience with it; I have considered ZFS, and might move to that later on. I may also dispense with HDFS for shared storage and to reduce the redundant JVM memory footprint. I'm attracted to ZFS for compression and error correction features, I just don't have any actual experience using it.
> - What sort of hardware are you using? (This is a non-catalyzing question, and is just out of curiosity)
Four bare metal servers, three of which have dual Xeon E5-2630 V4s (10 core, 20 thread 2.2 Ghz), one of which has an i7-6900K (8 core, 16 thread 3.2 Ghz). The latter server has 10TB SATA SSD capacity and four GTX 1080 GPUs. The other three have 2TB of NVMe SSD capacity. Each of the four has 128GB DDR4 2400 Mhz RAM. Currently all storage is local to the hardware, but there is a SuperMicro SC847 for future expansion. Every server has dual 10Gb/s SFP+ connections in bonded 802.3ad LACP, and are connected via a 16 port 10G network switch in the same rack. I think that's everything at a high level, off the top of my head.
> - How did this get started, and how are you managing this?
I joined John Baez and a few other scientists/researchers working on the Azimuth Climate Project, which sought to preserve critical snapshots of climate data in the event of mass defunding. Later on I decided to take this a step further since those snapshots were becoming very out of date and the multifarious data repositories were never very well documented, organized or normalized. Not sure what you mean by "managing this" though.
> - Will you ever be interested in accepting donations or funding? (Including on a voluntary basis; and including with clear stipulations/structure about management)
I initiated the process of starting a formal nonprofit, but not for the purpose of soliciting donations; rather, just so that it would be clear it's a noncommercial activity. I might be interested in that kind of thing - in either way you mentioned - once I have exhausted my own reasonable resources for the task and have to significantly expand.
> - What sorts of decisions/motivated this initiative?
It began with reading worrydream's blog post, "What can a technologist do about climate change?": http://worrydream.com/ClimateChange/. Not much more to it than that.
> - Do you have any interest in creating this as a "community hub", with a central code repo that data scientists can push updates to that then get run on the cluster? If the visualization data (or prerendered bits of it) are openly cached/accessible, having the code that generated the data equivalently open/available could be interesting.
Yes, that's an interesting idea.
> - What kind of availability/openness are you looking at with data? (TL;DR translation: rate limiting)
Well everything is intended to be extremely transparent, so I'm going to open source all software (infrastructure, development, research, etc). I'm also going to keep all data open. In practice there will probably be a limit of 1000 or so queries per day per IP address, but I'll burn that bridge when I get to it. It really depends on how "real time" queries end up being, and how much abuse the system actually receives.