Hacker News new | past | comments | ask | show | jobs | submit login

Question from ignorance: how do you get "petabytes of data" into the Google Cloud in a reasonable time? I find copying a mere few TB can take days and that's on a local network not over the internet.



The AWS Snowball service (https://aws.amazon.com/blogs/aws/aws-importexport-snowball-t...) can transfer 1 petabyte per week. Amazon mails you hard drives, you upload your data to the drives, and then you mail the hard drives back to Amazon to upload.


There's also Snowmobile, 100PB of storage in a shipping container, able to be filed within 10 days.

https://aws.amazon.com/snowmobile


I'd also be interested to hear this.

I'm running a project that's 10gb in size and uploading the data to AWS S3 was absurdly slow.

Is there any way to speed up the upload that you found? 10 GB was painful though, I can't imagine uploading terabytes.


I don't work in this specific field, but did previously, during the first decade of this century, in broadcast video distribution.

At the time, UDP based tools such as Aspera[1], Signiant[2] and FileCatalyst[3] were all the rage for punting large amount of data over the public Internet.

[1] http://asperasoft.com/

[2] http://www.signiant.com/

[3] http://filecatalyst.com/


Aspera, is the current winner in Bioinformatics. The European Bioinformatics Institute and US NCBI are both big users of it. Mainly for INSDC (Genbank/ENA/DDBJ) and SRA (Short Read Achive) uploads.

For UniProt a smaller dataset we just use it to clone servers and data from Switzerland to the UK and US at 1GB/s over wide area internet.

Very fast, and quite affordable.


I used aspera for a while, but plain old HTTP over commodity networks works fine if you balance your transfers over many TCP connections.


Jim Kent wrote a small program parafetch - basically an ftp client that parallelized uploads. It worked reasonably well for speeding things up maybe 10x. You can get it somewhere on the UCSC web site in his software repository, though it involves compiling the C code.


For GCS, the gsutil program can saturate 1GB NIC using "gsutil -m cp -R"


The fastest way to upload is to ship hard drives in an airplane.


OK -- Tannenbaum's "station wagon full of tapes" updated for the 21st century.


Tannenbaum always forgot to include the time writing and reading tapes. Typical 10TB hard drives (which most people use for data interchange instead of tapes) only have ~100MB/sec bandwidth (~ same as 1Gbit NIC).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: