Question from ignorance: how do you get "petabytes of data" into the Google Clou...

MrPowers · on Dec 1, 2016

The AWS Snowball service (https://aws.amazon.com/blogs/aws/aws-importexport-snowball-t...) can transfer 1 petabyte per week. Amazon mails you hard drives, you upload your data to the drives, and then you mail the hard drives back to Amazon to upload.

_coldfire · on Dec 2, 2016

There's also Snowmobile, 100PB of storage in a shipping container, able to be filed within 10 days.

https://aws.amazon.com/snowmobile

fnbr · on Dec 1, 2016

I'd also be interested to hear this.

I'm running a project that's 10gb in size and uploading the data to AWS S3 was absurdly slow.

Is there any way to speed up the upload that you found? 10 GB was painful though, I can't imagine uploading terabytes.

phillc73 · on Dec 1, 2016

I don't work in this specific field, but did previously, during the first decade of this century, in broadcast video distribution.

At the time, UDP based tools such as Aspera[1], Signiant[2] and FileCatalyst[3] were all the rage for punting large amount of data over the public Internet.

[1] http://asperasoft.com/

[2] http://www.signiant.com/

[3] http://filecatalyst.com/

jerven · on Dec 1, 2016

Aspera, is the current winner in Bioinformatics. The European Bioinformatics Institute and US NCBI are both big users of it. Mainly for INSDC (Genbank/ENA/DDBJ) and SRA (Short Read Achive) uploads.

For UniProt a smaller dataset we just use it to clone servers and data from Switzerland to the UK and US at 1GB/s over wide area internet.

Very fast, and quite affordable.

dekhn · on Dec 1, 2016

I used aspera for a while, but plain old HTTP over commodity networks works fine if you balance your transfers over many TCP connections.

collyw · on Dec 1, 2016

Jim Kent wrote a small program parafetch - basically an ftp client that parallelized uploads. It worked reasonably well for speeding things up maybe 10x. You can get it somewhere on the UCSC web site in his software repository, though it involves compiling the C code.

dekhn · on Dec 1, 2016

For GCS, the gsutil program can saturate 1GB NIC using "gsutil -m cp -R"

xapata · on Dec 1, 2016

The fastest way to upload is to ship hard drives in an airplane.

ams6110 · on Dec 1, 2016

OK -- Tannenbaum's "station wagon full of tapes" updated for the 21st century.

dekhn · on Dec 1, 2016

Tannenbaum always forgot to include the time writing and reading tapes. Typical 10TB hard drives (which most people use for data interchange instead of tapes) only have ~100MB/sec bandwidth (~ same as 1Gbit NIC).