See my comment about two-way. It wouldn't matter much as it's rare that both parties are speaking at the same time.
It's also possible that the compression techniques for long-term storage are vastly superior to realtime codecs. The lowest realtime voice codecs are 300-600 bits per second (they sound like shit), which is 213x compression (so an hour would be under a a meg).
81TB a day. Again, this is assuming one hour of calls for 300 million people.
I did a quick search and found this snippet: "A telephia survey said that Americans average 13 talking hours a month – with the 18-24 age group averaging 22 hours."[1]
So that is under half an hour a day average. So, let's assume 300bps (lowest realtime voice codec I'm aware of), half hour a day, I'll stick to 300M people and we get:
http://www.wolframalpha.com/input/?i=6+MB+*+300+million