Hacker News new | past | comments | ask | show | jobs | submit login

I found the script they used for copying data really interesting: https://gist.github.com/peterwj/0614bf6b6fe339a3cbd42eb93dc5...

It's written in Python, spins up a queue.Queue object, populates it with ranges of rows that need to be copied (based on min < ID < max ranges), starts up a bunch of Python threads and then each of those threads uses os.system() to run this:

    psql "{source_url}" -c "COPY (SELECT * FROM ...) TO STDOUT" \
      | psql "{dest_url}" -c "COPY {table_name} FROM STDIN"
This feels really smart to me. The Python GIL won't be a factor here.



For ETL out of Postgres, it is very hard to beat psql. Something as simple as this will happily saturate all your available network, CPU, and disk write. Wrapping it in Python helps you batch it out cleanly.

    psql -c "..." | pigz -c > file.tsv.gz


Thanks Simon! I can indeed confirm that this script managed to saturate the database's hardware capacity (I recall CPU being the bottleneck, and I had to dial down the parallelism to leave some CPU for actual application queries).


Sounds to me like this is the exact thing that the normal parallel command was made for, not sure python is needed here if the end result is shelling out to os.system anyway.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: