Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Is fastest parallel PDB parser written in Haskell?
2 points by miga on Nov 24, 2013 | hide | past | favorite | 1 comment
Growing collection of bioinformatic libraries now includes BioHaskell http://biohaskell.org/.

It contains fully parallel parser of Protein Databank files (PDB) which happens to outperform other benchmarked parsers in C, Java, Python, Ruby , including some parallel ones (like BioJava). Article here: http://www.biomedcentral.com/1756-0500/6/483/abstract

Do you know of any other parser that beats these parallel benchmarks?



I'm not sure that there would be much call for that. Surely the time taken to parse the input is a tiny fraction of the time taken to perform analyses with this data?

The one calculation given that's long-running is parsing the entire (textual) PDB, in 14m cpu time/50m I/O time. Firstly, this is I/O bound, so do you gain anything from optimising the code or even from running in parallel? (also: that disk is capable of loading 16GB in 64 seconds at peak throughput. What's going on for 50 minutes?)

Secondly, the use case seems unrealistic: if this was a bottleneck, you would do this exactly once, to load your data into something that can be read faster? mmap'd data is ready to use immediately, for example (and there's a haskell package for this).

It seems likely from the benchmarks that most of these parsers are 'fast enough' for their uses cases, but the motivation for needing this to be faster is lacking from the paper. I'm sure your code is much cleaner for being in Haskell though, and it sounds like it's been worthwhile to get yourself a better API.

edited to add: so I looked at the code, and I see it does mmap in the data, but from a compressed text file. My next question would be, what is the overhead of compression here? I recall git having problems with zlib being a bottleneck http://marc.info/?l=git&m=117400704304354&w=2




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: