Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Sometimes genomics researchers do things inefficiently on purpose. You very rarely want to look to see if two strings are exactly the same. Instead, you usually try to look at how similar the sequences are, taking into account ambiguities or mismatches. In this scenario, storing sequences compressed as 2-bit encoding isn't significantly faster (if it is at all).


It can be if you represent them correctly.

I count mismatches between sequences using bitwise operators by representing the strings as two binary numbers. One number stores a 1 for an A/C, 0 for G/T and one number with 1's A/G as 0 for C/T. And it was at least an order of magnitude faster. I benchmarked it. (It wasn't just comparing one sequence to another, but also other factors, such as having a red or green lazer in a particular position).


And how do you represent an unknown base in that representation? Or, how would you do the comparison with a gap in one of the sequences (which is incredibly common). I've benchmarked it all too, and it is really difficult to represent DNA with just two bits. Once you start to get out from the simplest use-cases, it gets significantly more difficult. The most efficient mechanisms require three bits or at least some mechanism for denoting where the ambiguous calls are (see UCSC's 2-bit format). And now that you're dealing with three bits, the numbers start to get funky quickly - you end up dealing with 24 bits for 6 consecutive bases, and you still have the problem of representing gaps efficiently... have you ever tried to insert 4 bits in the middle of another number?


I didn't need to for the problem I was solving.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: