Sometimes genomics researchers do things inefficiently on purpose. You very rare...

collyw · on July 11, 2015

It can be if you represent them correctly.

I count mismatches between sequences using bitwise operators by representing the strings as two binary numbers. One number stores a 1 for an A/C, 0 for G/T and one number with 1's A/G as 0 for C/T. And it was at least an order of magnitude faster. I benchmarked it. (It wasn't just comparing one sequence to another, but also other factors, such as having a red or green lazer in a particular position).

mbreese · on July 12, 2015

And how do you represent an unknown base in that representation? Or, how would you do the comparison with a gap in one of the sequences (which is incredibly common). I've benchmarked it all too, and it is really difficult to represent DNA with just two bits. Once you start to get out from the simplest use-cases, it gets significantly more difficult. The most efficient mechanisms require three bits or at least some mechanism for denoting where the ambiguous calls are (see UCSC's 2-bit format). And now that you're dealing with three bits, the numbers start to get funky quickly - you end up dealing with 24 bits for 6 consecutive bases, and you still have the problem of representing gaps efficiently... have you ever tried to insert 4 bits in the middle of another number?

collyw · on July 13, 2015

I didn't need to for the problem I was solving.