Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The thing that strikes me is that while the data being searched for has been optimized down to 2 bytes of bitfields, the same hasn't been done with the phone numbers - area codes in India appear to be variable-length (2, 3, ? digits) with that length being part of the 10 digits, and presumably mobile numbers are handled similarly either with their own area codes or overlapping. Breaking the data by area code would allow the elimination of significant chunks of the total possible address space, and further separating (as they do) into dense and sparse would further reduce the requirements (since tries, skip-lists, etc. have their own overhead).

It comes down to this: while the Indian phone system could in theory allow 10 billion phone numbers, there are not in fact 10 billion legally-allocatable phone numbers and the rules in place likely effectively reduce the legally-allocatable number to somewhere between 1-3 billion numbers. Dividing based on those rules may both reduce the problem to something requiring less engineering time and simiplify possible marketing-related factors (e.g. selling access/systems targeted only to specific area codes).



And after a bit more thinking while doing something mindless, it seems that other interesting areas would be the distribution of digits within each of the first 4-5 positions and perhaps looking at the bit patterns of either run-length or Huffman encoding of the first 4-5 digits.

And separately, depending on the hit rate and particularly in the sparse sections, there might be situations where it would make sense to have a preliminary lookup that simply indicated whether there were any possible matches within a prefix range - before searching, get an overview of whether there's anything to search.

Overall this strikes me as something that could demonstrate the importance of having developers aware of the environment in which something will be used. If there are going to be 2 of something, throw hardware at it. If there are going to be 2,000 of "something" instead, an extra $1000 each in "throw hardware at it" could become a real issue.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: