Number of vectors are determined by (other than original dataset ofc) how you ch... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		ankit219 on Sept 5, 2023 \| parent \| context \| favorite \| on: Do we think about vector storage wrong? Number of vectors are determined by (other than original dataset ofc) how you chose to chunk the data you have available. Bigger chunks work better in terms of search (empirically) and they also keep the number of vectors down. For openai, based on prevalant norms and their cookbooks, 1M vectors likely mean 1M (more like 700K) pdf pages of text (at a token size of 1000 per embedding). That is a lot of textual data for a decent size company. Enterprises might reach that stage. Consulting firms definitely would - though they already trained and announced their own models.

brigadier132 on Sept 5, 2023 [–]

700k pdf pages is not a lot. Also, you might be a business serving other businesses and indexing their documents and at reasonable (again, not google scale), 700k pages is again not a lot.

Another way to look at 700k pages is 2333 300 page books.

Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact