embedding text into vector spaces
-
Word embedding is an nlp technique to map text or words to vectors
-
Allows vector space operations to happen, such as summing or computing the distances of vectors
-
Once words are generated into individual vectors, combine them into text vectors (aka document or sentence vecotrs)
- Easy way typically sums or averages the vectors together
-
Two snippets of text are discovered by mapping both of them into vector space and finding the distances between the vectors
- Typically uses the angular distance
-
nearest neighbor can be used
- high dimensional vectors of word embeddings typically break down
- approximate nearest neighbor must be used (ANN)
-
assuming 4 billion 200-dimensional query vectors
- 4 billions lets you store each dimension as a 4 byte float
- ~3TB
- tried quantization/discretization
-
Cliqz uses graph based approximate nearest neighbor search (granne)