If you haven't read our previous post on data maps, we recommend starting there for some useful background!
Embeddings, the data representations used within AI models, are a way for computers to quantitatively capture semantics of data. We're pretty passionate about this at Nomic; we view this as important a revolution in computing technology as the invention of the computer itself. For centuries people have organized things by putting them in lines, or in grids, or in trees; to arrange data as points in a 768-dimensional-space is definitely an approach that seems elaborate and strange by comparison, so we're going to try and explain why it's useful.
While Retrieval Augmented Generation (RAG) has brought embeddings into the spotlight, their applications extend far beyond document retrieval for LLMs. The ability to represent semantic information as vectors in high-dimensional space enables a lot of different capabilities across numerous domains. We think of embeddings as being like the semantic layer of an application.
Embedding models arrange data based on semantic patterns and relationships learned from training datasets. Word2vec was an early successful example of this working: it showed that neural networks could capture analogies like "man is to king as woman is to queen".
The diagram that launched a thousand ML blog posts
This breakthrough demonstrated that given sufficient parameters and training data, relatively simple neural networks could naturally discover and encode abstract semantic relationships and analogies. The models achieved this without extensive human annotation or explicit instructions – they simply learned these language patterns through exposure to large amounts of text.
But the image of parallel vectors encoding analogies isn't all that relevant to how embeddings actually support downstream use-cases. For embeddings to be useful, they don't need to maintain this kind of parallel-vectors-for-analogical-concepts strictly: they just need to, in general, map qualitatively similar points of data to quantitatively similar points in vector space.
Modern embedding models like Nomic Embed are trained using contrastive learning, where the model learns to map similar data points close together to similar representations in vector space, and dissimilar data points ones far apart to dissimilar representations in vector space. Contrastive learning is a pretty amazing general learning paradigm, and may even be a principle in biological learning.
For example, during training the model might see these pairs of related texts:
The model learns to map each pair close together in vector space, while ensuring unrelated pairs (like a Wikipedia article and a summary of some other, unrelated article) end up far apart. Through millions of these comparisons, the model develops a rich understanding of semantic relationships without needing explicit labels or rules.
This training process is elegant because it doesn't require manually labeling data points with tags to encode all the different things they mean - the model uncovers meaning naturally through exposure to millions of comparisons.
The image of parallel directions encoding analogical concepts in vector space is fine for understanding a single example. But for rich exploration of large embedding datasets, we think interactive 2D scatterplots are the most effective approach:
Embeddings are remarkably efficient: once data is embedded into vectors, comparing them becomes extremely fast. Operations like cosine similarity and Hamming distance (which are high-dimensional analogs of the familiar notion of a 2D angle) can run in parallel across massive datasets. This means you can search through millions of embedded items in seconds. Additionally, systems often only need to compute the embedding vector once; these vectors can then be cached and reused for all future comparisons and searches. This combination of fast vector operations and one-time embedding computation makes embeddings ideal for applications that need high throughput and low latency for search, retrieval, and nearest-neighbor calculations.
One of the most exciting applications of embeddings is their ability to work with different types of data - like text and images - in a unified way. These advances are helping businesses unlock more value from their multimedia data assets through better organization, search, and analysis capabilities. The technology has evolved rapidly: CLIP by OpenAI showed in 2020 how to connect images and text in a shared space, then LiT by Google Brain improved the approach in 2023 by focusing on better text embedding while keeping the image embedding model fixed. Most recently, our models Nomic Embed Text and Nomic Embed Vision took an inverted approach - using our text embedding model to train an image embedding model with semantic representations learned by the text embedder.
Modified diagram from the OpenAI CLIP paper
We are building our own embedding model seriesNomic Embed with open source data and training code. It is integrated into everything we make at Nomic, and serves as the embedding space underpinning Atlas's core data mapping capabilities: the embeddings determine the layout in the map visualization, drive Atlas's semantic search functionality, and enable the creation of custom semantic classifiers.
Nomic Embed models are used by hundreds of thousands of developers building their own AI tools. Both the text embedder and image embedder are available on Hugging Face, becoming some of their most downloaded embedding models due to the right balance of strong performance, fast inference speed, and low memory usage.
Transparency and the open-source movement has also played a role in Nomic Embed's widespread adoption. By publishing both the model weights and training code, we've made it easier for users to understand, trust, and adapt the model to their specific needs. This openness has encouraged long-term adoption, as users can use the model weights directly in their projects without concerns about sudden changes or lack of support.
As we continue to develop and enhance Nomic Embed, we are focusing on expanding its capabilities even further. We’re adding support for more languages to make Nomic Embed truly global and versatile. And beyond text and images, we’re preparing to support audio, video, and customizability so that users can naturally extend our embedding space to their own user-defined modalities.
Atlas serves as a practical workspace for anyone, with or without ML experience, to visualize, explore, and interact with their data on an embeddings-powered data map. When you upload data to Atlas -- from Python or in the browser -- we automatically run Nomic Embed on your data as part of building your data map; you can then download the embeddings we compute and re-use them however you want.
As a fundamental layer of Atlas, embeddings provide a powerful way for anyone looking to get the most value and understanding from large datasets. Whether you're a data professional or just someone curious about understanding some data you have on hand, using embeddings in Atlas can help you see your data more clearly and explore it more deeply.