Background image

Author: Nomic Team

Nomic Atlas Data Mapping Series

Data Mapping Series


This is our Data Mapping Series, where we do a deep dive into the technologies and tools powering the unstructured data visualization and exploration capabilities of the Nomic Atlas platform. In this series you'll learn about how machine learning concepts like embeddings and dimensionality reduction blend with scalable web graphics to enable anyone to explore and work with massive datasets in their web browsers.

If you haven't read our previous post on data maps, we recommend starting there for some useful background!



How We Think About Embeddings At Nomic

Embeddings, the data representations used within AI models, are a way for computers to quantitatively capture semantics of data. We're pretty passionate about this at Nomic; we view this as important a revolution in computing technology as the invention of the computer itself. For centuries people have organized things by putting them in lines, or in grids, or in trees; to arrange data as points in a 768-dimensional-space is definitely an approach that seems elaborate and strange by comparison, so we're going to try and explain why it's useful.

Embeddings Are For So Much More Than RAG

While Retrieval Augmented Generation (RAG) has brought embeddings into the spotlight, their applications extend far beyond document retrieval for LLMs. The ability to represent semantic information as vectors in high-dimensional space enables a lot of different capabilities across numerous domains. We think of embeddings as being like the semantic layer of an application.

DataTextCodeImagesAudioEmbedding SpaceVectorRepresentationsUse CasesRAGSemantic SearchClusteringClassificationDeduplicationAnomaly DetectionTopic Modeling

The Usual Pictures Of Embeddings

Embedding models arrange data based on semantic patterns and relationships learned from training datasets. Word2vec was an early successful example of this working: it showed that neural networks could capture analogies like "man is to king as woman is to queen".

Visualizing word2vec embeddings with arrow directions

The diagram that launched a thousand ML blog posts

This breakthrough demonstrated that given sufficient parameters and training data, relatively simple neural networks could naturally discover and encode abstract semantic relationships and analogies. The models achieved this without extensive human annotation or explicit instructions – they simply learned these language patterns through exposure to large amounts of text.

But the image of parallel vectors encoding analogies isn't all that relevant to how embeddings actually support downstream use-cases. For embeddings to be useful, they don't need to maintain this kind of parallel-vectors-for-analogical-concepts strictly: they just need to, in general, map qualitatively similar points of data to quantitatively similar points in vector space.

Embeddings Contrast Data In Vector Space

Modern embedding models like Nomic Embed are trained using contrastive learning, where the model learns to map similar data points close together to similar representations in vector space, and dissimilar data points ones far apart to dissimilar representations in vector space. Contrastive learning is a pretty amazing general learning paradigm, and may even be a principle in biological learning.

cars are fast[0.06599499, 0.03031864, 0.00064119,-0.01795955, -0.04433974, -0.01212416, ...]oranges are delicious[0.02279393, 0.09127039, 0.00110198,0.01061488, -0.0569771, -0.04645369, ...]apples are tasty[0.0408471, 0.02852942, 0.01011915,-0.02003389, -0.04360955, -0.04741682, ...]The vectors for "oranges are delicious" and"apples are tasty" point in similar directions because they share semanticcontent about fruit and positive sentiment - the vector for "cars are fast" points in adifferent direction since it represents an unrelated concept.

For example, during training the model might see these pairs of related texts:

The model learns to map each pair close together in vector space, while ensuring unrelated pairs (like a Wikipedia article and a summary of some other, unrelated article) end up far apart. Through millions of these comparisons, the model develops a rich understanding of semantic relationships without needing explicit labels or rules.

This training process is elegant because it doesn't require manually labeling data points with tags to encode all the different things they mean - the model uncovers meaning naturally through exposure to millions of comparisons.

Visualizing Embeddings As Neighborhoods

The image of parallel directions encoding analogical concepts in vector space is fine for understanding a single example. But for rich exploration of large embedding datasets, we think interactive 2D scatterplots are the most effective approach:

Scatterplots of embeddings as points in 2D gives a better picture of the broad patterns contained in the embeddings of a dataset.

Color-coding by sub-topics and zooming in shows more niche sub-neighborhoods within the high-level neighborhoods. This allows the more minor aspects of data picked up by embeddings to be represented visually.

Exploring the different neighborhoods of your dataset like this makes it easy to see the structure of your data that embeddings have captured!

Efficiency And Scalability Of Embeddings

Embeddings are remarkably efficient: once data is embedded into vectors, comparing them becomes extremely fast. Operations like cosine similarity and Hamming distance (which are high-dimensional analogs of the familiar notion of a 2D angle) can run in parallel across massive datasets. This means you can search through millions of embedded items in seconds. Additionally, systems often only need to compute the embedding vector once; these vectors can then be cached and reused for all future comparisons and searches. This combination of fast vector operations and one-time embedding computation makes embeddings ideal for applications that need high throughput and low latency for search, retrieval, and nearest-neighbor calculations.

Multimodality

One of the most exciting applications of embeddings is their ability to work with different types of data - like text and images - in a unified way. These advances are helping businesses unlock more value from their multimedia data assets through better organization, search, and analysis capabilities. The technology has evolved rapidly: CLIP by OpenAI showed in 2020 how to connect images and text in a shared space, then LiT by Google Brain improved the approach in 2023 by focusing on better text embedding while keeping the image embedding model fixed. Most recently, our models Nomic Embed Text and Nomic Embed Vision took an inverted approach - using our text embedding model to train an image embedding model with semantic representations learned by the text embedder.

Training Nomic Embed Vision

Modified diagram from the OpenAI CLIP paper

Nomic Embed: The Embedding Space For Working With Data

We are building our own embedding model seriesNomic Embed with open source data and training code. It is integrated into everything we make at Nomic, and serves as the embedding space underpinning Atlas's core data mapping capabilities: the embeddings determine the layout in the map visualization, drive Atlas's semantic search functionality, and enable the creation of custom semantic classifiers.

Why Nomic Embed Has Become A Popular Choice

Nomic Embed models are used by hundreds of thousands of developers building their own AI tools. Both the text embedder and image embedder are available on Hugging Face, becoming some of their most downloaded embedding models due to the right balance of strong performance, fast inference speed, and low memory usage.

Transparency and the open-source movement has also played a role in Nomic Embed's widespread adoption. By publishing both the model weights and training code, we've made it easier for users to understand, trust, and adapt the model to their specific needs. This openness has encouraged long-term adoption, as users can use the model weights directly in their projects without concerns about sudden changes or lack of support.

The Future Of Nomic Embed

As we continue to develop and enhance Nomic Embed, we are focusing on expanding its capabilities even further. We’re adding support for more languages to make Nomic Embed truly global and versatile. And beyond text and images, we’re preparing to support audio, video, and customizability so that users can naturally extend our embedding space to their own user-defined modalities.

Atlas: The Workspace For Embeddings

Atlas serves as a practical workspace for anyone, with or without ML experience, to visualize, explore, and interact with their data on an embeddings-powered data map. When you upload data to Atlas -- from Python or in the browser -- we automatically run Nomic Embed on your data as part of building your data map; you can then download the embeddings we compute and re-use them however you want.

As a fundamental layer of Atlas, embeddings provide a powerful way for anyone looking to get the most value and understanding from large datasets. Whether you're a data professional or just someone curious about understanding some data you have on hand, using embeddings in Atlas can help you see your data more clearly and explore it more deeply.

Data Mapping Series
Part 1: Data Maps
Data Mapping Series
Part 3: Dimensionality Reduction
🔒
nomic logo
nomic logonomic logo nomic logo nomic logonomic logonomic logo nomic logo nomic logo
“Henceforth, it is the map that precedes the territory” – Jean Baudrillard