Nomic AI

Nomic Research

Nomic Research builds domain-specific AI models and publishes open-research in embeddings, language models, and dimensionality reduction.

At Nomic, we believe that AI research should be open, transparent, and accessible to everyone. It should also be useful and solve real world problems. Our research efforts and direction are driven by hard problems we discover while working with our customers.

Research & Publications

Explore our open-source research in embeddings, language models, and dimensionality reduction.

Embeddings

Training Sparse MoE Text Embedding Models

Under Review

Introduces the first general purpose mixture of experts text embedding model, which achieves state-of-the-art performance on the MIRACL benchmark. The model is truly open source, meaning the training data, weights, and code are available and permisively licensed.

Read Paper

CoRNStack: High-Quality Contrastive Data for Better Code Ranking

ICLR 2025

An Open Dataset for training State-of-the-Art Code Embedding Models. Work done in collaboration with University of Illinois at Urbana-Champaign.

Read Paper

Nomic Embed: Training a Reproducible Long Context Text Embedder

TMLR 2024

The first truly open (i.e. open data, weights, and code) text embedding model that outperforms OpenAI Ada. Work done in collaboration with Cornell University.

150+ citations

Read Paper

Nomic Embed Vision: Expanding the Latent Space

ArXiv 2024

The first multimodal embedding model to achieve high performance on text-text, text-image, and image-image tasks with a single unified latent space.

Read Paper

Embedding Based Inference on Generative Models

ArXiv 2024

An extension of Data Kernel methods to black box settings. Work done in collaboration with Johns Hopkins University.

Read Paper

Language Models

Tracking the Perspectives of Interacting Language Models

EMNLP 2024

Developing and studying metrics for understanding information diffusion in communication networks of LLMs. Work done in collaboration with Johns Hopkins University.

Read Paper

GPT4All: An Ecosystem of Open Source Compressed Language Models

EMNLP 2023

How the first open source LLM to surpass GPT-3.5's performance grew from a model into a movement. Work done in collaboration with the GPT4All community.

150+ citations

Read Paper

Comparing Foundation Models using Data Kernels

ArXiv 2023

A method for statistically rigorous comparison of embedding spaces without labeled data. Work done in collaboration with Johns Hopkins University.

Read Paper

Dimensionality Reduction

HUMAP: Hierarchical Uniform Manifold Approximation and Projection

IEEE VIS 2025

A hierarchical and flexible dimensionality reduction algorithm that preserves local and global structures for accurate, efficient data exploration. Collaboration with São Paulo State University (Brazil), Eindhoven University of Technology (the Netherlands), and Linnaeus University (Sweden).

Read Paper

The Landscape of Biomedical Research

Cell Patterns Cover 2024

The first systematic study of the entirety of PubMed from an information cartography perspective. Work done in collaboration with University of Tubingen.

Cover Story

Read Paper

Mapping Wikipedia with BERT and UMAP

IEEE Vis 2022

The first systematic study of the entirety of English Wikipedia from an information cartography perspective. Work done in collaboration with New York University.

Watch Talk

Platform

Atlas GPT4All

Solutions

Innovation

About

Customers Security Careers Research

Resources