Use ESM-C Online

Commercially Available ESM-C No-Code Web Server

ESM-C: A State-of-the-Art Language Model for the Evolutionary Scale

ESM-C (ESM Cambrian) is a state-of-the-art protein language model family trained on approximately 2.8 billion sequences drawn from across all of life. Developed by Biohub, ESM-C represents the fourth generation of Evolutionary Scale Modeling (ESM). For the first time, this complete model family is being released as a fully open-source scientific engine under the MIT license.

By applying language modeling to proteins at the true scale of evolution, ESM-C internalizes the fundamental properties that govern protein biology—learning how they fold, interact, and function across the living world.

Key Innovations: Predictable Scaling & Latent Intelligence

ESM-C takes evolutionary sequence modeling to the next level by transitioning from millions to billions of training sequences.

  • Linear Scaling Returns: Built upon a strict scaling law that couples training compute power with biological accuracy, enabling linear returns with scale and setting a new state-of-the-art in protein representations.

  • Context-Driven Learning: Trained with no prior biological knowledge or explicit structural supervision using a simple unsupervised objective: predicting amino acids that have been randomly masked out based purely on sequence context.

  • Compositional Reduction of Biology: Learns internal representation spaces that automatically encode three-dimensional structures, evolutionary lineages, and abstract functional concepts.

  • Unbiased Biological Lens: Provides a mathematical substrate capable of connecting distantly related proteins and bridging the gap between known and completely unstudied "unknown" biology.

  • Open Foundation Backbone: Functions as the core representational space powering downstream discovery architectures, including the structure-prediction engine ESMFold2 and the 6.8-billion-sequence ESM Atlas.

Decomposing the Model: Structural & Biochemical Interpretability

To probe the mechanics of what ESM-C internalizes during pretraining, researchers leveraged Sparse Autoencoders (SAEs) to break its internal representations down into more than 16,000 distinct, interpretable features.

SAE analysis proves that the model independently recovers the exact hierarchical organization of biology established by decades of empirical science:

  • Amino Acid Biochemistry: Features activate on specific amino acid identities, generalized biochemical classes (e.g., aromatics, small hydrophobics), or context-dependent configurations like lysine side chains in low-complexity disordered regions.

  • Local Secondary Structure Interactions: Internal circuits selectively track the local physical interactions that form stable alpha helices and beta sheets.

  • Abstract Evolutionary Themes: Tracks complex intermolecular spaces, including conserved subunit interfaces, short linear motifs, and explicit cellular localization signals across unrelated folds.

Scientific Case Studies in Applied Genetics

Unsupervised Recovery of Functional Motifs

Without task-specific training, ESM-C successfully converged on the concept of the nucleophilic elbow—a catalytic motif that evolution independently favored across 32 distinct protein folds. A single learned SAE feature automatically activates on this motif across 90 out of 102 structurally unrelated, non-ancestral enzymes.

Mapping Low-Homology Gene-Editing Tools

When mapped across the broader sequence space, ESM-C features independently discover deep functional and evolutionary relationships within RNA-guided DNA endonucleases. The model groups eukaryotic Fanzor proteins directly alongside their prokaryotic TnpB ancestors, accurately mapping shared gene-editing signatures despite extreme evolutionary divergence and minimal sequence similarity.

ESM-C on Tamarind Bio: Access the World Model

Through a strategic partnership with Tamarind Bio, the open-source ESM-C architecture is optimized for cloud deployment and accelerated deployment.

By incorporating highly optimized context-parallel kernels trained via NVIDIA hardware infrastructures, the Tamarind Bio platform allows researchers to navigate, interpret, and extract raw structural representations at a scale that standard wet labs cannot approach.

How to Use ESM-C on Tamarind Bio

  1. Access the Model: Navigate to tamarind.bio and select the open-source ESM-C (Cambrian) representation engine.

  2. Submit Input Sequences: Provide primary unannotated amino acid strings from your target libraries.

  3. Decompose Latent Space Features: Opt to pass sequence data through the integrated Sparse Autoencoder (SAE) module to interpret specific biochemical or functional features.

  4. Evaluate Site Likelihoods: Score specific sequences using the masked language modeling head to evaluate structural tolerability or zero-shot fitness landscapes.

  5. Downstream Integration: Export the high-dimensional latent vectors to drive downstream applications, such as running high-throughput structure predictions or evaluating de novo binder candidates.

Source

Supporting 10,000+ scientists around the world,

from leading biotechs, and global biopharma