How to Use Dayhoff Atlas Online

Try Dayhoff Atlas

Commercially Available Online Web Server

The Dayhoff Atlas: Scaling Sequence Diversity for Protein Generation

The Dayhoff Atlas, a centralized collection of giga-scale protein sequence data and state-of-the-art generative protein language models (PLMs). This modern resource, named after Margaret Dayhoff's foundational 1965 work, is designed to enhance protein generation by dramatically expanding the scale and diversity of sequences available for training. Operating solely in amino acid sequence space, the Dayhoff models can natively predict mutation effects on fitness, scaffold structural motifs, and perform guided generation of new proteins within a specific family.

Learning from the datasets within the Dayhoff Atlas has been shown to increase the cellular expression rates of generated proteins, confirming the real-world value of this new resource.

How The Dayhoff Atlas Works

The Atlas is a suite of datasets and a family of PLMs that unify information from various sources and modeling techniques:

  • GigaRef Dataset: This is the largest open dataset of natural proteins to date, containing 3.34 billion protein sequences. It integrates and reclusters metagenomic sequences with UniRef100 to capture the vast, diverse set of unculturable organisms across the tree of life.

  • BackboneRef Dataset: A first-in-class set of 46 million synthetic protein sequences that fuse the richness of protein structure with the scalability of sequence space. This data is generated from 240,811 de novo designed backbones. Notably, models trained with this structure-based data augmentation produced the highest expression success rate (51.7%) for generated proteins in E. coli.

  • Hybrid Model Architecture: The Dayhoff PLMs use a hybrid SSM (State-Space Model) and transformer architecture. This unique design efficiently accommodates long input lengths and combines single protein sequences with sets of evolutionarily-related homologs at scale.

  • Bidirectional Generation: The models are trained to generate in both the N-to-C (left-right) and C-to-N (right-left) directions, enabling them to perform arbitrary and flexible sequence infilling for complex design tasks.

What is Tamarind Bio?

Tamarind Bio is a pioneering no-code bioinformatics platform built to democratize access to powerful computational tools for life scientists and researchers. Recognizing that many cutting-edge machine learning models are often difficult to deploy and use, Tamarind provides an intuitive, web-based environment that completely abstracts away the complexities of high-performance computing, software dependencies, and command-line interfaces.

The platform is designed provide easy access to biologists, chemists, and other researchers who may not have a background in programming or cloud infrastructure but want to run experimental models with their data. Key features include a user-friendly graphical interface for setting up and launching experiments, a robust API for integration into existing research pipelines, and an automated system for managing and scaling computational resources. By handling the technical heavy lifting, Tamarind empowers researchers to concentrate on their scientific questions and accelerate the pace of discovery.

Accelerating Discovery with Dayhoff Atlas on Tamarind Bio

Using the Dayhoff Atlas and its trained PLMs on a platform like Tamarind would provide researchers with an unprecedented, scalable engine for protein design.

  • Zero-Shot Prediction & Screening: Researchers can use the highly accurate, zero-shot prediction capabilities of the Dayhoff models to quickly score thousands of single-site mutations and indels for their effect on fitness, drastically reducing the cost and time of experimental screening.

  • Functional Design: The platform can leverage the model's ability to scaffold structural motifs and perform guided generation (homolog conditioning) to design new proteins with desired functions, such as creating shortened gene editors like Cas9.

  • High-Quality, Expressible Candidates: By using the models trained on the BackboneRef dataset, researchers can generate novel protein sequences that have a significantly higher probability of successful cellular expression.

  • Integrated Workflow: Tamarind would handle the massive datasets (GigaRef) and the complex hybrid architecture, making powerful evolutionary and structural reasoning available through a simple interface.

How to Use Dayhoff Atlas on Tamarind Bio

To leverage the power of the Dayhoff Atlas, a researcher could follow this streamlined workflow on Tamarind:

  1. Access the Platform: Begin by logging in to the tamarind.bio website.

  2. Select Dayhoff: From the list of available computational models, choose the Dayhoff tool.

  3. Input a Seed Sequence: Provide the wild-type or parental protein sequence you wish to analyze or optimize.

  4. Select a Task: Choose a function: Zero-Shot Prediction (to score variants for fitness), Motif Scaffolding (to generate a scaffold around a desired motif), or Guided Generation (to generate a new member of a specific protein family).

  5. Run Dayhoff Model: The platform runs the appropriate Dayhoff PLM (e.g., the 3-billion-parameter model). For generation tasks, the model can automatically use aligned or unaligned homologous sequences as context.

  6. Acquire and Validate: The output provides a set of novel, high-likelihood sequences that are prioritized for expression and function, accelerating the path to experimental validation.

Source