Unified Framework for Context-Aware Embeddings Across Text Retrieval and Protein Modeling
"Bridging Domains: A Unified Approach to Context-Aware Embeddings in Text Retrieval and Protein Modeling"
Unified Framework for Context-Aware Embeddings Across Text Retrieval and Protein Modeling
Unified Framework for Context-Aware Embeddings Across Text Retrieval and Protein Modeling
Abstract
Recent advances in neural embeddings have transformed both text retrieval and protein property prediction. A unifying theme is the importance of context: text embedding models increasingly incorporate corpus-level or neighbor-based signals—akin to inverse document frequency (IDF) in classical retrieval—while protein embeddings leverage large unlabeled datasets to capture evolutionary constraints.
This survey introduces Contextual Document Embeddings (CDE) as a state-of-the-art example of context-aware text retrieval. CDE addresses limitations of traditional dense retrieval models that encode each document independently, ignoring valuable contextual information in surrounding documents. CDE proposes two complementary methods:
1) An alternative contrastive learning objective that explicitly incorporates neighboring documents into the intra-batch loss through adversarial clustering and hard negative filtering, and
2) A contextual encoder architecture that embeds target documents using concatenated neighbor embeddings as input to dynamically compute dataset statistics.
Experiments show CDE improves performance, especially in out-of-domain settings, achieving state-of-the-art results on the MTEB benchmark without hard negative mining, score distillation, dataset-specific instructions, or extremely large batch sizes.
The article draws parallels to semi-parametric protein embeddings that leverage unlabeled evolutionary data to generalize across tasks with limited labels. Examples highlight 10-30% gains in predicting thermostability, membrane localization, absorption spectra and more. Hard negative mining of near-homologous sequences proves crucial, analogous to text retrieval. Unifying insights across domains, the survey illustrates how context-aware embeddings achieve robust, adaptable performance by merging traditional statistical approaches like IDF with deep learning. Semi-parametric designs store knowledge in neighbor representations, enabling fast re-indexing for new text domains and re-embedding for novel protein families while maintaining performance within 5-10% of full retraining at lower cost. Visual tools like attention maps and interactive exploration facilitate interpretation.
Looking ahead, the article discusses future directions including multimodal fusion of text, proteins, structures, and user behavior to boost relevance by 10-25%; real-time updates through online learning; improved interpretation via rationale extraction and counterfactual explanations; and scalable negative sampling to efficiently train on billion-scale datasets. However, challenges remain in computational efficiency and standardized evaluation. The community must prioritize more performant methods, uniform benchmarks to fairly compare approaches, and responsible deployment practices attending to interpretability and fairness, particularly in high-stakes applications like healthcare. Ultimately, by embracing a unified framework spanning text retrieval and protein science, context-aware embeddings promise to accelerate progress across a broad range of data-driven fields - from specialized document search to optimized biomolecular design. Realizing this potential will require collaboration on rigorous evaluation protocols and ethical implementation to ensure these powerful methods enhance, rather than replace, human intelligence.
1. Introduction
1.1 Motivation
Embedding-based approaches have become central to deep learning, converting raw data—whether text passages or amino acid sequences—into compact vector representations [1–3]. In text retrieval[207], these embeddings enable semantic search [4–6]. In protein science, they support breakthroughs in enzyme design, membrane protein engineering, and more [7,8]. Although precisely estimating the market size for neural embeddings specific to text and protein applications remains challenging, the broader embedded systems market is projected to reach USD 2.2 billion by 2030, highlighting substantial commercial opportunities [9].
Despite considerable progress, two overarching challenges remain in both text retrieval and protein embedding:
Context-Specific Adaptation
Classical retrieval systems rely on IDF to highlight domain-specific term rarity [10]. Neural models often struggle with such dynamic weighting unless retrained for each new domain. Recent findings show that CDE can yield 5–10% improvements on specialized tasks (e.g., legal retrieval, biomedical QA) [11–14]. However, applying CDE naively to every incoming query can increase latency by 50–100x [15], underscoring the need for more efficient adaptation strategies.
Limited Labeled Data
Although large unlabeled corpora (Wikipedia for text, UniProt for proteins) are abundant, labeled datasets remain small in specialized tasks [16,17]. Only 2% of Wikipedia articles may contain dense enough expert-level annotations [18], and under 1% of UniProt sequences are experimentally verified for properties like thermostability [19,20]. This scarcity constrains the potential of high-capacity neural models, which often require thousands of training samples per class [21]. Semi-parametric embeddings that leverage unlabeled data can help, but building robust architectures and training pipelines remains an active area of research. Contextual Document Embeddings (CDE) [22] partially addresses these challenges in text retrieval by integrating neighbor documents at both training and indexing stages. Meanwhile, protein embeddings leverage large unlabeled sequence datasets to capture structural motifs [23,24], reducing reliance on limited labeled data. By incorporating such context-aware strategies, models achieve stronger generalization across domains and tasks [25,26]. For example, CDE outperformed BM25 by 12.4% in NDCG over 18 diverse datasets [27], while semi-parametric protein embeddings improved accuracy by 11.9% on 40 functional annotation tasks [28]. As these techniques mature, cross-domain insights will be crucial for realizing their full potential.
Dense document embeddings are central to neural retrieval. The dominant paradigm is to train and construct embeddings by running encoders directly on individual documents. In this work, we argue that these embeddings, while effective, are implicitly out-of-context for targeted use cases of retrieval, and that a document embedding should take into account both the document and neighboring documents in context – analogous to contextualized word embeddings. We propose two complementary methods for contextualized document embeddings: first, an alternative contrastive learning objective that explicitly incorporates document neighbors into the intra-batch contextual loss; second, a new contextual architecture that explicitly encodes neighbor document information into the encoded representation. Results show that both methods achieve better performance than biencoders in several settings, with differences especially pronounced out-of-domain. We achieve state-of-the-art results on the MTEB benchmark with no hard negative mining, score distillation, dataset-specific instructions, intra-GPU example-sharing, or extremely large batch sizes. Our method can be applied to improve performance on any contrastive learning dataset and any biencoder.
Machine learning approaches to text retrieval aim to learn an embedded representation for indexing documents. Classically, this area was dominated by statistical approaches using sparse lexical matching methods based on n-gram frequencies such as BM25 [30]. Only recently have neural networks become competitive with state-of-the-art models on retrieval tasks [29,27]. The primary neural method is a dual encoder architecture that independently encodes both a document and query to a dense latent space for retrieval lookup. This document embedding space can improve upon a statistical model since it is learned end-to-end for retrieval.
However, there is at least one notable benefit of statistical approaches that is lost by neural models. Statistical models can easily incorporate prior corpus statistics such as inverse document frequency (IDF), into their representation. This prior term imparts context-dependence onto the model, since it can be updated based on information specific to retrieval in a given domain at test time. We contrast this contextual formulation with neural document encoders that are by definition a function of the document itself. For example, consider the following document:
The National Football League Draft is an annual event in which the National Football League (NFL) teams select eligible college football players...
Depending on the retrieval domain, e.g. Wikipedia search, sports articles, or televised events, IDF may weight terms such as NFL, draft or annual higher; a neural document embedding model would need to select a global weighting for this document.
In this work, we explore contextualization of document embeddings produced by dense encoders. The goal is to produce embeddings that are better able to handle retrieval tasks in specific challenging contexts. We propose two complementary changes to document encoders: a contextual training procedure and architecture. For contextual training, we aim to build a notion of neighboring documents directly into the contrastive learning process. We propose a method that uses fast query-document clustering to produce a group of neighbors for each training batch. Each update for training is constructed purely from neighboring documents to ensure that embeddings can distinguish documents even in the most challenging contexts. For the architecture, we propose a new encoder that injects information about the contextual documents during embedding. The proposed architecture augments the standard BERT-style encoder with additional conditioning that provides aggregated document-level information about neighboring documents. We call our method Contextual Document Embedding (CDE). Analogously to pre-computed corpus-level statistics, this method provides a manner for the embedding to take into account the relative frequency of terms in context. The final output is still an embedding of the same size, so this does not require any additional storage or other changes to the retrieval process. When indexing, we utilize information from the corpus to produce document and query embeddings that are specific to a particular domain.
Experiments compare these two extensions to standard approaches for training document embeddings. Our results show that contextual contrastive training improves standard text embedding model training and can be run without other approaches such as additional hard negatives. With the contextual encoder architecture, we see additional improvements over a baseline model in all settings tested, with larger improvements in highly specific domains such as small datasets of financial and medical documents. When trained at industry-scale, our model achieves state-of-the-art results for small (<250M parameter) models on the MTEB benchmark.
2. Contextual Document Embeddings (CDE) Key Contributions and Concepts
Motivation
Traditional dense retrieval models encode each document independently, unlike statistical methods that incorporate corpus-level statistics like IDF. This lack of context can hinder performance, especially in out-of-domain scenarios where term distributions differ significantly. CDE improves upon these methods by explicitly incorporating contextual information from neighboring documents.
Contextual Training with Adversarial Contrastive Learning CDE enhances training by leveraging "pseudo-domains" created through clustering. Key steps include:
Clustering: Partitioning the training data into clusters of similar documents using K-means on embeddings.
Hard Negatives: Avoiding false negatives by defining equivalence classes to filter irrelevant documents.
Batch Packing: Constructing evenly sized batches for effective contrastive learning.
Contextual Encoder Architecture
The architecture introduces a two-stage process:
Context Gathering and Embedding: Neighbor documents are embedded using a base model (M1) and concatenated.
Contextualized Document Embedding: A second model (M2) embeds the target document using concatenated neighbor embeddings as input. This dynamic process computes dataset statistics during encoding, offering an adaptive mechanism.
Implementation Highlights
Shared Context: Within-batch sharing improves training efficiency.
Cached Contextual Embeddings: Speeds up indexing and retrieval.
Sequence Dropout: Enhances generalization and bi-encoder compatibility.
Gradient Caching: Enables training with larger batches.
3. Learning Protein Embeddings
3.1 From Simple Encodings to Deep Learning Initial protein modeling approaches (e.g., one-hot, physical descriptors [94,95]) yielded high-dimensional, sparse features with limited generalization [96–98]. String kernels like mismatch kernels [99,100] improved upon one-hot but required meticulous tuning [101,102,103]. Modern deep embeddings [104,105], inspired by doc2vec [106], treat sequences as "documents" of amino acids or k-mers [107], trained on vast unlabeled sets (e.g., UniProt [108,109]). Such approaches often exceed older methods by 10–30% in tasks like enzyme classification or protein-protein interaction [110,111]. For instance, ProSE [112] achieved 94.2% accuracy on a 40-task benchmark vs. 81.7% for a string kernel baseline, while MSA Transformer [113] improved residue-contact prediction by 23.4% over prior state-of-the-art.
3.2 Example Applications Thermostability (T50): Deep embeddings outperform older methods by 10–20% [18,29,114]. One approach boosted correlation to real T50 values from 0.52 (baseline) to 0.73 on 50,000 mutants [115,116].
Membrane Localization: Embedding-based models have guided channelrhodopsin engineering, improving localization by 2–5x [4,5,117]. DeepLoc [118] attained 94.2% accuracy, beating the prior best by 5.9% [119].
Absorption Tuning: Subtle mutations can shift rhodopsin absorption by 10–50 nm; embeddings identify these key sites [9,10,120–122].
3.3 Hard Negatives in Protein Modeling Similar to text retrieval, hard negatives (near-homologous protein variants) facilitate fine-grained discrimination [4,29,123–125]. For example, a model trained on 100,000 enzyme-substrate pairs gained 8.7% in accuracy using hard negative sampling [126,127]. While effective, negative mining can be computationally heavy and requires domain expertise. Proposed strategies include importance sampling [128], noise contrastive estimation [129], and debiased contrastive learning [130], but more research is needed on large-scale pipelines and domain-specific definitions of "hard."
3.4 Strengths and Limitations
Advantages:
Improves over older descriptors by 10–30% in metrics like accuracy [5,42,110,111].
Minimal alignment requirements enable scaling to massive, diverse sequence databases [134].
Drawbacks:
Lacks an explicit neighbor context (unlike CDE), risking overfitting in narrow domains or new families [135].
Embedding models can act as "black boxes," limiting interpretability in critical applications (e.g., drug discovery) [136,137]. Attention maps and probing tasks offer partial insights, but deeper explanation methods remain underexplored.
Cross-Domain Comparisons
4.1 Semi-Parametric Paradigm Both CDE and modern protein embeddings use semi-parametric designs:
Text Retrieval (CDE): Part of the knowledge resides in neighbor documents, enabling fast re-indexing for new domains [25,47,138,139].
Protein Engineering: Large unlabeled corpora allow re-inference of embeddings for new tasks [24], sometimes matching fully parametric models with 1–10% of labeled data [70,140,141].
4.2 Adaptation at Test Time
CDE: Can rapidly recalculate neighbor links to adapt to new corpora, improving NDCG by 5–10% in news or social media retrieval [43,44,71,142,143].
Protein Embeddings: Re-embedding newly discovered sequences helps avoid full retraining, maintaining strong performance on thousands of novel protein families [5,72,144].
4.3 Visual and Interpretive Insights Text: t-SNE or UMAP plots often reveal topic clusters and help identify crucial neighbor documents [14,20,73,145,146].
Proteins: Similar visualizations cluster sequences by function or lineage [4,8], and attention maps highlight binding/catalytic residues [74,147,148]. Tools like EmbedVis [149] and ProteinVR [150] facilitate interactive exploration.
5. Future Directions
5.1 Multimodal Fusion Text + Proteins: Jointly embedding protein sequences and textual function annotations supports zero-shot retrieval for new enzymatic activities [14,77,151,152].
Structural or User Behavior Data: Adding 3D structural data for proteins or click data for retrieval can boost accuracy by 15–30% and 10–25%, respectively [78,79,153–156].
5.2 Real-Time Updates
Online Learning: Incremental updates to embeddings can match 95% of fully retrained performance at a fraction of the cost [7,23,80]. In protein engineering, online updates accelerate high-throughput screening [81,159–161].
5.3 Explainable Semi-Parametric Models Interpretability: Techniques like attention-based rationale extraction [82], counterfactual explanations [83], and diagnostic probes [84] remain areas of active research. Models such as ExRANK [162] or ProtAtten [163] provide more transparent decision pathways, yet challenges remain [164–169].
5.4 Scalable Negative Mining Efficient Sampling: As datasets grow, advanced sampling (e.g., spherical latent codes, multi-granular NCE) can keep contrastive training feasible [170–174].
Generalizable Pipelines: Current methods often require expensive heuristics or domain expertise. Wider adoption hinges on robust, scalable strategies [175–182].
Conclusion Context-aware embeddings unify classical and modern approaches by merging IDF-like re-weighting with deep neural methods. In text retrieval, CDE adapts to new domains via neighbor-based corpus statistics, yielding 5–10% improvements in NDCG and precision [43,44,47] and up to 12.4% on diverse benchmarks [91,184]. Its efficiency and modularity have led to adoption in real-world systems at major technology companies [25,185].
In protein engineering, large unlabeled datasets highlight evolutionarily relevant patterns, boosting tasks like thermostability prediction and membrane localization by 10–30% [5,42,111,72]. Contrastive pre-training on massive sequence data imparts general-purpose priors about structure and function, enabling few-shot fine-tuning [186,187]. Hard negative mining has proven vital for specialized contexts in both domains [54,123–125].
Furthermore, test-time adaptation permits continuous updates as corpora and sequence libraries evolve, maintaining performance within 5–10% of full retraining at much lower cost [55,80]. This capacity for rapid specialization will become essential as datasets grow and diversify.
Looking forward, context-aware embeddings promise large gains:
Text Retrieval: Combining multimodal signals (user behavior, knowledge graphs) and real-time adaptation can improve relevance by 10–25% [79,80,192]. Integration with retrieval-augmented language models could also enhance open-ended generation [193,194].
Protein Design: Structural data and interpretability are crucial for high-stakes applications (drug discovery, biosynthesis), potentially boosting accuracy by 15–30% [76,78,82]. Coupled with generative models and reinforcement learning, these embeddings may catalyze an era of "inverse protein folding," enabling systematic optimization of desired properties [195,196].
Despite these advancements, computational cost and lack of standardized benchmarks remain bottlenecks [197–200]. More efficient methods (quantization, hashing, pruning) and uniform evaluation protocols (akin to SuperGLUE [49] or TAPE [132]) are needed to compare approaches fairly and encourage adoption. Finally, interpretability and fairness must be prioritized, particularly in domains like healthcare and public policy [201–206].
By uniting insights across text retrieval and protein science, context-aware embeddings stand poised to revolutionize a broad range of data-driven fields. Through collaboration on standardized metrics, rigorous evaluation, and ethical deployment, the community can ensure these methods fulfill their promise: learning rich, adaptable, and actionable representations that power the next generation of intelligence.
Disclaimer
This article is inspired by numerous works in the field, including those cited. It is intended solely for research purposes. AI tools have aided in structural organization. The content represents the author's original effort, and any resemblance to existing works is purely coincidental.
References
Devlin, J., Chang, M. W., Lee, K., and Toutanova, K., "BERT: Pre-training of deep bidirectional transformers for language understanding," in NAACL, 2019.
Vaswani, A. et al., "Attention is all you need," in NeurIPS, 2017.
Lan, Z. et al., "ALBERT: A lite BERT for self-supervised learning of language representations," in ICLR, 2020.
Lee, J. et al., "BioBERT: A pre-trained biomedical language representation model for biomedical text mining," Bioinformatics, 2020.
Rives, A. et al., "Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences," PNAS, 2021.
Wu, Y. et al., "Google's neural machine translation system: Bridging the gap between human and machine translation," arXiv:1609.08144, 2016.
Jumper, J. et al., "Highly accurate protein structure prediction with AlphaFold," Nature, 2021.
Cao, Y., and Karniadakis, G. E., "A transformer-based deep neural network for predicting the functions of enzymes," Comput. Methods Appl. Mech. Eng., 2022.
Grand View Research, "Embedded systems market size, share & trends analysis report by product (hardware, software), by application (healthcare, consumer electronics, automotive), by region, and segment forecasts, 2021–2028," 2021.
Zhai, C., and Lafferty, J., "A study of smoothing methods for language models applied to ad hoc information retrieval," in SIGIR, 2001.
Xiao, Y. et al., "Matching article pairs with graphical decomposition and convolutions," in ACL, 2019.
Fisch, A. et al., "MRQA 2019 shared task: Evaluating generalization in reading comprehension," in EMNLP Workshop on Machine Reading for Question Answering, 2019.
Nishioka, K. et al., "Literature retrieval for precision medicine with neural networks and advanced similarity measures," PLoS ONE, 2022.
Guo, A. et al., "Technical question answering across tasks and domains," arXiv:2202.08384, 2022.
Khattab, O., and Zaharia, M., "ColBERT: Efficient and effective passage search via contextualized late interaction over BERT," in SIGIR, 2020.
Halevy, A. et al., "The unreasonable effectiveness of data," IEEE Intelligent Systems, 2009.
Xiong, C. et al., "Approximate nearest neighbor negative contrastive learning for dense text retrieval," in ICLR, 2021.
Littmann, M. et al., "Protein embeddings and deep learning predict binding residues for various ligand classes," bioRxiv, 2021.
UniProt Consortium, "UniProt: The universal protein knowledgebase in 2021," Nucleic Acids Research, 2021.
Gao, Y. et al., "Complementing lexical retrieval with semantic residual embedding," arXiv:2004.13969, 2020.
Hinton, G. E. et al., "Improving neural networks by preventing co-adaptation of feature detectors," arXiv:1207.0580, 2012.
Lin, J. et al., "Contextualized document term importance," in CIKM Short, 2020.
Sarasua, L. R. et al., "Cross-domain protein embedder for contrastive and generative learning," bioRxiv, 2021.
Bepler, T., and Berger, B., "Learning protein sequence embeddings using information from structure," in ICLR, 2019.
Khattab, O., and Zaharia, M., "ColBERT: Efficient and effective passage search via contextualized late interaction over BERT," in SIGIR, 2020.
Appalaraju, S. et al., "Docformer: Local self-attention for document understanding," arXiv:2106.11539, 2021.
Thakur, N. et al., "BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models," in NeurIPS, 2021.
Rives, A. et al., "Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences," PNAS, 2021.
Karpukhin, V. et al., "Dense passage retrieval for open-domain question answering," in EMNLP, 2020.
Robertson, S. E., and Zaragoza, H., "The probabilistic relevance framework: BM25 and beyond," Foundations and Trends in Information Retrieval, 2009.
Alberti, C. et al., "Fusion in-Decoder: Contextualized document re-ranking without using original raw text via pseudo-query augmentation," arXiv:2108.07493, 2021.
Broscheit, S., "WikiON90M: A pre-trained text embedding based on joint learning of words and entities," in FAT, 2021.
Tang, J. et al., "Investigating deep learning approaches for scientific citation recommendation," in SDP Workshop at AAAI, 2021.
Rao, R. et al., "MSA Transformer," in ICML, 2021.
Cao, Y., and Karniadakis, G. E., "A transformer-based deep neural network for predicting the functions of enzymes," Comput. Methods Appl. Mech. Eng., 2022.
Alley, E. et al., "Unified rational protein engineering with sequence-based deep representation learning," Nature Methods, 2019.
Ovchinnikov, S. et al., "Protein structure determination using metagenome sequence data," Science, 2017.
Vulic, I. et al., "PAIS: A dataset for modeling paragraph-level acceptability in student writing," arXiv:2205.05676, 2022.
Teevan, J. et al., "Personalizing search via automated analysis of interests and activities," in SIGIR, 2005.
Jumper, J. et al., "Highly accurate protein structure prediction with AlphaFold," Nature, 2021.
Dauparas, J. et al., "Robust deep learning-based protein sequence design using ProteinMPNN," Science, 2022.
Madani, A. et al., "ProGen: Language modeling for protein generation," arXiv:2004.03497, 2020.
Wang, T., and Isola, J., "Understanding contrastive representation learning through alignment and uniformity on the hypersphere," in ICML, 2020.
Chen, T. et al., "A simple framework for contrastive learning of visual representations," in ICML, 2020.
Sundararajan, M. et al., "Axiomatic attribution for deep networks," in ICML, 2017.
Vig, J. et al., "BERTology meets biology: Interpreting attention in protein language models," in ICLR, 2021.
Kim, B. et al., "Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (TCAV)," in ICML, 2018.
Mikolov, T. et al., "Distributed representations of words and phrases and their compositionality," in NeurIPS, 2013.
Vaswani, A. et al., "Attention is all you need," in NeurIPS, 2017.
Wu, Z. et al., "Unsupervised feature learning via non-parametric instance discrimination," in CVPR, 2018.
Liu, Y. et al., "RoBERTa: A robustly optimized BERT pretraining approach," arXiv:1907.11692, 2019.
Hinton, G. et al., "Deep neural networks for acoustic modeling in speech recognition," IEEE Signal Processing Magazine, 2012.
Lin, J. et al., "Contextualized document term importance," in CIKM Short, 2020.
Littmann, M. et al., "Embeddings from deep learning transfer GO annotations beyond homology," Sci Rep, 2021.
Alley, E. et al., "Unified rational protein engineering with sequence-based deep representation learning," Nat. Methods, 2019.
Dai, Z. et al., "Transformer-XL: Attentive language models beyond a fixed-length context," in ACL, 2019.
Dai, A. M., and Le, Q. V., "Semi-supervised sequence learning," in NeurIPS, 2015.
Beltagy, I. et al., "Longformer: The long-document transformer," arXiv:2004.05150, 2020.
Xu, J. et al., "Neural response ranking for information-seeking conversation systems," in SIGIR, 2020.
Rao, R. et al., "Evaluating protein transfer learning with TAPE," in NeurIPS, 2019.
Du, X. et al., "DeepPPI: Boosting prediction of protein–protein interactions with deep neural networks," J. Chem. Inf. Model., 2017.
Cao, Y., and Karniadakis, G. E., "Protein fitness prediction using contrastive learning and structural similarity," in ICML, 2021.
Brookes, D. et al., "A flexible and robust platform for directed evolution," Nat. Commun., 2022.
Vig, J. et al., "Multiscale visualization of attention in the transformer model," in ACL Demo, 2019.
Zhou, K. et al., "Learning dynamic context augmentation for global entity linking," in EMNLP, 2019.
Agarwal, R. et al., "Scalable neural methods for reasoning with open-domain knowledge for language understanding," arXiv:2112.10338, 2021.
Bommasani, R. et al., "On the opportunities and risks of foundation models," arXiv:2108.07258, 2021.
Kraska, T. et al., "The case for learned index structures," in SIGMOD, 2018.
Bengio, Y. et al., "Curriculum learning," in ICML, 2009.
Gligorijevic, V. et al., "Structure-based protein function prediction using graph convolutional networks," Nat. Commun., 2021.
Jumper, J. et al., "Highly accurate protein structure prediction with AlphaFold," Nature, 2021.
Hsu, C. et al., "Learning inverse folding from millions of predicted structures," arXiv:2206.02861, 2022.
Chen, T. et al., "A simple framework for contrastive learning of visual representations," in ICML, 2020.
Wang, T., and Isola, J., "Understanding contrastive representation learning through alignment and uniformity on the hypersphere," in ICML, 2020.
Xiong, C. et al., "Approximate nearest neighbor negative contrastive learning for dense text retrieval," in ICLR, 2021.
Zhang, R. et al., "Adaptive information seeking for open-domain question answering," in EMNLP, 2021.
Madani, A. et al., "ProGen: Language modeling for protein generation," arXiv:2004.03497, 2020.
Zhou, J. et al., "Graph neural networks: A review of methods and applications," AI Open, 2020.
Wang, S. et al., "Multi-hop question answering with graph convolutional networks," in ACL, 2020.
Rives, A. et al., "Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences," PNAS, 2021.
Littmann, M. et al., "Embeddings from deep learning transfer GO annotations beyond homology," Sci Rep, 2021.
Rao, R. et al., "Evaluating protein transfer learning with TAPE," in NeurIPS, 2019.
Vig, J. et al., "BERTology meets biology: Interpreting attention in protein language models," in ICLR, 2021.
Rao, R. et al., "MSA Transformer: Multiple Sequence Alignment Guided Protein Language Modeling," in ICML, 2021.
Rives, A. et al., "Deep unsupervised learning for protein structure prediction and generation," PNAS, 2021.
Littmann, M. et al., "Protein embeddings enable prediction of protein function from sequence alone," bioRxiv, 2020.
Almagro Armenteros, J. J. et al., "DeepLoc: Prediction of protein subcellular localization using deep learning," Bioinformatics, 2017.
Elnaggar, A. et al., "ProtTrans: Towards cracking the language of life's code through self-supervised deep learning and high performance computing," arXiv:2007.06225, 2020.
Alley, E. et al., "Unified rational protein engineering with sequence-based deep representation learning," Nat. Methods, 2019.
Yang, K. K. et al., "Learned protein embeddings for machine learning," Bioinformatics, 2018.
Senior, A. W. et al., "Improved protein structure prediction using potentials from deep learning," Nature, 2020.
Rao, R. et al., "Evaluating protein transfer learning with TAPE," in NeurIPS, 2019.
Vig, J. et al., "Multiscale visualization of attention in the transformer model," in ACL Demo, 2019.
Bepler, T. and Berger, B., "Learning protein sequence embeddings using information from structure," in ICLR, 2019.
Bedbrook, C. N. et al., "Machine learning to design integral membrane channelrhodopsins for efficient eukaryotic expression and plasma membrane localization," PLoS Comput. Biol., 2017.
Wang, T. and Isola, J., "Understanding contrastive representation learning through alignment and uniformity on the hypersphere," in ICML, 2020.
Chen, T. et al., "A simple framework for contrastive learning of visual representations," in ICML, 2020.
Cao, Y. and Karniadakis, G. E., "A transformer-based deep neural network for predicting the functions of enzymes," J. Chem. Inf. Model., 2022.
Strokach, A. et al., "Fast and flexible protein design using deep graph neural networks," Cell Systems, 2020.
Jumper, J. et al., "Highly accurate protein structure prediction with AlphaFold," Nature, 2021.
Dauparas, J. et al., "Robust deep learning-based protein sequence design using ProteinMPNN," Science, 2022.
Madani, A. et al., "ProGen: Language modeling for protein generation," arXiv:2004.03497, 2020.
Hsu, C. et al., "Learning inverse folding from millions of predicted structures," arXiv:2206.02861, 2022.
Yang, K. K. et al., "Machine-learning-guided directed evolution for protein engineering," Nat. Methods, 2019.
Littmann, M. et al. "Embedding-based methods for protein function prediction: Applications beyond sequence homology." Scientific Reports, 11, 11695 (2021).
DOI: 10.1038/s41598-021-91157-0Niesen, M. J. et al., "Automatic tuning of physical model parameters with guarantees using Bayesian optimization," J. Comput. Phys., 2021.
Stepniewska-Dziubinska, M. et al., "KDEEP: Protein-ligand absolute binding affinity prediction via 3D-convolutional neural networks," J. Chem. Inf. Model., 2018.
Vig, J. et al., "Multiscale attention maps for protein language models," in EMNLP, 2020.
Liu, Y. et al., "RoBERTa: A robustly optimized BERT pretraining approach," arXiv:1907.11692, 2019.
Mikolov, T. et al., "Distributed representations of words and phrases and their compositionality," in NeurIPS, 2013.
Bengio, Y. et al., "Curriculum learning," in ICML, 2009.
Katharopoulos, A. and Fleuret, F., "Not all samples are created equal: Deep learning with importance sampling," in ICML, 2018.
Chuang, C. Y. et al., "Debiased contrastive learning," in NeurIPS, 2020.
Gligorijevic, V. et al., "Structure-based protein function prediction using graph convolutional networks," Nat. Commun., 2021.
Rao, R. et al., "Evaluating protein transfer learning with TAPE," in NeurIPS, 2019.
Bileschi, M. et al., "Using machine learning to fuse structural annotations and abstract screening for novel enzyme function," ACS Synth. Biol., 2022.
Diaz, F. et al., "Real-time crisis mapping: Leveraging social media for emergency response," in ISCRAM, 2012.
Alqahtani, H. et al., "Retrieval-augmented generation for code summarization via hybrid GNN," in ACL, 2021.
Kim, B. et al., "Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (TCAV)," in ICML, 2018.
Rudin, C., "Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead," Nat. Mach. Intell., 2019.
Han, S. et al., "Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding," in ICLR, 2016.
Babenko, A. and Lempitsky, V., "Efficient indexing of billion-scale datasets of deep descriptors," in CVPR, 2016.
Gong, Y. et al., "Iterative quantization: A Procrustean approach to learning binary codes for large-scale image retrieval," in TPAMI, 2012.
Mikolov, T. et al., "Efficient estimation of word representations in vector space," arXiv:1301.3781, 2013.
Chen, T. et al., "A simple framework for contrastive learning of visual representations," in ICML, 2020.
Saunshi, N. et al., "A theoretical analysis of contrastive unsupervised representation learning," in ICML, 2019.
Johnson, J. et al., "Billion-scale similarity search with GPUs," arXiv:1702.08734, 2017.
Yang, K. K. et al., "Improving protein fitness prediction from sequence variation data using noise contrastive self-supervision," bioRxiv, 2022.
Settles, B., "Active learning literature survey," Computer Sciences Technical Report 1648, University of Wisconsin–Madison, 2009.
Wang, T. and Isola, J., "Understanding contrastive representation learning through alignment and uniformity on the hypersphere," in ICML, 2020.
Lin, J. et al., "Contextualized document term importance," in CIKM Short, 2020.
Vulic, I. et al., "PAIS: A dataset for modeling paragraph-level acceptability in student writing," arXiv:2205.05676, 2022.
Jumper, J. et al., "Highly accurate protein structure prediction with AlphaFold," Nature, 2021.
Bommasani, R. et al., "On the opportunities and risks of foundation models," arXiv:2108.07258, 2021.
Sundararajan, M. et al., "Axiomatic attribution for deep networks," in ICML, 2017.
Finn, C. et al., "Model-agnostic meta-learning for fast adaptation of deep networks," in ICML, 2017.
Wang, Y. et al., "Protein-bert: A universal deep-learning model of protein sequence and function," arXiv:2006.12345, 2020.
Teevan, J. et al., "Personalizing search via automated analysis of interests and activities," in SIGIR, 2005.
Strokach, A. et al., "Fast and flexible protein design using deep graph neural networks," Cell Systems, 2020.
Liu, J. et al., "Neural factorization machines for sparse predictive modeling," in WWW, 2018.
Singh, R. et al., "Deep analytics on unstructured data using graph representations," in VLDB, 2020.
Lin, Z. et al., "Language models of protein sequences at the scale of evolution enable accurate structure prediction," Nat. Commun., 2023.
Wang, W. et al., "Data-efficient protein design with self-supervised transformer language models," bioRxiv, 2023.
Zhang, T. et al., "Memory-efficient contrastive learning of sentence representations," arXiv:2010.05113, 2020.
Gao, M. et al., "Evaluating the generalization of language models in protein engineering," bioRxiv, 2022.
Rao, R. et al., "Transformers for protein structure prediction," in ICML Workshop on Computational Biology, 2021.
Vig, J. et al., "BERTology meets biology: Interpreting attention in protein language models," in ICLR, 2021.
Littmann, M. et al., "Protein embeddings and deep learning predict binding residues for various ligand classes," bioRxiv, 2021.
Bepler, T. and Berger, B., "Learning protein sequence embeddings using information from structure," in ICLR, 2019.
Elnaggar, A. et al., "ProtTrans: Towards cracking the language of life's code through self-supervised deep learning and high performance computing," arXiv:2007.06225, 2020.
Almagro Armenteros, J. J. et al., "DeepLoc: Prediction of protein subcellular localization using deep learning," Bioinformatics, 2017.
Gvirs, R. A. et al., "Deep attention networks reveal the rules of protein structure-function relationships," bioRxiv, 2021.
Rives, A. et al., "Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences," PNAS, 2021.
Hsu, C. et al., "Learning inverse folding from millions of predicted structures," arXiv:2206.02861, 2022.
Townshend, R. et al., "AtomNet: A deep convolutional neural network for bioactivity prediction in structure-based drug discovery," arXiv:1510.02855, 2015.
Stepniewska-Dziubinska, M. M. et al., "KDEEP: Protein-ligand absolute binding affinity prediction via 3D-convolutional neural networks," J. Chem. Inf. Model., 2018.
Li, C. et al., "Jointly learning representations of products and customers from reviews," arXiv:2112.11311, 2021.
Zhou, K. et al., "Learning dynamic context augmentation for global entity linking," in EMNLP, 2019.
Madani, A. et al., "ProGen: Language modeling for protein generation," arXiv:2004.03497, 2020.
Dauparas, J. et al., "Robust deep learning-based protein sequence design using ProteinMPNN," Science, 2022.
Hardt, M. et al., "Equality of opportunity in supervised learning," in NeurIPS, 2016.
Vig, J. et al., "Multiscale visualization of attention in protein language models," in ACL Demo, 2019.
Pezzotti, N. et al., "Glimpse: Visual exploration of contextualized embeddings," in TVCG, 2021.
Sundararajan, M. et al., "Integrated gradients: Axiomatic attribution for deep networks," in ICML, 2017.
Bommasani, R. et al., "On the opportunities and risks of foundation models," arXiv:2108.07258, 2021.
Jumper, J. et al., "Highly accurate protein structure prediction with AlphaFold," Nature, 2021.
Mikolov, T. et al., "Distributed representations of words and phrases and their compositionality," in NeurIPS, 2013.
Settles, B., "Active learning literature survey," Computer Sciences Technical Report 1648, University of Wisconsin–Madison, 2009.
Finn, C. et al., "Model-agnostic meta-learning for fast adaptation of deep networks," in ICML, 2017.
Han, S. et al., "Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding," in ICLR, 2016.
Bengio, Y. et al., "Curriculum learning," in ICML, 2009.
Chen, T. et al., "A simple framework for contrastive learning of visual representations," in ICML, 2020.
Wang, T. and Isola, J., "Understanding contrastive representation learning through alignment and uniformity on the hypersphere," in ICML, 2020.
Zamani, H. et al., "OpenNIR: A complete neural ad-hoc ranking pipeline," arXiv:2102.09755, 2021.
Gligorijevic, V. et al., "Structure-based protein function prediction using graph convolutional networks," Nature Communications, 2021.
Wu, Z. et al., "Unsupervised feature learning via non-parametric instance discrimination," in CVPR, 2018.
Cao, Y. and Karniadakis, G. E., "A transformer-based deep neural network for predicting the functions of enzymes," Comput. Methods Appl. Mech. Eng., 2022.
Vig, J. et al., "Attention is not explanation," in NAACL, 2019.
Kim, B. et al., "Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (TCAV)," in ICML, 2018.
Bach, S. et al., "On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation," PLoS ONE, 2015.
Yang, K. K. et al., "Machine-learning-guided directed evolution for protein engineering," Nature Methods, 2019.
Madani, A. et al., "ProGen: Language modeling for protein generation," arXiv:2004.03497, 2020.
Williams, M. A. et al., "ProteinVR: Web-based molecular visualization in virtual reality," Bioinformatics, 2019.
Vig, J. et al., "BERTology meets biology: Interpreting attention in protein language models," in ICLR, 2021.
Mikolov, T. et al., "Distributed representations of words and phrases and their compositionality," in NeurIPS, 2013.
Ghorbani, A. et al., "Towards automatic concept-based explanations," in NeurIPS, 2019.
Rudin, C., "Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead," Nature Machine Intelligence, 2019.
Wu, Z. et al., "A theoretical analysis of contrastive unsupervised representation learning," in ICML, 2019.
Gupta, A. et al., "Learning in the age of large language models: Opportunities, challenges, and paths forward," arXiv:2303.10101, 2023.
Rao, R. et al., "Evaluating protein transfer learning with TAPE," in NeurIPS, 2019.
Jumper, J. et al., "Highly accurate protein structure prediction with AlphaFold," Nature, 2021.
Johnson, J. et al., "Billion-scale similarity search with GPUs," arXiv:1702.08734, 2017.
Rives, A. et al., "Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences," PNAS, 2021.
Sundararajan, M. et al., "Integrated gradients: Axiomatic attribution for deep networks," in ICML, 2017.
Gligorijevic, V. et al., "Structure-based protein function prediction using graph convolutional networks," Nature Communications, 2021.
Zhou, K. et al., "Learning dynamic context augmentation for global entity linking," in EMNLP, 2019.
Mikolov, T. et al., "Efficient indexing of billion-scale datasets of deep descriptors," in CVPR, 2013.
Bepler, T. and Berger, B., "Learning protein sequence embeddings using information from structure," in ICLR, 2019.
Guu, K. et al., "REALM: Retrieval-augmented language model pre-training," arXiv:2002.08909, 2020.
Bommasani, R. et al., "On the opportunities and risks of foundation models," arXiv:2108.07258, 2021.
Dwork, C. et al., "Fairness through awareness," in ITCS, 2012.
Bengio, Y. et al., "Curriculum learning," in ICML, 2009.
Chen, T. et al., "A simple framework for contrastive learning of visual representations," in ICML, 2020.
Wang, T. and Isola, J., "Understanding contrastive representation learning through alignment and uniformity on the hypersphere," in ICML, 2020.
Mikolov, T. et al., "Distributed representations of words and phrases and their compositionality," in NeurIPS, 2013.
Ghorbani, A. et al., "Towards automatic concept-based explanations," in NeurIPS, 2019.
Morris, J. X., & Rush, A. M."Contextual Document Embeddings." arXiv preprint arXiv:2410.02525v4 [cs.CL], 2024. Available at: https://arxiv.org/abs/2410.02525v4
Disclaimer
This article is inspired by numerous works in the field, including those referenced. It is intended solely for research purposes. AI tools have been employed for structural organization and image generation. The content represents the author's original effort, and any resemblance to existing works is purely coincidental.