Language + Molecules @ EACL 2024 Tutorial
Synthesizing Language and Molecules for Scientific Insight and Discovery
Welcome to the Language + Molecules Tutorial!
Join us for this introductory tutorial targetted at NLP researchers with no knowledge of chemistry as we explore cutting-edge new work in the integration of molecules and natural language, with exciting applications such as developing new drugs, materials, and chemical processes.
Climate change, access to food and water, pandemics— these words, when uttered, immediately summon to mind global challenges with possible disastrous outcomes. Molecular solutions will be critical to addressing these global problems on scales of complexity never-before-seen. However, they exist in extremely large search spaces, which makes AI tools a necessity. Excitingly, the chemistry field is posed to be substantially accelerated via multimodal models combining language with molecules and drug structures.
Stay tuned by following us on Twitter @lang_plus_mols.
Our tutorial material is being uploaded to Github.
The slides from the tutorial can be found on Github as well!
Tutorial Outline [180 min.]
Applying language models to the scientific domain is becoming increasingly popular due to its potential impact for accelerating scientific discovery (Hope et al., 2022). Beyond extracting information from scientific literature, NLP has the possibility to increase control of the scientific discovery process, which can be achieved through multimodal representations and generative language models.
1. Background [60 min.]
Scientific Information Extraction [15 min.] To start, we will provide a high-level overview on traditional NLP tasks used for scientific discovery (e.g., named entity recognition, entity linking, and relation extraction), as well as recent domain-specific LLMs designed for superior performance on scientific tasks (Beltagy et al., 2019).
What is a molecule? [15 min.]
Half of the title is molecules, but what is one? We will start from scratch and discuss what a molecule actually is, including the basic constituents of molecules, atoms and bonds, and how they essentially form graph structures. Then, we will focus on molecular string languages, which are a key building block for chemical language models. We will discuss tradeoffs of these languages (Grisoni, 2023; Weininger, 1988; O’Boyle and Dalke, 2018; Krenn et al., 2020; Cheng et al., 2023). Krenn et al. (2020) proposes a formal grammar approach, which may particularly interest the ACL community.
Molecule Design using Language Models [15 min.]
Now that we know what a molecule is, we will overview recent work applying NLP techniques to these molecular languages with impressive results. LLMs are traineded with adapted pretraining techniques from (natural) language models to learn molecule representation from large collections of such molecule strings (Frey et al., 2022; Chithrananda et al., 2020; Ahmad et al., 2022; Fabian et al., 2020; Schwaller et al., 2021; NVIDIA Corporation, 2022; Flam-Shepherd and Aspuru-Guzik, 2023; Tysinger et al., 2023). Applications including molecule and material generation, property prediction, and protein binding site prediction.
Drug Discovery – A Brief Primer [15 min.]
Ok, so NLP is being used for molecules now. What can we do with it?—here, we present a brief overview of drug discovery–an important but challenging problem. Historically, molecular discovery has commonly been done by humans who design and build individual molecules, but this can cost over a billion dollars and take over ten years (Gaudelet et al., 2021). We’ll discuss a little of the process here, including non-NLP deep learning methods, so that we know how to improve it.
2. Integrating Language with Molecules [95 min.]
What does natural language have to offer? [15 min.]
At least at first, integrating languages and molecules seems like an odd idea. Here, we’ll start an interactive discussion with the audience on what they think potential benefits might be. We’ll make sure to mention the following major advantages, as discussed in the recent survey (Zhang et al., 2023, Section 10.3.3 “Natural Language-Guided Scientific Discovery”):
- Generative Modeling: One of the largest problems in current LLMs—hallucination—becomes a strength for discovering molecules with high-level functions and abstract properties. In particular, language is compositional by nature (Szabó, 2020; Partee et al., 1984; Han et al., 2023), and therefore holds promise for composing these high-level properties (e.g., antimalarial) (Liu et al., 2022).
- Bridging Modalities: Language can serve to “bridge” between modalities for scarce data.
- Domain Understanding: Grounding language models into external real world knowledge can improve understanding of unseen molecules and advance many emerging tasks, such as experimental procedure planning, which use LLMs as scientific agents.
- Democratization: Language enables scientists without computational expertise to leverage advances in scientific AI.
Do I want multimodality? [5 min.]
An important, yet often overlooked, question in multimodal NLP is to ask: do I need multimodality? For example, if one wants to extract reactions from the literature, a text-to-text model (Vaucher et al., 2020) might be sufficient. However, editing a drug with high-level instructions requires language (Liu et al., 2023a; Fang et al., 2023). Here, we will dive into this question and discuss example scenarios with the audience for how to answer it.
Integrating Modalities [30 min.]
Ok, we’ve decided we want or need multimodality. Next, we need to discuss how people are currently tackling this—we’ll start with two primary methods, bi-encoder models and joint representation models.
Bi-Encoder Models (and beyond)
Bi-Encoder Models consist of an encoder branch for text and a branch for molecules. They have the advantage of not requiring direct, early integration of the two modalities, allowing existing single-modal models to be integrated. Representative examples we will discuss include Text2Mol (Edwards et al., 2021), CLAMP (Seidl et al., 2023), and BioTranslator (Xu et al., 2023) Generally, bi-encoder models are effective for cross-modal retrieval (Edwards et al., 2021; Su et al., 2022; Liu et al., 2022; Zhao et al., 2023b), but they may also be integrated into molecule (Su et al., 2022; Liu et al., 2022) and protein (Liu et al., 2023b) generation frameworks. We’ll talk about all these tasks, applications, and return to some important motivations (e.g., bridging modalities).
Joint Molecule-Language Models
Joint models, on the other hand, seeks to model interactions between multiple modalities inside the same network to allow fine-grained interaction. We will discuss encoder-only models (Zeng et al., 2022), encoder-decoder models (Christofidellis et al., 2023), and decoder-only models (Liu et al., 2023c).
Model Differences
We will answer important questions such as: Which model should I use? What tasks can each do? Tasks include retrieval (Edwards et al., 2021), “translation” between molecules and language Edwards et al. (2022a), editing molecules (Liu et al., 2022), and chemical reaction planning (Vaucher et al., 2020, 2021).
An Interactive Example - Microtubule Stabilization [20 min.]
At this point, there’s been a lot of ideas thrown around. We’ll consolidate them by exploring an interactive example of language-enabled drug design using Google Colab. Our hands-on example will consist of three components:
- Language-enabled Drug Design: Participants will explore inputs to multiple language→molecule models to generate molecules for different purposes. Following this, they will use natural language to design their own candidate drugs.
- Language-Guided Assay Testing: Here, participants will test their proposed drugs for certain properties. We will follow (Seidl et al., 2023), where natural language descrip- tions are used to test unknown properties.
- Interaction Prediction: Finally, we will test if proposed drugs interact with the target. Here, we will compare a language model-based approach (e.g., (Blanchard et al., 2022; Kalakoti et al., 2022)) with a traditional docking approach (Trott and Olson, 2010).
Applications [25 min.]
Here, we will discuss important applications to improve cross-discipline communication, including drug discovery (Ferguson and Gray, 2018), organic photovoltaics (Kippelen et al., 2009), and catalyst discovery for renewable energy (Zitnick et al., 2020).
3. Recent Trends and Conclusion [25 min.]
Instruction-Following Molecular Design [10 min.]
In the last year, instruction-following language models (Wei et al., 2021) have surged in popularity. Following this trend, training methodologies and datasets have recently emerged to allow language models to follow instructions related to molecule properties (Liang et al., 2023; Fang et al., 2023; Zeng et al., 2023; Zhao et al., 2023a). We will give a brief overview of this new line of work.
LLMs as Scientific Agents [5 min.]
Further, we’ll focus on recent work which looks to control experiments with language models (Boiko et al., 2023) and to create tools for enabling domain-specific capabilities in general language models (Bran et al., 2023; Liu et al., 2023a).
Conclusion [10 min.]
We will discuss the key difficulties in the molecule-language domain that need to be addressed by the research community to allow similar progress to the vision-language do- main. This includes:
- data scarcity due to domain expertise requirements
- addressing inconsistency when training on scientific literature
- improved methods for integrating geometric structures into LLMs
- developing better evaluation metrics for chemical predictions without real-world experiments.
Reading List
- Molecule Representations and Language Models:
- “Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules.” Weininger, 1988.
- “Self-referencing embedded strings (selfies): A 100% robust molecular string representation.” Krenn et al., 2020.
- “Group selfies: a robust fragment-based molecular string representation.” Cheng et al., 2023.
- “Chemberta: Large-scale self-supervised pretraining for molecular property prediction.” Chithrananda et al., 2020.
- “Chemberta-2: Towards chemical foundation models.” Ahmad et al., 2022.
- “Can we quickly learn to “translate” bioactive molecules with transformer models?” Tysinger et al., 2023.
- Molecule-Language Modeling:
- “Text2Mol: Cross-modal molecule retrieval with natural language queries.” Edwards et al., 2021.
- “A deep-learning system bridging molecule structure and biomedical text with comprehension comparable to human professionals.” Zeng et al., 2022.
- “Translation between molecules and natural language.” Edwards et al., 2022.
- “A molecular multimodal foundation model associating molecule graphs with natural language.” Su et al., 2022.
- “Multi-modal molecule structure-text model for text-based retrieval and editing.” Liu et al., 2022.
- “Adversarial modality alignment network for cross-modal molecule retrieval.” Zhao et al., 2023.
- “Gimlet: A unified graph-text model for instruction-based molecule zero-shot learning.” Zhao et al., 2023.
- “Molxpt: Wrapping molecules with text for generative pre-training.” Liu et al., 2023.
- “Multilingual translation for zero-shot biomedical classification using biotranslator.” Xu et al., 2023.
- “Chatgpt-powered conversational drug editing using retrieval and domain feedback.” Liu et al., 2023.
- Applications:
- “Drug discovery chemistry: a primer for the non-specialist.” Jordan et al., 2009.
- “Organic photovoltaics.” Kippelen et al., 2009.
- “Kinase inhibitors: the road ahead.” Ferguson et al., 2018.
- LLMs as Scientific Agents:
- “Emergent autonomous scientific research capabilities of large language models.” Boiko et al., 2023.
- “ChemCrow: Augmenting large-language models with chemistry tools.” Bran et al., 2023.
- “Do large language models understand chemistry? a conversation with chatgpt.” Castro Nascimento et al., 2023.
- “Assessment of chemistry knowledge in large language models that generate code.” White et al., 2023.
- Survey:
- “Artificial intelligence for science in quantum, atomistic, and continuum systems.” Zhang et al., 2023. Section 10.3.3 “Natural Language-Guided Scientific Discovery”.
We won’t require reading these beforehand to ensure the tutorial is introductory.