Language + Molecules

@ ACL 2024 Workshop

August 15, 2024 hybrid in Bangkok, Thailand & Remote

Synthesizing Language and Molecules for Scientific Insight and Discovery


Call for Papers Schedule

Welcome to the Language + Molecules Workshop! Join us as we explore the integration of molecules and natural language, with exciting applications such as developing new drugs, materials, and chemical processes. These molecular solutions will be critical to address global problems on scales of complexity never-before-seen, in areas such as climate change and healthcare. However, they exist in extremely large search spaces, which makes AI tools a necessity. Excitingly, the chemistry field is posed to be substantially accelerated via multimodal models combining language with molecules and drug structures.

Stay tuned by following us on Twitter @lang_plus_mols.

Call for Papers

A natural question to ask is why we want to integrate natural language with molecules. Combining these types of information has the possibility to accelerate scientific discovery: imagine a future where a doctor can write a few sentences describing a patient’s symptoms and then receive exact structure of the drugs necessary to treat that patient’s ailment (taking into account the patient’s genotype, phenotype, and medical history). Or, imagine a world where a researcher can specify the function they want a molecule to perform (e.g., antimalarial or a photovoltaic) rather than its low level properties (e.g., pyridine-containing). This high-level control of molecules requires a method of abstract description, and humans have already developed one for communication: language. Integrating language with scientific modalities has the following major advantages, as discussed in this recent survey, section 10.3.3:

  1. Generative Modeling: One of the largest problems in current LLMs—hallucination— becomes a strength for discovering molecules with high-level functions, abstract properties, and composition of many properties.
  2. Bridging Modalities: Language can serve as a “bridge” between modalities (e.g., cellular path- ways and drugs) when data is scarce.
  3. Domain Understanding: Grounding language models into external real world knowledge can improve understanding of unseen molecules and advance many emerging tasks, such as experimental procedure planning and reasoning, which use LLMs as scientific agents.
  4. Automation: Instruction-following, dialogue-capable, and tool-equipped models can guide automated discovery in silico and in robotic labs.
  5. Democratization: Language enables scientists without computational expertise to leverage advances in scientific AI.

Research in scientific NLP, integrating molecules with natural language, and multimodal AI for science/medicine has experienced significant attention and growth in recent months. We believe now is the time to begin organizing this nascent community. To do so, we propose a new ACL workshop: “Language + Molecules”. Further, to broaden the communities’ understanding of the associated challenges, methodologies, and goals, we will be holding an EACL tutorial. In the workshop’s first year, we will focus on the following research themes:

  • Going beyond language to incorporate molecular structure and interactions into LLMs.
  • Addressing data scarcity and inconsistency: new training methodologies and methods for extracting data from scientific literature.
  • Language-enabled solutions for discovering new drugs and molecules.
  • Incorporating domain knowledge from human-constructed databases into LLMs.
  • Instruction-following, dialogue-capable, and tool-equipped LLMs for molecules.
  • Sequence representations for molecular structures, including organic molecules, proteins, DNA, and inorganic crystals.

Submission Instructions

We plan to have accept non-archival papers hosted on our website and an opt-in archival proceedings of relevant papers, as well as a shared task to benchmark the progress of generative text-molecule models. Shared task submissions will be encouraged to submit papers. All submissions should be in PDF format and made through OpenReview submission portal. Submissions must be anonymized following ACL guidelines, but a preprint policy will not be enforced. Information on submitting shared task predictions can be found at the shared task.

Authors are invited to submit papers of 4 (short) or 8 (long) pages, with unlimited pages for references and appendices. In line with the ACL main conference policy, camera-ready versions of papers will be given one additional page of content. It should follow the ACL template style, which can be found here.

The research presented in these papers should be substantially original. Regardless of their length, all submissions will undergo a single-track review process. All submissions must be anonymous for double-blind review. No author information should be included in the papers, and self-references that identify the authors should be avoided or anonymized. We expect each paper to be reviewed by at least three reviewers. To encourage higher quality submissions, we will offer Best Paper Award(s) based on nomination by the reviewers and extensive discussions among the chairs. Accepted papers will be presented as posters by default, and outstanding submissions will also be selected for oral or spotlight presentations.

According to the ACL workshop guidelines, we do not encourage the re-submission of already-published papers, but you are allowed to submit ArXiv pre-prints or those currently under submission. Moreover, a work that is presented at the *CL main conference should not appear in a workshop. Please be sure to indicate conflicts of interest for all authors on your paper.

Important Dates (Tentative)

All deadlines are 11:59 pm UTC-12h (“Anywhere on Earth”).

Dec 15 2023 Call for Workshop Papers
Mar 17 2024 EACL Tutorial on Language + Molecules
Apr 7 2024 Early Feedback Request Deadline for Underrepresented Groups
May 17 2024 Paper Submission Deadline
May 17 2024 Shared Task Submission Deadline
Jun 22 2024 Notification of Acceptance
Jul 5 2024 Camera-Ready Due
Jul 31 2024 Release of Proceedings
Aug 12-17 2024 Workshop Date

Shared Task

The workshop will feature a novel competitive yet collaborative shared task based on “translating” between molecules and natural language (Edwards et al., 2022). The task is designed to leverage the advantages of natural language for molecular design and understanding and will cover four high-impact areas of molecular science (medicine, energy, etc.).

Task Formulation

Molecule translation consists of two tasks: generating (1) a caption given a molecule and (2) a molecule given a description. We will release labeled training and validation splits of our proposed dataset (see below). For the final evaluation, we will release a list of molecules for captioning and adescription list for molecule generation; these lists will cover disjoint molecules. Submissions will upload a corresponding list of generated molecules or captions.

Evaluation

We will use the publicly available evaluation metrics described in (Edwards et al., 2022). Further, we will also extract specific properties from generated captions using simple rules to report property-specific abilities. Scores and a leaderboard will be reported for the full dataset and each of the four high-impact chemistry domains.

Dataset Design

The dataset will consist of textual prompts and associated molecular structures. The key reason for creating this standardized synthetic dataset is to thoroughly evaluate the abstraction, functional, and compositional capabilities of submissions; natural language descriptions from the literature are noisy, making it difficult to evaluate abstraction and compositionality. In consultation with domain experts on our advisory board, we will select 20 properties requiring high-level understanding spread across four various high-impact chemistry domains.

Further, we will design the dataset to focus on compositionality. For example, a molecule may share two properties which are desirable together (e.g., low toxicity and high blood brain barrier permeability). We will exclude some property combinations for the test set. Although we will focus on this synthetic dataset, we will also report a leaderboard for the existing real-world ChEBI-20 dataset.

Collaboration via Competition

While a competition will promote innovation and creativity, we recognize that our ultimate goal is scientific discovery. Thus, we propose a novel collaborative element by including molecules of particular interest with unknown properties as roughly 10% of the final evaluation data. We will use an ensemble approach to combine all competitor submissions, and we will invite all shared task participants to a joint publication based on these results. Thus, our workshop will produce tangible results usable by chemistry and medicinal researchers.

Download

The dataset is now available at https://github.com/language-plus-molecules/LPM-24-Dataset. It, along with finetuned baseline models, can be found on HuggingFace. A manuscript detailing the dataset’s creation can be found here.

Baseline

Baseline MolT5 models (Edwards et al., 2022) and Meditron-7B have been released on Huggingface.

Submission

Evaluation set results should be submitted here for the synthetic shared task and here for ChEBI-20.

These should follow the same formatting as the dataset training files.

We encourage authors of submissions to also submit a 4 page short paper detailing their submission. Task submissions may also accompany full length paper submissions, however.

Leaderboard

Molecular Captioning:
Team BLEU-2 BLEU-4 ROUGE-1 ROUGE-2 ROUGE-L METEOR Text2Mol
team 1 xx xx xx xx xx xx xx
team 2 xx xx xx xx xx xx xx
team 3 xx xx xx xx xx xx xx
Text-Based Molecule Generation:
Team BLEU Exact Levenshtein MACCS FTS RDK FTS Morgan FTS FCD Text2Mol Validity
team 1 xx xx xx xx xx xx xx xx xx
team 2 xx xx xx xx xx xx xx xx xx
team 3 xx xx xx xx xx xx xx xx xx

Speakers

Avatar

Kyunghyun Cho

NYU & Genentech

Avatar

Marinka Zitnik

Harvard

Schedule (Tentative)

Time Program
9:00-9:10 Opening remarks
9:10-10:40 Keynote speeches
10:40-11:30 Panel discussion
11:30-12:30 Poster session
12:30-13:30 Lunch break
13:30-15:00 Keynote speeches
15:00-15:50 Panel discussion
15:50-16:50 Oral paper session (12 min talk + 3 min QA)
16:50-17:20 Challenge track spotlight session (6 min talk)
17:20-17:30 Closing remarks

Accepted Papers

To appear.

Organization

Organizing Committee

Avatar

Carl Edwards

University of Illinois Urbana-Champaign

Avatar

Heng Ji

University of Illinois Urbana-Champaign

Avatar

Qingyun Wang

University of Illinois Urbana-Champaign

Avatar

Tom Hope

HUJI, AI2

Avatar

Manling Li

Northwestern University

Avatar

Lawrence Zhao

Yale University

Scientific Advisory Board

Avatar

Ying Diao

UIUC

Avatar

Teodoro Laino

IBM Research

Contact us

Please email language.molecules@gmail.com if you have any questions.

Acknowledgements

This workshop will be partially supported by the Molecule Maker Lab Institute: an AI research institute program supported by NSF under award No. 2019897 and No. 2034562. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.