Language + Molecules

@ ACL 2024 Workshop

August 15, 2024 hybrid in Bangkok, Thailand & Remote

Synthesizing Language and Molecules for Scientific Insight and Discovery


Call for Papers Schedule

Welcome to the Language + Molecules Workshop! Join us as we explore the integration of molecules and natural language, with exciting applications such as developing new drugs, materials, and chemical processes. These molecular solutions will be critical to address global problems on scales of complexity never-before-seen, in areas such as climate change and healthcare. However, they exist in extremely large search spaces, which makes AI tools a necessity. Excitingly, the chemistry field is posed to be substantially accelerated via multimodal models combining language with molecules and drug structures.

Stay tuned by following us on Twitter @lang_plus_mols.

News

The deadline is extended to May 24th for abstract registration and May 31st for full paper submissions. We believe this will remove conflicts with other deadlines and allow more time to improve submissions for the shared task.

Call for Papers

A natural question to ask is why we want to integrate natural language with molecules. Combining these types of information has the possibility to accelerate scientific discovery: imagine a future where a doctor can write a few sentences describing a patient’s symptoms and then receive exact structure of the drugs necessary to treat that patient’s ailment (taking into account the patient’s genotype, phenotype, and medical history). Or, imagine a world where a researcher can specify the function they want a molecule to perform (e.g., antimalarial or a photovoltaic) rather than its low level properties (e.g., pyridine-containing). This high-level control of molecules requires a method of abstract description, and humans have already developed one for communication: language. Integrating language with scientific modalities has the following major advantages, as discussed in this recent survey, section 10.3.3:

  1. Generative Modeling: One of the largest problems in current LLMs—hallucination— becomes a strength for discovering molecules with high-level functions, abstract properties, and composition of many properties.
  2. Bridging Modalities: Language can serve as a “bridge” between modalities (e.g., cellular path- ways and drugs) when data is scarce.
  3. Domain Understanding: Grounding language models into external real world knowledge can improve understanding of unseen molecules and advance many emerging tasks, such as experimental procedure planning and reasoning, which use LLMs as scientific agents.
  4. Automation: Instruction-following, dialogue-capable, and tool-equipped models can guide automated discovery in silico and in robotic labs.
  5. Democratization: Language enables scientists without computational expertise to leverage advances in scientific AI.

Research in scientific NLP, integrating molecules with natural language, and multimodal AI for science/medicine has experienced significant attention and growth in recent months. We believe now is the time to begin organizing this nascent community. To do so, we propose a new ACL workshop: “Language + Molecules”. Further, to broaden the communities’ understanding of the associated challenges, methodologies, and goals, we will be holding an EACL tutorial. In the workshop’s first year, we will focus on the following research themes:

  • Going beyond language to incorporate molecular structure and interactions into LLMs.
  • Addressing data scarcity and inconsistency: new training methodologies and methods for extracting data from scientific literature.
  • Language-enabled solutions for discovering new drugs and molecules.
  • Incorporating domain knowledge from human-constructed databases into LLMs.
  • Instruction-following, dialogue-capable, and tool-equipped LLMs for molecules.
  • Sequence representations for molecular structures, including organic molecules, proteins, DNA, and inorganic crystals.

Submission Instructions

We plan to have accept non-archival papers hosted on our website and an opt-in archival proceedings of relevant papers, as well as a shared task to benchmark the progress of generative text-molecule models. Shared task submissions will be encouraged to submit papers. All submissions should be in PDF format and made through OpenReview submission portal. Submissions must be anonymized following ACL guidelines, but a preprint policy will not be enforced. Information on submitting shared task predictions can be found at the shared task.

Authors are invited to submit papers of 4 (short) or 8 (long) pages, with unlimited pages for references and appendices. In line with the ACL main conference policy, camera-ready versions of papers will be given one additional page of content. It should follow the ACL template style, which can be found here.

The research presented in these papers should be substantially original. Regardless of their length, all submissions will undergo a single-track review process. All submissions must be anonymous for double-blind review. No author information should be included in the papers, and self-references that identify the authors should be avoided or anonymized. We expect each paper to be reviewed by at least three reviewers. To encourage higher quality submissions, we will offer Best Paper Award(s) based on nomination by the reviewers and extensive discussions among the chairs. Accepted papers will be presented as posters by default, and outstanding submissions will also be selected for oral or spotlight presentations.

According to the ACL workshop guidelines, we do not encourage the re-submission of already-published papers, but you are allowed to submit ArXiv pre-prints or those currently under submission. Moreover, a work that is presented at the *CL main conference should not appear in a workshop. Please be sure to indicate conflicts of interest for all authors on your paper.

Important Dates (Tentative)

All deadlines are 11:59 pm UTC-12h (“Anywhere on Earth”).

Dec 15 2023 Call for Workshop Papers
Mar 17 2024 EACL Tutorial on Language + Molecules
Apr 7 2024 Early Feedback Request Deadline for Underrepresented Groups
May 24 2024 Abstract Submission Deadline
May 31 2024 Paper Submission Deadline
May 31 2024 Shared Task Submission Deadline
Jul 5 2024 Notification of Acceptance
Jul 14 2024 Camera-Ready Due
Aug 2 2024 Release of Proceedings
Aug 15 2024 Workshop Date

Shared Task

The workshop will feature a novel competitive yet collaborative shared task based on “translating” between molecules and natural language (Edwards et al., 2022). The task is designed to leverage the advantages of natural language for molecular design and understanding and will cover four high-impact areas of molecular science (medicine, energy, etc.).

Task Formulation

Molecule translation consists of two tasks: generating (1) a caption given a molecule and (2) a molecule given a description. We will release labeled training and validation splits of our proposed dataset (see below). For the final evaluation, we will release a list of molecules for captioning and adescription list for molecule generation; these lists will cover disjoint molecules. Submissions will upload a corresponding list of generated molecules or captions.

Evaluation

We will use the publicly available evaluation metrics described in (Edwards et al., 2022). Further, we will also extract specific properties from generated captions using simple rules to report property-specific abilities. Scores and a leaderboard will be reported for the full dataset and each of the four high-impact chemistry domains.

Updates

We have updated the code for ``text_property_metrics.py’’ to produce more intuitive results for missing properties in the validation set. We will update table 3 in the dataset manuscript soon to address this change. Only the Overall and Biomedical columns of Table 3 change. See Github for more details.

Dataset Design

The dataset will consist of textual prompts and associated molecular structures. The key reason for creating this standardized synthetic dataset is to thoroughly evaluate the abstraction, functional, and compositional capabilities of submissions; natural language descriptions from the literature are noisy, making it difficult to evaluate abstraction and compositionality. In consultation with domain experts on our advisory board, we will select 20 properties requiring high-level understanding spread across four various high-impact chemistry domains.

Further, we will design the dataset to focus on compositionality. For example, a molecule may share two properties which are desirable together (e.g., low toxicity and high blood brain barrier permeability). We will exclude some property combinations for the test set. Although we will focus on this synthetic dataset, we will also report a leaderboard for the existing real-world ChEBI-20 dataset.

Collaboration via Competition

While a competition will promote innovation and creativity, we recognize that our ultimate goal is scientific discovery. Thus, we propose a novel collaborative element by including molecules of particular interest with unknown properties as roughly 10% of the final evaluation data. We will use an ensemble approach to combine all competitor submissions, and we will invite all shared task participants to a joint publication based on these results. Thus, our workshop will produce tangible results usable by chemistry and medicinal researchers.

Download

The dataset is now available at https://github.com/language-plus-molecules/LPM-24-Dataset. It, along with finetuned baseline models, can be found on HuggingFace. A manuscript detailing the dataset’s creation can be found here.

Baseline

Baseline MolT5 models (Edwards et al., 2022) and Meditron-7B have been released on Huggingface.

Submission

Evaluation set results should be submitted on Codabench: here for molecule captioning here for molecule generation.

Not all metrics are reported on this leaderboard due to computational limits associated with Codabench. We will take the 3 latest submissions posted to the leaderboard from each team and calculate all metrics for those after the submission window ends. If a team’s submissions are not posted to the leaderboard, we will use their best submission.

These submissions should be a zipped ``submit.txt’’ file with results from the caption and molecule generation eval splits. They should be in the same order as these splits and contain the same number of data points. Example submission files for MolT5-Small are available on the Github. Please contact language.molecules@gmail.com for any difficulties uploading.

We encourage authors of submissions to also submit a 4 page short paper detailing their submission. Task submissions may also accompany full length (8 page) paper submissions, however.

Leaderboard

Molecular Captioning:

The leaderboard ranking is based on the “Overall Increase” metric. Metrics are spread across two tables: 1) translation metric scores (BLEU, ROUGE, METEOR), and 2) property metric scores (whether a caption contains the ground truth property). For details on evaluation, please see the Github repo.

“Overall Increase” is the average absolute percent improvement over MolT5-Small (across all metrics from both tables). Additionally, we report a “Translation Metric Increase”, which is the average increase over all translation metric scores. “Property Metric Increase”, which is the average improvement over all captioning property scores. For property scores, everything is F-1 Score.

Translation Metrics:

Team Model Overall Increase Translation Metric Increase BLEU-2 BLEU-4 ROUGE-1 ROUGE-2 ROUGE-L METEOR
avaliev RAG_SIM_098 27.08 6.37 73.81 53.04 80.06 60.17 57.5 77.45
qizhipei BioT5+_large_voting 14.66 6.45 75.58 54.77 79.41 59.89 57.46 75.43
protonunfold SciMind 12.39 5.77 75.66 54.98 78.24 58.42 56.34 74.76
NLPeople Ensembled 12.3 5.68 75.54 54.83 78.1 58.47 56.37 74.57
hecao bioagent 10.95 5.57 74.11 53.84 78.73 59.38 57.06 74.08
xwk89 mistral_4b9_e1 10.8 5.56 74.38 54.08 78.49 59.49 56.82 73.91
mengmeng Mistral 10.37 5.48 75.04 54.56 78.1 58.81 56.42 73.73
langmolecules Meditron 10.34 5.47 75.16 54.72 77.97 58.75 56.33 73.69
NLPeople Rank_model_1 9.94 4.8 74.73 54.1 77.3 57.7 55.53 73.26
dimitris ALMol~10%DataTrained 9.61 4.27 74.71 54.34 76.43 56.54 54.66 72.76
xygui MDEG 9.43 2.96 73.98 53.33 75.08 54.39 52.58 72.21
danielshao SMol+LPM 7.83 2.7 72.2 52.02 74.76 56.1 53.34 71.57
xwk89 mistral_e1 7.45 3.46 73.18 53.24 75.29 56.74 54.39 71.75
duongttr Mol2Lang-VLM 4.52 4.11 73.43 53.19 76.72 57.67 55.43 72.05
langmolecules MolT5-Large 4.22 3.64 73.63 53.2 75.79 56.47 54.42 72.16
bluesky333 phi3-knowchem-sft-beam1 2.87 0.81 70.56 50.83 72.61 53.61 52.01 69.01
langmolecules MolT5-Base 1.06 1.2 69.83 50.56 73.34 54.55 52.86 69.86
langmolecules MolT5-Small 0.0 0.0 66.82 48.29 72.8 54.44 53.33 68.14
guiyike yike -16.64 -43.45 12.8 6.37 25.9 13.83 24.1 20.11

Property Metrics:

Team Model Overall Increase Property Metric Increase Overall Property F-1 Biomedical Human Interaction Agr. + Industry Light + Electro X-icides Toxins Light Electricity Inhibitors anti-X Modulators Antagonists Treatments Agonists Cancer Disease Held-out Combos
avaliev RAG_SIM_098 27.08 33.99 26.99 27.9 4.18 3.55 72.32 0.0 6.74 71.92 72.71 44.52 11.07 68.09 55.7 65.49 41.53 52.07 66.15 69.09
qizhipei BioT5+_large_voting 14.66 17.39 13.76 19.76 4.01 3.07 28.2 0.0 6.25 31.28 25.12 20.55 3.64 37.02 30.77 23.89 16.5 59.47 67.98 70.05
protonunfold SciMind 12.39 14.6 11.51 18.17 3.94 2.93 21.0 0.04 6.2 23.95 18.05 18.06 2.55 30.14 25.36 19.42 14.81 57.54 67.43 69.92
NLPeople Ensembled 12.3 14.5 11.63 17.86 4.02 2.97 21.67 0.04 6.33 24.74 18.59 17.02 2.77 30.96 25.89 18.61 15.1 53.69 67.44 69.99
hecao bioagent 10.95 12.74 9.94 16.86 3.9 2.76 16.24 0.0 6.43 18.12 14.37 18.84 2.48 26.78 23.41 16.33 14.02 50.28 67.22 69.64
xwk89 mistral_4b9_e1 10.8 12.54 9.93 16.98 3.85 2.81 16.1 0.0 6.26 17.15 15.05 15.2 2.12 31.67 19.11 14.34 14.59 51.17 67.69 70.02
mengmeng Mistral 10.37 12.0 9.73 16.72 3.81 2.83 15.57 0.04 6.03 18.49 12.64 14.74 2.12 24.43 21.21 14.34 10.55 53.61 67.45 69.99
langmolecules Meditron 10.34 11.96 9.7 16.87 3.75 2.83 15.36 0.04 6.07 18.19 12.53 14.71 2.17 24.72 20.61 14.35 10.54 53.74 67.4 69.99
NLPeople Rank_model_1 9.94 11.66 9.88 16.5 3.85 2.87 16.28 0.08 5.97 19.29 13.28 13.91 2.14 23.71 17.88 13.87 12.27 50.82 66.01 69.45
dimitris ALMol~10%DataTrained 9.61 11.39 10.05 15.75 3.74 2.78 17.94 0.21 6.0 17.11 18.77 12.39 1.83 17.34 18.53 11.14 11.16 51.66 67.09 69.76
xygui MDEG 9.43 11.59 8.74 16.08 3.57 2.72 12.6 0.0 5.86 12.68 12.52 13.2 1.87 29.96 19.61 12.47 14.08 54.55 67.1 69.32
danielshao SMol+LPM 7.83 9.54 8.55 14.97 2.39 2.62 14.2 0.0 3.19 16.38 12.01 10.9 1.61 22.13 17.43 9.66 9.93 40.41 64.84 68.74
xwk89 mistral_e1 7.45 8.78 8.23 14.44 2.69 2.74 13.05 0.0 4.33 13.43 12.67 14.23 2.0 26.09 18.6 12.99 11.75 20.44 59.5 69.19
duongttr Mol2Lang-VLM 4.52 4.66 5.76 10.73 3.18 2.36 6.78 0.0 5.5 9.39 4.18 1.08 0.3 0.0 2.49 1.72 0.69 40.62 67.61 69.67
langmolecules MolT5-Large 4.22 4.42 5.81 10.33 3.2 2.36 7.36 0.0 5.58 7.77 6.94 0.65 0.13 0.39 0.11 2.02 0.58 38.34 66.97 69.21
bluesky333 phi3-knowchem-sft-beam1 2.87 3.56 5.36 10.05 3.15 2.13 6.11 0.0 5.45 8.01 4.21 0.93 0.23 0.0 0.29 2.04 0.21 35.39 63.71 64.96
langmolecules MolT5-Base 1.06 1.02 3.99 8.27 2.68 2.0 3.0 0.0 4.63 5.52 0.48 0.08 0.04 0.0 0.0 1.79 0.0 18.04 47.64 68.46
langmolecules MolT5-Small 0.0 0.0 3.23 7.87 0.27 1.65 3.12 0.0 0.0 6.24 0.0 0.06 0.0 0.0 0.0 1.66 0.0 17.99 41.09 65.06
guiyike yike -16.64 -7.7 1.23 4.93 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.12 0.0 3.42 0.0 0.0

Text-Based Molecule Generation:

The leaderboard ranking is based on the “Overall Increase” metric. Metrics are spread across two tables: 1) metrics computed for the entire Test Split, and 2) metrics computed for the “Withheld Combos” molecules within the test split.

“Overall Increase” is the average absolute percent improvement over MolT5-Small (across all metrics except exact match from both tables), except for FCD and Levenshtein. For FCD and Levenshtein, absolute improvement is used instead.

“Test Metric Increase” is this score using only the test split metrics. “Withheld Combo Increase” is this score using only the witheld combos split metrics.

Test Split Metrics:

Team Model Overall Increase Test Metric Increase BLEU Exact Match Levenshtein Validity MACCS FTS RDK FTS Morgan FTS FCD
qizhipei BioT5+_large 12.97 13.2 73.17 0.01 41.05 100.0 76.05 68.7 50.05 3.13
protonunfold SciMind 12.68 12.76 73.44 0.0 40.35 99.78 75.06 67.05 47.82 2.54
avaliev PLAIN 12.41 12.39 71.82 0.01 43.91 98.98 74.97 66.85 48.92 0.28
mengmeng Mistral 12.26 12.22 70.56 0.0 43.75 99.4 75.6 67.57 48.62 2.01
langmolecules Meditron 11.81 11.66 68.84 0.01 46.47 99.54 75.59 67.66 48.72 2.44
dimitris ALMol~10%Data 10.41 9.26 69.74 0.01 43.24 92.84 70.22 62.79 42.96 3.05
hecao bioagent_epoch5 10.21 10.67 61.98 0.02 47.12 99.67 75.94 68.38 46.92 2.17
langmolecules MolT5-Base 10.07 10.0 67.04 0.0 45.71 99.89 74.61 63.7 46.29 nan
danielshao SMol+LPM 8.74 8.45 59.74 0.01 55.09 97.66 74.35 66.79 46.6 4.25
hecao bioagent 6.39 5.36 51.5 0.01 70.67 98.05 74.0 66.43 45.56 3.74
langmolecules MolT5-Large 3.65 4.78 55.31 0.0 56.47 99.12 74.14 63.4 38.54 17.63
erikxiong PUF 0.0 0.0 55.44 0.0 57.21 81.03 63.06 56.83 36.69 nan
langmolecules MolT5-Small 0.0 0.0 55.44 0.0 57.21 81.03 63.06 56.83 36.69 nan
ndhieunguyen Lang2mol-diff -1.21 -0.75 54.15 0.0 55.26 100.0 59.6 32.44 31.98 10.71
guiyike Nano -7.39 -6.64 43.51 0.0 83.38 100.0 49.25 37.82 23.52 5.64

Withheld Combos Split Metrics:

Team Model Overall Increase Withheld Combo Increase BLEU Exact Match Levenshtein Validity MACCS FTS RDK FTS Morgan FTS FCD
qizhipei BioT5+_large 12.97 12.74 78.48 0.02 43.99 100.0 86.54 80.43 59.54 4.32
protonunfold SciMind 12.68 12.6 78.99 0.0 43.39 99.77 86.03 79.16 58.06 3.06
avaliev PLAIN 12.41 12.42 78.19 0.0 47.84 99.64 86.3 79.16 59.02 0.35
mengmeng Mistral 12.26 12.31 78.08 0.0 46.34 99.48 86.14 78.75 59.39 2.27
langmolecules Meditron 11.81 11.96 77.11 0.0 48.29 99.53 86.23 78.91 59.57 2.59
dimitris ALMol~10%Data 10.41 11.56 76.93 0.01 45.68 98.99 85.9 79.35 55.28 3.51
hecao bioagent_epoch5 10.21 9.75 65.86 0.0 52.06 99.83 86.55 80.13 55.68 3.24
langmolecules MolT5-Base 10.07 10.15 72.48 0.0 50.9 99.84 86.22 78.23 57.56 nan
danielshao SMol+LPM 8.74 9.03 65.95 0.0 56.31 99.1 86.14 79.77 56.75 4.36
hecao bioagent 6.39 7.42 64.6 0.0 65.63 99.11 85.86 79.58 54.81 4.18
langmolecules MolT5-Large 3.65 2.36 57.74 0.0 66.94 99.17 83.19 74.77 40.97 nan
erikxiong PUF 0.0 0.0 56.96 0.0 72.44 91.89 81.03 73.36 41.6 nan
langmolecules MolT5-Small 0.0 0.0 56.96 0.0 72.44 91.89 81.03 73.36 41.6 nan
ndhieunguyen Lang2mol-diff -1.21 -1.67 59.38 0.0 63.17 100.0 73.26 38.6 39.76 6.38
guiyike Nano -7.39 -8.15 44.84 0.0 85.92 100.0 59.67 47.95 30.91 7.86

Speakers

Avatar

Kyunghyun Cho

NYU & Genentech

Avatar

Marinka Zitnik

Harvard

Avatar

Huan Sun

The Ohio State University

Avatar

Lei Li

Carnegie Mellon University

Schedule (Tentative)

Time Program
9:00-9:10 Opening remarks
9:10-10:40 Keynote speeches
10:40-11:30 Panel discussion
11:30-12:30 Poster session
12:30-13:30 Lunch break
13:30-15:00 Keynote speeches
15:00-15:50 Panel discussion
15:50-16:50 Oral paper session (12 min talk + 3 min QA)
16:50-17:20 Challenge track spotlight session (6 min talk)
17:20-17:30 Closing remarks

Accepted Papers

To appear.

Organization

Organizing Committee

Avatar

Carl Edwards

University of Illinois Urbana-Champaign

Avatar

Heng Ji

University of Illinois Urbana-Champaign

Avatar

Qingyun Wang

University of Illinois Urbana-Champaign

Avatar

Tom Hope

HUJI, AI2

Avatar

Manling Li

Northwestern University

Avatar

Lawrence Zhao

Yale University

Scientific Advisory Board

Avatar

Ying Diao

UIUC

Avatar

Teodoro Laino

IBM Research

Contact us

Please email language.molecules@gmail.com if you have any questions.

Acknowledgements

This workshop will be partially supported by the Molecule Maker Lab Institute: an AI research institute program supported by NSF under award No. 2019897 and No. 2034562. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.