Language + Molecules

@ ACL 2024 Workshop

August 15, 2024 hybrid in Bangkok, Thailand & Remote

Synthesizing Language and Molecules for Scientific Insight and Discovery


Call for Papers Schedule

Welcome to the Language + Molecules Workshop! Join us as we explore the integration of molecules and natural language, with exciting applications such as developing new drugs, materials, and chemical processes. These molecular solutions will be critical to address global problems on scales of complexity never-before-seen, in areas such as climate change and healthcare. However, they exist in extremely large search spaces, which makes AI tools a necessity. Excitingly, the chemistry field is posed to be substantially accelerated via multimodal models combining language with molecules and drug structures.

Stay tuned by following us on Twitter @lang_plus_mols.

News

Thank you to everyone who particpated in the workshop! It was great to hear from our wonderful speakers and submission presenters. The workshop recording is now available on Underline. Seeing such exciting work in this field was very inspiring!

Thank you again to the organizers, program comittee, and advisory board for making this possible!

Call for Papers

A natural question to ask is why we want to integrate natural language with molecules. Combining these types of information has the possibility to accelerate scientific discovery: imagine a future where a doctor can write a few sentences describing a patient’s symptoms and then receive exact structure of the drugs necessary to treat that patient’s ailment (taking into account the patient’s genotype, phenotype, and medical history). Or, imagine a world where a researcher can specify the function they want a molecule to perform (e.g., antimalarial or a photovoltaic) rather than its low level properties (e.g., pyridine-containing). This high-level control of molecules requires a method of abstract description, and humans have already developed one for communication: language. Integrating language with scientific modalities has the following major advantages, as discussed in this recent survey, section 10.3.3:

  1. Generative Modeling: One of the largest problems in current LLMs—hallucination— becomes a strength for discovering molecules with high-level functions, abstract properties, and composition of many properties.
  2. Bridging Modalities: Language can serve as a “bridge” between modalities (e.g., cellular path- ways and drugs) when data is scarce.
  3. Domain Understanding: Grounding language models into external real world knowledge can improve understanding of unseen molecules and advance many emerging tasks, such as experimental procedure planning and reasoning, which use LLMs as scientific agents.
  4. Automation: Instruction-following, dialogue-capable, and tool-equipped models can guide automated discovery in silico and in robotic labs.
  5. Democratization: Language enables scientists without computational expertise to leverage advances in scientific AI.

Research in scientific NLP, integrating molecules with natural language, and multimodal AI for science/medicine has experienced significant attention and growth in recent months. We believe now is the time to begin organizing this nascent community. To do so, we propose a new ACL workshop: “Language + Molecules”. Further, to broaden the communities’ understanding of the associated challenges, methodologies, and goals, we will be holding an EACL tutorial. In the workshop’s first year, we will focus on the following research themes:

  • Going beyond language to incorporate molecular structure and interactions into LLMs.
  • Addressing data scarcity and inconsistency: new training methodologies and methods for extracting data from scientific literature.
  • Language-enabled solutions for discovering new drugs and molecules.
  • Incorporating domain knowledge from human-constructed databases into LLMs.
  • Instruction-following, dialogue-capable, and tool-equipped LLMs for molecules.
  • Sequence representations for molecular structures, including organic molecules, proteins, DNA, and inorganic crystals.

Submission Instructions

We plan to have accept non-archival papers hosted on our website and an opt-in archival proceedings of relevant papers, as well as a shared task to benchmark the progress of generative text-molecule models. Shared task submissions will be encouraged to submit papers. All submissions should be in PDF format and made through OpenReview submission portal. Submissions must be anonymized following ACL guidelines, but a preprint policy will not be enforced. Information on submitting shared task predictions can be found at the shared task.

Authors are invited to submit papers of 4 (short) or 8 (long) pages, with unlimited pages for references and appendices. In line with the ACL main conference policy, camera-ready versions of papers will be given one additional page of content. It should follow the ACL template style, which can be found here.

The research presented in these papers should be substantially original. Regardless of their length, all submissions will undergo a single-track review process. All submissions must be anonymous for double-blind review. No author information should be included in the papers, and self-references that identify the authors should be avoided or anonymized. We expect each paper to be reviewed by at least three reviewers. Accepted papers will be presented as posters by default, and outstanding submissions will also be selected for oral or spotlight presentations.

According to the ACL workshop guidelines, we do not encourage the re-submission of already-published papers, but you are allowed to submit ArXiv pre-prints or those currently under submission. Moreover, a work that is presented at the *CL main conference should not appear in a workshop. Please be sure to indicate conflicts of interest for all authors on your paper.

Important Dates (Tentative)

All deadlines are 11:59 pm UTC-12h (“Anywhere on Earth”).

Dec 15 2023 Call for Workshop Papers
Mar 17 2024 EACL Tutorial on Language + Molecules
Apr 7 2024 Early Feedback Request Deadline for Underrepresented Groups
May 24 2024 Abstract Submission Deadline
May 31 2024 Paper Submission Deadline
May 31 2024 Shared Task Submission Deadline
Jul 5 2024 Notification of Acceptance
Jul 14 2024 Camera-Ready Due
Aug 2 2024 Release of Proceedings
Aug 15 2024 Workshop Date

Shared Task

The workshop will feature a novel competitive yet collaborative shared task based on “translating” between molecules and natural language (Edwards et al., 2022). The task is designed to leverage the advantages of natural language for molecular design and understanding and will cover four high-impact areas of molecular science (medicine, energy, etc.).

Task Formulation

Molecule translation consists of two tasks: generating (1) a caption given a molecule and (2) a molecule given a description. We will release labeled training and validation splits of our proposed dataset (see below). For the final evaluation, we will release a list of molecules for captioning and adescription list for molecule generation; these lists will cover disjoint molecules. Submissions will upload a corresponding list of generated molecules or captions.

Evaluation

We will use the publicly available evaluation metrics described in (Edwards et al., 2022). Further, we will also extract specific properties from generated captions using simple rules to report property-specific abilities. Scores and a leaderboard will be reported for the full dataset and each of the four high-impact chemistry domains.

Updates

We have updated the code for ``text_property_metrics.py’’ to produce more intuitive results for missing properties in the validation set. We will update table 3 in the dataset manuscript soon to address this change. Only the Overall and Biomedical columns of Table 3 change. See Github for more details.

Dataset Design

The dataset will consist of textual prompts and associated molecular structures. The key reason for creating this standardized synthetic dataset is to thoroughly evaluate the abstraction, functional, and compositional capabilities of submissions; natural language descriptions from the literature are noisy, making it difficult to evaluate abstraction and compositionality. In consultation with domain experts on our advisory board, we will select 20 properties requiring high-level understanding spread across four various high-impact chemistry domains.

Further, we will design the dataset to focus on compositionality. For example, a molecule may share two properties which are desirable together (e.g., low toxicity and high blood brain barrier permeability). We will exclude some property combinations for the test set. Although we will focus on this synthetic dataset, we will also report a leaderboard for the existing real-world ChEBI-20 dataset.

Collaboration via Competition

While a competition will promote innovation and creativity, we recognize that our ultimate goal is scientific discovery. Thus, we propose a novel collaborative element by including molecules of particular interest with unknown properties as roughly 10% of the final evaluation data. We will use an ensemble approach to combine all competitor submissions. Thus, our workshop will produce tangible results usable by chemistry and medicinal researchers.

Download

The dataset is now available at https://github.com/language-plus-molecules/LPM-24-Dataset. It, along with finetuned baseline models, can be found on HuggingFace. A manuscript detailing the dataset’s creation can be found here.

Baseline

Baseline MolT5 models (Edwards et al., 2022) and Meditron-7B have been released on Huggingface.

Submission

Evaluation set results should be submitted on Codabench: here for molecule captioning here for molecule generation.

Not all metrics are reported on this leaderboard due to computational limits associated with Codabench. We will take the 3 latest submissions posted to the leaderboard from each team and calculate all metrics for those after the submission window ends. If a team’s submissions are not posted to the leaderboard, we will use their best submission.

These submissions should be a zipped ``submit.txt’’ file with results from the caption and molecule generation eval splits. They should be in the same order as these splits and contain the same number of data points. Example submission files for MolT5-Small are available on the Github. Please contact language.molecules@gmail.com for any difficulties uploading.

We encourage authors of submissions to also submit a 4 page short paper detailing their submission. Task submissions may also accompany full length (8 page) paper submissions, however.

Leaderboard

Molecular Captioning:

The leaderboard ranking is based on the “Overall Increase” metric. Metrics are spread across two tables: 1) translation metric scores (BLEU, ROUGE, METEOR), and 2) property metric scores (whether a caption contains the ground truth property). For details on evaluation, please see the Github repo.

“Overall Increase” is the average absolute percent improvement over MolT5-Small (across all metrics from both tables). Additionally, we report a “Translation Metric Increase”, which is the average increase over all translation metric scores. “Property Metric Increase”, which is the average improvement over all captioning property scores. For property scores, everything is F-1 Score.

Translation Metrics:

Team Model Overall Increase Translation Metric Increase BLEU-2 BLEU-4 ROUGE-1 ROUGE-2 ROUGE-L METEOR
Insilico Medicine Nach0 27.08 6.37 73.81 53.04 80.06 60.17 57.5 77.45
qizhipei BioT5+_large_voting 14.66 6.45 75.58 54.77 79.41 59.89 57.46 75.43
protonunfold SciMind 12.39 5.77 75.66 54.98 78.24 58.42 56.34 74.76
NLPeople Ensembled 12.3 5.68 75.54 54.83 78.1 58.47 56.37 74.57
hecao bioagent 10.95 5.57 74.11 53.84 78.73 59.38 57.06 74.08
xwk89 mistral_4b9_e1 10.8 5.56 74.38 54.08 78.49 59.49 56.82 73.91
mengmeng Mistral 10.37 5.48 75.04 54.56 78.1 58.81 56.42 73.73
langmolecules Meditron 10.34 5.47 75.16 54.72 77.97 58.75 56.33 73.69
NLPeople Rank_model_1 9.94 4.8 74.73 54.1 77.3 57.7 55.53 73.26
dimitris ALMol~10%DataTrained 9.61 4.27 74.71 54.34 76.43 56.54 54.66 72.76
xygui MDEG 9.43 2.96 73.98 53.33 75.08 54.39 52.58 72.21
danielshao SMol+LPM 7.83 2.7 72.2 52.02 74.76 56.1 53.34 71.57
xwk89 mistral_e1 7.45 3.46 73.18 53.24 75.29 56.74 54.39 71.75
duongttr Mol2Lang-VLM 4.52 4.11 73.43 53.19 76.72 57.67 55.43 72.05
langmolecules MolT5-Large 4.22 3.64 73.63 53.2 75.79 56.47 54.42 72.16
bluesky333 phi3-knowchem-sft-beam1 2.87 0.81 70.56 50.83 72.61 53.61 52.01 69.01
langmolecules MolT5-Base 1.06 1.2 69.83 50.56 73.34 54.55 52.86 69.86
langmolecules MolT5-Small 0.0 0.0 66.82 48.29 72.8 54.44 53.33 68.14
guiyike yike -16.64 -43.45 12.8 6.37 25.9 13.83 24.1 20.11

Property Metrics:

Team Model Overall Increase Property Metric Increase Overall Property F-1 Biomedical Human Interaction Agr. + Industry Light + Electro X-icides Toxins Light Electricity Inhibitors anti-X Modulators Antagonists Treatments Agonists Cancer Disease Held-out Combos
Insilico Medicine Nach0 27.08 33.99 26.99 27.9 4.18 3.55 72.32 0.0 6.74 71.92 72.71 44.52 11.07 68.09 55.7 65.49 41.53 52.07 66.15 69.09
qizhipei BioT5+_large_voting 14.66 17.39 13.76 19.76 4.01 3.07 28.2 0.0 6.25 31.28 25.12 20.55 3.64 37.02 30.77 23.89 16.5 59.47 67.98 70.05
protonunfold SciMind 12.39 14.6 11.51 18.17 3.94 2.93 21.0 0.04 6.2 23.95 18.05 18.06 2.55 30.14 25.36 19.42 14.81 57.54 67.43 69.92
NLPeople Ensembled 12.3 14.5 11.63 17.86 4.02 2.97 21.67 0.04 6.33 24.74 18.59 17.02 2.77 30.96 25.89 18.61 15.1 53.69 67.44 69.99
hecao bioagent 10.95 12.74 9.94 16.86 3.9 2.76 16.24 0.0 6.43 18.12 14.37 18.84 2.48 26.78 23.41 16.33 14.02 50.28 67.22 69.64
xwk89 mistral_4b9_e1 10.8 12.54 9.93 16.98 3.85 2.81 16.1 0.0 6.26 17.15 15.05 15.2 2.12 31.67 19.11 14.34 14.59 51.17 67.69 70.02
mengmeng Mistral 10.37 12.0 9.73 16.72 3.81 2.83 15.57 0.04 6.03 18.49 12.64 14.74 2.12 24.43 21.21 14.34 10.55 53.61 67.45 69.99
langmolecules Meditron 10.34 11.96 9.7 16.87 3.75 2.83 15.36 0.04 6.07 18.19 12.53 14.71 2.17 24.72 20.61 14.35 10.54 53.74 67.4 69.99
NLPeople Rank_model_1 9.94 11.66 9.88 16.5 3.85 2.87 16.28 0.08 5.97 19.29 13.28 13.91 2.14 23.71 17.88 13.87 12.27 50.82 66.01 69.45
dimitris ALMol~10%DataTrained 9.61 11.39 10.05 15.75 3.74 2.78 17.94 0.21 6.0 17.11 18.77 12.39 1.83 17.34 18.53 11.14 11.16 51.66 67.09 69.76
xygui MDEG 9.43 11.59 8.74 16.08 3.57 2.72 12.6 0.0 5.86 12.68 12.52 13.2 1.87 29.96 19.61 12.47 14.08 54.55 67.1 69.32
danielshao SMol+LPM 7.83 9.54 8.55 14.97 2.39 2.62 14.2 0.0 3.19 16.38 12.01 10.9 1.61 22.13 17.43 9.66 9.93 40.41 64.84 68.74
xwk89 mistral_e1 7.45 8.78 8.23 14.44 2.69 2.74 13.05 0.0 4.33 13.43 12.67 14.23 2.0 26.09 18.6 12.99 11.75 20.44 59.5 69.19
duongttr Mol2Lang-VLM 4.52 4.66 5.76 10.73 3.18 2.36 6.78 0.0 5.5 9.39 4.18 1.08 0.3 0.0 2.49 1.72 0.69 40.62 67.61 69.67
langmolecules MolT5-Large 4.22 4.42 5.81 10.33 3.2 2.36 7.36 0.0 5.58 7.77 6.94 0.65 0.13 0.39 0.11 2.02 0.58 38.34 66.97 69.21
bluesky333 phi3-knowchem-sft-beam1 2.87 3.56 5.36 10.05 3.15 2.13 6.11 0.0 5.45 8.01 4.21 0.93 0.23 0.0 0.29 2.04 0.21 35.39 63.71 64.96
langmolecules MolT5-Base 1.06 1.02 3.99 8.27 2.68 2.0 3.0 0.0 4.63 5.52 0.48 0.08 0.04 0.0 0.0 1.79 0.0 18.04 47.64 68.46
langmolecules MolT5-Small 0.0 0.0 3.23 7.87 0.27 1.65 3.12 0.0 0.0 6.24 0.0 0.06 0.0 0.0 0.0 1.66 0.0 17.99 41.09 65.06
guiyike yike -16.64 -7.7 1.23 4.93 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.12 0.0 3.42 0.0 0.0

Text-Based Molecule Generation:

The leaderboard ranking is based on the “Overall Increase” metric. Metrics are spread across two tables: 1) metrics computed for the entire Test Split, and 2) metrics computed for the “Withheld Combos” molecules within the test split.

“Overall Increase” is the average absolute percent improvement over MolT5-Small (across all metrics except exact match from both tables), except for FCD and Levenshtein. For FCD and Levenshtein, absolute improvement is used instead.

“Test Metric Increase” is this score using only the test split metrics. “Withheld Combo Increase” is this score using only the witheld combos split metrics.

Test Split Metrics:

Team Model Overall Increase Test Metric Increase BLEU Exact Match Levenshtein Validity MACCS FTS RDK FTS Morgan FTS FCD
qizhipei BioT5+_large 12.97 13.2 73.17 0.01 41.05 100.0 76.05 68.7 50.05 3.13
protonunfold SciMind 12.68 12.76 73.44 0.0 40.35 99.78 75.06 67.05 47.82 2.54
Insilico Medicine Nach0 12.41 12.39 71.82 0.01 43.91 98.98 74.97 66.85 48.92 0.28
mengmeng Mistral 12.26 12.22 70.56 0.0 43.75 99.4 75.6 67.57 48.62 2.01
langmolecules Meditron 11.81 11.66 68.84 0.01 46.47 99.54 75.59 67.66 48.72 2.44
dimitris ALMol~10%Data 10.41 9.26 69.74 0.01 43.24 92.84 70.22 62.79 42.96 3.05
hecao bioagent_epoch5 10.21 10.67 61.98 0.02 47.12 99.67 75.94 68.38 46.92 2.17
langmolecules MolT5-Base 10.07 10.0 67.04 0.0 45.71 99.89 74.61 63.7 46.29 nan
danielshao SMol+LPM 8.74 8.45 59.74 0.01 55.09 97.66 74.35 66.79 46.6 4.25
hecao bioagent 6.39 5.36 51.5 0.01 70.67 98.05 74.0 66.43 45.56 3.74
langmolecules MolT5-Large 3.65 4.78 55.31 0.0 56.47 99.12 74.14 63.4 38.54 17.63
erikxiong PUF 0.0 0.0 55.44 0.0 57.21 81.03 63.06 56.83 36.69 nan
langmolecules MolT5-Small 0.0 0.0 55.44 0.0 57.21 81.03 63.06 56.83 36.69 nan
ndhieunguyen Lang2mol-diff -1.21 -0.75 54.15 0.0 55.26 100.0 59.6 32.44 31.98 10.71
guiyike Nano -7.39 -6.64 43.51 0.0 83.38 100.0 49.25 37.82 23.52 5.64

Withheld Combos Split Metrics:

Team Model Overall Increase Withheld Combo Increase BLEU Exact Match Levenshtein Validity MACCS FTS RDK FTS Morgan FTS FCD
qizhipei BioT5+_large 12.97 12.74 78.48 0.02 43.99 100.0 86.54 80.43 59.54 4.32
protonunfold SciMind 12.68 12.6 78.99 0.0 43.39 99.77 86.03 79.16 58.06 3.06
Insilico Medicine Nach0 12.41 12.42 78.19 0.0 47.84 99.64 86.3 79.16 59.02 0.35
mengmeng Mistral 12.26 12.31 78.08 0.0 46.34 99.48 86.14 78.75 59.39 2.27
langmolecules Meditron 11.81 11.96 77.11 0.0 48.29 99.53 86.23 78.91 59.57 2.59
dimitris ALMol~10%Data 10.41 11.56 76.93 0.01 45.68 98.99 85.9 79.35 55.28 3.51
hecao bioagent_epoch5 10.21 9.75 65.86 0.0 52.06 99.83 86.55 80.13 55.68 3.24
langmolecules MolT5-Base 10.07 10.15 72.48 0.0 50.9 99.84 86.22 78.23 57.56 nan
danielshao SMol+LPM 8.74 9.03 65.95 0.0 56.31 99.1 86.14 79.77 56.75 4.36
hecao bioagent 6.39 7.42 64.6 0.0 65.63 99.11 85.86 79.58 54.81 4.18
langmolecules MolT5-Large 3.65 2.36 57.74 0.0 66.94 99.17 83.19 74.77 40.97 nan
erikxiong PUF 0.0 0.0 56.96 0.0 72.44 91.89 81.03 73.36 41.6 nan
langmolecules MolT5-Small 0.0 0.0 56.96 0.0 72.44 91.89 81.03 73.36 41.6 nan
ndhieunguyen Lang2mol-diff -1.21 -1.67 59.38 0.0 63.17 100.0 73.26 38.6 39.76 6.38
guiyike Nano -7.39 -8.15 44.84 0.0 85.92 100.0 59.67 47.95 30.91 7.86

Speakers

Avatar

Kyunghyun Cho

NYU & Genentech

Avatar

Marinka Zitnik

Harvard

Avatar

Huan Sun

The Ohio State University

Avatar

Lei Li

Carnegie Mellon University

Schedule

Time Program
9:00-9:10 Opening Remarks
9:10-9:40 Keynote Speech: Kyunghyun Cho
9:40-10:10 Keynote Speech: Marinka Zitnick
10:10-10:40 Keynote Speech: Elsa Olivetti
10:40-11:00 Coffee Break
11:00-11:20 Shared Task Overview
11:30-11:45 Oral Presentation: Repurformer: Transformers for Repurposing-Aware Molecule Generation
11:45-12:00 Oral Presentation: SciMind: A Multimodal Mixture-of-Experts Model for Advancing Pharmaceutical Sciences
12:00-12:15 Oral Presentation: ChatMol Copilot: An Agent for Molecular Modeling and Computation Powered by LLMs
12:15-12:30 Oral Presentation: Enhanced BioT5+ for Molecule-Text Translation: A Three-Stage Approach with Data Distillation, Diverse Training, and Voting Ensemble
12:30-13:30 Lunch Break
13:30-14:05 Keynote Speech: Huan Sun
14:05-14:40 Keynote Speech: Lei Li
14:40-15:30 Panel Discussion: Huan Sun, Lei Li, Heng Ji
15:30-16:00 Coffee Break
16:00-17:15 Poster Session
17:20-17:30 Closing Remarks

Accepted Papers

Proceedings:

The archival proceedings are now online on the ACL Anthology.

Accepted papers can also be read on OpenReview.

Non-archival papers:

GLaD: Synergizing Molecular Graphs and Language Descriptors for Enhanced Power Conversion Efficiency Prediction in Organic Photovoltaic Devices

Thao Nguyen, Tiara Torres-Flores, Changhyun Hwang, Carl Edwards, Ying Diao, Heng Ji

This paper presents a novel approach for predicting Power Conversion Efficiency (PCE) of Organic Photovoltaic (OPV) devices, called GLaD: synergizing molecular Graphs and Language Descriptors for enhanced PCE prediction. Due to the lack of high-quality experimental data, we collect a dataset consisting of 500 pairs of OPV donor and acceptor molecules along with their corresponding PCE values, which we utilize as the training data for our predictive model. In this low-data regime, GLaD leverages properties learned from large language models (LLMs) pretrained on extensive scientific literature to enrich molecular structural representations, allowing for a multimodal representation of molecules. GLaD achieves precise predictions of PCE, thereby facilitating the synthesis of new OPV molecules with improved efficiency. Furthermore, GLaD showcases versatility, as it applies to a range of molecular property prediction tasks (BBBP, BACE, ClinTox and SIDER), not limited to those concerning OPV materials. Especially, GLaD proves valuable for tasks in low-data regimes within the chemical space, as it enriches molecular representations by incorporating molecular property descriptions learned from large-scale pretraining. This capability is significant in real-world scientific endeavors like drug and material discovery, where access to comprehensive data is crucial for informed decision-making and efficient exploration of the chemical space.

pdf

Molecule Language Model with Augmented Pairs and Expertise Transfer

Namkyeong Lee, Siddhartha Laghuvarapu, Chanyoung Park, Jimeng Sun

Understanding the molecules and their textual descriptions via molecule language models (MoLM) recently got a surge of interest among researchers. However, unique challenges exist in the field of MoLM due to 1) a limited amount of molecule-text paired data and 2) missing expertise that occurred due to the specialized areas of focus among the experts. To this end, we propose AMOLE, which 1) augments molecule-text pairs with structural similarity preserving loss, and 2) transfers the expertise between the molecules. Extensive experiments on various downstream tasks demonstrate the superiority of AMOLE in comprehending molecules and their descriptions, highlighting its potential for application in real-world drug discovery.

pdf

Extracting Materials Science Data from Scientific Tables

Defne Circi, Ghazal Khalighinejad, Anlan Chen, Bhuwan Dhingra, L. Brinson

Advances in materials science depend on leveraging data from the vast published literature. Extracting detailed data and metadata from these publications is challenging, leading current data repositories to rely on newly created data in narrow domains. Large Language Models (LLMs) offer a new opportunity to rapidly and accurately extract data and insights from the published literature, transforming it into structured formats for easy querying and reuse. This paper explores using LLMs for autonomous data extraction from materials science articles, focusing on polymer composites to demonstrate successes and challenges in extracting tabular data. We explored different table representations for use with LLMs, finding that a multimodal model with an image input yielded the most promising results. This model achieved an accuracy score of 0.910 for composition information extraction, which includes polymer names, molecule names used as fillers, and their respective compositions. Additionally, it achieved an F_1 score of 0.863 for property name information extraction. With the most conservative evaluation for the property extraction requiring exact match in all the details we obtained an F_1 score of 0.419. We observed that by allowing varying degrees of flexibility in the evaluation, the score can increase to 0.769. We envision that the results and analysis from this study will promote further research directions in developing information extraction strategies from materials information sources.

pdf

LlaSMol: Advancing Large Language Models for Chemistry with a Large-Scale, Comprehensive, High-Quality Instruction Tuning Dataset

Botao Yu, Frazier N. Baker, Ziqi Chen, Xia Ning, Huan Sun

Chemistry plays a crucial role in many domains, such as drug discovery and material science. While large language models (LLMs) such as GPT-4 exhibit remarkable capabilities on natural language processing tasks, existing research indicates that their performance on chemistry tasks is discouragingly low. In this paper, however, we demonstrate that our developed LLMs can achieve very strong results on a comprehensive set of chemistry tasks, outperforming the most advanced GPT-4 and Claude 3 Opus by a substantial margin. To accomplish this, we propose SMolInstruct, a large-scale, comprehensive, and high-quality dataset for instruction tuning. It contains 14 selected chemistry tasks and over three million samples, laying a solid foundation for training and evaluating LLMs for chemistry. Using SMolInstruct, we fine-tune a set of open-source LLMs named as LlaSMol, among which, we find that Mistral serves as the best base model for chemistry tasks. Our analysis further demonstrates the critical role of the proposed dataset in driving the performance improvements.

pdf

Organization

Organizing Committee

Avatar

Carl Edwards

University of Illinois Urbana-Champaign

Avatar

Heng Ji

University of Illinois Urbana-Champaign

Avatar

Qingyun Wang

University of Illinois Urbana-Champaign

Avatar

Tom Hope

HUJI, AI2

Avatar

Manling Li

Northwestern University

Avatar

Lawrence Zhao

Yale University

Scientific Advisory Board

Avatar

Ying Diao

UIUC

Avatar

Teodoro Laino

IBM Research

Contact us

Please email language.molecules@gmail.com if you have any questions.

Acknowledgements

This workshop will be partially supported by the Molecule Maker Lab Institute: an AI research institute program supported by NSF under award No. 2019897 and No. 2034562. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.