Language + Molecules
@ ACL 2024 Workshop
August 15, 2024 hybrid in Bangkok, Thailand & Remote
Synthesizing Language and Molecules for Scientific Insight and Discovery
Welcome to the Language + Molecules Workshop! Join us as we explore the integration of molecules and natural language, with exciting applications such as developing new drugs, materials, and chemical processes. These molecular solutions will be critical to address global problems on scales of complexity never-before-seen, in areas such as climate change and healthcare. However, they exist in extremely large search spaces, which makes AI tools a necessity. Excitingly, the chemistry field is posed to be substantially accelerated via multimodal models combining language with molecules and drug structures.
Stay tuned by following us on Twitter @lang_plus_mols.
Thank you to everyone who particpated in the workshop! It was great to hear from our wonderful speakers and submission presenters. The workshop recording is now available on Underline. Seeing such exciting work in this field was very inspiring!
Thank you again to the organizers, program comittee, and advisory board for making this possible!
A natural question to ask is why we want to integrate natural language with molecules. Combining these types of information has the possibility to accelerate scientific discovery: imagine a future where a doctor can write a few sentences describing a patient’s symptoms and then receive exact structure of the drugs necessary to treat that patient’s ailment (taking into account the patient’s genotype, phenotype, and medical history). Or, imagine a world where a researcher can specify the function they want a molecule to perform (e.g., antimalarial or a photovoltaic) rather than its low level properties (e.g., pyridine-containing). This high-level control of molecules requires a method of abstract description, and humans have already developed one for communication: language. Integrating language with scientific modalities has the following major advantages, as discussed in this recent survey, section 10.3.3:
Research in scientific NLP, integrating molecules with natural language, and multimodal AI for science/medicine has experienced significant attention and growth in recent months. We believe now is the time to begin organizing this nascent community. To do so, we propose a new ACL workshop: “Language + Molecules”. Further, to broaden the communities’ understanding of the associated challenges, methodologies, and goals, we will be holding an EACL tutorial. In the workshop’s first year, we will focus on the following research themes:
Submission Instructions
We plan to have accept non-archival papers hosted on our website and an opt-in archival proceedings of relevant papers, as well as a shared task to benchmark the progress of generative text-molecule models. Shared task submissions will be encouraged to submit papers. All submissions should be in PDF format and made through OpenReview submission portal. Submissions must be anonymized following ACL guidelines, but a preprint policy will not be enforced. Information on submitting shared task predictions can be found at the shared task.
Authors are invited to submit papers of 4 (short) or 8 (long) pages, with unlimited pages for references and appendices. In line with the ACL main conference policy, camera-ready versions of papers will be given one additional page of content. It should follow the ACL template style, which can be found here.
The research presented in these papers should be substantially original. Regardless of their length, all submissions will undergo a single-track review process. All submissions must be anonymous for double-blind review. No author information should be included in the papers, and self-references that identify the authors should be avoided or anonymized. We expect each paper to be reviewed by at least three reviewers. Accepted papers will be presented as posters by default, and outstanding submissions will also be selected for oral or spotlight presentations.
According to the ACL workshop guidelines, we do not encourage the re-submission of already-published papers, but you are allowed to submit ArXiv pre-prints or those currently under submission. Moreover, a work that is presented at the *CL main conference should not appear in a workshop. Please be sure to indicate conflicts of interest for all authors on your paper.
All deadlines are 11:59 pm UTC-12h (“Anywhere on Earth”).
Dec 15 2023 | Call for Workshop Papers |
---|---|
Mar 17 2024 | EACL Tutorial on Language + Molecules |
Apr 7 2024 | Early Feedback Request Deadline for Underrepresented Groups |
May 24 2024 | Abstract Submission Deadline |
May 31 2024 | Paper Submission Deadline |
May 31 2024 | Shared Task Submission Deadline |
Jul 5 2024 | Notification of Acceptance |
Jul 14 2024 | Camera-Ready Due |
Aug 2 2024 | Release of Proceedings |
Aug 15 2024 | Workshop Date |
The leaderboard ranking is based on the “Overall Increase” metric. Metrics are spread across two tables: 1) translation metric scores (BLEU, ROUGE, METEOR), and 2) property metric scores (whether a caption contains the ground truth property). For details on evaluation, please see the Github repo.
“Overall Increase” is the average absolute percent improvement over MolT5-Small (across all metrics from both tables). Additionally, we report a “Translation Metric Increase”, which is the average increase over all translation metric scores. “Property Metric Increase”, which is the average improvement over all captioning property scores. For property scores, everything is F-1 Score.
Team | Model | Overall Increase | Translation Metric Increase | BLEU-2 | BLEU-4 | ROUGE-1 | ROUGE-2 | ROUGE-L | METEOR |
---|---|---|---|---|---|---|---|---|---|
Insilico Medicine | Nach0 | 27.08 | 6.37 | 73.81 | 53.04 | 80.06 | 60.17 | 57.5 | 77.45 |
qizhipei | BioT5+_large_voting | 14.66 | 6.45 | 75.58 | 54.77 | 79.41 | 59.89 | 57.46 | 75.43 |
protonunfold | SciMind | 12.39 | 5.77 | 75.66 | 54.98 | 78.24 | 58.42 | 56.34 | 74.76 |
NLPeople | Ensembled | 12.3 | 5.68 | 75.54 | 54.83 | 78.1 | 58.47 | 56.37 | 74.57 |
hecao | bioagent | 10.95 | 5.57 | 74.11 | 53.84 | 78.73 | 59.38 | 57.06 | 74.08 |
xwk89 | mistral_4b9_e1 | 10.8 | 5.56 | 74.38 | 54.08 | 78.49 | 59.49 | 56.82 | 73.91 |
mengmeng | Mistral | 10.37 | 5.48 | 75.04 | 54.56 | 78.1 | 58.81 | 56.42 | 73.73 |
langmolecules | Meditron | 10.34 | 5.47 | 75.16 | 54.72 | 77.97 | 58.75 | 56.33 | 73.69 |
NLPeople | Rank_model_1 | 9.94 | 4.8 | 74.73 | 54.1 | 77.3 | 57.7 | 55.53 | 73.26 |
dimitris | ALMol~10%DataTrained | 9.61 | 4.27 | 74.71 | 54.34 | 76.43 | 56.54 | 54.66 | 72.76 |
xygui | MDEG | 9.43 | 2.96 | 73.98 | 53.33 | 75.08 | 54.39 | 52.58 | 72.21 |
danielshao | SMol+LPM | 7.83 | 2.7 | 72.2 | 52.02 | 74.76 | 56.1 | 53.34 | 71.57 |
xwk89 | mistral_e1 | 7.45 | 3.46 | 73.18 | 53.24 | 75.29 | 56.74 | 54.39 | 71.75 |
duongttr | Mol2Lang-VLM | 4.52 | 4.11 | 73.43 | 53.19 | 76.72 | 57.67 | 55.43 | 72.05 |
langmolecules | MolT5-Large | 4.22 | 3.64 | 73.63 | 53.2 | 75.79 | 56.47 | 54.42 | 72.16 |
bluesky333 | phi3-knowchem-sft-beam1 | 2.87 | 0.81 | 70.56 | 50.83 | 72.61 | 53.61 | 52.01 | 69.01 |
langmolecules | MolT5-Base | 1.06 | 1.2 | 69.83 | 50.56 | 73.34 | 54.55 | 52.86 | 69.86 |
langmolecules | MolT5-Small | 0.0 | 0.0 | 66.82 | 48.29 | 72.8 | 54.44 | 53.33 | 68.14 |
guiyike | yike | -16.64 | -43.45 | 12.8 | 6.37 | 25.9 | 13.83 | 24.1 | 20.11 |
Team | Model | Overall Increase | Property Metric Increase | Overall Property F-1 | Biomedical | Human Interaction | Agr. + Industry | Light + Electro | X-icides | Toxins | Light | Electricity | Inhibitors | anti-X | Modulators | Antagonists | Treatments | Agonists | Cancer | Disease | Held-out Combos |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Insilico Medicine | Nach0 | 27.08 | 33.99 | 26.99 | 27.9 | 4.18 | 3.55 | 72.32 | 0.0 | 6.74 | 71.92 | 72.71 | 44.52 | 11.07 | 68.09 | 55.7 | 65.49 | 41.53 | 52.07 | 66.15 | 69.09 |
qizhipei | BioT5+_large_voting | 14.66 | 17.39 | 13.76 | 19.76 | 4.01 | 3.07 | 28.2 | 0.0 | 6.25 | 31.28 | 25.12 | 20.55 | 3.64 | 37.02 | 30.77 | 23.89 | 16.5 | 59.47 | 67.98 | 70.05 |
protonunfold | SciMind | 12.39 | 14.6 | 11.51 | 18.17 | 3.94 | 2.93 | 21.0 | 0.04 | 6.2 | 23.95 | 18.05 | 18.06 | 2.55 | 30.14 | 25.36 | 19.42 | 14.81 | 57.54 | 67.43 | 69.92 |
NLPeople | Ensembled | 12.3 | 14.5 | 11.63 | 17.86 | 4.02 | 2.97 | 21.67 | 0.04 | 6.33 | 24.74 | 18.59 | 17.02 | 2.77 | 30.96 | 25.89 | 18.61 | 15.1 | 53.69 | 67.44 | 69.99 |
hecao | bioagent | 10.95 | 12.74 | 9.94 | 16.86 | 3.9 | 2.76 | 16.24 | 0.0 | 6.43 | 18.12 | 14.37 | 18.84 | 2.48 | 26.78 | 23.41 | 16.33 | 14.02 | 50.28 | 67.22 | 69.64 |
xwk89 | mistral_4b9_e1 | 10.8 | 12.54 | 9.93 | 16.98 | 3.85 | 2.81 | 16.1 | 0.0 | 6.26 | 17.15 | 15.05 | 15.2 | 2.12 | 31.67 | 19.11 | 14.34 | 14.59 | 51.17 | 67.69 | 70.02 |
mengmeng | Mistral | 10.37 | 12.0 | 9.73 | 16.72 | 3.81 | 2.83 | 15.57 | 0.04 | 6.03 | 18.49 | 12.64 | 14.74 | 2.12 | 24.43 | 21.21 | 14.34 | 10.55 | 53.61 | 67.45 | 69.99 |
langmolecules | Meditron | 10.34 | 11.96 | 9.7 | 16.87 | 3.75 | 2.83 | 15.36 | 0.04 | 6.07 | 18.19 | 12.53 | 14.71 | 2.17 | 24.72 | 20.61 | 14.35 | 10.54 | 53.74 | 67.4 | 69.99 |
NLPeople | Rank_model_1 | 9.94 | 11.66 | 9.88 | 16.5 | 3.85 | 2.87 | 16.28 | 0.08 | 5.97 | 19.29 | 13.28 | 13.91 | 2.14 | 23.71 | 17.88 | 13.87 | 12.27 | 50.82 | 66.01 | 69.45 |
dimitris | ALMol~10%DataTrained | 9.61 | 11.39 | 10.05 | 15.75 | 3.74 | 2.78 | 17.94 | 0.21 | 6.0 | 17.11 | 18.77 | 12.39 | 1.83 | 17.34 | 18.53 | 11.14 | 11.16 | 51.66 | 67.09 | 69.76 |
xygui | MDEG | 9.43 | 11.59 | 8.74 | 16.08 | 3.57 | 2.72 | 12.6 | 0.0 | 5.86 | 12.68 | 12.52 | 13.2 | 1.87 | 29.96 | 19.61 | 12.47 | 14.08 | 54.55 | 67.1 | 69.32 |
danielshao | SMol+LPM | 7.83 | 9.54 | 8.55 | 14.97 | 2.39 | 2.62 | 14.2 | 0.0 | 3.19 | 16.38 | 12.01 | 10.9 | 1.61 | 22.13 | 17.43 | 9.66 | 9.93 | 40.41 | 64.84 | 68.74 |
xwk89 | mistral_e1 | 7.45 | 8.78 | 8.23 | 14.44 | 2.69 | 2.74 | 13.05 | 0.0 | 4.33 | 13.43 | 12.67 | 14.23 | 2.0 | 26.09 | 18.6 | 12.99 | 11.75 | 20.44 | 59.5 | 69.19 |
duongttr | Mol2Lang-VLM | 4.52 | 4.66 | 5.76 | 10.73 | 3.18 | 2.36 | 6.78 | 0.0 | 5.5 | 9.39 | 4.18 | 1.08 | 0.3 | 0.0 | 2.49 | 1.72 | 0.69 | 40.62 | 67.61 | 69.67 |
langmolecules | MolT5-Large | 4.22 | 4.42 | 5.81 | 10.33 | 3.2 | 2.36 | 7.36 | 0.0 | 5.58 | 7.77 | 6.94 | 0.65 | 0.13 | 0.39 | 0.11 | 2.02 | 0.58 | 38.34 | 66.97 | 69.21 |
bluesky333 | phi3-knowchem-sft-beam1 | 2.87 | 3.56 | 5.36 | 10.05 | 3.15 | 2.13 | 6.11 | 0.0 | 5.45 | 8.01 | 4.21 | 0.93 | 0.23 | 0.0 | 0.29 | 2.04 | 0.21 | 35.39 | 63.71 | 64.96 |
langmolecules | MolT5-Base | 1.06 | 1.02 | 3.99 | 8.27 | 2.68 | 2.0 | 3.0 | 0.0 | 4.63 | 5.52 | 0.48 | 0.08 | 0.04 | 0.0 | 0.0 | 1.79 | 0.0 | 18.04 | 47.64 | 68.46 |
langmolecules | MolT5-Small | 0.0 | 0.0 | 3.23 | 7.87 | 0.27 | 1.65 | 3.12 | 0.0 | 0.0 | 6.24 | 0.0 | 0.06 | 0.0 | 0.0 | 0.0 | 1.66 | 0.0 | 17.99 | 41.09 | 65.06 |
guiyike | yike | -16.64 | -7.7 | 1.23 | 4.93 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.12 | 0.0 | 3.42 | 0.0 | 0.0 |
The leaderboard ranking is based on the “Overall Increase” metric. Metrics are spread across two tables: 1) metrics computed for the entire Test Split, and 2) metrics computed for the “Withheld Combos” molecules within the test split.
“Overall Increase” is the average absolute percent improvement over MolT5-Small (across all metrics except exact match from both tables), except for FCD and Levenshtein. For FCD and Levenshtein, absolute improvement is used instead.
“Test Metric Increase” is this score using only the test split metrics. “Withheld Combo Increase” is this score using only the witheld combos split metrics.
Team | Model | Overall Increase | Test Metric Increase | BLEU | Exact Match | Levenshtein | Validity | MACCS FTS | RDK FTS | Morgan FTS | FCD |
---|---|---|---|---|---|---|---|---|---|---|---|
qizhipei | BioT5+_large | 12.97 | 13.2 | 73.17 | 0.01 | 41.05 | 100.0 | 76.05 | 68.7 | 50.05 | 3.13 |
protonunfold | SciMind | 12.68 | 12.76 | 73.44 | 0.0 | 40.35 | 99.78 | 75.06 | 67.05 | 47.82 | 2.54 |
Insilico Medicine | Nach0 | 12.41 | 12.39 | 71.82 | 0.01 | 43.91 | 98.98 | 74.97 | 66.85 | 48.92 | 0.28 |
mengmeng | Mistral | 12.26 | 12.22 | 70.56 | 0.0 | 43.75 | 99.4 | 75.6 | 67.57 | 48.62 | 2.01 |
langmolecules | Meditron | 11.81 | 11.66 | 68.84 | 0.01 | 46.47 | 99.54 | 75.59 | 67.66 | 48.72 | 2.44 |
dimitris | ALMol~10%Data | 10.41 | 9.26 | 69.74 | 0.01 | 43.24 | 92.84 | 70.22 | 62.79 | 42.96 | 3.05 |
hecao | bioagent_epoch5 | 10.21 | 10.67 | 61.98 | 0.02 | 47.12 | 99.67 | 75.94 | 68.38 | 46.92 | 2.17 |
langmolecules | MolT5-Base | 10.07 | 10.0 | 67.04 | 0.0 | 45.71 | 99.89 | 74.61 | 63.7 | 46.29 | nan |
danielshao | SMol+LPM | 8.74 | 8.45 | 59.74 | 0.01 | 55.09 | 97.66 | 74.35 | 66.79 | 46.6 | 4.25 |
hecao | bioagent | 6.39 | 5.36 | 51.5 | 0.01 | 70.67 | 98.05 | 74.0 | 66.43 | 45.56 | 3.74 |
langmolecules | MolT5-Large | 3.65 | 4.78 | 55.31 | 0.0 | 56.47 | 99.12 | 74.14 | 63.4 | 38.54 | 17.63 |
erikxiong | PUF | 0.0 | 0.0 | 55.44 | 0.0 | 57.21 | 81.03 | 63.06 | 56.83 | 36.69 | nan |
langmolecules | MolT5-Small | 0.0 | 0.0 | 55.44 | 0.0 | 57.21 | 81.03 | 63.06 | 56.83 | 36.69 | nan |
ndhieunguyen | Lang2mol-diff | -1.21 | -0.75 | 54.15 | 0.0 | 55.26 | 100.0 | 59.6 | 32.44 | 31.98 | 10.71 |
guiyike | Nano | -7.39 | -6.64 | 43.51 | 0.0 | 83.38 | 100.0 | 49.25 | 37.82 | 23.52 | 5.64 |
Team | Model | Overall Increase | Withheld Combo Increase | BLEU | Exact Match | Levenshtein | Validity | MACCS FTS | RDK FTS | Morgan FTS | FCD |
---|---|---|---|---|---|---|---|---|---|---|---|
qizhipei | BioT5+_large | 12.97 | 12.74 | 78.48 | 0.02 | 43.99 | 100.0 | 86.54 | 80.43 | 59.54 | 4.32 |
protonunfold | SciMind | 12.68 | 12.6 | 78.99 | 0.0 | 43.39 | 99.77 | 86.03 | 79.16 | 58.06 | 3.06 |
Insilico Medicine | Nach0 | 12.41 | 12.42 | 78.19 | 0.0 | 47.84 | 99.64 | 86.3 | 79.16 | 59.02 | 0.35 |
mengmeng | Mistral | 12.26 | 12.31 | 78.08 | 0.0 | 46.34 | 99.48 | 86.14 | 78.75 | 59.39 | 2.27 |
langmolecules | Meditron | 11.81 | 11.96 | 77.11 | 0.0 | 48.29 | 99.53 | 86.23 | 78.91 | 59.57 | 2.59 |
dimitris | ALMol~10%Data | 10.41 | 11.56 | 76.93 | 0.01 | 45.68 | 98.99 | 85.9 | 79.35 | 55.28 | 3.51 |
hecao | bioagent_epoch5 | 10.21 | 9.75 | 65.86 | 0.0 | 52.06 | 99.83 | 86.55 | 80.13 | 55.68 | 3.24 |
langmolecules | MolT5-Base | 10.07 | 10.15 | 72.48 | 0.0 | 50.9 | 99.84 | 86.22 | 78.23 | 57.56 | nan |
danielshao | SMol+LPM | 8.74 | 9.03 | 65.95 | 0.0 | 56.31 | 99.1 | 86.14 | 79.77 | 56.75 | 4.36 |
hecao | bioagent | 6.39 | 7.42 | 64.6 | 0.0 | 65.63 | 99.11 | 85.86 | 79.58 | 54.81 | 4.18 |
langmolecules | MolT5-Large | 3.65 | 2.36 | 57.74 | 0.0 | 66.94 | 99.17 | 83.19 | 74.77 | 40.97 | nan |
erikxiong | PUF | 0.0 | 0.0 | 56.96 | 0.0 | 72.44 | 91.89 | 81.03 | 73.36 | 41.6 | nan |
langmolecules | MolT5-Small | 0.0 | 0.0 | 56.96 | 0.0 | 72.44 | 91.89 | 81.03 | 73.36 | 41.6 | nan |
ndhieunguyen | Lang2mol-diff | -1.21 | -1.67 | 59.38 | 0.0 | 63.17 | 100.0 | 73.26 | 38.6 | 39.76 | 6.38 |
guiyike | Nano | -7.39 | -8.15 | 44.84 | 0.0 | 85.92 | 100.0 | 59.67 | 47.95 | 30.91 | 7.86 |
Time | Program |
---|---|
9:00-9:10 | Opening Remarks |
9:10-9:40 | Keynote Speech: Kyunghyun Cho |
9:40-10:10 | Keynote Speech: Marinka Zitnick |
10:10-10:40 | Keynote Speech: Elsa Olivetti |
10:40-11:00 | Coffee Break |
11:00-11:20 | Shared Task Overview |
11:30-11:45 | Oral Presentation: Repurformer: Transformers for Repurposing-Aware Molecule Generation |
11:45-12:00 | Oral Presentation: SciMind: A Multimodal Mixture-of-Experts Model for Advancing Pharmaceutical Sciences |
12:00-12:15 | Oral Presentation: ChatMol Copilot: An Agent for Molecular Modeling and Computation Powered by LLMs |
12:15-12:30 | Oral Presentation: Enhanced BioT5+ for Molecule-Text Translation: A Three-Stage Approach with Data Distillation, Diverse Training, and Voting Ensemble |
12:30-13:30 | Lunch Break |
13:30-14:05 | Keynote Speech: Huan Sun |
14:05-14:40 | Keynote Speech: Lei Li |
14:40-15:30 | Panel Discussion: Huan Sun, Lei Li, Heng Ji |
15:30-16:00 | Coffee Break |
16:00-17:15 | Poster Session |
17:20-17:30 | Closing Remarks |
The archival proceedings are now online on the ACL Anthology.
Accepted papers can also be read on OpenReview.
Thao Nguyen, Tiara Torres-Flores, Changhyun Hwang, Carl Edwards, Ying Diao, Heng Ji
This paper presents a novel approach for predicting Power Conversion Efficiency (PCE) of Organic Photovoltaic (OPV) devices, called GLaD: synergizing molecular Graphs and Language Descriptors for enhanced PCE prediction. Due to the lack of high-quality experimental data, we collect a dataset consisting of 500 pairs of OPV donor and acceptor molecules along with their corresponding PCE values, which we utilize as the training data for our predictive model. In this low-data regime, GLaD leverages properties learned from large language models (LLMs) pretrained on extensive scientific literature to enrich molecular structural representations, allowing for a multimodal representation of molecules. GLaD achieves precise predictions of PCE, thereby facilitating the synthesis of new OPV molecules with improved efficiency. Furthermore, GLaD showcases versatility, as it applies to a range of molecular property prediction tasks (BBBP, BACE, ClinTox and SIDER), not limited to those concerning OPV materials. Especially, GLaD proves valuable for tasks in low-data regimes within the chemical space, as it enriches molecular representations by incorporating molecular property descriptions learned from large-scale pretraining. This capability is significant in real-world scientific endeavors like drug and material discovery, where access to comprehensive data is crucial for informed decision-making and efficient exploration of the chemical space.
Namkyeong Lee, Siddhartha Laghuvarapu, Chanyoung Park, Jimeng Sun
Understanding the molecules and their textual descriptions via molecule language models (MoLM) recently got a surge of interest among researchers. However, unique challenges exist in the field of MoLM due to 1) a limited amount of molecule-text paired data and 2) missing expertise that occurred due to the specialized areas of focus among the experts. To this end, we propose AMOLE, which 1) augments molecule-text pairs with structural similarity preserving loss, and 2) transfers the expertise between the molecules. Extensive experiments on various downstream tasks demonstrate the superiority of AMOLE in comprehending molecules and their descriptions, highlighting its potential for application in real-world drug discovery.
Defne Circi, Ghazal Khalighinejad, Anlan Chen, Bhuwan Dhingra, L. Brinson
Advances in materials science depend on leveraging data from the vast published literature. Extracting detailed data and metadata from these publications is challenging, leading current data repositories to rely on newly created data in narrow domains. Large Language Models (LLMs) offer a new opportunity to rapidly and accurately extract data and insights from the published literature, transforming it into structured formats for easy querying and reuse. This paper explores using LLMs for autonomous data extraction from materials science articles, focusing on polymer composites to demonstrate successes and challenges in extracting tabular data. We explored different table representations for use with LLMs, finding that a multimodal model with an image input yielded the most promising results. This model achieved an accuracy score of 0.910 for composition information extraction, which includes polymer names, molecule names used as fillers, and their respective compositions. Additionally, it achieved an F_1 score of 0.863 for property name information extraction. With the most conservative evaluation for the property extraction requiring exact match in all the details we obtained an F_1 score of 0.419. We observed that by allowing varying degrees of flexibility in the evaluation, the score can increase to 0.769. We envision that the results and analysis from this study will promote further research directions in developing information extraction strategies from materials information sources.
Botao Yu, Frazier N. Baker, Ziqi Chen, Xia Ning, Huan Sun
Chemistry plays a crucial role in many domains, such as drug discovery and material science. While large language models (LLMs) such as GPT-4 exhibit remarkable capabilities on natural language processing tasks, existing research indicates that their performance on chemistry tasks is discouragingly low. In this paper, however, we demonstrate that our developed LLMs can achieve very strong results on a comprehensive set of chemistry tasks, outperforming the most advanced GPT-4 and Claude 3 Opus by a substantial margin. To accomplish this, we propose SMolInstruct, a large-scale, comprehensive, and high-quality dataset for instruction tuning. It contains 14 selected chemistry tasks and over three million samples, laying a solid foundation for training and evaluating LLMs for chemistry. Using SMolInstruct, we fine-tune a set of open-source LLMs named as LlaSMol, among which, we find that Mistral serves as the best base model for chemistry tasks. Our analysis further demonstrates the critical role of the proposed dataset in driving the performance improvements.
Please email language.molecules@gmail.com if you have any questions.
This workshop will be partially supported by the Molecule Maker Lab Institute: an AI research institute program supported by NSF under award No. 2019897 and No. 2034562. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.