Digital Discovery | From Subtle to Significant: The PANIP Model Enables High-Precision Characterization of Non-Covalent Interactions Between Protein Fragments - News - 北京科学生命云顶集团(中国)

云顶集团(中国)

EN |  中文
Current :Home / News / Institutional News / Text

Institutional News

Digital Discovery | From Subtle to Significant: The PANIP Model Enables High-Precision Characterization of Non-Covalent Interactions Between Protein Fragments

Publication Date:2026/06/24

Macroscopic biological processes are ultimately driven by atomic-level interactions. Among them, non‑covalent interactions (NCIs) dominant core process such as protein folding, molecular recognition, and drug–target binding, serving as the foundation for deciphering biological functions and for the rational design of innovative drugs. In computational biology, describing NCIs often involves multidimensional trade‑offs: some methods pursue physical accuracy to the extreme, while others emphasize computational speed. Nevertheless, reconciling the merits of both strengths within conventional classical force fields or quantum mechanical frameworks remains challenging,forming a critical bottleneck limiting broader applications of computational biology and representing a “last mile” awaiting resolution.


Classical force fields are renowned for their efficiency and are currently the mainstream for biomolecular simulation. However, its physical approximations introduces inherent errors: fixed atomic point charges cannot capture electronic polarization, leading to insufficient accuracy in describing distinctive NCIs such as ion–π, π–π stacking, halogen bonding, and metal chelation; At the same time, traditional van der Waals potentials tend to overestimate short‑range repulsion, which can misidentify molecular interactions and undermine the reliability of simulation predictions. Although Quantum mechanics (QM) can precisely resolve NCIs with first-principles accuracy, its prohibitive computational cost and limited computational efficiency make it difficult to support large‑scale simulations of complex biological systems.


In recent years, machine‑learning interatomic potentials (MLIPs) have been regarded as a promising solution to break the long‑standing trade‑off between accuracy and efficiency, offering the prospect of achieving both simultaneously. However, the precision and generalizability of such models heavily depend on the quality, diversity, and representativeness of the training datasets.


To address the long‑standing challenge of reconciling accuracy and efficiency, the research team led by Dr. Huang Niu from National Institute of Biological Sciences, Beijing (NIBS) and Tsinghua Institute of Multidisciplinary Biomedical Research, Tsinghua University published their latest work in Digital Discovery. Drawing on the iterative development of classical force fields, this study adopts a “bottom‑up, divide‑and‑conquer” modeling strategy, starting from the interactions of small molecular fragments, they gradually construct a multilayer, comprehensive QM dataset of pairwise fragment NCIs. Then, they developed PANIP (PAirwise Non‑covalent Interaction Potential), a MLIP tailored specifically for protein fragment NCIs. By introducing a multi‑fidelity active learning (MFAL) strategy, representative samples are efficiently screened from massive protein fragment datasets, achieving near QM level accuracy with low data volume. PANIP is a brand-new tool for studying protein NCIs, and demonstrates broad application potential in drug docking and virtual screening.


Intelligent Data Selection: Reducing Tens of Millions of Samples to 8.7% While Balancing Diversity and Cost


The research team constructed the dataset based on the Protein Data Bank (PDB). They first screened 29,204 high‑resolution protein structures and decomposed them into 17 types of chemical fragments, including amino‑acid side chains, backbone motifs, and water molecules. Fragment dimers within a heavy‑atom distance of 2–4 Å were selected as candidates for NCI. This process yielded an original dataset of dimers covering 153 combination types and 36.3 million dimers.


If all samples were to be labelled with QM‑level calculations, the computational and time costs would be prohibitive. To overcome this challenge, the team proposed a hierarchical MFAL workflow (Figure 1). First, the low‑cost r²SCAN‑3c method was used to perform preliminary energy calculations for all samples, then machine‑learning surrogate model was then used to iteratively identify “key samples” with high prediction errors and high information value, gradually expanding the training set; Ultimately, from 36.3 million dimers, about 3.15 million representative samples sets were selected to construct the PDB‑FRAGID dataset. This concise dataset represents only 8.7% of the original size, yet it fully preserves the chemical features and conformational diversity of 17 fragment types and 153 fragment combinations, covering typical protein NCI motifs such as hydrogen bonding, electrostatics, cation–π interactions, and sulfur‑driven contacts. The team then adopted a high‑accuracy ωB97X‑D3BJ/def2‑TZVPP method to annotate the PDB‑FRAGID dataset, providing a robust foundation of high‑quality data for training the PANIP model.



Figure 1. Workflow for Dataset Construction and Model Training


Outstanding Model Performance: Quantum‑Level Accuracy and Industry‑Leading Generalization


PANIP is Built on the NequIP equivariant graph neural network framework, capable of accurately captureing NCI differences caused by atomic spatial orientation.. Across multiple independent benchmark datasets, the model demonstrates remarkable accuracy, conformational adaptability, and cross‑system generalization capability:

  1. Superior performance on protein‑derived systems: For equilibrium dimers extracted from proteins, the PANIP mean absolute error (MAE) is as low as 0.09 kcal/mol, highly consistent with high-precision QM‑level results. Even for geometry‑optimized structures and non‑equilibrium conformations, errors remain within chemically acceptable limits.
  2. Robust cross‑system generalization: In external test sets including small‑molecule fragments from the Cambridge Structural Database (CSD) and randomly sampled non‑equilibrium conformations, PANIP maintains high-precision output, demonstrating that the model is not limited to protein environments and can adapt to diverse molecular systems.
  3. Demonstrated superiority over mainstream models: Compared to the widely used AIMNet2 potential, PANIP significantly reduces errors in challenging scenarios such as charged systems, strongly interacting dimers, and sulfur‑driven interactions. In widely adopted benchmarks such as GMTKN55 and a recent ChemRxiv benchmark developed for general‑purpose ML potentials targeting intermolecular and noncovalent interactions, PANIP achieves across‑the‑board superior performance.


In terms of computational efficiency, PANIP achieves an even greater leap forward: compared to high‑precision ωB97X‑D3BJ/def2‑TZVPP level calculations, the model processing speed is improved by more than two orders of magnitude. Even compared to AIMNet2, end‑to‑end computing efficiency is about 1.3‑fold higher, truly realizing quantum‑level accuracy at force‑field speed.


Leveraging PANIP’s efficient computational capability, the team further completed large‑scale energy analyses of 36.3 million protein fragment dimers, systematically analyzing the spatial distribution and energetic characteristics of typical non‑covalent interactions such as cation–π contacts and methionine–aromatic sulfur interactions. This study uncovered multiple previously under‑reported interaction motifs and deepened the understanding of protein microscopic interaction mechanisms (Figure 2).



Figure 2. Spatial distribution and representative low‑energy structures of ETAM-PMPO (a), ETAM-MIND (b), and MBZ-MSM (c) dimers


Application to Protein–Ligand Systems: Transforming into an Efficient Scoring Function for Docking and Conformation Prediction


The study further expanded the application scenarios of PANIP, combined with  fragmentation‑based energy prediction scheme, and developed it into a fragmentation‑based scoring function for protein–ligand docking and binding conformation ranking. Classic model systems such as T4 lysozyme mutants and pyruvate kinase M2 (PKM2) were selected, with a total of 22 protein–ligand complex systems tested. Results showed that in half of the test systems, PANIP can rank the native crystal binding conformation as the top docking result. Compared with the traditional AMBER force‑field scoring function embedded in DOCK, PANIP significantly improves the ranking accuracy of native conformation and reduces the root‑mean‑square deviation (RMSD) of the best‑predicted conformations. Remarkably, even without introducing long‑range electrostatics or solvation corrections, PANIP’s precise short‑range non‑covalent interaction calculations deliver overall performance comparable to mainstream machine‑learning scoring models, fully validating its practical value in industrial scenarios such as drug docking and virtual screening.


Outlook


Methodologically, this work demonstrates that multi‑fidelity active learning provides an efficient pathway to address large‑scale biomolecular datasets redundancy, and to balance annotation cost and model performance. This work also established a standardized paradigm for the development of similar machine‑learning interatomic potentials. At the application level, PANIP enhances the modeling capability of protein‑specific, high‑accuracy ML potentials, offering a low‑cost, high precision computational tool for protein engineering, molecular interaction mechanism analysis, and lead compound screening.


The research team stated that as present, PANIP focuses on pairwise fragment interactions. Going forward, they will further expand fragment chemical diversity by introducing multi‑body interactions, long‑range electrostatics, and solvation models to further improve model capabilities. In the future, this tool is expected to complement classical force fields and general ML models, paving the way for the continuous development of biomolecular simulation and computational drug development toward "high precision, high efficiency, and scale."


Authors and Project Information


The first author of this work is doctoral student Lejia Zeng from the TIMBR project at Huang Niu’s laboratory, with Researcher Niu Huang serving as the corresponding author. Doctoral students Xintong Zhang and Yuchan Pei, together with former postdoctoral researchers Lifeng Zhao, Lan Hua, and Jincai Yang, contributed to the early exploration and parts of the related work. This work is supported by Beijing Municipal Science & Technology Commission, the Administrative Committee of Zhongguancun Science Park, and Tsinghua University. All research was carried out at the National Institute of Biological Sciences, Beijing.


The PANIP model, PDB‑FRAGID dataset, benchmark sets, and related code have been fully open‑sourced, and the resources are available via the project’s GitHub homepage.