
The Yang lab is building an integrated enzymology data ecosystem that makes the “dark matter” of enzyme kinetics accessible for predictive modeling and method development. IntEnzyDB provides a fast, flattened relational architecture that unifies enzyme structure and function across six EC classes and exposes a public web interface for streamlined access; using 1,050 structure-kinetics pairs, we quantified how efficiency-enhancing mutations are globally encoded while deleterious effects concentrate near active sites, enabling facile statistical modeling and machine learning. To transform literature into model-ready data at scale, we developed EnzyExtract, a large language model pipeline that processes full-text PDFs/XMLs to automatically extract, verify, and structure enzyme-substrate-kinetics records. From 137,892 publications, EnzyExtract assembled >218,095 entries, including 218,095 kcat and 167,794 Km values, mapped across 3,569 unique four-digit EC numbers (84,464 entries assigned ≥ first-digit EC). It uncovered 89,544 kinetic entries absent from BRENDA, and after aligning enzymes and substrates to UniProt and PubChem, yielded 92,286 high-confidence sequence-mapped records compiled as EnzyExtractDB. Benchmarking shows high accuracy versus manual curation and strong consistency with BRENDA, and retraining state-of-the-art kcat predictors (MESI, DLKcat, TurNuP) on EnzyExtractDB improves RMSE, MAE, and R² on held-out tests. Together, IntEnzyDB and EnzyExtractDB supply the breadth, structure linkage, and quality needed to power generalizable, data-driven enzyme engineering.
IntEnzyDB development was led by Bailu Yan and Xinchun Ran; EnzyExtractDB development was led by Galen Wei and Xinchun Ran.
The software code:
https://github.com/ChemBioHTP/IntEnzyDB
https://github.com/ChemBioHTP/EnzyExtract
Web Interface:
https://colab.research.google.com/drive/1MwKSEZzLPNOseksRshbzkkFoO_cgJhva
Publications:
https://onlinelibrary.wiley.com/doi/full/10.1002/pro.70251
https://pubs.acs.org/doi/10.1021/acs.jcim.2c01139