Computer Science > Computational Engineering, Finance, and Science
[Submitted on 30 Apr 2024]
Title:ZSMILES: an approach for efficient SMILES storage for random access in Virtual Screening
View PDFAbstract:Virtual screening is a technique used in drug discovery to select the most promising molecules to test in a lab. To perform virtual screening, we need a large set of molecules as input, and storing these molecules can become an issue. In fact, extreme-scale high-throughput virtual screening applications require a big dataset of input molecules and produce an even bigger dataset as output. These molecules' databases occupy tens of TB of storage space, and domain experts frequently sample a small portion of this data. In this context, SMILES is a popular data format for storing large sets of molecules since it requires significantly less space to represent molecules than other formats (e.g., MOL2, SDF). This paper proposes an efficient dictionary-based approach to compress SMILES-based datasets. This approach takes advantage of domain knowledge to provide a readable output with separable SMILES, enabling random access. We examine the benefits of storing these datasets using ZSMILES to reduce the cold storage footprint in HPC systems. The main contributions concern a custom dictionary-based approach and a data pre-processing step. From experimental results, we can notice how ZSMILES leverage domain knowledge to compress x1.13 more than state of the art in similar scenarios and up to $0.29$ compression ratio. We tested a CUDA version of ZSMILES targetting NVIDIA's GPUs, showing a potential speedup of 7x.
References & Citations
Bibliographic and Citation Tools
Bibliographic Explorer (What is the Explorer?)
Litmaps (What is Litmaps?)
scite Smart Citations (What are Smart Citations?)
Code, Data and Media Associated with this Article
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub (What is DagsHub?)
Gotit.pub (What is GotitPub?)
Papers with Code (What is Papers with Code?)
ScienceCast (What is ScienceCast?)
Demos
Recommenders and Search Tools
Influence Flower (What are Influence Flowers?)
Connected Papers (What is Connected Papers?)
CORE Recommender (What is CORE?)
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.