Quantitative Biology
- [1] arXiv:2405.03707 [pdf, ps, other]
-
Title: Quantifying indirect and direct vaccination effects arising in the SIR modelSubjects: Populations and Evolution (q-bio.PE)
Vaccination campaigns have both direct and indirect effects that act to control an infectious disease as it spreads through a population. Indirect effects arise when vaccinated individuals block disease transmission in any infection chains they are part of, and this in turn can benefit both vaccinated and unvaccinated individuals. Indirect effects are difficult to quantify in practice, but here, working with the Susceptible-Infected-Recovered (SIR) model, they are analytically calculated in important cases, through pivoting on the Final Size formula for epidemics. Their relationship to herd immunity is also clarified. Furthermore, we identify the important distinction between quantifying indirect effects of vaccination at the "population level" versus the "per capita" individual level, which often results in radically different conclusions. As an important example, the analysis unpacks why population-level indirect effect can appear significantly larger than its per capita analogue. In addition, we consider a recently proposed epidemiological non-pharamaceutical intervention used over COVID-19, referred to as "shielding", and study its impact in our mathematical analysis. The shielding scheme is extended by inclusion of limited vaccination.
- [2] arXiv:2405.03726 [pdf, ps, other]
-
Title: sc-OTGM: Single-Cell Perturbation Modeling by Solving Optimal Mass Transport on the Manifold of Gaussian MixturesAndac Demir, Elizaveta Solovyeva, James Boylan, Mei Xiao, Fabrizio Serluca, Sebastian Hoersch, Jeremy Jenkins, Murthy Devarakonda, Bulent KiziltanComments: ICLR 2024, Machine Learning for Genomics Explorations WorkshopSubjects: Genomics (q-bio.GN); Machine Learning (cs.LG)
Influenced by breakthroughs in LLMs, single-cell foundation models are emerging. While these models show successful performance in cell type clustering, phenotype classification, and gene perturbation response prediction, it remains to be seen if a simpler model could achieve comparable or better results, especially with limited data. This is important, as the quantity and quality of single-cell data typically fall short of the standards in textual data used for training LLMs. Single-cell sequencing often suffers from technical artifacts, dropout events, and batch effects. These challenges are compounded in a weakly supervised setting, where the labels of cell states can be noisy, further complicating the analysis. To tackle these challenges, we present sc-OTGM, streamlined with less than 500K parameters, making it approximately 100x more compact than the foundation models, offering an efficient alternative. sc-OTGM is an unsupervised model grounded in the inductive bias that the scRNAseq data can be generated from a combination of the finite multivariate Gaussian distributions. The core function of sc-OTGM is to create a probabilistic latent space utilizing a GMM as its prior distribution and distinguish between distinct cell populations by learning their respective marginal PDFs. It uses a Hit-and-Run Markov chain sampler to determine the OT plan across these PDFs within the GMM framework. We evaluated our model against a CRISPR-mediated perturbation dataset, called CROP-seq, consisting of 57 one-gene perturbations. Our results demonstrate that sc-OTGM is effective in cell state classification, aids in the analysis of differential gene expression, and ranks genes for target identification through a recommender system. It also predicts the effects of single-gene perturbations on downstream gene regulation and generates synthetic scRNA-seq data conditioned on specific cell states.
- [3] arXiv:2405.03829 [pdf, ps, html, other]
-
Title: Unsupervised Machine Learning Identifies Latent Ultradian States in Multi-Modal Wearable Sensor SignalsSubjects: Neurons and Cognition (q-bio.NC)
Wearable sensors such as smartwatches have become ubiquitous in recent years, allowing the easy and continual measurement of physiological parameters such as heart rate, physical activity, body temperature, and blood glucose in an every-day setting. This multi-modal data offers the potential to identify latent states occurring across physiological measures, which may represent important bio-behavioural states that could not be observed in any single measure. Here we present an approach, utilising a hidden semi-Markov model, to identify such states in data collected using a smartwatch, electrocardiogram, and blood glucose monitor, over two weeks from a sample of 9 participants. We found 26 latent ultradian states across the sample, with many occurring at particular times of day. Here we describe some of these, as well as their association with subjective mood and time use diaries. These methods provide a novel avenue for developing insights into the physiology of everyday life.
- [4] arXiv:2405.03861 [pdf, ps, html, other]
-
Title: Homeostasis in Input-Output Networks: Structure, Classification and ApplicationsComments: 45 pages, 26 figures, submitted to the MBS special issue "Dynamical Systems in Life Sciences"Subjects: Molecular Networks (q-bio.MN); Combinatorics (math.CO); Dynamical Systems (math.DS); Biological Physics (physics.bio-ph)
Homeostasis is concerned with regulatory mechanisms, present in biological systems, where some specific variable is kept close to a set value as some external disturbance affects the system. Mathematically, the notion of homeostasis can be formalized in terms of an input-output function that maps the parameter representing the external disturbance to the output variable that must be kept within a fairly narrow range. This observation inspired the introduction of the notion of infinitesimal homeostasis, namely, the derivative of the input-output function is zero at an isolated point. This point of view allows for the application of methods from singularity theory to characterize infinitesimal homeostasis points (i.e. critical points of the input-output function). In this paper we review the infinitesimal approach to the study of homeostasis in input-output networks. An input-output network is a network with two distinguished nodes `input' and `output', and the dynamics of the network determines the corresponding input-output function of the system. This class of dynamical systems provides an appropriate framework to study homeostasis and several important biological systems can be formulated in this context. Moreover, this approach, coupled to graph-theoretic ideas from combinatorial matrix theory, provides a systematic way for classifying different types of homeostasis (homeostatic mechanisms) in input-output networks, in terms of the network topology. In turn, this leads to new mathematical concepts, such as, homeostasis subnetworks, homeostasis patterns, homeostasis mode interaction. We illustrate the usefulness of this theory with several biological examples: biochemical networks, chemical reaction networks (CRN), gene regulatory networks (GRN), Intracellular metal ion regulation and so on.
- [5] arXiv:2405.03913 [pdf, ps, html, other]
-
Title: Digital Twin Calibration for Biological System-of-Systems: Cell Culture Manufacturing ProcessComments: 12 pages, 5 figuresSubjects: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG); Machine Learning (stat.ML)
Biomanufacturing innovation relies on an efficient design of experiments (DoE) to optimize processes and product quality. Traditional DoE methods, ignoring the underlying bioprocessing mechanisms, often suffer from a lack of interpretability and sample efficiency. This limitation motivates us to create a new optimal learning approach that can guide a sequential DoEs for digital twin model calibration. In this study, we consider a multi-scale mechanistic model for cell culture process, also known as Biological Systems-of-Systems (Bio-SoS), as our digital twin. This model with modular design, composed of sub-models, allows us to integrate data across various production processes. To calibrate the Bio-SoS digital twin, we evaluate the mean squared error of model prediction and develop a computational approach to quantify the impact of parameter estimation error of individual sub-models on the prediction accuracy of digital twin, which can guide sample-efficient and interpretable DoEs.
- [6] arXiv:2405.04011 [pdf, ps, html, other]
-
Title: Adjoint Sensitivity Analysis on Multi-Scale Bioprocess Stochastic Reaction NetworkComments: 11 pages, 2 figuresSubjects: Molecular Networks (q-bio.MN); Machine Learning (stat.ML)
Motivated by the pressing challenges in the digital twin development for biomanufacturing process, we introduce an adjoint sensitivity analysis (SA) approach to expedite the learning of mechanistic model parameters. In this paper, we consider enzymatic stochastic reaction networks representing a multi-scale bioprocess mechanistic model that allows us to integrate disparate data from diverse production processes and leverage the information from existing macro-kinetic and genome-scale models. To support forward prediction and backward reasoning, we develop a convergent adjoint SA algorithm studying how the perturbations of model parameters and inputs (e.g., initial state) propagate through enzymatic reaction networks and impact on output trajectory predictions. This SA can provide a sample efficient and interpretable way to assess the sensitivities between inputs and outputs accounting for their causal dependencies. Our empirical study underscores the resilience of these sensitivities and illuminates a deeper comprehension of the regulatory mechanisms behind bioprocess through sensitivities.
- [7] arXiv:2405.04248 [pdf, ps, other]
-
Title: Neurocomputational Phenotypes in Female and Male Autistic IndividualsComments: 10 pages, 2 figures, 4 tables. Submitted to Journal of Science and Health, University of AlabamaSubjects: Neurons and Cognition (q-bio.NC); Chaotic Dynamics (nlin.CD)
Autism Spectrum Disorder (ASD) is characterized by an altered phenotype in social interaction and communication. Additionally, autism typically manifests differently in females as opposed to males: a phenomenon that has likely led to long-term problems in diagnostics of autism in females. These sex-based differences in communicative behavior may originate from differences in neurocomputational properties of brain organization. The present study looked to examine the relationship between one neurocomputational measure of brain organization, the local power-law exponent, in autistic vs. neurotypical, as well as male vs. female participants. To investigate the autistic phenotype in neural organization based on biological sex, we collected continuous resting-state EEG data for 19 autistic young adults (10 F), and 23 controls (14 F), using a 64-channel Net Station EEG acquisition system. The data was analyzed to quantify the 1/f power spectrum. Correlations between power-law exponent and behavioral measures were calculated in a between-group (female vs. male; autistic vs. neurotypical) design. On average, the power-law exponent was significantly greater in the male ASD group than in the female ASD group in fronto-central regions. The differences were more pronounced over the left hemisphere, suggesting neural organization differences in regions responsible for language complexity. These differences provide a potential explanation for behavioral variances in female vs. male autistic young adults.
New submissions for Wednesday, 8 May 2024 (showing 7 of 7 entries )
- [8] arXiv:2405.03799 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: Synthetic Data from Diffusion Models Improve Drug Discovery PredictionSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
Artificial intelligence (AI) is increasingly used in every stage of drug development. Continuing breakthroughs in AI-based methods for drug discovery require the creation, improvement, and refinement of drug discovery data. We posit a new data challenge that slows the advancement of drug discovery AI: datasets are often collected independently from each other, often with little overlap, creating data sparsity. Data sparsity makes data curation difficult for researchers looking to answer key research questions requiring values posed across multiple datasets. We propose a novel diffusion GNN model Syngand capable of generating ligand and pharmacokinetic data end-to-end. We show and provide a methodology for sampling pharmacokinetic data for existing ligands using our Syngand model. We show the initial promising results on the efficacy of the Syngand-generated synthetic target property data on downstream regression tasks with AqSolDB, LD50, and hERG central. Using our proposed model and methodology, researchers can easily generate synthetic ligand data to help them explore research questions that require data spanning multiple datasets.
- [9] arXiv:2405.03879 (cross-list from stat.ML) [pdf, ps, html, other]
-
Title: Scalable Amortized GPLVMs for Single Cell Transcriptomics DataSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Genomics (q-bio.GN); Applications (stat.AP)
Dimensionality reduction is crucial for analyzing large-scale single-cell RNA-seq data. Gaussian Process Latent Variable Models (GPLVMs) offer an interpretable dimensionality reduction method, but current scalable models lack effectiveness in clustering cell types. We introduce an improved model, the amortized stochastic variational Bayesian GPLVM (BGPLVM), tailored for single-cell RNA-seq with specialized encoder, kernel, and likelihood designs. This model matches the performance of the leading single-cell variational inference (scVI) approach on synthetic and real-world COVID datasets and effectively incorporates cell-cycle and batch information to reveal more interpretable latent structures as we demonstrate on an innate immunity dataset.
- [10] arXiv:2405.03931 (cross-list from math.DS) [pdf, ps, other]
-
Title: Incorporating changeable attitudes toward vaccination into an SIR infectious disease modelComments: 30 pages, 3 tables, 10 figuresSubjects: Dynamical Systems (math.DS); Populations and Evolution (q-bio.PE)
We develop a mechanistic model that classifies individuals both in terms of epidemiological status (SIR) and vaccination attitude (willing or unwilling), with the goal of discovering how disease spread is influenced by changing opinions about vaccination. Analysis of the model identifies existence and stability criteria for both disease-free and endemic disease equilibria. The analytical results, supported by numerical simulations, show that attitude changes induced by disease prevalence can destabilize endemic disease equilibria, resulting in limit cycles.
- [11] arXiv:2405.04078 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: WISER: Weak supervISion and supErvised Representation learning to improve drug response prediction in cancerComments: ICML 2024Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
Cancer, a leading cause of death globally, occurs due to genomic changes and manifests heterogeneously across patients. To advance research on personalized treatment strategies, the effectiveness of various drugs on cells derived from cancers (`cell lines') is experimentally determined in laboratory settings. Nevertheless, variations in the distribution of genomic data and drug responses between cell lines and humans arise due to biological and environmental differences. Moreover, while genomic profiles of many cancer patients are readily available, the scarcity of corresponding drug response data limits the ability to train machine learning models that can predict drug response in patients effectively. Recent cancer drug response prediction methods have largely followed the paradigm of unsupervised domain-invariant representation learning followed by a downstream drug response classification step. Introducing supervision in both stages is challenging due to heterogeneous patient response to drugs and limited drug response data. This paper addresses these challenges through a novel representation learning method in the first phase and weak supervision in the second. Experimental results on real patient data demonstrate the efficacy of our method (WISER) over state-of-the-art alternatives on predicting personalized drug response.
Cross submissions for Wednesday, 8 May 2024 (showing 4 of 4 entries )
- [12] arXiv:1610.09637 (replaced) [pdf, ps, other]
-
Title: Nonequilibrium and nonlinear kinetics as key determinants for bistability in fission yeast G2-M transitionDe Zhao (1 and 2), Teng Wang (1), Jian Zhao (2 and 3), Dianjie Li (1), Zhili Lin (1), Zeyan Chen (1), Qi Ouyang (1), Hong Qian (4), Yu V. Fu (2 and 3), Fangting Li (1) ((1) Peking University, Beijing, (2) Chinese Academy of Sciences, Beijing, (3) University of Chinese Academy of Sciences, Beijing,(4) University of Washington, Seattle)Comments: 53 pages, 4 figuresSubjects: Molecular Networks (q-bio.MN); Subcellular Processes (q-bio.SC)
A living cell is an open, nonequilibrium biochemical system where ATP hydrolysis serves as the energy source for a wide range of intracellular processes, possibly including the assurance for decision-making. In the fission yeast cell cycle, the transition from G2 to M phase is driven by the activation of Cdc13/Cdc2 and Cdc25 and the deactivation of Wee1 through phosphorylation-dephosphorylation cycles with feedback loops. Here, we present a kinetic description of the G2-M circuit which reveals that both cellular ATP level and ATP hydrolysis free energy critically control Cdc2 activation. Using fission yeast nucleoplasmic extract (YNPE), we experimentally verify that increased ATP level drives the activation of Cdc2 which exhibits bistability and hysteresis in response to changes in cellular ATP level and ATP hydrolysis energy. These findings suggest that cellular ATP level and ATP hydrolysis energy are determinants of the bistability and robustness of Cdc2 activation during G2-M transition.
- [13] arXiv:2201.03193 (replaced) [pdf, ps, html, other]
-
Title: The impact of life-history strategies on the stability of competitive ecological networkSubjects: Populations and Evolution (q-bio.PE)
In natural ecosystems, species can be characterized by the nonlinear density-dependent self-regulation of their growth profile. Species of many taxa show a substantial density-dependent reduction for low population size. Nevertheless, many show the opposite trend; density regulation is minimal for small populations and increases significantly when the population size is near the carrying capacity. The theta-logistic growth equation can portray the intraspecific density regulation in the growth profile, theta being the density regulation parameter. In this study, we examine the role of these different growth profiles on the stability of a competitive ecological community with the help of a mathematical model of competitive species interactions. This manuscript deals with the random matrix theory to understand the stability of the classical theta-logistic models of competitive interactions. Our results suggest that having more species with strong density dependence, which self-regulate at low densities, leads to more stable communities. With this, stability also depends on the complexity of the ecological network. Species network connectance (link density) shows a consistent trend of increasing stability, whereas community size (species richness) shows a context-dependent effect. We also interpret our results from the aspect of two different life history strategies: r and K-selection. Our results show that the stability of a competitive network increases with the fraction of r-selected species in the community. Our result is robust, irrespective of different network architectures.
- [14] arXiv:2303.11833 (replaced) [pdf, ps, html, other]
-
Title: Materials Discovery with Extreme Properties via Reinforcement Learning-Guided Combinatorial ChemistryHyunseung Kim (1), Haeyeon Choi (2), Dongju Kang (1), Won Bo Lee (1), Jonggeol Na (2) ((1) Seoul National University, (2) Ewha Womans University)Comments: 18 pages, 8 figuresJournal-ref: Chemical Science, 2024Subjects: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
The goal of most materials discovery is to discover materials that are superior to those currently known. Fundamentally, this is close to extrapolation, which is a weak point for most machine learning models that learn the probability distribution of data. Herein, we develop reinforcement learning-guided combinatorial chemistry, which is a rule-based molecular designer driven by trained policy for selecting subsequent molecular fragments to get a target molecule. Since our model has the potential to generate all possible molecular structures that can be obtained from combinations of molecular fragments, unknown molecules with superior properties can be discovered. We theoretically and empirically demonstrate that our model is more suitable for discovering better compounds than probability distribution-learning models. In an experiment aimed at discovering molecules that hit seven extreme target properties, our model discovered 1,315 of all target-hitting molecules and 7,629 of five target-hitting molecules out of 100,000 trials, whereas the probability distribution-learning models failed. Moreover, it has been confirmed that every molecule generated under the binding rules of molecular fragments is 100% chemically valid. To illustrate the performance in actual problems, we also demonstrate that our models work well on two practical applications: discovering protein docking molecules and HIV inhibitors.
- [15] arXiv:2308.16093 (replaced) [pdf, ps, other]
-
Title: Linking discrete and continuous models of cell birth and migrationW. Duncan Martinson, Alexandria Volkening, Markus Schmidtchen, Chandrasekhar Venkataraman, José A. CarrilloComments: 25 pages, 11 figures in main manuscript. 24 pages, 14 figures in supplementary informationSubjects: Cell Behavior (q-bio.CB)
Self-organisation of individuals within large collectives occurs throughout biology. Mathematical models can help elucidate the individual-level mechanisms behind these dynamics, but analytical tractability often comes at the cost of biological intuition. Discrete models provide straightforward interpretations by tracking each individual yet can be computationally expensive. Alternatively, continuous models supply a large-scale perspective by representing the "effective" dynamics of infinite agents, but their results are often difficult to translate into experimentally relevant insights. We address this challenge by quantitatively linking spatio-temporal dynamics of continuous models and individual-based data in settings with biologically realistic, time-varying cell numbers. Specifically, we introduce and fit scaling parameters in continuous models to account for discrepancies that can arise from low cell numbers and localised interactions. We illustrate our approach on an example motivated by zebrafish-skin pattern formation, in which we create a continuous framework describing the movement and proliferation of a single cell population by upscaling rules from a discrete model. Our resulting continuous models accurately depict ensemble average agent-based solutions when migration or proliferation act alone. Interestingly, the same parameters are not optimal when both processes act simultaneously, highlighting a rich difference in how combining migration and proliferation affects discrete and continuous dynamics.
- [16] arXiv:2309.02708 (replaced) [pdf, ps, other]
-
Title: Cooling down and waking up: feedback cooling switches an unconsciousness neural computer into a conscious quantum computerComments: 37 pages, 3 figures. Text reorganised; some text split off and placed at this https URLSubjects: Neurons and Cognition (q-bio.NC); Biological Physics (physics.bio-ph)
This paper presents a theory of how feedback cooling in the brain reduces thermal noise to the point where macroscale quantum phenomena - crucially Bose-Einstein condensation - can operate at body temperature. It takes the core idea from Stapp that mind and brain interact via some sort of oscillator and identifies a likely candidate, neuronal arrays identified by Stapp as cortical minicolumns. Feedback cooling allows amplifiers to act like refrigerators, and when applied to minicolumns it is suggested they perform as quantum accelerators, solid-state devices able to supercharge standard computers. When the accelerator is idle, as in sleep, we have a neural computer operating unconsciously, but feedback cooling produces a Bose-Einstein condensate, quantum computation, and consciousness. The model explains how macroscale quantum phenomena can operate in a warm and noisy brain, how and why consciousness evolved, and gives insight into unconscious states like sleepwalking. The model is testable, predicting that cold states in the brain are detectable by magnetic resonance thermometry.
- [17] arXiv:2310.09758 (replaced) [pdf, ps, other]
-
Title: Genome hybridization: A universal way for the origin and diversification of organelles as well as the origin and speciation of eukaryotesComments: 22 pages with two tables; added references for section 2; revised testable predictions for Section 5Subjects: Other Quantitative Biology (q-bio.OT)
The origin of organelles (mitochondrion, chloroplast and nucleus) remains enigmatic. The endosymbiotic hypothesis that chloroplasts, mitochondria and nuclei descend from the endosymbiotic cyanobacterium, bacterium and archaebacterium respectively is dominant yet uncompelling, while our discovery of de novo organelle biogenesis in the cyanobacterium TDX16 that had acquired the genome of its green algal host Haematococcus pluvialis overturns this hypothesis. In light of organelle biogenesis in the cyanobacterium TDX16 in combination with the relevant cellular and molecular evidence, we propose genome hybridization hypothesis (GHH) that the origin of organelles and origin of eukaryotes as well as the diversification of organelles and speciation of eukaryotes are unified and achieved by genome hybridization: the endosymbiotic cyanobacteria or bacteria obtain genomes of their archaebacterial or eukaryotic hosts and hybridize with their own ones resulting in expanded genomes containing a mixture of hybrid prokaryotic genes and eukaryotic genes, and thus the cyanobacteria or bacteria have to compartmentalize to accommodate different genes for specialized function of photosynthesis (chloroplast), respiration (mitochondrion) and DNA preservation (nucleus), and consequently turn into photosynthetic or heterotrophic eukaryotes. Accordingly, eukaryotes and their organelles are of multiple origin, while the formation of cancer cells is the speciation of eukaryotes as cancer cells are new species of unicellular eukaryotes arising from bacteria. Therefore, GHH provides a theoretical framework unifying evolutionary biology, cancer biology and cell biology and directing the integrated multidisciplinary research.
- [18] arXiv:2311.18142 (replaced) [pdf, ps, other]
-
Title: Emergence of multiphase condensates from a limited set of chemical building blocksComments: Includes supplementary informationSubjects: Soft Condensed Matter (cond-mat.soft); Biological Physics (physics.bio-ph); Biomolecules (q-bio.BM)
Biomolecules composed of a limited set of chemical building blocks can co-localize into distinct, spatially segregated compartments known as biomolecular condensates. While many condensates are known to form spontaneously via phase separation, it has been unclear how immiscible condensates with precisely controlled molecular compositions assemble from a small number of chemical building blocks. We address this question by establishing a connection between the specificity of biomolecular interactions and the thermodynamic stability of coexisting condensates. By computing the minimum interaction specificity required to assemble condensates with target molecular compositions, we show how to design heteropolymer mixtures that produce compositionally complex condensates using only a small number of monomer types. Our results provide insight into how compositional specificity arises in naturally occurring multicomponent condensates and demonstrate a rational algorithm for engineering complex artificial condensates from simple chemical building blocks.
- [19] arXiv:2401.13858 (replaced) [pdf, ps, other]
-
Title: Graph Diffusion Transformer for Multi-Conditional Molecular GenerationComments: 21 pages, 9 figures, 7 tablesSubjects: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
Inverse molecular design with diffusion models holds great potential for advancements in material and drug discovery. Despite success in unconditional molecule generation, integrating multiple properties such as synthetic score and gas permeability as condition constraints into diffusion models remains unexplored. We present the Graph Diffusion Transformer (Graph DiT) for multi-conditional molecular generation. Graph DiT has a condition encoder to learn the representation of numerical and categorical properties and utilizes a Transformer-based graph denoiser to achieve molecular graph denoising under conditions. Unlike previous graph diffusion models that add noise separately on the atoms and bonds in the forward diffusion process, we propose a graph-dependent noise model for training Graph DiT, designed to accurately estimate graph-related noise in molecules. We extensively validate the Graph DiT for multi-conditional polymer and small molecule generation. Results demonstrate our superiority across metrics from distribution learning to condition control for molecular properties. A polymer inverse design task for gas separation with feedback from domain experts further demonstrates its practical utility.
- [20] arXiv:2404.17626 (replaced) [pdf, ps, other]
-
Title: Using Pre-training and Interaction Modeling for ancestry-specific disease prediction in UK BiobankSubjects: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM); Applications (stat.AP); Computation (stat.CO)
Recent genome-wide association studies (GWAS) have uncovered the genetic basis of complex traits, but show an under-representation of non-European descent individuals, underscoring a critical gap in genetic research. Here, we assess whether we can improve disease prediction across diverse ancestries using multiomic data. We evaluate the performance of Group-LASSO INTERaction-NET (glinternet) and pretrained lasso in disease prediction focusing on diverse ancestries in the UK Biobank. Models were trained on data from White British and other ancestries and validated across a cohort of over 96,000 individuals for 8 diseases. Out of 96 models trained, we report 16 with statistically significant incremental predictive performance in terms of ROC-AUC scores (p-value < 0.05), found for diabetes, arthritis, gall stones, cystitis, asthma and osteoarthritis. For the interaction and pretrained models that outperformed the baseline, the PRS score was the primary driver behind prediction. Our findings indicate that both interaction terms and pre-training can enhance prediction accuracy but for a limited set of diseases and moderate improvements in accuracy
- [21] arXiv:2405.01015 (replaced) [pdf, ps, other]
-
Title: Network reconstruction via the minimum description length principleComments: 17 pages, 10 figures. Code and documentation are available at this https URLSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Social and Information Networks (cs.SI); Data Analysis, Statistics and Probability (physics.data-an); Populations and Evolution (q-bio.PE)
A fundamental problem associated with the task of network reconstruction from dynamical or behavioral data consists in determining the most appropriate model complexity in a manner that prevents overfitting, and produces an inferred network with a statistically justifiable number of edges. The status quo in this context is based on $L_{1}$ regularization combined with cross-validation. However, besides its high computational cost, this commonplace approach unnecessarily ties the promotion of sparsity with weight "shrinkage". This combination forces a trade-off between the bias introduced by shrinkage and the network sparsity, which often results in substantial overfitting even after cross-validation. In this work, we propose an alternative nonparametric regularization scheme based on hierarchical Bayesian inference and weight quantization, which does not rely on weight shrinkage to promote sparsity. Our approach follows the minimum description length (MDL) principle, and uncovers the weight distribution that allows for the most compression of the data, thus avoiding overfitting without requiring cross-validation. The latter property renders our approach substantially faster to employ, as it requires a single fit to the complete data. As a result, we have a principled and efficient inference scheme that can be used with a large variety of generative models, without requiring the number of edges to be known in advance. We also demonstrate that our scheme yields systematically increased accuracy in the reconstruction of both artificial and empirical networks. We highlight the use of our method with the reconstruction of interaction networks between microbial communities from large-scale abundance samples involving in the order of $10^{4}$ to $10^{5}$ species, and demonstrate how the inferred model can be used to predict the outcome of interventions in the system.