title <- example_preprints |> dplyr::filter(subject=="bioinformatics") |> dplyr::pull(title)
summary <- example_preprints |> dplyr::filter(subject=="bioinformatics") |> dplyr::pull(summary)
build_prompt_subject(subject="bioinformatics", title=title, summary=summary)
-#> [1] "I am giving you information about preprints published in bioRxiv recently. I'll give you the subject, preprint titles, and short summary of each paper. Please provide a general summary new advances in this subject/field in general. Provide this summary of the field in as many sentences as I instruct. Do not include any preamble text to the summary just give me the summary with no preface or intro sentence.\n\nSubject: bioinformatics\nNumber of sentences in summary: 5\n\nHere are the titles and summaries:\n\nTitle: Integrity and miss grouping as support for clusters in agglomerative hierarchical methods: the R-package octopucs\nSummary: The proposed method assesses cluster support throughout hierarchical analyses by compiling a consensus topology and using ecological concepts of reciprocal complementarities to define cluster integrity and contamination. This approach allows for building support for groups even when there is partial membership match after resampling, and was implemented in the R package octopucs, which showed robust detection of changes in group memberships compared to other methods.\n\nTitle: Sainsc: a computational tool for segmentation-free analysis of in-situ capture\nSummary: Sainsc is a computational tool that enables segmentation-free analysis of spatially resolved transcriptomics data, allowing for accurate cell-type assignment at the subcellular level without requiring manual cell border delineation. The tool provides efficient processing of high-resolution spatial data and can generate maps of cell types with corresponding confidence scores, making it a valuable resource for biomedical researchers working with complex tissue samples.\n\nTitle: BRACE: A novel Bayesian-based imputation approach for dimension reduction analysis of alternative splicing at single-cell resolution\nSummary: Alternative splicing represents an additional layer of complexity in gene expression profiles, but analyzing it at single-cell resolution is challenging due to missing data. This paper introduces BRACE, a Bayesian-based imputation approach that improves upon existing methods and enables dimension reduction analysis of alternative splicing events at single-cell resolution.\n\nTitle: Topological embedding and directional feature importance in ensemble classifiers for multi-class classification\nSummary: Researchers developed a new metric called class-based direction feature importance (CLIFI) to provide interpretable insights into the decision-making process of ensemble classifiers for multi-class classification problems, specifically in the context of cancer biomarker identification. The CLIFI metric was incorporated into four algorithms and applied to The Cancer Genome Atlas proteomics data, resulting in high F1-scores and allowing for the visualization of model decision-making functions and the identification of heterogeneity in several proteins across different cancer types.\n\nTitle: SeuratExtend: Streamlining Single-Cell RNA-Seq Analysis Through an Integrated and Intuitive Framework\nSummary: SeuratExtend is an R package that integrates essential tools and databases for single-cell RNA sequencing (scRNA-seq) data analysis, streamlining the process through a user-friendly interface. The package offers various analyses, including functional enrichment and gene regulatory network reconstruction, and seamlessly integrates multiple databases and popular Python tools.\n\nTitle: An Evolutionary Statistics Toolkit for Simplified Sequence Analysis on Web with Client-Side Processing\nSummary: The \"Evolutionary Statistics Toolkit\" is a web-based platform that integrates multiple evolutionary statistics tools for simplified sequence analysis, including Tajima's D calculator and Shannon's Entropy. The open-source toolkit facilitates streamlined workflows for researchers in evolutionary biology and genomics, and also serves as an educational interactive website for beginners in evolutionary statistics.\n\nTitle: A map of integrated cis-regulatory elements enhances gene regulatory analysis in maize\nSummary: The authors integrated various methods for profiling cis-regulatory elements (CREs) in maize, resulting in maps of integrated CREs that show increased completeness and precision. These maps were used to infer drought-specific gene regulatory networks and identify candidate regulators of maize drought response, as well as to study the potential role of transposable elements in regulating gene expression.\n\nTitle: MOSTPLAS: A Self-correction Multi-label Learning Model for Plasmid Host Range Prediction\nSummary: Plasmid host range prediction tools are essential for understanding how plasmids promote bacterial evolution, but existing learning-based tools struggle due to limited well-annotated training samples. The proposed model, MOSTPLAS, addresses this issue with a self-correction multi-label learning approach that uses pseudo label learning and asymmetric loss to facilitate training with incomplete labels.\n\nTitle: Bootstrap Evaluation of Association Matrices (BEAM) for Integrating Multiple Omics Profiles with Multiple Outcomes\nSummary: The authors propose Bootstrap Evaluation of Association Matrices (BEAM), a new statistical method that integrates multiple omics profiles with multiple clinical endpoints to identify significant associations between them. BEAM outperformed other integrated analysis methods in simulations and identified biologically relevant genes in a pediatric leukemia application that were missed by univariate screens and other methods.\n\nTitle: Thermodynamic modeling of Csr/Rsm- RNA interactions capture novel, direct binding interactions across the Pseudomonas aeruginosa transcriptome\nSummary: Researchers developed a thermodynamic model to predict interactions between the post-transcriptional regulator RsmA and mRNAs in Pseudomonas aeruginosa, predicting 1043 direct binding interactions, including 457 novel targets. The predictions were validated through in vitro binding assays and in vivo translational reporters, revealing direct regulation of genes involved in quorum sensing and the Type IV Secretion system, expanding the known pool of RsmA target genes.\n\nTitle: Assessing the ability of ChatGPT to extract natural product bioactivity and biosynthesis data from publications\nSummary: ChatGPT was tested on its ability to extract data from publications on natural product bioactivity and biosynthesis, which is crucial for training models that predict natural product activity from biosynthetic gene clusters. The results showed that ChatGPT performed well in identifying papers describing natural product discovery and extracting information about the product's bioactivity, but struggled with extracting accession numbers for the biosynthetic gene cluster or producer's genome.\n\nTitle: Genome-Wide Analysis of TCP Family Genes and Their Constitutive Expression Pattern Analysis in the Melon (Cucumis melo)\nSummary: This study identified and characterized 29 putative TCP genes in melon, classifying them into two classes and analyzing their chromosomal location, gene structure, and expression patterns. The results suggest that some CmTCP genes may have similar functions to their homologs in other plant species, while others may have undergone functional diversification, providing a resource for future investigations into their roles in melon development.\n\nTitle: Single-cell differential expression analysis between conditions within nested settings\nSummary: Researchers compared various methods for differential expression analysis of single-cell transcriptomics data and found that methods designed specifically for single-cell data do not offer performance advantages over conventional pseudobulk methods like DESeq2 when applied to individual datasets. However, permutation-based methods excel in performance for atlas-level analysis, but require significantly longer run times, making DREAM a compromise between quality and runtime.\n\nTitle: CoMPHI: A Novel Composite Machine Learning Approach Utilizing Multiple FeatureRepresentation to Predict Hosts of Bacteriophages\nSummary: Here is a 2-sentence summary of the paper: This study introduces CoMPHI, a novel composite machine learning approach that combines multiple feature representations to predict hosts of bacteriophages, with potential applications in phage therapy for treating bacterial infections. The model achieves high prediction accuracy, with an Area Under the ROC Curve (AUC) of up to 96.7% and accuracy of up to 95.1%, outperforming existing methods due to its inclusion of alignment scores and use of both nucleotide and protein sequences from phages and hosts.\n\nTitle: FourierMIL: Fourier filtering-based multiple instance learning for whole slide image analysis\nSummary: The paper presents FourierMIL, a multiple instance learning framework that uses the discrete Fourier transform to analyze whole-slide images (WSIs) in digital pathology. The method captures both global and local dependencies within WSIs and outperforms existing state-of-the-art methods in tumor classification tasks on gigapixel-resolution WSIs.\n\nTitle: Multiple Protein Structure Alignment at Scale with FoldMason\nSummary: Here is a 2-sentence summary of the paper: FoldMason is a new method for multiple protein structure alignment that can handle hundreds of thousands of structures at scale with high speed and accuracy. It leverages the structural alphabet from Foldseek to compute confidence scores, provide interactive visualizations, and support large-scale protein structure analysis and phylogenetic studies.\n\nTitle: Deciphering octoploid strawberry evolution with serial LTR similarity matrices for subgenome partition\nSummary: A novel approach was developed to assign polyploid genome assemblies to subgenomes using long terminal repeat retrotransposons (LTR-RTs) and the Serial Similarity Matrix (SSM) method, which is particularly useful for genomes without known diploid ancestors. The SSM approach was validated using well-studied allopolyploidy genomes and then applied to the octoploid strawberry genome, revealing three allopolyploidization events in its evolutionary history.\n\nTitle: IDENTIFICATION OF IMMUNE RESPONSE AND RNA NETWORK OF RHEUMATOID ARTHRITIS AND MOLECULAR DOCKING OF CELASTRUS PANICULATUS AS POTENTIAL THERAPEUTIC AGENT\nSummary: This study used bioinformatics analysis to identify immune responses, microRNA-hub genes networks, and potential therapeutic agents for rheumatoid arthritis (RA), a complex autoimmune disease with an unknown pathogenesis. The researchers found several hub genes and miRNAs associated with RA, and identified oleic acid and zeylasterone as potential novel drug candidates against the disease through molecular docking analysis of Celastrus paniculatus phytochemical compounds.\n\nTitle: Imputing abundance of over 2500 surface proteins from single-cell transcriptomes with context-agnostic zero-shot deep ensembles\nSummary: SPIDER is a deep ensemble model that predicts the abundance of over 2500 surface proteins from single-cell transcriptomes with improved generalization across diverse contexts such as tissues or disease states. The model outperforms other state-of-the-art methods and has various applications including cell type annotation, biomarker/target identification, and cell-cell interaction analysis in cancer research.\n\nTitle: Modelling Protein-Glycan Interactions with HADDOCK\nSummary: Glycans play important roles in living organisms by interacting with proteins for information transfer and signalling purposes, making it essential to determine the three-dimensional structures of protein-glycan complexes. The molecular docking approach HADDOCK was used to predict protein-glycan complexes with a top 5 success rate of 70% for bound datasets and 40% for unbound datasets using a benchmark of 89 complexes.\n\nTitle: Machine Learning Reveals Key Glycoprotein Mutations and Rapidly Assigns Lassa Virus Lineages\nSummary: Machine learning and phylogenetic analysis of Lassa virus glycoprotein sequences revealed key mutations and genetic differences between Nigerian lineages and those from other West African countries. The study identified specific amino acid positions that are highly variable among the lineages, which may explain structural and phenotypical differences, and developed a machine learning-based tool for rapid lineage classification.\n\nTitle: RESP2: An uncertainty aware multi-target multi-property optimization AI pipeline for antibody discovery\nSummary: The RESP2 pipeline is an AI-powered tool designed to optimize the discovery of therapeutic antibodies against infectious disease pathogens, taking into account multiple targets and properties such as specificity, low immunogenicity, and high affinity. The pipeline uses a suite of methods to estimate uncertainty in predictions and has been successfully applied to discover a highly human antibody with broad binding to variants of the COVID-19 spike protein receptor binding domain.\n\nTitle: Extending the capabilities of deconvolution to provide cell type specific pathway analysis of bulk RNA-seq data for idiopathic pulmonary fibrosis\nSummary: A deconvolution method was applied to bulk RNA-seq data from idiopathic pulmonary fibrosis (IPF) samples to correct for changes in cell type proportions and provide cell-type specific pathway analysis. The results showed significant increases in fibroblasts and myofibroblasts, decreases in vascular endothelial capillary cells, and IPF-related changes in extracellular matrix organization and TGF-{beta} regulation, as well as the involvement of interferon signaling in ATII cells.\n\nTitle: A survey of ADP-ribosyltransferase families in the pathogenic Legionella\nSummary: A comprehensive bioinformatic survey of 41 Legionella species identified 63 proteins with significant sequence or structural similarity to known ADP-ribosyltransferases (ARTs), organized into 39 ART-like families, including 26 novel families. The study found that most members of the novel ART families are predicted effectors, presenting promising targets for understanding Legionella pathogenicity and developing therapeutic strategies.\n\nTitle: A replicable and modular benchmark for long-read transcript quantification methods\nSummary: Researchers have developed a replicable benchmark for evaluating long-read transcript quantification methods using synthetic RNA-seq datasets, which can be easily extended to include new tools or data sets. The study reveals discrepancies with previously published results and highlights the importance of high-quality simulated data in assessing the robustness of certain approaches.\n\nTitle: Logan: Planetary-Scale Genome Assembly Surveys Life's Diversity\nSummary: The NCBI Sequence Read Archive contains over 50 petabases of DNA sequencing data across 27 million datasets, but its size makes it impractical to search for specific genetic sequences within a reasonable time frame. To address this issue, the authors used cloud computing to perform genome assembly on each dataset and created the Logan assemblage, which is now freely available and enables faster querying of the data, with some queries completing in as little as 11 hours.\n\nTitle: Cell-type specific epigenetic clocks to quantify biological age at cell-type resolution\nSummary: Epigenetic clocks have been developed to estimate biological age, but most are based on heterogeneous bulk tissues and reflect both changes in cell-type composition and individual cell aging. This study created neuron- and hepatocyte-specific DNA methylation clocks that provide improved estimates of chronological age and detect accelerated biological aging in Alzheimer's disease and liver pathology.\n\nTitle: Genomic and transcriptomic analyses of Heteropoda venatoria reveal the expansion of P450 family for starvation resistance in spider\nSummary: The genome of Heteropoda venatoria was sequenced and comparative genomic analysis revealed significant expansions in gene families related to lipid metabolism, including cytochrome P450 and steroid hormone biosynthesis genes. The study found that during starvation, H. venatoria undergoes a series of physiological changes, including the activation of fatty acid metabolism and protein degradation pathways, and the expression of expanded P450 gene families, which help the spider maintain a low-energy metabolic state and endure longer periods of starvation.\n\nTitle: Annotation Vocabulary (Might Be) All You Need\nSummary: The authors introduce the \"Annotation Vocabulary\", a language of protein properties defined by structured ontologies that can be used to train transformer models without reference to amino acid sequences. They demonstrate the effectiveness of this approach in various experiments, achieving state-of-the-art results on several common datasets with competitive performance on others, and generating high-quality de novo protein sequences from annotation-only prompts.\n\nTitle: AncFlow: An Ancestral Sequence Reconstruction Approach for Determining Novel Protein Structural\nSummary: Here is the summary in 2 sentences: AncFlow is an automated software pipeline that integrates phylogenetic analysis, subfamily identification, and ancestral sequence reconstruction (ASR) to generate ancestral protein sequences for structural prediction using state-of-the-art tools like AlphaFold. The pipeline was validated on two well-characterized protein families, providing insights into the evolutionary mechanisms underpinning functional diversification within these families and demonstrating its potential to guide protein engineering efforts."
+#> [1] "I am giving you information about recent bioRxiv/medRxiv preprints. I'll give you the subject, preprint titles, and short summary of each paper. Please provide a general summary new advances in this subject/field in general. Provide this summary of the field in as many sentences as I instruct. Do not include any preamble text to the summary just give me the summary with no preface or intro sentence.\n\nSubject: bioinformatics\nNumber of sentences in summary: 5\n\nHere are the titles and summaries:\n\nTitle: MedGraphNet: Leveraging Multi-Relational Graph Neural Networks and Text Knowledge for Biomedical Predictions\nSummary: MedGraphNet leverages multi-relational Graph Neural Networks and text knowledge to improve biomedical predictions by initializing nodes using informative embeddings from existing text knowledge, allowing for robust integration of various data types and improved generalizability. The model demonstrates superior performance compared to traditional single-relation approaches in scenarios with isolated or sparsely connected nodes, particularly in identifying disease-gene associations and drug-phenotype relationships, and shows promising results in accurately inferring drug side effects without direct training on such data.\n\nTitle: High-throughput bacterial aggregation analysis in droplets\nSummary: The communal lifestyle of bacteria can contribute significantly to antimicrobial resistance by promoting biofilm formation. A key approach to addressing this issue is to develop novel techniques for analyzing bacterial behavior, such as those enabled by droplet-based platforms and image analysis methods.\n\nTitle: scParadise: Tunable highly accurate multi-task cell type annotation and surface protein abundance prediction\nSummary: scAdam outperforms existing methods in annotating rare cell types with high accuracy and consistency across diverse datasets. scEve enhances clustering and cell type separation through improved surface protein prediction, leading to better characterization of complex tissues.\n\nTitle: Camera Paths, Modeling, and Image Processing Tools for ArtiaX\nSummary: ArtiaX is a plugin that has been extended to improve the analysis and visualization of cryo-electron tomography data through advanced visualization techniques. The plugin allows for the generation of diverse models with putative particle positions and orientations, as well as a coarse grained algorithm to rectify overlaps in template matching, driving camera position and facilitating movie creation with fundamental image filtering options.\n\nTitle: dScaff - an automatic bioinformatics framework for scaffolding draft de novo assemblies based on reference genome data\nSummary: dScaff is an automatic bioinformatics framework designed for scaffolding draft de novo assemblies based on reference genome data. The tool uses a series of bash and R scripts to create a minimal complete scaffold from a genome assembly, with potential future features to be implemented, including using reference chromosomes or scaffolds.\n\nTitle: Jaeger: an accurate and fast deep-learning tool to detect bacteriophage sequences\nSummary: Jaeger's accuracy and speed in identifying bacteriophage sequences outperform existing deep-learning tools by consistently producing few false positives despite encountering diverse viral sequences. The novel method achieves an estimated 2-27% false discovery rate when applied to over 16,000 metagenomic assemblies, which is significantly lower than the benchmarking paper where deep-learning tools produced many false positives.\n\nTitle: AI-Augmented R-Group Exploration in Medicinal Chemistry\nSummary: The paper presents a novel approach to enhancing free-wing QSAR models by embedding R-groups with atom-centric pharmacophoric features, allowing for the distinction of regioisomers and improved predictivity across 12 public datasets. The proposed method is integrated into an open-source program, enabling its application in various scenarios, including classic free-Wilson analysis and exploration of uncharted chemical space facilitated by AI-generated building blocks.\n\nTitle: OPLS-based Multiclass Classification and Data-Driven Inter-Class Relationship Discovery\nSummary: OPLS-DA models are widely used in metabolomics for two-class comparisons due to their strong discrimination capabilities, but these models face challenges in multiclass settings. An extension of OPLS-DA called OPLS-HDA integrates Hierarchical Cluster Analysis with the OPLS-DA framework to create a decision tree that addresses multiclass classification challenges and provides intuitive visualization of inter-class relationships.\n\nTitle: STANCE: a unified statistical model to detect cell-type-specific spatially variable genes in spatial transcriptomics\nSummary: STANCE, a unified statistical model to detect cell-type-specific spatially variable genes in spatial transcriptomics, was developed to address the challenges posed by existing methods in detecting spatially variable genes (SVGs) and cell type-specific spatially variable genes (ctSVGs). The proposed method integrates gene expression, spatial location, and cell type composition through a linear mixed-effect model to identify both SVGs and ctSVGs in an initial stage, followed by a second stage test dedicated to ctSVG detection.\n\nTitle: AsaruSim: a single-cell and spatial RNA-Seq Nanopore long-reads simulation workflow\nSummary: AsaruSim simulates synthetic single-cell long-read Nanopore datasets that closely mimic real experimental data by employing a multi-step process. It includes the creation of a synthetic UMI count matrix, generation of perfect reads, optional PCR amplification, introduction of sequencing errors, and comprehensive quality control reporting.\n\nTitle: Building a literature knowledge base towards transparent biomedical AI\nSummary: LiteralGraph extracts biomedical terms and relationships from PubMed literature, establishing a comprehensive knowledge graph. The resulting Genomic Literature Knowledge Base consolidates over 263 million biomedical terms, 14 million relationships, and 10 million genomic events across multiple sources, including nine established repositories.\n\nTitle: Accurate non-invasive quantification of astaxanthin content using hyperspectral images and machine learning\nSummary: The authors investigated a method to accurately quantify astaxanthin content in Haematococcus pluvialis microalgae cultures using hyperspectral images and machine learning. They found that this approach, combining reflectance hyperspectral imaging with a 1-dimensional convolutional neural network, had low average prediction error across a range of astaxanthin contents, although it was unreliable at very low levels (<0.6 micrograms mg-1).\n\nTitle: AlphaMut: a deep reinforcement learning model to suggest helix-disrupting mutations\nSummary: The authors propose a deep reinforcement learning model called AlphaMut to predict helix-disrupting mutations in proteins. AlphaMut identifies amino acids crucial for maintaining structural integrity and predicts key mutations that could alter protein function.\n\nTitle: Beyond Static Brain Atlases: AI-Powered Open Databasing and Dynamic Mining of Brain-Wide Neuron Morphometry\nSummary: NeuroXiv is a large-scale database that provides detailed 3D morphologies of individual neurons mapped to a standard brain atlas, allowing for dynamic, interactive neuroscience applications. The database offers a comprehensive collection of 175,149 atlas-oriented reconstructed morphologies of individual neurons from over 518 mouse brains, classified into 292 distinct types and mapped into the Common Coordinate Framework Version 3 (CCFv3).\n\nTitle: Metabolic modeling identifies determinants of thermal growth responses in Arabidopsis thaliana\nSummary: The paper developed an enzyme-constrained model of Arabidopsis thaliana's metabolism, which facilitates predictions of growth-related phenotypes at different temperatures and identifies genes affecting plant growth at suboptimal temperatures. This model was validated using mutant lines, demonstrating its potential in accurately predicting plant thermal responses and providing a template for developing climate-resilient crops.\n\nTitle: Decoding Protein Dynamics: ProFlex as a Linguistic Bridge in Normal Mode Analysis\nSummary: Artificial intelligence has revolutionized structural bioinformatics with AlphaFold being arguably the most impactful development to date. The structural atlases generated by these methods present significant opportunities for unraveling biological mysteries, but also pose challenges in leveraging such massive datasets effectively.\n\nTitle: Exploring midgut expression dynamics: longitudinal transcriptomic analysis of adult female Amblyomma americanum midgut and comparative insights with other hard tick species\nSummary: The study investigates the transcriptomic dynamics of the midgut in adult female Amblyomma americanum ticks during different feeding stages, revealing 15,599 putative DNA coding sequences and highlighting dynamic transcriptional changes as feeding progresses. The analysis also identified conserved transcripts across three hard tick species, providing insight into the physiological pathways relevant to the tick midgut and potential avenues for developing control methods targeting multiple tick species.\n\nTitle: Designing of thermostable proteins with a desired melting temperature\nSummary: We developed a regression method for predicting protein melting temperatures (Tm) using 17,312 non-redundant proteins and achieved the highest Pearson correlation of 0.80 with an R2 of 0.63 between predicted and actual Tm values. Our best model, fine-tuned on large language models such as ProtBert, achieved a maximum correlation of 0.89 with an R2 of 0.80, demonstrating improved performance in predicting protein stability at higher temperatures.\n\nTitle: Joint Modeling of Cellular Heterogeneity and Condition Effects with scPCA in Single-Cell RNA-Seq\nSummary: scRNA-seq in multi-condition experiments enables the systematic assessment of treatment effects by analyzing gene expression profiles. scPCA is a flexible DR framework that jointly models cellular heterogeneity and conditioning variables, allowing for an integrated factor representation and revealing transcriptional changes across conditions and components.\n\nTitle: Identification of potential inhibitors against Inosine 5'-Monophosphate Dehydrogenase of Cryptosporidium parvum through an integrated in silico approach\nSummary: A total of 24 bioactive phytochemicals were screened virtually using molecular docking and ADMET analyses to identify potential inhibitors against Inosine 5'-Monophosphate Dehydrogenase (IMPDH) of Cryptosporidium parvum, with four lead compounds identified as Brevelin A, Vernodalin, Luteolin, and Pectolinarigenin. The lead compounds were found to possess favorable pharmacokinetic and pharmacodynamic properties, satisfactory toxicity analysis results, and no major side effects or violation of Lipinski's rules of five, indicating the possibility of oral bioavailability as potential drug candidates.\n\nTitle: Identification and Diagnostic Potential of Pyroptosis-Related Genes in Endometriosis: A Novel Bioinformatics Analysis\nSummary: Pyroptosis-related genes were identified through a bioinformatics analysis of endometriosis (EM) transcriptomic datasets, resulting in 26 differentially expressed genes that play a crucial role in the pathogenesis of EM. A novel diagnostic model was constructed using LASSO regression based on pyroptosis scores, which included five key genes: KIF13B, BAG6, MYO5A, HEATR, and AK055981.\n\nTitle: Improving the accuracy of pose prediction by incorporating symmetry-related molecules\nSummary: The study aimed to improve the accuracy of pose prediction in molecular docking by incorporating symmetry-related molecules (SRMs). Redocking protein-ligand complexes with and without SRMs revealed that using SRMs significantly improved the prediction of biologically significant poses, as indicated by MM-GBSA calculations.\n\nTitle: Identification and study of Prolyl Oligopeptidases and related sequences in bacterial lineages\nSummary: The study examined ~32000 completely annotated bacterial genomes from the NCBI RefSeq Assembly database to identify annotated S9 family proteins, resulting in the discovery of ~53,000 bacterial S9 family proteins (referred to as POP homologues) which can be classified into distinct subfamilies through various machine-learning approaches and comprehensive analysis. These sequence homologues display distinct subclusters and class-specific motifs suggesting differences in substrate specificity in POP homologues.\n\nTitle: Learning-Augmented Sketching Offers Improved Performance for Privacy Preserving and Secure GWAS\nSummary: The introduction of trusted execution environments (TEEs) such as Intel SGX technology has enabled secure and privacy-preserving computation on the cloud, but stringent resource limitations pose a challenge for some TEEs. The SkSES method, which identifies significant SNPs in GWAS without disclosing sensitive genotype information, has been improved upon with a learning-augmented approach that achieves up to 40% accuracy gain compared to the original SkSES method.\n\nTitle: Liberality is More Explainable than PCA of Transcriptome for Vertebrate Embryo Development\nSummary: Liberality is a quantitative index of cellular differentiation and dedifferentiation that has been widely used for genome-scale data analysis, particularly in understanding vertebrate embryo development. The study analyzed a time course transcriptome dataset on vertebrate embryo development and found a trend that historically annotated embryo developmental stages matched changes in liberality, indicating the potential of liberality to analyze biological phenomena beyond just embryo development.\n\nTitle: Bacopa monnieri phytochemicals as promising BACE1 inhibitors for Alzheimers Disease Therapy\nSummary: Bacopa monnieri phytochemicals are investigated as potential BACE1 inhibitors for Alzheimer's Disease Therapy, with Bacopaside I showing superior binding affinity and interaction profile compared to established synthetic inhibitors. The study highlights the promising role of natural compounds in AD treatment, emphasizing their potential to overcome limitations faced in clinical settings, and advocates for a paradigm shift towards integrating traditional medicinal knowledge into contemporary drug discovery efforts.\n\nTitle: Accurate Multiple Sequence Alignment of Ultramassive Genome Sets\nSummary: The current state of multiple sequence alignment (MSA) is insufficient for handling ultramassive genome sets due to challenges in scalability and accuracy. The proposed algorithms, including directed acyclic graph construction, profile hidden Markov model training, and graph-based alignment, significantly improve accuracy and acceleration of MSA compared to widely used MAFFT for genome set sizes ranging from 40,000 to over 4 million.\n\nTitle: Machine Learning Driven Simulations of SARS-CoV-2 Fitness Landscape\nSummary: The SARS-CoV-2 infection is caused by interactions between the receptor binding domain of viral spike proteins and host cell ACE2 receptors, with mutations in the spike protein leading to neutralizing antibody escape and breakthrough infections. Machine learning-driven simulations combined with deep mutational scanning data predict variants of concern not seen in the training data and sample statistics of the fitness landscape, providing insight into the relationship between RBD sequence elements and emerging viral strains.\n\nTitle: Modelling dynamics of human NDPK hexamer structure, stability and interactions\nSummary: The precise assembly of the NDPK hexameric structure into homo- /hetero-oligomeric complexes is necessary for kinase activity but has been poorly understood due to high subunit homology, experimental challenges, and limited data on in vivo heterohexamer formation and subunit abundances across cellular compartments. A conserved Arg27 residue plays a key role in hexamer assembly, mediating inter- and intra-molecular monomeric interactions and ensuring similar hexameric assembly across subunits.\n\nTitle: GuaCAMOLE: GC-bias aware estimation improves the accuracy of metagenomic species abundances\nSummary: GuaCAMOLE is a novel computational method that detects and removes GC bias from metagenomic sequencing data, which affects the accuracy of quantifying microbial community compositions. The algorithm reports unbiased abundances and corrects the abundance of clinically relevant GC-poor species by up to a factor of two in gut microbiomes of colorectal cancer patients."