Automating the constructions of datasets from SynBioHub to streamline the trainning of ML models on standardized data in SBOL #246

Gonza10V · 2025-01-17T00:00:41Z

Background

Machine Learning (ML) quickly advanced in the last few years. Key to the ML models is the training process where models learn from data. Therefore, data quality and quantity are fundamental to model performance. Datasets like ImageNet have promoted the use of ML on the Computer Vision field for image classification leading to major advances like the development of AlexNet and ResNet for example. Synthetic Biology (SynBio) looks for engineering biological systems and has created abstractions and standards for that endeavor. Sequence-to-expression is a hallmark for SynBio but the field lacks easy to use datasets so researchers and developers can focus on creating new models instead of gathering and preprocessing data. The Synthetic Biology Open Language (SBOL) [1] was developed by the SynBio community as a standard to represent biological designs hierarchically. SynBioHub is a repository of designs in SBOL, with an API for easy programmatic access. To promote the development and training of ML models from SBOL we will collect data from bibliography [2], encode it in SBOL and make it available on SynBioHub. Then, we will develop a Python package to facilitate the creation of datasets from data in SynBioHub. This will include the query and preprocessing of data to be usable for ML model training. We will test the package by training ML models reproducing results from the paper from where the data was extracted. Finally, we will explore the performance of Graph Neural Networks (GNN) as SBOL is represented in graphs and GNN should be better to extract data from it.

[1] Buecherl, Lukas, et al. "Synthetic biology open language (SBOL) version 3.1. 0." Journal of integrative bioinformatics 20.1 (2023): 20220058.
[2] Urtecho, Guillaume, et al. "Genome-wide Functional Characterization of Escherichia coli Promoters and Sequence Elements Encoding Their Regulation." eLife 12 (2023).

Goal

Develop a Python package for dataset building from SynBioHub.

Specific Goals:
Encode data from reference [2] in SBOL and upload it to SynBioHub.
Create a Python package to query and preprocess data from SynBioHub.
Train an ML model replicating reference [2] results.
Explore the performance of GNN on the same data.
Document the package and create example notebooks.

Difficulty Level: Medium

This project involves encoding designs in SBOL, development of a Python package for data set creation and training on ML models.

Size and Length of Project

medium: 175 hours
16 weeks

Skills

Essential skills: Python, GitHub, Git, ML
Nice to have skills: SBOL

Public Repository

https://github.com/synbiodex

Potential Mentors

Gonzalo Vidal ([email protected])
Chris Myers ([email protected])

Artinong · 2025-01-18T04:07:57Z

hey @khanspers is this available right now if so please do assign me for this problem

cywlol · 2025-01-18T04:34:29Z

Hi, I am interested in working on this project.

Gonza10V · 2025-01-22T18:36:07Z

To know more about the project here is the SynBioHub paper, and here is the GitHub repository. The main interface will be by using the API to upload data to the library and download data for dataset building, here the API documentation. Then skim Keras examples to familiarize with the workflow wher this tool will be used.

angkul07 · 2025-01-29T17:58:54Z

Hello @Gonza10V and @khanspers, I have read the mentioned SynBioHub paper and went through the SynBioHub API documentation. I understood many concepts of SynBioHub. I am already experienced in working with pytorch, so it was easy for me to understood the Keras workflow.

Can you please tell me what I can do next to deep dive into the project? Thanks

AryanPrakhar · 2025-02-02T11:00:18Z

Hey, I’m Aryan Prakhar, a pre-final-year student at IIT BHU, and I’m super excited about the chance to contribute to this project. I’ve been wanting to dive into GNNs for a while, and this project seems like the perfect opportunity to do that besides a bunch of other cool stuff! I have prior worked on dataset and benchmark creation, data-driven scientific discovery through multi-agent systems etc and have co-authored a paper at ICLR. I also contributed to another open-source program C4GT last year and had a fulfilling time.

Looking forward to discussing this further!

khanspers assigned Gonza10V Jan 18, 2025

khanspers added Difficulty: Medium Size: 175h Python Machine learning SBOL labels Jan 18, 2025

khanspers assigned cjmyers Jan 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automating the constructions of datasets from SynBioHub to streamline the trainning of ML models on standardized data in SBOL #246

Automating the constructions of datasets from SynBioHub to streamline the trainning of ML models on standardized data in SBOL #246

Gonza10V commented Jan 17, 2025 •

edited

Loading

Artinong commented Jan 18, 2025

cywlol commented Jan 18, 2025

Gonza10V commented Jan 22, 2025

angkul07 commented Jan 29, 2025

AryanPrakhar commented Feb 2, 2025

Automating the constructions of datasets from SynBioHub to streamline the trainning of ML models on standardized data in SBOL #246

Automating the constructions of datasets from SynBioHub to streamline the trainning of ML models on standardized data in SBOL #246

Comments

Gonza10V commented Jan 17, 2025 • edited Loading

Background

Goal

Difficulty Level: Medium

Size and Length of Project

Skills

Public Repository

Potential Mentors

Artinong commented Jan 18, 2025

cywlol commented Jan 18, 2025

Gonza10V commented Jan 22, 2025

angkul07 commented Jan 29, 2025

AryanPrakhar commented Feb 2, 2025

Gonza10V commented Jan 17, 2025 •

edited

Loading