Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automating the constructions of datasets from SynBioHub to streamline the trainning of ML models on standardized data in SBOL #246

Open
Gonza10V opened this issue Jan 17, 2025 · 5 comments

Comments

@Gonza10V
Copy link

Gonza10V commented Jan 17, 2025

Background

Machine Learning (ML) quickly advanced in the last few years. Key to the ML models is the training process where models learn from data. Therefore, data quality and quantity are fundamental to model performance. Datasets like ImageNet have promoted the use of ML on the Computer Vision field for image classification leading to major advances like the development of AlexNet and ResNet for example. Synthetic Biology (SynBio) looks for engineering biological systems and has created abstractions and standards for that endeavor. Sequence-to-expression is a hallmark for SynBio but the field lacks easy to use datasets so researchers and developers can focus on creating new models instead of gathering and preprocessing data. The Synthetic Biology Open Language (SBOL) [1] was developed by the SynBio community as a standard to represent biological designs hierarchically. SynBioHub is a repository of designs in SBOL, with an API for easy programmatic access. To promote the development and training of ML models from SBOL we will collect data from bibliography [2], encode it in SBOL and make it available on SynBioHub. Then, we will develop a Python package to facilitate the creation of datasets from data in SynBioHub. This will include the query and preprocessing of data to be usable for ML model training. We will test the package by training ML models reproducing results from the paper from where the data was extracted. Finally, we will explore the performance of Graph Neural Networks (GNN) as SBOL is represented in graphs and GNN should be better to extract data from it.

[1] Buecherl, Lukas, et al. "Synthetic biology open language (SBOL) version 3.1. 0." Journal of integrative bioinformatics 20.1 (2023): 20220058.
[2] Urtecho, Guillaume, et al. "Genome-wide Functional Characterization of Escherichia coli Promoters and Sequence Elements Encoding Their Regulation." eLife 12 (2023).

Goal

Develop a Python package for dataset building from SynBioHub.

Specific Goals:
Encode data from reference [2] in SBOL and upload it to SynBioHub.
Create a Python package to query and preprocess data from SynBioHub.
Train an ML model replicating reference [2] results.
Explore the performance of GNN on the same data.
Document the package and create example notebooks.

Difficulty Level: Medium

This project involves encoding designs in SBOL, development of a Python package for data set creation and training on ML models.

Size and Length of Project

  • medium: 175 hours
  • 16 weeks

Skills

Essential skills: Python, GitHub, Git, ML
Nice to have skills: SBOL

Public Repository

https://github.com/synbiodex

Potential Mentors

Gonzalo Vidal ([email protected])
Chris Myers ([email protected])

@Artinong
Copy link

hey @khanspers is this available right now if so please do assign me for this problem

@cywlol
Copy link

cywlol commented Jan 18, 2025

Hi, I am interested in working on this project.

@Gonza10V
Copy link
Author

To know more about the project here is the SynBioHub paper, and here is the GitHub repository. The main interface will be by using the API to upload data to the library and download data for dataset building, here the API documentation. Then skim Keras examples to familiarize with the workflow wher this tool will be used.

@angkul07
Copy link

Hello @Gonza10V and @khanspers, I have read the mentioned SynBioHub paper and went through the SynBioHub API documentation. I understood many concepts of SynBioHub. I am already experienced in working with pytorch, so it was easy for me to understood the Keras workflow.

Can you please tell me what I can do next to deep dive into the project? Thanks

@AryanPrakhar
Copy link

Hey, I’m Aryan Prakhar, a pre-final-year student at IIT BHU, and I’m super excited about the chance to contribute to this project. I’ve been wanting to dive into GNNs for a while, and this project seems like the perfect opportunity to do that besides a bunch of other cool stuff! I have prior worked on dataset and benchmark creation, data-driven scientific discovery through multi-agent systems etc and have co-authored a paper at ICLR. I also contributed to another open-source program C4GT last year and had a fulfilling time.

Looking forward to discussing this further!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants