You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Machine Learning (ML) quickly advanced in the last few years. Key to the ML models is the training process where models learn from data. Therefore, data quality and quantity are fundamental to model performance. Datasets like ImageNet have promoted the use of ML on the Computer Vision field for image classification leading to major advances like the development of AlexNet and ResNet for example. Synthetic Biology (SynBio) looks for engineering biological systems and has created abstractions and standards for that endeavor. Sequence-to-expression is a hallmark for SynBio but the field lacks easy to use datasets so researchers and developers can focus on creating new models instead of gathering and preprocessing data. The Synthetic Biology Open Language (SBOL) [1] was developed by the SynBio community as a standard to represent biological designs hierarchically. SynBioHub is a repository of designs in SBOL, with an API for easy programmatic access. To promote the development and training of ML models from SBOL we will collect data from bibliography [2], encode it in SBOL and make it available on SynBioHub. Then, we will develop a Python package to facilitate the creation of datasets from data in SynBioHub. This will include the query and preprocessing of data to be usable for ML model training. We will test the package by training ML models reproducing results from the paper from where the data was extracted. Finally, we will explore the performance of Graph Neural Networks (GNN) as SBOL is represented in graphs and GNN should be better to extract data from it.
[1] Buecherl, Lukas, et al. "Synthetic biology open language (SBOL) version 3.1. 0." Journal of integrative bioinformatics 20.1 (2023): 20220058.
[2] Urtecho, Guillaume, et al. "Genome-wide Functional Characterization of Escherichia coli Promoters and Sequence Elements Encoding Their Regulation." eLife 12 (2023).
Goal
Develop a Python package for dataset building from SynBioHub.
Specific Goals:
Encode data from reference [2] in SBOL and upload it to SynBioHub.
Create a Python package to query and preprocess data from SynBioHub.
Train an ML model replicating reference [2] results.
Explore the performance of GNN on the same data.
Document the package and create example notebooks.
Difficulty Level: Medium
This project involves encoding designs in SBOL, development of a Python package for data set creation and training on ML models.
Size and Length of Project
medium: 175 hours
16 weeks
Skills
Essential skills: Python, GitHub, Git, ML
Nice to have skills: SBOL
To know more about the project here is the SynBioHub paper, and here is the GitHub repository. The main interface will be by using the API to upload data to the library and download data for dataset building, here the API documentation. Then skim Keras examples to familiarize with the workflow wher this tool will be used.
Hello @Gonza10V and @khanspers, I have read the mentioned SynBioHub paper and went through the SynBioHub API documentation. I understood many concepts of SynBioHub. I am already experienced in working with pytorch, so it was easy for me to understood the Keras workflow.
Can you please tell me what I can do next to deep dive into the project? Thanks
Hey, I’m Aryan Prakhar, a pre-final-year student at IIT BHU, and I’m super excited about the chance to contribute to this project. I’ve been wanting to dive into GNNs for a while, and this project seems like the perfect opportunity to do that besides a bunch of other cool stuff! I have prior worked on dataset and benchmark creation, data-driven scientific discovery through multi-agent systems etc and have co-authored a paper at ICLR. I also contributed to another open-source program C4GT last year and had a fulfilling time.
Background
Machine Learning (ML) quickly advanced in the last few years. Key to the ML models is the training process where models learn from data. Therefore, data quality and quantity are fundamental to model performance. Datasets like ImageNet have promoted the use of ML on the Computer Vision field for image classification leading to major advances like the development of AlexNet and ResNet for example. Synthetic Biology (SynBio) looks for engineering biological systems and has created abstractions and standards for that endeavor. Sequence-to-expression is a hallmark for SynBio but the field lacks easy to use datasets so researchers and developers can focus on creating new models instead of gathering and preprocessing data. The Synthetic Biology Open Language (SBOL) [1] was developed by the SynBio community as a standard to represent biological designs hierarchically. SynBioHub is a repository of designs in SBOL, with an API for easy programmatic access. To promote the development and training of ML models from SBOL we will collect data from bibliography [2], encode it in SBOL and make it available on SynBioHub. Then, we will develop a Python package to facilitate the creation of datasets from data in SynBioHub. This will include the query and preprocessing of data to be usable for ML model training. We will test the package by training ML models reproducing results from the paper from where the data was extracted. Finally, we will explore the performance of Graph Neural Networks (GNN) as SBOL is represented in graphs and GNN should be better to extract data from it.
[1] Buecherl, Lukas, et al. "Synthetic biology open language (SBOL) version 3.1. 0." Journal of integrative bioinformatics 20.1 (2023): 20220058.
[2] Urtecho, Guillaume, et al. "Genome-wide Functional Characterization of Escherichia coli Promoters and Sequence Elements Encoding Their Regulation." eLife 12 (2023).
Goal
Develop a Python package for dataset building from SynBioHub.
Specific Goals:
Encode data from reference [2] in SBOL and upload it to SynBioHub.
Create a Python package to query and preprocess data from SynBioHub.
Train an ML model replicating reference [2] results.
Explore the performance of GNN on the same data.
Document the package and create example notebooks.
Difficulty Level: Medium
This project involves encoding designs in SBOL, development of a Python package for data set creation and training on ML models.
Size and Length of Project
Skills
Essential skills: Python, GitHub, Git, ML
Nice to have skills: SBOL
Public Repository
https://github.com/synbiodex
Potential Mentors
Gonzalo Vidal ([email protected])
Chris Myers ([email protected])
The text was updated successfully, but these errors were encountered: