upload again all the files for extraxtion

commons-research · Aug 6, 2024 · 5e5fe7c · 5e5fe7c
commit 5e5fe7c
Show file tree

Hide file tree

Showing 20 changed files with 5,171 additions and 0 deletions.
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -0,0 +1,133 @@
+# Contributing to `dataset-extractor-lotus`
+
+Contributions are welcome, and they are greatly appreciated!
+Every little bit helps, and credit will always be given.
+
+You can contribute in many ways:
+
+# Types of Contributions
+
+## Report Bugs
+
+Report bugs at https://github.com/commons-research/dataset-extractor-lotus/issues
+
+If you are reporting a bug, please include:
+
+- Your operating system name and version.
+- Any details about your local setup that might be helpful in troubleshooting.
+- Detailed steps to reproduce the bug.
+
+## Fix Bugs
+
+Look through the GitHub issues for bugs.
+Anything tagged with "bug" and "help wanted" is open to whoever wants to implement a fix for it.
+
+## Implement Features
+
+Look through the GitHub issues for features.
+Anything tagged with "enhancement" and "help wanted" is open to whoever wants to implement it.
+
+## Write Documentation
+
+Cookiecutter PyPackage could always use more documentation, whether as part of the official docs, in docstrings, or even on the web in blog posts, articles, and such.
+
+## Submit Feedback
+
+The best way to send feedback is to file an issue at https://github.com/fpgmaas/dataset-extractor-lotus/issues.
+
+If you are proposing a new feature:
+
+- Explain in detail how it would work.
+- Keep the scope as narrow as possible, to make it easier to implement.
+- Remember that this is a volunteer-driven project, and that contributions
+  are welcome :)
+
+# Get Started!
+
+Ready to contribute? Here's how to set up `dataset-extractor-lotus` for local development.
+Please note this documentation assumes you already have `poetry` and `Git` installed and ready to go.
+
+1. Fork the `dataset-extractor-lotus` repo on GitHub.
+
+2. Clone your fork locally:
+
+```bash
+cd <directory_in_which_repo_should_be_created>
+git clone [email protected]:commons-research/dataset-extractor-lotus.git
+```
+
+3. Now we need to install the environment. Navigate into the directory
+
+```bash
+cd dataset-extractor-lotus
+```
+
+If you are using `pyenv`, select a version to use locally. (See installed versions with `pyenv versions`)
+
+```bash
+pyenv local <x.y.z>
+```
+
+Then, install and activate the environment with:
+
+```bash
+poetry install
+poetry shell
+```
+
+4. Install pre-commit to run linters/formatters at commit time:
+
+```bash
+poetry run pre-commit install
+```
+
+5. Create a branch for local development:
+
+```bash
+git checkout -b name-of-your-bugfix-or-feature
+```
+
+Now you can make your changes locally.
+
+6. Don't forget to add test cases for your added functionality to the `tests` directory.
+
+7. When you're done making changes, check that your changes pass the formatting tests.
+
+```bash
+make check
+```
+
+Now, validate that all unit tests are passing:
+
+```bash
+make test
+```
+
+9. Before raising a pull request you should also run tox.
+   This will run the tests across different versions of Python:
+
+```bash
+tox
+```
+
+This requires you to have multiple versions of python installed.
+This step is also triggered in the CI/CD pipeline, so you could also choose to skip this step locally.
+
+10. Commit your changes and push your branch to GitHub:
+
+```bash
+git add .
+git commit -m "Your detailed description of your changes."
+git push origin name-of-your-bugfix-or-feature
+```
+
+11. Submit a pull request through the GitHub website.
+
+# Pull Request Guidelines
+
+Before you submit a pull request, check that it meets these guidelines:
+
+1. The pull request should include tests.
+
+2. If the pull request adds functionality, the docs should be updated.
+   Put your new functionality into a function with a docstring, and add the feature to the list in `README.md`.
diff --git a/LICENSE b/LICENSE
@@ -0,0 +1,33 @@
+GNU GENERAL PUBLIC LICENSE
+                      Version 3, 29 June 2007
+
+    Extract specific elements from the LOTUS dataset.
+    Copyright (C) 2024  Pascal Amrein
+
+    This program is free software: you can redistribute it and/or modify
+    it under the terms of the GNU General Public License as published by
+    the Free Software Foundation, either version 3 of the License, or
+    (at your option) any later version.
+
+    This program is distributed in the hope that it will be useful,
+    but WITHOUT ANY WARRANTY; without even the implied warranty of
+    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+    GNU General Public License for more details.
+
+    You should have received a copy of the GNU General Public License
+    along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+Also add information on how to contact you by electronic and paper mail.
+
+  You should also get your employer (if you work as a programmer) or school,
+if any, to sign a "copyright disclaimer" for the program, if necessary.
+For more information on this, and how to apply and follow the GNU GPL, see
+<http://www.gnu.org/licenses/>.
+
+  The GNU General Public License does not permit incorporating your program
+into proprietary programs.  If your program is a subroutine library, you
+may consider it more useful to permit linking proprietary applications with
+the library.  If this is what you want to do, use the GNU Lesser General
+Public License instead of this License.  But first, please read
+<http://www.gnu.org/philosophy/why-not-lgpl.html>.
+
diff --git a/Makefile b/Makefile
@@ -0,0 +1,35 @@
+.PHONY: install
+install: ## Install the poetry environment and install the pre-commit hooks
+	@echo "🚀 Creating virtual environment using pyenv and poetry"
+	@poetry install
+	@ poetry run pre-commit install
+	@poetry shell
+
+.PHONY: check
+check: ## Run code quality tools.
+	@echo "🚀 Checking Poetry lock file consistency with 'pyproject.toml': Running poetry lock --check"
+	@poetry check --lock
+	@echo "🚀 Linting code: Running pre-commit"
+	@poetry run pre-commit run -a
+	@echo "🚀 Static type checking: Running mypy"
+	@poetry run mypy
+
+.PHONY: test
+test: ## Test the code with pytest
+	@echo "🚀 Testing code: Running pytest"
+	@poetry run pytest --doctest-modules
+
+.PHONY: build
+build: clean-build ## Build wheel file using poetry
+	@echo "🚀 Creating wheel file"
+	@poetry build
+
+.PHONY: clean-build
+clean-build: ## clean build artifacts
+	@rm -rf dist
+
+.PHONY: help
+help:
+	@grep -E '^[a-zA-Z_-]+:.*?## .*$$' $(MAKEFILE_LIST) | awk 'BEGIN {FS = ":.*?## "}; {printf "\033[36m%-20s\033[0m %s\n", $$1, $$2}'
+
+.DEFAULT_GOAL := help
diff --git a/README.md b/README.md
@@ -0,0 +1,93 @@
+# dataset extractor LOTUS
+
+[![Release](https://img.shields.io/github/v/release/fpgmaas/dataset-extractor-lotus)](https://img.shields.io/github/v/release/fpgmaas/dataset-extractor-lotus)
+[![Build status](https://img.shields.io/github/actions/workflow/status/fpgmaas/dataset-extractor-lotus/main.yml?branch=main)](https://github.com/fpgmaas/dataset-extractor-lotus/actions/workflows/main.yml?query=branch%3Amain)
+[![codecov](https://codecov.io/gh/fpgmaas/dataset-extractor-lotus/branch/main/graph/badge.svg)](https://codecov.io/gh/fpgmaas/dataset-extractor-lotus)
+[![Commit activity](https://img.shields.io/github/commit-activity/m/fpgmaas/dataset-extractor-lotus)](https://img.shields.io/github/commit-activity/m/fpgmaas/dataset-extractor-lotus)
+[![License](https://img.shields.io/github/license/fpgmaas/dataset-extractor-lotus)](https://img.shields.io/github/license/fpgmaas/dataset-extractor-lotus)
+[![DOI](https://zenodo.org/badge/DOI/records/7534071.svg)](https://zenodo.org/records/7534071)
+
+Extract specific elements/rows from the LOTUS dataset.
+
+## How to run the LOTUS extractor
+This is one possible example to run the script.
+
+```bash
+# navigater to a place, where you want to put the project
+cd ./path/to/folder/
+
+# clone it from git and enter the folder
+git clone [email protected]:commons-research/dataset-extractor-lotus.git
+cd dataset-extractor-lotus
+
+# install the environment with poetry
+## if poetry is not already installed, follow the following steps
+pip install pipx
+pipx install poetry
+
+# Install from the poetry.lock file (command should be run in the same folder as the poetry.lock file)
+poetry install
+
+# run the script in poetry and get the help page 
+poetry run python dataset_extractor_lotus/main.py -h
+
+# run the script in the interactive mode
+poetry run python dataset_extractor_lotus/main.py
+
+```
+
+## Interactive Mode
+Example how to proceed:
+
+```bash
+Start interactive mode.
+? Welcome to the "toydataset extractor" for the LOTUS datasets.
+            Chose your action: 
+  sampling
+❯ download
+  exit (Ctrl+C)
+```
+
+### Download dataset
+
+```bash
+Start interactive mode.
+? Welcome to the "toydataset extractor" for the LOTUS datasets.
+            Chose your action: download
+The datasets will be searched from the internet. One moment please...
+? Please choose your dataset to download (sorted from newest to oldest): 220916_frozen_metadata.csv.gz (record: 7085063)
+? Enter path to download: data/
+The dataset will be downloaded to data//220916_frozen_metadata.csv.gz (record: 7085063).
+Download complete: 220916_frozen_metadata.csv.gz
+```
+
+### sampling from dataset
+
+```bash
+Start interactive mode.
+? Welcome to the "toydataset extractor" for the LOTUS datasets.
+            Chose your action: sampling
+? Enter the filepath to sample from: data/220916_frozen_metadata.csv.gz
+Before type:  String
+After type:  Int32
+? Please choose the taxonomy level to sample from: organism_taxonomy_05order
+? Choose from which member to sample (522 options): Anthoathecata
+? Please enter the amount of members to sample (max. 77): 32
+? Please choose the output format: full
+? Enter the output file name or existing filename to append: test_sampling.csv
+File test_sampling.csv does not exist. Creating new file.
+```
+
+
+## Documentation
+- **Github repository**: <https://github.com/commons-research/dataset-extractor-lotus.git>
+- **Documentation with mkdocs** <https://commons-research.github.io/dataset-extractor-lotus/>
+
+## Feedback
+Please follow this [guidelines](CONTRIBUTING.md) for reporting a bug.  
+For other issues please use <https://github.com/commons-research/dataset-extractor-lotus/issues>. 
+
+
+## Releasing a new version
+21.03.2024: v1.1 is released (interactive mode)  
+07.03.2024: v1.0 is released (basic random sampling)
diff --git a/data/.gitignore b/data/.gitignore
@@ -0,0 +1,2 @@
+*
+!.gitignore