Skip to content

Commit

Permalink
upload again all the files for extraxtion
Browse files Browse the repository at this point in the history
  • Loading branch information
pamrein committed Aug 6, 2024
0 parents commit 5e5fe7c
Show file tree
Hide file tree
Showing 20 changed files with 5,171 additions and 0 deletions.
133 changes: 133 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
# Contributing to `dataset-extractor-lotus`

Contributions are welcome, and they are greatly appreciated!
Every little bit helps, and credit will always be given.

You can contribute in many ways:

# Types of Contributions

## Report Bugs

Report bugs at https://github.com/commons-research/dataset-extractor-lotus/issues

If you are reporting a bug, please include:

- Your operating system name and version.
- Any details about your local setup that might be helpful in troubleshooting.
- Detailed steps to reproduce the bug.

## Fix Bugs

Look through the GitHub issues for bugs.
Anything tagged with "bug" and "help wanted" is open to whoever wants to implement a fix for it.

## Implement Features

Look through the GitHub issues for features.
Anything tagged with "enhancement" and "help wanted" is open to whoever wants to implement it.

## Write Documentation

Cookiecutter PyPackage could always use more documentation, whether as part of the official docs, in docstrings, or even on the web in blog posts, articles, and such.

## Submit Feedback

The best way to send feedback is to file an issue at https://github.com/fpgmaas/dataset-extractor-lotus/issues.

If you are proposing a new feature:

- Explain in detail how it would work.
- Keep the scope as narrow as possible, to make it easier to implement.
- Remember that this is a volunteer-driven project, and that contributions
are welcome :)

# Get Started!

Ready to contribute? Here's how to set up `dataset-extractor-lotus` for local development.
Please note this documentation assumes you already have `poetry` and `Git` installed and ready to go.

1. Fork the `dataset-extractor-lotus` repo on GitHub.

2. Clone your fork locally:

```bash
cd <directory_in_which_repo_should_be_created>
git clone [email protected]:commons-research/dataset-extractor-lotus.git
```

3. Now we need to install the environment. Navigate into the directory

```bash
cd dataset-extractor-lotus
```

If you are using `pyenv`, select a version to use locally. (See installed versions with `pyenv versions`)

```bash
pyenv local <x.y.z>
```

Then, install and activate the environment with:

```bash
poetry install
poetry shell
```

4. Install pre-commit to run linters/formatters at commit time:

```bash
poetry run pre-commit install
```

5. Create a branch for local development:

```bash
git checkout -b name-of-your-bugfix-or-feature
```

Now you can make your changes locally.

6. Don't forget to add test cases for your added functionality to the `tests` directory.

7. When you're done making changes, check that your changes pass the formatting tests.

```bash
make check
```

Now, validate that all unit tests are passing:

```bash
make test
```

9. Before raising a pull request you should also run tox.
This will run the tests across different versions of Python:

```bash
tox
```

This requires you to have multiple versions of python installed.
This step is also triggered in the CI/CD pipeline, so you could also choose to skip this step locally.

10. Commit your changes and push your branch to GitHub:

```bash
git add .
git commit -m "Your detailed description of your changes."
git push origin name-of-your-bugfix-or-feature
```

11. Submit a pull request through the GitHub website.

# Pull Request Guidelines

Before you submit a pull request, check that it meets these guidelines:

1. The pull request should include tests.

2. If the pull request adds functionality, the docs should be updated.
Put your new functionality into a function with a docstring, and add the feature to the list in `README.md`.
33 changes: 33 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
GNU GENERAL PUBLIC LICENSE
Version 3, 29 June 2007

Extract specific elements from the LOTUS dataset.
Copyright (C) 2024 Pascal Amrein

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program. If not, see <http://www.gnu.org/licenses/>.

Also add information on how to contact you by electronic and paper mail.

You should also get your employer (if you work as a programmer) or school,
if any, to sign a "copyright disclaimer" for the program, if necessary.
For more information on this, and how to apply and follow the GNU GPL, see
<http://www.gnu.org/licenses/>.

The GNU General Public License does not permit incorporating your program
into proprietary programs. If your program is a subroutine library, you
may consider it more useful to permit linking proprietary applications with
the library. If this is what you want to do, use the GNU Lesser General
Public License instead of this License. But first, please read
<http://www.gnu.org/philosophy/why-not-lgpl.html>.

35 changes: 35 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
.PHONY: install
install: ## Install the poetry environment and install the pre-commit hooks
@echo "🚀 Creating virtual environment using pyenv and poetry"
@poetry install
@ poetry run pre-commit install
@poetry shell

.PHONY: check
check: ## Run code quality tools.
@echo "🚀 Checking Poetry lock file consistency with 'pyproject.toml': Running poetry lock --check"
@poetry check --lock
@echo "🚀 Linting code: Running pre-commit"
@poetry run pre-commit run -a
@echo "🚀 Static type checking: Running mypy"
@poetry run mypy

.PHONY: test
test: ## Test the code with pytest
@echo "🚀 Testing code: Running pytest"
@poetry run pytest --doctest-modules

.PHONY: build
build: clean-build ## Build wheel file using poetry
@echo "🚀 Creating wheel file"
@poetry build

.PHONY: clean-build
clean-build: ## clean build artifacts
@rm -rf dist

.PHONY: help
help:
@grep -E '^[a-zA-Z_-]+:.*?## .*$$' $(MAKEFILE_LIST) | awk 'BEGIN {FS = ":.*?## "}; {printf "\033[36m%-20s\033[0m %s\n", $$1, $$2}'

.DEFAULT_GOAL := help
93 changes: 93 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
# dataset extractor LOTUS

[![Release](https://img.shields.io/github/v/release/fpgmaas/dataset-extractor-lotus)](https://img.shields.io/github/v/release/fpgmaas/dataset-extractor-lotus)
[![Build status](https://img.shields.io/github/actions/workflow/status/fpgmaas/dataset-extractor-lotus/main.yml?branch=main)](https://github.com/fpgmaas/dataset-extractor-lotus/actions/workflows/main.yml?query=branch%3Amain)
[![codecov](https://codecov.io/gh/fpgmaas/dataset-extractor-lotus/branch/main/graph/badge.svg)](https://codecov.io/gh/fpgmaas/dataset-extractor-lotus)
[![Commit activity](https://img.shields.io/github/commit-activity/m/fpgmaas/dataset-extractor-lotus)](https://img.shields.io/github/commit-activity/m/fpgmaas/dataset-extractor-lotus)
[![License](https://img.shields.io/github/license/fpgmaas/dataset-extractor-lotus)](https://img.shields.io/github/license/fpgmaas/dataset-extractor-lotus)
[![DOI](https://zenodo.org/badge/DOI/records/7534071.svg)](https://zenodo.org/records/7534071)

Extract specific elements/rows from the LOTUS dataset.

## How to run the LOTUS extractor
This is one possible example to run the script.

```bash
# navigater to a place, where you want to put the project
cd ./path/to/folder/

# clone it from git and enter the folder
git clone [email protected]:commons-research/dataset-extractor-lotus.git
cd dataset-extractor-lotus

# install the environment with poetry
## if poetry is not already installed, follow the following steps
pip install pipx
pipx install poetry

# Install from the poetry.lock file (command should be run in the same folder as the poetry.lock file)
poetry install

# run the script in poetry and get the help page
poetry run python dataset_extractor_lotus/main.py -h

# run the script in the interactive mode
poetry run python dataset_extractor_lotus/main.py

```

## Interactive Mode
Example how to proceed:

```bash
Start interactive mode.
? Welcome to the "toydataset extractor" for the LOTUS datasets.
Chose your action:
sampling
❯ download
exit (Ctrl+C)
```

### Download dataset

```bash
Start interactive mode.
? Welcome to the "toydataset extractor" for the LOTUS datasets.
Chose your action: download
The datasets will be searched from the internet. One moment please...
? Please choose your dataset to download (sorted from newest to oldest): 220916_frozen_metadata.csv.gz (record: 7085063)
? Enter path to download: data/
The dataset will be downloaded to data//220916_frozen_metadata.csv.gz (record: 7085063).
Download complete: 220916_frozen_metadata.csv.gz
```

### sampling from dataset

```bash
Start interactive mode.
? Welcome to the "toydataset extractor" for the LOTUS datasets.
Chose your action: sampling
? Enter the filepath to sample from: data/220916_frozen_metadata.csv.gz
Before type: String
After type: Int32
? Please choose the taxonomy level to sample from: organism_taxonomy_05order
? Choose from which member to sample (522 options): Anthoathecata
? Please enter the amount of members to sample (max. 77): 32
? Please choose the output format: full
? Enter the output file name or existing filename to append: test_sampling.csv
File test_sampling.csv does not exist. Creating new file.
```


## Documentation
- **Github repository**: <https://github.com/commons-research/dataset-extractor-lotus.git>
- **Documentation with mkdocs** <https://commons-research.github.io/dataset-extractor-lotus/>

## Feedback
Please follow this [guidelines](CONTRIBUTING.md) for reporting a bug.
For other issues please use <https://github.com/commons-research/dataset-extractor-lotus/issues>.


## Releasing a new version
21.03.2024: v1.1 is released (interactive mode)
07.03.2024: v1.0 is released (basic random sampling)
2 changes: 2 additions & 0 deletions data/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
*
!.gitignore
Loading

0 comments on commit 5e5fe7c

Please sign in to comment.