diff --git a/docs/api/benchmark.md b/docs/api/benchmark.md index 738678aa..48dc2864 100644 --- a/docs/api/benchmark.md +++ b/docs/api/benchmark.md @@ -1,16 +1,10 @@ -# Base class - -::: polaris.benchmark.BenchmarkSpecification +::: polaris.benchmark.BenchmarkV2Specification options: filters: ["!^_", "!md5sum", "!get_cache_path"] ---- -## Subclasses - -::: polaris.benchmark.SingleTaskBenchmarkSpecification - ---- -::: polaris.benchmark.MultiTaskBenchmarkSpecification +::: polaris.benchmark.BenchmarkSpecification + options: + filters: ["!^_", "!md5sum", "!get_cache_path"] --- \ No newline at end of file diff --git a/docs/api/dataset.md b/docs/api/dataset.md index 225d6390..9321f8fa 100644 --- a/docs/api/dataset.md +++ b/docs/api/dataset.md @@ -1,9 +1,3 @@ -::: polaris.dataset.Dataset - options: - filters: ["!^_"] - ---- - ::: polaris.dataset.DatasetV2 options: filters: ["!^_"] @@ -26,4 +20,4 @@ options: filters: ["!^_"] ---- +--- diff --git a/docs/images/zarr.png b/docs/images/zarr.png new file mode 100644 index 00000000..d21d36cc Binary files /dev/null and b/docs/images/zarr.png differ diff --git a/docs/index.md b/docs/index.md index d9c45e93..5b51a744 100644 --- a/docs/index.md +++ b/docs/index.md @@ -2,21 +2,15 @@ Welcome to the Polaris documentation! - - ---- - ## What is Polaris? -!!! info "Our vision" +!!! info "Our mission" - Polaris aims to **foster the development of impactful AI models in drug discovery** by establishing a new - and adaptive standard for measuring progress of computational tools in drug discovery. + Polaris is on a mission to bring innovators and practitioners closer together to develop methods that matter. -Polaris is a suite of tools to implement, host and run benchmarks in computational drug discovery. Existing benchmarks leave several key challenges - related to the characteristics of datasets in drug discovery - unaddressed. This can lead to a situation in which newly proposed models do not perform as well _as advertised_ in real drug discovery programs, ultimately risking misalignment between the scientists developing the models and downstream users. With Polaris, we aim to further close that gap. +Polaris is an optimistic community that fundamentally believes in the ability of Machine Learning to radically improve lives by disrupting the drug discovery process. However, we recognize that the absence of standardized, domain-appropriate datasets, guidelines, and tools for method evaluation is limiting its current impact. -### Polaris Hub -A quick word on the [Polaris Hub](https://polarishub.io/). The hub hosts a variety of high-quality benchmarks and datasets. While the hub is built to easily integrate with the Polaris library, you can use them independently. +Polaris is a Python library designed to interact with the [Polaris Hub](https://www.polarishub.io). Our aim is to build the leading benchmarking platform for drug discovery, promoting the use of high-quality resources and domain-appropriate evaluation protocols. Learn more through our [blog posts](https://polarishub.io/blog). ## Where to next? @@ -35,7 +29,7 @@ If you are entirely new to Polaris, this is the place to start! Learn about the Dive deeper into the Polaris code and learn about advanced concepts to create your own benchmarks and datasets. -[:material-arrow-right: Let's get started](./tutorials/basics.ipynb) +[:material-arrow-right: Let's get started](./tutorials/submit_to_benchmark.ipynb) --- diff --git a/docs/quickstart.md b/docs/quickstart.md index 3021ea91..33bb0add 100644 --- a/docs/quickstart.md +++ b/docs/quickstart.md @@ -1,39 +1,48 @@ # Quickstart +Welcome to the Polaris Quickstart guide! This page will introduce you to core concepts and you'll submit a first result to a benchmark on the [Polaris Hub](https://www.polarishub.io). ## Installation +!!! warning "`polaris-lib` vs `polaris`" + Be aware that the package name differs between _pip_ and _conda_. -First things first, let's install Polaris! +Polaris can be installed via _pip_: -We highly recommend using a [Conda Python distribution](https://github.com/conda-forge/miniforge), such as `mamba`: +```bash +pip install polaris-lib +``` +or _conda_: ```bash -mamba install -c conda-forge polaris +conda install -c conda-forge polaris ``` -??? info "Other installation options" - You can replace `mamba` by `conda`. The package is also pip installable if you need it: `pip install polaris-lib`. +## Core concepts +Polaris explicitly distinguished **datasets** and **benchmarks**. + +- A _dataset_ is simply a tabular collection of data, storing datapoints in a row-wise manner. +- A _benchmark_ defines the ML task and evaluation logic (e.g. split and metrics) for a dataset. + +One dataset can therefore be associated with multiple benchmarks. + +## Login +To interact with the [Polaris Hub](https://polarishub.io/) from the client, you must first authenticate yourself. If you don't have an account yet, you can create one [here](https://polarishub.io/sign-up). -## Authenticating to the Polaris Hub -To interact with the [Polaris Hub](https://polarishub.io/) from the client, you must first login. You can do this -via the following command in your terminal: +You can do this via the following command in your terminal: ```bash polaris login ``` -This will redirect you to a login page on the Polaris Hub where you can either sign in or sign up. Once either -of these options have been completed, you will see an authorization code on your screen. Copy this and paste it -back into your terminal when prompted by the client. +or in Python: +```py +from polaris.hub.client import PolarisHubClient -That's it! You're now all set to interact with datasets and benchmarks across Polaris. - -## Benchmarking API - -At its core, Polaris is a benchmarking library. It provides a simple API to run benchmarks. While it can be used -independently, it is built to easily integrate with the Polaris Hub. The hub hosts -a variety of high-quality datasets, benchmarks and associated results. +with PolarisHubClient() as client: + client.login() +``` -If all you care about is to partake in a benchmark that is hosted on the hub, it is as simple as: +## Benchmark API +To get started, we will submit a result to the [`polaris/hello-world-benchmark`](https://polarishub.io/benchmarks/polaris/hello-world-benchmark). ```python import polaris as po @@ -57,17 +66,18 @@ predictions = [0.0 for x in test] results = benchmark.evaluate(predictions) # Submit your results -results.upload_to_hub(owner="dummy-user") +results.upload_to_hub(owner="dummy-user", access="public") ``` -That's all there is to it to partake in a benchmark. No complicated, custom data-loaders or evaluation protocol. With just a few lines of code, you can feel confident that you are properly evaluating your model and focus on what you do best: Solving the hard problems in our domain! +Through immutable datasets and standardized benchmarks, Polaris aims to serve as a source of truth for machine learning in drug discovery. The limited flexibility might differ from your typical experience, but this is by design to improve reproducibility. Learn more [here](https://polarishub.io/blog/reproducible-machine-learning-in-drug-discovery-how-polaris-serves-as-a-single-source-of-truth). -Similarly, you can easily access a dataset. +## Dataset API +Loading a benchmark will automatically load the underlying dataset. We can also directly access the [`polaris/hello-world`](https://polarishub.io/datasets/polaris/hello-world) dataset. ```python import polaris as po -# Load the dataset from the hub +# Load the dataset from the Hub dataset = po.load_dataset("polaris/hello-world") # Get information on the dataset size @@ -82,21 +92,14 @@ dataset.get_data( # Or, similarly: dataset[dataset.rows[0], dataset.columns[0]] -# Get an entire row +# Get an entire data point dataset[0] ``` -## Core concepts - -At the core of our API are 4 core concepts, each associated with a class: - -1. [`Dataset`][polaris.dataset.Dataset]: The dataset class is carefully designed data-structure, stress-tested on terra-bytes of data, to ensure whatever dataset you can think of, you can easily create, store and use it. -2. [`BenchmarkSpecification`][polaris.benchmark.BenchmarkSpecification]: The benchmark specification class wraps a `Dataset` with additional meta-data to produce a the benchmark. Specifically, it specifies how to evaluate a model's performance on the underlying dataset (e.g. the train-test split and metrics). It provides a simple API to run said evaluation protocol. -3. [`Subset`][polaris.dataset.Subset]: The subset class should be used as a starting-point for any framework-specific (e.g. PyTorch or Tensorflow) data loaders. To facilitate this, it abstracts away the non-trivial logic of accessing the data and provides several style of access to built upon. -4. [`BenchmarkResults`][polaris.evaluate.BenchmarkResults]: The benchmark results class stores the results of a benchmark, along with additional meta-data. This object can be easily uploaded to the Polaris Hub and shared with the broader community. +Drug discovery research involves a maze of file formats (e.g. PDB for 3D structures, SDF for small molecules, and so on). Each format requires specialized knowledge to parse and interpret properly. At Polaris, we wanted to remove that barrier. We use a universal data format based on [Zarr](https://zarr.dev/). Learn more [here](https://polarishub.io/blog/dataset-v2-built-to-scale). ## Where to next? -Now that you've seen how easy it is to use Polaris, let's dive into the details through a set of tutorials! +Now that you've seen how easy it is to use Polaris, let's dive into the details through [a set of tutorials](./tutorials/submit_to_benchmark.ipynb)! --- diff --git a/docs/resources.md b/docs/resources.md new file mode 100644 index 00000000..a3305628 --- /dev/null +++ b/docs/resources.md @@ -0,0 +1,13 @@ +# Resources + +## Publications + +- Correspondence in Nature Biotechnology: [10.1038/s42256-024-00911-w](https://doi.org/10.1038/s42256-024-00911-w). +- Preprint on Method Comparison Protocols: [10.26434/chemrxiv-2024-6dbwv-v2](https://doi.org/10.26434/chemrxiv-2024-6dbwv-v2). + +## Talks + +- PyData London (June, 2024): [https://www.youtube.com/watch?v=YZDfD9D7mtE](https://www.youtube.com/watch?v=YZDfD9D7mtE) +- MoML (June, 2024): [https://www.youtube.com/watch?v=Tsz_T1WyufI](https://www.youtube.com/watch?v=Tsz_T1WyufI) + +--- \ No newline at end of file diff --git a/docs/tutorials/basics.ipynb b/docs/tutorials/basics.ipynb deleted file mode 100644 index 24cf03da..00000000 --- a/docs/tutorials/basics.ipynb +++ /dev/null @@ -1,839 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "id": "40f99374-b47e-4f84-bdb9-148a11f9c07d", - "metadata": { - "editable": true, - "slideshow": { - "slide_type": "" - }, - "tags": [] - }, - "source": [ - "# The Basics\n", - "\n", - "
\n", - "

In short

\n", - "

This tutorial walks you through the basic usage of Polaris. We will first login to the hub and will then see how easy it is to load a dataset or benchmark from it. Finally, we will train a simple baseline to submit a first set of results!

\n", - "
\n", - "\n", - "Polaris is designed to standardize the process of constructing datasets, specifying benchmarks and evaluating novel machine learning techniques within the realm of drug discovery.\n", - "\n", - "While the Polaris library can be used independently from the Polaris Hub, the two were designed to seamlessly work together. The hub provides various pre-made, high quality datasets and benchmarks to develop and evaluate novel ML methods. In this tutorial, we will see how easy it is to load and use these datasets and benchmarks." - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "id": "3d66f466", - "metadata": { - "editable": true, - "slideshow": { - "slide_type": "" - }, - "tags": [ - "remove_cell" - ] - }, - "outputs": [], - "source": [ - "# Note: Cell is tagged to not show up in the mkdocs build\n", - "%load_ext autoreload\n", - "%autoreload 2" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "id": "9b465ea4-7c71-443b-9908-3f9e567ee4c4", - "metadata": {}, - "outputs": [], - "source": [ - "import polaris as po\n", - "from polaris.hub.client import PolarisHubClient" - ] - }, - { - "cell_type": "markdown", - "id": "168c7f21-f9ec-43e2-b123-2bdcba2e8a71", - "metadata": {}, - "source": [ - "### Login\n", - "To be able to complete this step, you will require a Polaris Hub account. Go to [https://polarishub.io/](https://polarishub.io/) to create one. You only have to log in once at the start or when you haven't used your account in a while." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "de8bf4bf-4dbd-42eb-8f74-bf8aa0339469", - "metadata": {}, - "outputs": [], - "source": [ - "client = PolarisHubClient()\n", - "client.login()" - ] - }, - { - "cell_type": "markdown", - "id": "0ea6d6c0", - "metadata": {}, - "source": [ - "Instead of through the Python API, you could also use the Polaris CLI. See:\n", - "```sh\n", - "polaris login --help\n", - "```" - ] - }, - { - "cell_type": "markdown", - "id": "5edee39f-ce29-4ae6-91ce-453d9190541b", - "metadata": {}, - "source": [ - "### Load from the Hub\n", - "Both datasets and benchmarks are identified by a `owner/name` id. You can easily find and copy these through the Hub. Once you have the id, loading a dataset or benchmark is incredibly easy. " - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "id": "4e004589-6c48-4232-b353-b1700536dde6", - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "\u001b[32m2024-06-26 09:52:08.706\u001b[0m | \u001b[1mINFO \u001b[0m | \u001b[36mpolaris._artifact\u001b[0m:\u001b[36m_validate_version\u001b[0m:\u001b[36m66\u001b[0m - \u001b[1mThe version of Polaris that was used to create the artifact (0.0.0) is different from the currently installed version of Polaris (0.0.2.dev191+g82e7db2).\u001b[0m\n", - "\u001b[32m2024-06-26 09:52:10.327\u001b[0m | \u001b[1mINFO \u001b[0m | \u001b[36mpolaris._artifact\u001b[0m:\u001b[36m_validate_version\u001b[0m:\u001b[36m66\u001b[0m - \u001b[1mThe version of Polaris that was used to create the artifact (0.0.0) is different from the currently installed version of Polaris (0.0.2.dev191+g82e7db2).\u001b[0m\n", - "\u001b[32m2024-06-26 09:52:10.338\u001b[0m | \u001b[1mINFO \u001b[0m | \u001b[36mpolaris._artifact\u001b[0m:\u001b[36m_validate_version\u001b[0m:\u001b[36m66\u001b[0m - \u001b[1mThe version of Polaris that was used to create the artifact (0.0.0) is different from the currently installed version of Polaris (0.0.2.dev191+g82e7db2).\u001b[0m\n" - ] - } - ], - "source": [ - "dataset = po.load_dataset(\"polaris/hello-world\")\n", - "benchmark = po.load_benchmark(\"polaris/hello-world-benchmark\")" - ] - }, - { - "cell_type": "markdown", - "id": "1ce8e0e5-88c8-4d3b-9292-e75c97315833", - "metadata": {}, - "source": [ - "### Use the benchmark\n", - "The polaris library is designed to make it easy to participate in a benchmark. In just a few lines of code, we can get the train and test partition, access the associated data in various ways and evaluate our predictions. There's two main API endpoints. \n", - "\n", - "- `get_train_test_split()`: For creating objects through which we can access the different dataset partitions.\n", - "- `evaluate()`: For evaluating a set of predictions in accordance with the benchmark protocol." - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "id": "b55195cd-84da-4cd9-951b-c148265b303c", - "metadata": {}, - "outputs": [], - "source": [ - "train, test = benchmark.get_train_test_split()" - ] - }, - { - "cell_type": "markdown", - "id": "c14e189c", - "metadata": {}, - "source": [ - "The created objects support various flavours to access the data.\n", - "\n", - "- The objects are iterable;\n", - "- The objects can be indexed;\n", - "- The objects have properties to access all data at once." - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "id": "43cbe460", - "metadata": {}, - "outputs": [], - "source": [ - "for x, y in train:\n", - " pass" - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "id": "2f317c10", - "metadata": {}, - "outputs": [], - "source": [ - "for i in range(len(train)):\n", - " x, y = train[i]" - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "id": "08ce24c7-992a-40a7-b8ef-c862fab99e6e", - "metadata": {}, - "outputs": [], - "source": [ - "x = train.inputs\n", - "y = train.targets" - ] - }, - { - "cell_type": "markdown", - "id": "d5fa35c5-e2d0-4d75-a2cb-75b4749d91ef", - "metadata": {}, - "source": [ - "To avoid accidental access to the test targets, the test object does not expose the labels and will throw an error if you try access them explicitly." - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "id": "8c33b7d4-fa82-4994-a7ab-5d0821ad5fd4", - "metadata": { - "scrolled": true - }, - "outputs": [], - "source": [ - "for x in test:\n", - " pass" - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "id": "8b4ac073", - "metadata": {}, - "outputs": [], - "source": [ - "for i in range(len(test)):\n", - " x = test[i]" - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "id": "5664eb87", - "metadata": {}, - "outputs": [], - "source": [ - "x = test.inputs\n", - "\n", - "# NOTE: The below will throw an error!\n", - "# y = test.targets" - ] - }, - { - "cell_type": "markdown", - "id": "955ad9db-3468-4f34-b303-18e6d642be56", - "metadata": {}, - "source": [ - "### Partake in the benchmark\n", - "\n", - "To complete our example, let's participate in the benchmark. We will train a simple random forest model on the ECFP representation through scikit-learn and datamol." - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "id": "748dd278-0fd0-4c5b-ac6a-8d974143c3b9", - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "\u001b[32m2024-06-26 09:52:12.003\u001b[0m | \u001b[1mINFO \u001b[0m | \u001b[36mpolaris._artifact\u001b[0m:\u001b[36m_validate_version\u001b[0m:\u001b[36m66\u001b[0m - \u001b[1mThe version of Polaris that was used to create the artifact (0.0.0) is different from the currently installed version of Polaris (0.0.2.dev191+g82e7db2).\u001b[0m\n", - "\u001b[32m2024-06-26 09:52:12.014\u001b[0m | \u001b[1mINFO \u001b[0m | \u001b[36mpolaris._artifact\u001b[0m:\u001b[36m_validate_version\u001b[0m:\u001b[36m66\u001b[0m - \u001b[1mThe version of Polaris that was used to create the artifact (0.0.0) is different from the currently installed version of Polaris (0.0.2.dev191+g82e7db2).\u001b[0m\n" - ] - }, - { - "data": { - "text/html": [ - "
RandomForestRegressor(max_depth=2, random_state=0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" - ], - "text/plain": [ - "RandomForestRegressor(max_depth=2, random_state=0)" - ] - }, - "execution_count": 12, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "import datamol as dm\n", - "from sklearn.ensemble import RandomForestRegressor\n", - "\n", - "# Load the benchmark (automatically loads the underlying dataset as well)\n", - "benchmark = po.load_benchmark(\"polaris/hello-world-benchmark\")\n", - "\n", - "# Get the split and convert SMILES to ECFP fingerprints by specifying an featurize function.\n", - "train, test = benchmark.get_train_test_split(featurization_fn=dm.to_fp)\n", - "\n", - "# Define a model and train\n", - "model = RandomForestRegressor(max_depth=2, random_state=0)\n", - "model.fit(train.X, train.y)" - ] - }, - { - "cell_type": "markdown", - "id": "a75a9f01", - "metadata": {}, - "source": [ - "To evaluate a model within Polaris, you should use the `evaluate()` endpoint. This requires you to just provide the predictions. The targets of the test set are automatically extracted so that the chance of the user accessing the test labels is minimal" - ] - }, - { - "cell_type": "code", - "execution_count": 13, - "id": "6633ec79-a6ff-4ce0-bc7d-cdb9e1042462", - "metadata": {}, - "outputs": [], - "source": [ - "predictions = model.predict(test.X)" - ] - }, - { - "cell_type": "code", - "execution_count": 14, - "id": "79c072cf-683e-4257-b31e-59fdbcf5e979", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
nameNone
description
tags
user_attributes
ownerNone
polaris_version0.0.2.dev191+g82e7db2
benchmark_namehello-world-benchmark
benchmark_owner
slugpolaris
external_idorg_2gtoaJIVrgRqiIR8Qm5BnpFCbxu
typeorganization
github_urlNone
paper_urlNone
contributorsNone
artifact_idNone
benchmark_artifact_idpolaris/hello-world-benchmark
results
Test setTarget labelMetricScore
testSOLmean_squared_error2.6875139821
testSOLmean_absolute_error1.2735690161
" - ], - "text/plain": [ - "{\n", - " \"name\": null,\n", - " \"description\": \"\",\n", - " \"tags\": [],\n", - " \"user_attributes\": {},\n", - " \"owner\": null,\n", - " \"polaris_version\": \"0.0.2.dev191+g82e7db2\",\n", - " \"benchmark_name\": \"hello-world-benchmark\",\n", - " \"benchmark_owner\": {\n", - " \"slug\": \"polaris\",\n", - " \"external_id\": \"org_2gtoaJIVrgRqiIR8Qm5BnpFCbxu\",\n", - " \"type\": \"organization\"\n", - " },\n", - " \"github_url\": null,\n", - " \"paper_url\": null,\n", - " \"contributors\": null,\n", - " \"artifact_id\": null,\n", - " \"benchmark_artifact_id\": \"polaris/hello-world-benchmark\",\n", - " \"results\": [\n", - " {\n", - " \"Test set\": \"test\",\n", - " \"Target label\": \"SOL\",\n", - " \"Metric\": \"mean_squared_error\",\n", - " \"Score\": 2.6875139821\n", - " },\n", - " {\n", - " \"Test set\": \"test\",\n", - " \"Target label\": \"SOL\",\n", - " \"Metric\": \"mean_absolute_error\",\n", - " \"Score\": 1.2735690161\n", - " }\n", - " ]\n", - "}" - ] - }, - "execution_count": 14, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "results = benchmark.evaluate(predictions)\n", - "results" - ] - }, - { - "cell_type": "markdown", - "id": "90114c20-4c01-432b-9f4d-b31863881cc6", - "metadata": {}, - "source": [ - "Before uploading the results to the Hub, you can provide some additional information about the results that will be displayed on the Polaris Hub." - ] - }, - { - "cell_type": "code", - "execution_count": 15, - "id": "a601f415-c563-4efe-94c3-0d44f3fd6576", - "metadata": {}, - "outputs": [], - "source": [ - "# For a complete list of meta-data, check out the BenchmarkResults object\n", - "results.name = \"hello-world-result\"\n", - "results.github_url = \"https://github.com/polaris-hub/polaris-hub\"\n", - "results.paper_url = \"https://polarishub.io/\"\n", - "results.description = \"Hello, World!\"" - ] - }, - { - "cell_type": "markdown", - "id": "4e7cc06d", - "metadata": {}, - "source": [ - "Finally, let's upload the results to the Hub! The result will be private, but visiting the link in the logs you can decide to make it public through the Hub." - ] - }, - { - "cell_type": "code", - "execution_count": 16, - "id": "60cbf4b9-8514-480d-beda-8a50e5f7c9a6", - "metadata": { - "scrolled": true - }, - "outputs": [], - "source": [ - "client.upload_results(results, owner=\"cwognum\")\n", - "client.close()" - ] - }, - { - "cell_type": "markdown", - "id": "78fe8d63", - "metadata": {}, - "source": [ - "That's it! Just like that you have partaken in your first Polaris benchmark. In next tutorials, we will consider more advanced use cases of Polaris, such as creating and uploading your own datasets and benchmarks. \n", - "\n", - "The End.\n", - "\n", - "---" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.12.3" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/docs/tutorials/competition.participate.ipynb b/docs/tutorials/competition.participate.ipynb deleted file mode 100644 index f31f1bd4..00000000 --- a/docs/tutorials/competition.participate.ipynb +++ /dev/null @@ -1,249 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "id": "40f99374-b47e-4f84-bdb9-148a11f9c07d", - "metadata": { - "editable": true, - "slideshow": { - "slide_type": "" - }, - "tags": [] - }, - "source": [ - "# Participating in a Competition\n", - "\n", - "
\n", - "

In short

\n", - "

This tutorial walks you through how to fetch an active competition from Polaris, prepare your predictions and then submit them for secure evaluation by the Polaris Hub.

\n", - "
\n", - "\n", - "Participating in a competition on Polaris is very similar to participating in a standard benchmark. The main difference lies in how predictions are prepared and how they are evaluated. We'll touch on each of these topics later in the tutorial. \n", - "\n", - "Before continuing, please ensure you are logged into Polaris." - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "id": "3d66f466", - "metadata": { - "editable": true, - "slideshow": { - "slide_type": "" - }, - "tags": [ - "remove_cell" - ] - }, - "outputs": [], - "source": [ - "# Note: Cell is tagged to not show up in the mkdocs build\n", - "%load_ext autoreload\n", - "%autoreload 2" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "9b465ea4-7c71-443b-9908-3f9e567ee4c4", - "metadata": {}, - "outputs": [], - "source": [ - "import polaris as po\n", - "from polaris.hub.client import PolarisHubClient\n", - "\n", - "# Don't forget to add your Polaris Hub username below!\n", - "MY_POLARIS_USERNAME = \"\"\n", - "\n", - "client = PolarisHubClient()\n", - "client.login()" - ] - }, - { - "cell_type": "markdown", - "id": "5edee39f-ce29-4ae6-91ce-453d9190541b", - "metadata": {}, - "source": [ - "## Fetching a Competition\n", - "\n", - "As with standard benchmarks, Polaris provides simple APIs that allow you to quickly fetch a competition from the Polaris Hub. All you need is the unique identifier for the competition which follows the format of `competition_owner`/`competition_name`." - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "id": "4e004589-6c48-4232-b353-b1700536dde6", - "metadata": {}, - "outputs": [], - "source": [ - "competition_id = \"polaris/hello-world-competition\"\n", - "competition = po.load_competition(competition_id)" - ] - }, - { - "cell_type": "markdown", - "id": "36f3e829", - "metadata": {}, - "source": [ - "## Participate in the Competition\n", - "The Polaris library is designed to make it easy to participate in a competition. In just a few lines of code, we can get the train and test partition, access the associated data in various ways and evaluate our predictions. There's two main API endpoints. \n", - "\n", - "- `get_train_test_split()`: For creating objects through which we can access the different dataset partitions.\n", - "- `submit_predictions()`: For submitting the predictions to an active competition." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "d8605928", - "metadata": {}, - "outputs": [], - "source": [ - "train, test = competition.get_train_test_split()" - ] - }, - { - "cell_type": "markdown", - "id": "e78bf878", - "metadata": {}, - "source": [ - "Similar to benchmarks, the created test and train objects support various flavours to access the data." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "7b17bb31", - "metadata": {}, - "outputs": [], - "source": [ - "# The objects are iterable\n", - "for x, y in train:\n", - " pass\n", - "\n", - "# The objects can be indexed\n", - "for i in range(len(train)):\n", - " x, y = train[i]\n", - "\n", - "# The objects have properties to access all data at once. Use this with\n", - "# caution if the underlying dataset is large!\n", - "x = train.inputs\n", - "y = train.targets" - ] - }, - { - "cell_type": "markdown", - "id": "5ec12825", - "metadata": {}, - "source": [ - "Now, let's create some predictions against the imaginary `hello-world-competition`. Let's assume we train a simple random forest model on the ECFP representation through scikit-learn and datamol, and then we submit our results for secure evaluation by the Polaris Hub." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "902353bc", - "metadata": {}, - "outputs": [], - "source": [ - "import datamol as dm\n", - "from sklearn.ensemble import RandomForestRegressor\n", - "\n", - "# Load the competition (automatically loads the underlying dataset as well)\n", - "competition = po.load_competition(\"polaris/hello-world-benchmark\")\n", - "\n", - "# Get the split and convert SMILES to ECFP fingerprints by specifying a featurize function.\n", - "train, test = competition.get_train_test_split(featurization_fn=dm.to_fp)\n", - "\n", - "# Define a model and train\n", - "model = RandomForestRegressor(max_depth=2, random_state=0)\n", - "model.fit(train.X, train.y)\n", - "\n", - "predictions = model.predict(test.X)" - ] - }, - { - "cell_type": "markdown", - "id": "1a36e334", - "metadata": {}, - "source": [ - "Now that we have created some predictions, we can construct a `CompetitionPredictions` object that will prepare our predictions for evaluation by the Polaris Hub. Here, you can also add metadata to your predictions to better describe your results and how you achieved them. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "2b36e09b", - "metadata": {}, - "outputs": [], - "source": [ - "from polaris.evaluate import CompetitionPredictions\n", - "\n", - "competition_predictions = CompetitionPredictions(\n", - " name=\"hello-world-result\",\n", - " predictions=predictions,\n", - " target_labels=competition.target_cols,\n", - " test_set_labels=competition.test_set_labels,\n", - " test_set_sizes=competition.test_set_sizes,\n", - " github_url=\"https://github.com/polaris-hub/polaris-hub\",\n", - " paper_url=\"https://polarishub.io/\",\n", - " description=\"Hello, World!\",\n", - ")" - ] - }, - { - "cell_type": "markdown", - "id": "5ff06a9c", - "metadata": {}, - "source": [ - "Once your `CompetitionPredictions` object is created, you're ready to submit them for evaluation! This will automatically save your result to the Polaris Hub, but it will be private until the competition closes." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "e684c611", - "metadata": {}, - "outputs": [], - "source": [ - "results = competition.evaluate(competition_predictions)\n", - "\n", - "client.close()" - ] - }, - { - "cell_type": "markdown", - "id": "44973556", - "metadata": {}, - "source": [ - "That's it! Just like that you have partaken in your first Polaris competition. Keep an eye on that leaderboard when it goes public and best of luck in your future competitions!\n", - "\n", - "The End.\n", - "\n", - "---" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.12.2" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/docs/tutorials/create_a_benchmark.ipynb b/docs/tutorials/create_a_benchmark.ipynb new file mode 100644 index 00000000..54d06887 --- /dev/null +++ b/docs/tutorials/create_a_benchmark.ipynb @@ -0,0 +1,267 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Polaris explicitly distinguished datasets from benchmarks. A benchmark defines the ML task and evaluation logic (e.g. split and metrics) for a dataset. Because of this, a single dataset can be the basis of multiple benchmarks.\n", + "\n", + "## Create a Benchmark\n", + "\n", + "To create a benchmark, you need to instantiate the `BenchmarkV2Specification` class. This requires you to specify: \n", + "\n", + "1. The **dataset**, which can be stored either locally or on the Hub.\n", + "1. The **task**, where a task is defined by input and target columns.\n", + "2. The **split**, where a split is defined by a bunch of indices.\n", + "3. The **metric**, where a metric needs to be officially supported by Polaris.\n", + "4. The **metadata** to contextualize your benchmark.\n", + "\n", + "### Define the dataset\n", + "To learn how to create a dataset, see [this tutorial](./create_a_dataset.html). \n", + "\n", + "Alternatively, we can also load an existing dataset from the Hub.\n", + "\n", + "
\n", + "

Not all Hub datasets are supported

\n", + "

You can only create benchmarks for DatasetV2 instances, not for DatasetV1 instances. Some of the datasets stored on the Hub are still V1 datasets.

\n", + "
\n", + "\n", + "### Define the task\n", + "Currently, Polaris only supports predictive tasks. Specifying a predictive task is simply done by specifying the input and target columns." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "input_columns = [\"SMILES\"]\n", + "target_columns = [\"LOG_SOLUBILITY\"]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In this case, we specified just a single input and target column, but a benchmark can have multiple (e.g. a multi-task benchmark).\n", + "\n", + "### Define the split\n", + "\n", + "To ensure reproducible results, Polaris represents a split through a bunch of sets of indices.\n", + "\n", + "_But there is a catch_: We want Polaris to scale to extra large datasets. If we are to naively store millions of indices as lists of integers, this would impose a significant memory footprint. We therefore use bitmaps, more specifically [roaring bitmaps](https://roaringbitmap.org/) to store the splits in a memory efficient way." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from polaris.benchmark._split_v2 import IndexSet\n", + "\n", + "# To specify a set of integers, you can directly pass in a list of integers\n", + "# This will automatically convert the indices to a BitMap\n", + "training = IndexSet(indices=[0, 1])\n", + "test = IndexSet(indices=[2])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from pyroaring import BitMap\n", + "\n", + "# Or you can create the BitMap manually and iteratively\n", + "indices = BitMap()\n", + "indices.add(0)\n", + "indices.add(1)\n", + "\n", + "training = IndexSet(indices=indices)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from polaris.benchmark._split_v2 import SplitV2\n", + "\n", + "# Finally, we create the actual split object\n", + "split = SplitV2(training=training, test=test)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Define the metrics\n", + "Even something as widely used as Mean Absolute Error (MAE) can be implemented in subtly different ways. Some people apply a log transform first, others might clip outliers, and sometimes an off-by-one or a bug creeps in. Over time, these variations add up. We decided to codify each metric for a Polaris benchmark in a single, transparent implementation. Our priority here is eliminating “mystery differences” that have nothing to do with actual model performance. Learn more [here](https://polarishub.io/blog/reproducible-machine-learning-in-drug-discovery-how-polaris-serves-as-a-single-source-of-truth).\n", + "\n", + "Specifying a metric is easy. You can simply specify its label." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "metrics = [\"mean_absolute_error\", \"mean_squared_error\"]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can also specify a main metric, which will be the metric used to rank the leaderboard." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "main_metric = \"mean_absolute_error\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To get a list of all support metrics, you can use:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from polaris.evaluate._metric import DEFAULT_METRICS\n", + "\n", + "DEFAULT_METRICS.keys()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can also create more complex metrics that wrap these base metrics." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from polaris.evaluate import Metric\n", + "\n", + "mae_agg = Metric(label=\"mean_absolute_error\", config={\"group_by\": \"UNIQUE_ID\", \"on_error\": \"ignore\", \"aggregation\": \"mean\"})\n", + "metrics.append(mae_agg)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + "

What if my metric isn't supported yet?

\n", + "

Using a metric that's not supported yet, currently requires adding it to the Polaris codebase. We're always looking to improve support. Reach out to us over Github and we're happy to help!

\n", + "
\n", + "\n", + "### Bringing it all together\n", + "Now we can create the `BenchmarkV2Specification` instance." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "type(dataset)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from polaris.benchmark._benchmark_v2 import BenchmarkV2Specification\n", + "\n", + "benchmark = BenchmarkV2Specification(\n", + " # 1. The dataset\n", + " dataset=dataset,\n", + " # 2. The task\n", + " input_cols=input_columns,\n", + " target_cols=target_columns,\n", + " # 3. The split\n", + " split=split,\n", + " # 4. The metrics\n", + " metrics=metrics,\n", + " main_metric=main_metric,\n", + " # 5. The metadata\n", + " name=\"my-first-benchmark\",\n", + " owner=\"your-username\", \n", + " description=\"Created using the Polaris tutorial\",\n", + " tags=[\"tutorial\"], \n", + " user_attributes={\"Key\": \"Value\"}\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Share your benchmark\n", + "Want to share your benchmark with the community? Upload it to the Polaris Hub!" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "benchmark.upload_to_hub(owner=\"your-username\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "The End." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.8" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/docs/tutorials/create_a_dataset.ipynb b/docs/tutorials/create_a_dataset.ipynb new file mode 100644 index 00000000..d237c9dc --- /dev/null +++ b/docs/tutorials/create_a_dataset.ipynb @@ -0,0 +1,371 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "On the surface, a dataset in Polaris is simply a tabular collection of data, storing datapoints in a row-wise manner. However, as you try create your own, you'll realize that there is some additional complexity under the hood.\n", + "\n", + "## Create a Dataset\n", + "\n", + "To create a dataset, you need to instantiate the `DatasetV2` class. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from polaris.dataset import DatasetV2, ColumnAnnotation\n", + "\n", + "dataset = DatasetV2(\n", + " \n", + " # Specify metadata on the dataset level\n", + " name=\"tutorial-example\",\n", + " owner=\"your-username\",\n", + " tags=[\"small-molecules\", \"predictive\", \"admet\"],\n", + " source=\"https://example.com\",\n", + " license=\"CC-BY-4.0\",\n", + " \n", + " # Specify metadata on the column level\n", + " annotations = {\n", + " \"Ligand Pose\": ColumnAnnotation(\n", + " description=\"The 3D pose of the ligand\", \n", + " user_attributes={\"Object Type\": \"rdkit.Chem.Mol\"}, \n", + " modality=\"MOLECULE_3D\"\n", + " ),\n", + " \"Ligand SMILES\": ColumnAnnotation(\n", + " description=\"The 2D graph structure of the ligand, as SMILES\", \n", + " user_attributes={\"Object Type\": \"str\"}, \n", + " modality=\"MOLECULE\"\n", + " ),\n", + " \"Permeability\": ColumnAnnotation(\n", + " description=\"MDR1-MDCK efflux ratio (B-A/A-B)\", \n", + " user_attributes={\"Unit\": \"\tmL/min/kg\"}\n", + " )\n", + " },\n", + " \n", + " # Specify the actual data\n", + " zarr_root_path=\"path/to/root.zarr\",\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For the rest of this tutorial, we will take a deeper look at the `zarr_root_path` parameter.\n", + "\n", + "First, some context.\n", + "\n", + "## Universal and ML-ready\n", + "\n", + "![image](../images/zarr.png)\n", + "_An illustration of Zarr, which is core to Polaris its datamodel_\n", + "\n", + "With the Polaris Hub we set out to design a universal data format for ML scientists in drug discovery. Whether you’re working with phenomics, small molecules, or protein structures, you shouldn’t have to spend time learning about domain-specific file formats, APIs, and software tools to be able to run some ML experiments. Beyond modalities, drug discovery datasets also come in different sizes, from kilobytes to terabytes.\n", + "
\n", + "\n", + "We found such a universal data format in [Zarr](https://zarr.readthedocs.io/). Zarr is a powerful library for storage of n-dimensional arrays, supporting chunking, compression, and various backends, making it a versatile choice for scientific and large-scale data. It's similar to HDF5, if you're familiar with that. \n", + "\n", + "Want to learn more? \n", + "- Learn about the motivation of our dataset implementation [here](https://polarishub.io/blog/dataset-v2-built-to-scale).\n", + "- Learn what we mean by ML-ready [here](https://polarishub.io/blog/dataset-v2-built-to-scale).\n", + "\n", + "## Zarr basics\n", + "Zarr is well [documented](https://zarr.readthedocs.io/en/stable/index.html) and before continuing this tutorial, we recommend you to at least read through the [Quickstart](https://zarr.readthedocs.io/en/stable/quickstart.html).\n", + "\n", + "## Converting to Zarr\n", + "In its most basic form, a Polaris compatible Zarr archive is a single Zarr group (the _root_) with equal length Zarr arrays for each of the columns in the dataset.\n", + "\n", + "Chances are that your dataset is currently not stored in a Zarr archive. We will show you how to convert a few common formats to a Polaris compatible Zarr archive.\n", + "\n", + "### From a Numpy Array\n", + "The most simple case is if you have your data in a NumPy array." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import numpy as np\n", + "\n", + "data = np.random.random(2048)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import zarr\n", + "\n", + "# Create an empty Zarr group\n", + "root = zarr.open(path, \"w\")\n", + "\n", + "# Populate it with the array\n", + "root.array(\"column_name\", data)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### From a DataFrame\n", + "Since Pandas DataFrames can be thought of as labeled NumPy arrays, converting a DataFrame is straight-forward too." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "\n", + "df = pd.DataFrame({\n", + " \"A\": np.random.random(2048),\n", + " \"B\": np.random.random(2048)\n", + "})" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Converting it to Zarr is as simple as creating equally named Zarr Arrays." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "import zarr\n", + "\n", + "# Create an empty Zarr group\n", + "root = zarr.open(zarr_root_path, \"w\")\n", + "\n", + "# Populate it with the arrays\n", + "for col in set(df.columns):\n", + " root.array(col, data=df[col].values)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Things get a little more tricky if you have columns with the `object` dtype, for example text." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df[\"C\"] = [\"test\"] * 2048" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In that case you need to tell Zarr how to encode the Python object." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import numcodecs\n", + "\n", + "root.array(\"C\", data=df[\"C\"].values, dtype=object, object_codec=numcodecs.VLenUTF8())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### From RDKit (e.g. SDF)\n", + "\n", + "The ability to encode custom Python objects is powerful. \n", + "\n", + "Using custom object codecs that Polaris provides, we can for example also store RDKit [`Chem.Mol`](https://www.rdkit.org/docs/source/rdkit.Chem.rdchem.html#rdkit.Chem.rdchem.Mol) objects in a Zarr array." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Create an exemplary molecule\n", + "mol = Chem.MolFromSmiles('Cc1ccccc1')\n", + "mol" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from polaris.dataset.zarr.codecs import RDKitMolCodec\n", + "\n", + "# Write it to a Zarr array\n", + "root = zarr.open(zarr_root_path, \"w\")\n", + "root.array(\"molecules\", data=[mol] * 100, dtype=object, object_codec=RDKitMolCodec())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "A common use case of this is to convert a number of **SDF files** to a Zarr array.\n", + "\n", + "1. Load the SDF files using RDKit to `Chem.Mol` objects.\n", + "2. Create a Zarr array with the `RDKitMolCodec`.\n", + "3. Store all RDKit objects in the Zarr array.\n", + "\n", + "### From Biotite (e.g. mmCIF)\n", + "Similarly, we can also store entire protein structures, as represented by the Biotite [`AtomArray`](https://www.biotite-python.org/latest/apidoc/biotite.structure.AtomArray.html) class." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from tempfile import TemporaryDirectory\n", + "\n", + "import biotite.database.rcsb as rcsb\n", + "from biotite.structure.io import load_structure\n", + "\n", + "# Load an exemplary structure\n", + "with TemporaryDirectory() as tmpdir: \n", + " path = rcsb.fetch(\"1l2y\", \"pdb\", tmpdir)\n", + " struct = load_structure(path, model=1)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from polaris.dataset.zarr.codecs import AtomArrayCodec\n", + "\n", + "# Write it to a Zarr array\n", + "root = zarr.open(zarr_root_path, \"w\")\n", + "root.array(\"molecules\", data=[struct] * 100, dtype=object, object_codec=AtomArrayCodec())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### From Images (e.g. PNG)\n", + "For more convential formats, such as images, codecs likely exist already.\n", + "\n", + "For images for example, these codecs are bundled in [`imagecodecs`](https://github.com/cgohlke/imagecodecs), which is an optional dependency of Polaris.\n", + "\n", + "An image is commonly represented as a 3D array (i.e. width x height x channels). It's therefore not needed to use object_codecs here. Instead, we specify the _compressor_ Zarr should use to compress its _chunks_." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from imagecodecs.numcodecs import Jpeg2k\n", + "\n", + "# You need to explicitly register the codec\n", + "numcodecs.register_codec(Jpeg2k)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "root = zarr.open(zarr_root_path, \"w\")\n", + "\n", + "# Array with a single 3 channel image\n", + "arr = root.zeros(\n", + " \"image\",\n", + " shape=(1, 512, 512, 3),\n", + " chunks=(1, 512, 512, 3),\n", + " dtype='u1',\n", + " compressor=Jpeg2k(level=52, reversible=True),\n", + ")\n", + "\n", + "arr[0] = img" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Share your dataset\n", + "Want to share your dataset with the community? Upload it to the Polaris Hub!" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "dataset.upload_to_hub(owner=\"your-username\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Advanced: Optimization\n", + "In this tutorial, we only briefly touched on the high-level concepts that need to be understood to create a Polaris compatible dataset using Zarr. However, Zarr has a lot more to offer and tweaking the settings **can drastically improve storage or data access efficiency.**\n", + "\n", + "If you would like to learn more, please see the [Zarr documentation](https://zarr.readthedocs.io/en/stable/user-guide/performance.html#changing-chunk-shapes-rechunking).\n", + "\n", + "---\n", + "\n", + "The End." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.8" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/docs/tutorials/custom_dataset_benchmark.ipynb b/docs/tutorials/custom_dataset_benchmark.ipynb deleted file mode 100644 index 52e6f7ab..00000000 --- a/docs/tutorials/custom_dataset_benchmark.ipynb +++ /dev/null @@ -1,608 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "id": "172ae3e5", - "metadata": { - "editable": true, - "slideshow": { - "slide_type": "" - }, - "tags": [] - }, - "source": [ - "
\n", - "

In short

\n", - "

This tutorial walks you through the dataset and benchmark data-structures. After creating our own custom dataset and benchmark, we will learn how to upload it to the Hub!

\n", - "
\n", - "\n", - "We have already seen how easy it is to load a benchmark or dataset from the Polaris Hub. Let's now learn a bit more about the underlying data model by creating our own dataset and benchmark!" - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "id": "833650d7", - "metadata": { - "tags": [ - "remove_cell" - ] - }, - "outputs": [], - "source": [ - "# Note: Cell is tagged to not show up in the mkdocs build\n", - "%load_ext autoreload\n", - "%autoreload 2" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "id": "2621b09b", - "metadata": { - "tags": [ - "remove_cell" - ] - }, - "outputs": [], - "source": [ - "# Note: Cell is tagged to not show up in the mkdocs build\n", - "import warnings\n", - "\n", - "warnings.filterwarnings(\"ignore\")" - ] - }, - { - "cell_type": "markdown", - "id": "ece200dc", - "metadata": {}, - "source": [ - "## Create the dataset\n", - "\n", - "A dataset in Polaris is at its core a tabular data-structure in which each row stores a single datapoint. For this example, we will process a multi-task DMPK dataset from [Fang et al.](https://doi.org/10.1021/acs.jcim.3c00160). For the sake of simplicity, we don't do any curation and will just download the dataset as-is from their Github.\n", - "\n", - "
\n", - "

The importance of curation

\n", - "

While we do not address it in this tutorial, data curation is essential to an impactful benchmark. Because of this, we have not just made several high-quality benchmarks readily available on the Polaris Hub, but also open-sourced some of the tools we've built to curate these datasets.

\n", - "
" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "id": "9ecc653f-ec84-4102-aa22-cada0377c964", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
Internal IDVendor IDSMILESCollectionNameLOG HLM_CLint (mL/min/kg)LOG MDR1-MDCK ER (B-A/A-B)LOG SOLUBILITY PH 6.8 (ug/mL)LOG PLASMA PROTEIN BINDING (HUMAN) (% unbound)LOG PLASMA PROTEIN BINDING (RAT) (% unbound)LOG RLM_CLint (mL/min/kg)
0Mol1317714313CNc1cc(Nc2cccn(-c3ccccn3)c2=O)nn2c(C(=O)N[C@@H...emolecules0.6756871.4931670.0899050.9912260.5185141.392169
1Mol2324056965CCOc1cc2nn(CCC(C)(C)O)cc2cc1NC(=O)c1cccc(C(F)F)n1emolecules0.6756871.0407800.5502280.0996810.2683441.027920
2Mol3304005766CN(c1ncc(F)cn1)[C@H]1CCCNC1emolecules0.675687-0.358806NaN2.0000002.0000001.027920
3Mol4194963090CC(C)(Oc1ccc(-c2cnc(N)c(-c3ccc(Cl)cc3)c2)cc1)C...emolecules0.6756871.0266621.657056-1.158015-1.4034031.027920
4Mol5324059015CC(C)(O)CCn1cc2cc(NC(=O)c3cccc(C(F)(F)F)n3)c(C...emolecules0.9963801.010597NaN1.0156111.0922641.629093
\n", - "
" - ], - "text/plain": [ - " Internal ID Vendor ID SMILES \\\n", - "0 Mol1 317714313 CNc1cc(Nc2cccn(-c3ccccn3)c2=O)nn2c(C(=O)N[C@@H... \n", - "1 Mol2 324056965 CCOc1cc2nn(CCC(C)(C)O)cc2cc1NC(=O)c1cccc(C(F)F)n1 \n", - "2 Mol3 304005766 CN(c1ncc(F)cn1)[C@H]1CCCNC1 \n", - "3 Mol4 194963090 CC(C)(Oc1ccc(-c2cnc(N)c(-c3ccc(Cl)cc3)c2)cc1)C... \n", - "4 Mol5 324059015 CC(C)(O)CCn1cc2cc(NC(=O)c3cccc(C(F)(F)F)n3)c(C... \n", - "\n", - " CollectionName LOG HLM_CLint (mL/min/kg) LOG MDR1-MDCK ER (B-A/A-B) \\\n", - "0 emolecules 0.675687 1.493167 \n", - "1 emolecules 0.675687 1.040780 \n", - "2 emolecules 0.675687 -0.358806 \n", - "3 emolecules 0.675687 1.026662 \n", - "4 emolecules 0.996380 1.010597 \n", - "\n", - " LOG SOLUBILITY PH 6.8 (ug/mL) \\\n", - "0 0.089905 \n", - "1 0.550228 \n", - "2 NaN \n", - "3 1.657056 \n", - "4 NaN \n", - "\n", - " LOG PLASMA PROTEIN BINDING (HUMAN) (% unbound) \\\n", - "0 0.991226 \n", - "1 0.099681 \n", - "2 2.000000 \n", - "3 -1.158015 \n", - "4 1.015611 \n", - "\n", - " LOG PLASMA PROTEIN BINDING (RAT) (% unbound) LOG RLM_CLint (mL/min/kg) \n", - "0 0.518514 1.392169 \n", - "1 0.268344 1.027920 \n", - "2 2.000000 1.027920 \n", - "3 -1.403403 1.027920 \n", - "4 1.092264 1.629093 " - ] - }, - "execution_count": 3, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "import pandas as pd\n", - "\n", - "PATH = (\n", - " \"https://raw.githubusercontent.com/molecularinformatics/Computational-ADME/main/ADME_public_set_3521.csv\"\n", - ")\n", - "table = pd.read_csv(PATH)\n", - "table.head(5)" - ] - }, - { - "cell_type": "markdown", - "id": "b330cf1a-bcb8-44d1-a6d2-54f468749083", - "metadata": {}, - "source": [ - "While not required, a good dataset will specify additional meta-data to give further explanations on the data is contained within the dataset. This can be done on both the column level and on the dataset level." - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "id": "3145fc25-e670-413a-8926-8ab9d6fcb3b0", - "metadata": {}, - "outputs": [], - "source": [ - "from polaris.dataset import ColumnAnnotation\n", - "\n", - "# Additional meta-data on the column level\n", - "# Of course, for a real dataset we should annotate all columns.\n", - "annotations = {\n", - " \"LOG HLM_CLint (mL/min/kg)\": ColumnAnnotation(\n", - " desription=\"Microsomal stability\",\n", - " user_attributes={\"unit\": \"mL/min/kg\"},\n", - " ),\n", - " \"SMILES\": ColumnAnnotation(desription=\"Molecule SMILES string\", modality=\"molecule\"),\n", - "}" - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "id": "9e606547-f711-4e4d-8c93-97002a8a2236", - "metadata": {}, - "outputs": [], - "source": [ - "from polaris.dataset import Dataset\n", - "from polaris.utils.types import HubOwner\n", - "\n", - "dataset = Dataset(\n", - " # The table is the core data-structure required to construct a dataset\n", - " table=table,\n", - " # Additional meta-data on the dataset level.\n", - " name=\"Fang_2023_DMPK\",\n", - " description=\"120 prospective data sets, collected over 20 months across six ADME in vitro endpoints\",\n", - " source=\"https://doi.org/10.1021/acs.jcim.3c00160\",\n", - " annotations=annotations,\n", - " tags=[\"DMPK\", \"ADME\"],\n", - " owner=HubOwner(user_id=\"cwognum\", slug=\"cwognum\"),\n", - " license=\"CC-BY-4.0\",\n", - " user_attributes={\"year\": \"2023\"},\n", - ")" - ] - }, - { - "cell_type": "markdown", - "id": "19bfee67-cda4-43d8-b9fd-31e8bf09de68", - "metadata": {}, - "source": [ - "## Save and load the dataset \n", - "\n", - "We can now save the dataset either to a local path or directly to the hub!" - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "id": "9c7766c1-ae85-4d0e-9757-aa5c495bc95e", - "metadata": {}, - "outputs": [], - "source": [ - "import tempfile\n", - "\n", - "temp_dir = tempfile.TemporaryDirectory().name" - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "id": "ff8fff6d", - "metadata": {}, - "outputs": [], - "source": [ - "import datamol as dm\n", - "\n", - "save_dir = dm.fs.join(temp_dir, \"dataset\")" - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "id": "1b8d6aa6-9ca7-41dd-b219-9fbee813bc61", - "metadata": {}, - "outputs": [], - "source": [ - "path = dataset.to_json(save_dir)" - ] - }, - { - "cell_type": "markdown", - "id": "45881174", - "metadata": {}, - "source": [ - "Looking at the save destination, we see this created two files: A JSON with all the meta-data and a `.parquet` file with the tabular data. " - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "id": "114f8ceb", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "['/var/folders/1y/1v1blh6x56zdn027g5g9bwph0000gr/T/tmpe_g26lrl/dataset/table.parquet',\n", - " '/var/folders/1y/1v1blh6x56zdn027g5g9bwph0000gr/T/tmpe_g26lrl/dataset/dataset.json']" - ] - }, - "execution_count": 9, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "fs = dm.fs.get_mapper(save_dir).fs\n", - "fs.ls(save_dir)" - ] - }, - { - "cell_type": "markdown", - "id": "f7a0d35e", - "metadata": {}, - "source": [ - "Loading the dataset can be done through this JSON file." - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "id": "6161d30d", - "metadata": {}, - "outputs": [], - "source": [ - "import polaris as po\n", - "\n", - "dataset = po.load_dataset(path)" - ] - }, - { - "cell_type": "markdown", - "id": "637525f8-5d00-4937-8c0c-29876a309e46", - "metadata": {}, - "source": [ - "We can also upload the dataset to the hub!" - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "id": "e98d9a66-dcd2-451e-8fb2-b603c344e87f", - "metadata": { - "scrolled": true - }, - "outputs": [], - "source": [ - "# from polaris.hub.client import PolarisHubClient\n", - "\n", - "# NOTE: Commented out to not flood the DB\n", - "# with PolarisHubClient() as client:\n", - "# client.upload_dataset(dataset=dataset)" - ] - }, - { - "cell_type": "markdown", - "id": "e146f9a6", - "metadata": {}, - "source": [ - "## Create the benchmark specification\n", - "A benchmark is represented by the `BenchmarkSpecification`, which wraps a `Dataset` with additional data to produce a benchmark.\n", - "\n", - "It specifies:\n", - "1. Which dataset to use (see Dataset);\n", - "2. Which columns are used as input and which columns are used as target;\n", - "3. Which metrics should be used to evaluate performance on this task;\n", - "4. A predefined, static train-test split to use during evaluation." - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "id": "3313a76b", - "metadata": {}, - "outputs": [], - "source": [ - "import numpy as np\n", - "from polaris.benchmark import SingleTaskBenchmarkSpecification\n", - "\n", - "# For the sake of simplicity, we use a very simple, ordered split\n", - "split = (np.arange(3000).tolist(), (np.arange(521) + 3000).tolist()) # train # test\n", - "\n", - "benchmark = SingleTaskBenchmarkSpecification(\n", - " dataset=dataset,\n", - " target_cols=\"LOG SOLUBILITY PH 6.8 (ug/mL)\",\n", - " input_cols=\"SMILES\",\n", - " split=split,\n", - " metrics=\"mean_absolute_error\",\n", - ")" - ] - }, - { - "cell_type": "markdown", - "id": "560c105b-0a01-41ed-a5da-08fcf0db820d", - "metadata": {}, - "source": [ - "Metrics should be supported in the polaris framework.\n", - "\n", - "For more information, see the `Metric` class.\n", - "\n", - "To support the vast flexibility in specifying a benchmark, we have different classes that correspond to different types of benchmarks. Each of these subclasses makes the data-model or logic more specific to a particular case. For example, trying to create a multitask benchmark with the same arguments as we used above will throw an error as there is just a single target column specified." - ] - }, - { - "cell_type": "code", - "execution_count": 14, - "id": "1a036795", - "metadata": {}, - "outputs": [ - { - "ename": "ValidationError", - "evalue": "1 validation error for MultiTaskBenchmarkSpecification\ntarget_cols\n Value error, A multi-task benchmark should specify at least two target columns [type=value_error, input_value='LOG SOLUBILITY PH 6.8 (ug/mL)', input_type=str]\n For further information visit https://errors.pydantic.dev/2.4/v/value_error", - "output_type": "error", - "traceback": [ - "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", - "\u001b[0;31mValidationError\u001b[0m Traceback (most recent call last)", - "\u001b[1;32m/Users/cas.wognum/Documents/repositories/polaris/docs/tutorials/custom_dataset_benchmark.ipynb Cell 25\u001b[0m line \u001b[0;36m3\n\u001b[1;32m 1\u001b[0m \u001b[39mfrom\u001b[39;00m \u001b[39mpolaris\u001b[39;00m\u001b[39m.\u001b[39;00m\u001b[39mbenchmark\u001b[39;00m \u001b[39mimport\u001b[39;00m MultiTaskBenchmarkSpecification\n\u001b[0;32m----> 3\u001b[0m benchmark \u001b[39m=\u001b[39m MultiTaskBenchmarkSpecification(\n\u001b[1;32m 4\u001b[0m dataset\u001b[39m=\u001b[39;49mdataset,\n\u001b[1;32m 5\u001b[0m target_cols\u001b[39m=\u001b[39;49m\u001b[39m\"\u001b[39;49m\u001b[39mLOG SOLUBILITY PH 6.8 (ug/mL)\u001b[39;49m\u001b[39m\"\u001b[39;49m,\n\u001b[1;32m 6\u001b[0m input_cols\u001b[39m=\u001b[39;49m\u001b[39m\"\u001b[39;49m\u001b[39mSMILES\u001b[39;49m\u001b[39m\"\u001b[39;49m,\n\u001b[1;32m 7\u001b[0m split\u001b[39m=\u001b[39;49msplit,\n\u001b[1;32m 8\u001b[0m metrics\u001b[39m=\u001b[39;49m\u001b[39m\"\u001b[39;49m\u001b[39mmean_absolute_error\u001b[39;49m\u001b[39m\"\u001b[39;49m,\n\u001b[1;32m 9\u001b[0m )\n", - "File \u001b[0;32m~/micromamba/envs/polaris/lib/python3.11/site-packages/pydantic/main.py:164\u001b[0m, in \u001b[0;36mBaseModel.__init__\u001b[0;34m(__pydantic_self__, **data)\u001b[0m\n\u001b[1;32m 162\u001b[0m \u001b[39m# `__tracebackhide__` tells pytest and some other tools to omit this function from tracebacks\u001b[39;00m\n\u001b[1;32m 163\u001b[0m __tracebackhide__ \u001b[39m=\u001b[39m \u001b[39mTrue\u001b[39;00m\n\u001b[0;32m--> 164\u001b[0m __pydantic_self__\u001b[39m.\u001b[39;49m__pydantic_validator__\u001b[39m.\u001b[39;49mvalidate_python(data, self_instance\u001b[39m=\u001b[39;49m__pydantic_self__)\n", - "\u001b[0;31mValidationError\u001b[0m: 1 validation error for MultiTaskBenchmarkSpecification\ntarget_cols\n Value error, A multi-task benchmark should specify at least two target columns [type=value_error, input_value='LOG SOLUBILITY PH 6.8 (ug/mL)', input_type=str]\n For further information visit https://errors.pydantic.dev/2.4/v/value_error" - ] - } - ], - "source": [ - "from polaris.benchmark import MultiTaskBenchmarkSpecification\n", - "\n", - "benchmark = MultiTaskBenchmarkSpecification(\n", - " dataset=dataset,\n", - " target_cols=\"LOG SOLUBILITY PH 6.8 (ug/mL)\",\n", - " input_cols=\"SMILES\",\n", - " split=split,\n", - " metrics=\"mean_absolute_error\",\n", - ")" - ] - }, - { - "cell_type": "markdown", - "id": "4c1bae4e", - "metadata": {}, - "source": [ - "## Save and load the benchmark\n", - "Saving the benchmark is easy and can be done with a single line of code." - ] - }, - { - "cell_type": "code", - "execution_count": 15, - "id": "bfe454ff", - "metadata": {}, - "outputs": [], - "source": [ - "save_dir = dm.fs.join(temp_dir, \"benchmark\")" - ] - }, - { - "cell_type": "code", - "execution_count": 16, - "id": "adbaf57f", - "metadata": {}, - "outputs": [], - "source": [ - "path = benchmark.to_json(save_dir)" - ] - }, - { - "cell_type": "code", - "execution_count": 17, - "id": "06ca03cb", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "['/var/folders/1y/1v1blh6x56zdn027g5g9bwph0000gr/T/tmpe_g26lrl/benchmark/table.parquet',\n", - " '/var/folders/1y/1v1blh6x56zdn027g5g9bwph0000gr/T/tmpe_g26lrl/benchmark/benchmark.json',\n", - " '/var/folders/1y/1v1blh6x56zdn027g5g9bwph0000gr/T/tmpe_g26lrl/benchmark/dataset.json']" - ] - }, - "execution_count": 17, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "fs = dm.fs.get_mapper(save_dir).fs\n", - "fs.ls(save_dir)" - ] - }, - { - "cell_type": "markdown", - "id": "73a691ae", - "metadata": {}, - "source": [ - "This created three files. Two `json` files and a single `parquet` file. The `parquet` file saves the tabular structure at the base of the `Dataset` class, whereas the `json` files save all the meta-data for the `Dataset` and `BenchmarkSpecification`.\n", - "\n", - "As before, loading the benchmark can be done through the JSON file. " - ] - }, - { - "cell_type": "code", - "execution_count": 18, - "id": "71e3fe8c", - "metadata": {}, - "outputs": [], - "source": [ - "benchmark = po.load_benchmark(path)" - ] - }, - { - "cell_type": "markdown", - "id": "1a88c7b9-63ab-4d33-aa46-614aa0f2bda9", - "metadata": {}, - "source": [ - "And as before, we can also upload the benchmark directly to the hub." - ] - }, - { - "cell_type": "code", - "execution_count": 19, - "id": "ca2d1529-3773-4caa-b12d-62865c477e76", - "metadata": {}, - "outputs": [], - "source": [ - "# NOTE: Commented out to not flood the DB\n", - "# with PolarisHubClient() as client:\n", - "# client.upload_benchmark(dataset=dataset)" - ] - }, - { - "cell_type": "markdown", - "id": "d16789db", - "metadata": {}, - "source": [ - "The End. \n", - "\n", - "---" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.12.2" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/docs/tutorials/dataset_pdb.ipynb b/docs/tutorials/dataset_pdb.ipynb deleted file mode 100644 index 56123c12..00000000 --- a/docs/tutorials/dataset_pdb.ipynb +++ /dev/null @@ -1,427 +0,0 @@ -{ - "cells": [ - { - "cell_type": "code", - "execution_count": 1, - "id": "217690be-9836-4e06-930e-ba7efbb37d91", - "metadata": { - "editable": true, - "slideshow": { - "slide_type": "" - }, - "tags": [ - "remove_cell" - ] - }, - "outputs": [], - "source": [ - "# Note: Cell is tagged to not show up in the mkdocs build\n", - "%load_ext autoreload\n", - "%autoreload 2" - ] - }, - { - "cell_type": "markdown", - "id": "39b58e71", - "metadata": { - "editable": true, - "slideshow": { - "slide_type": "" - }, - "tags": [] - }, - "source": [ - "
\n", - "

In short

\n", - "

This tutorial shows how to create datasets with PDBs through the .zarr format.

\n", - "
\n", - "\n", - "
\n", - "

This feature is still very new.

\n", - "

The features we will show in this tutorial are still experimental. We would love to learn from the community how we can make it easier to create datasets.

\n", - "
" - ] - }, - { - "cell_type": "markdown", - "id": "e154bb54", - "metadata": { - "editable": true, - "slideshow": { - "slide_type": "" - }, - "tags": [] - }, - "source": [ - "### Dummy PDB example" - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "id": "5e201379", - "metadata": { - "editable": true, - "slideshow": { - "slide_type": "" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "import platformdirs\n", - "\n", - "import datamol as dm\n", - "\n", - "from polaris.dataset import DatasetFactory\n", - "from polaris.dataset.converters import PDBConverter\n", - "\n", - "SAVE_DIR = dm.fs.join(platformdirs.user_cache_dir(appname=\"polaris-tutorials\"), \"dataset_pdb\")" - ] - }, - { - "cell_type": "markdown", - "id": "f4a4a9c7", - "metadata": {}, - "source": [ - "### Fetch PDB files from RCSB PDB" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "fda9b878", - "metadata": {}, - "outputs": [], - "source": [ - "import biotite.database.rcsb as rcsb\n", - "\n", - "pdb_path = rcsb.fetch(\"6s89\", \"pdb\", SAVE_DIR)\n", - "print(pdb_path)" - ] - }, - { - "cell_type": "markdown", - "id": "8a47ae20", - "metadata": {}, - "source": [ - "### Create dataset from PDB file" - ] - }, - { - "cell_type": "code", - "execution_count": 14, - "id": "07442028", - "metadata": { - "editable": true, - "slideshow": { - "slide_type": "" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "save_dst = dm.fs.join(SAVE_DIR, \"tutorial_pdb.zarr\")\n", - "\n", - "factory = DatasetFactory(zarr_root_path=save_dst)\n", - "factory.reset(save_dst)\n", - "\n", - "factory.register_converter(\"pdb\", PDBConverter(pdb_column=\"pdb\"))\n", - "factory.add_from_file(pdb_path)\n", - "\n", - "# Build the dataset\n", - "dataset = factory.build()" - ] - }, - { - "cell_type": "markdown", - "id": "35bb183e", - "metadata": {}, - "source": [ - "### Check the dataset" - ] - }, - { - "cell_type": "code", - "execution_count": 15, - "id": "05712cbd", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
nameNone
description
tags
user_attributes
ownerNone
polaris_version0.7.10.dev22+g8edf177.d20240814
default_adapters
pdbARRAY_TO_PDB
zarr_root_path/Users/lu.zhu/Library/Caches/polaris-tutorials/002/tutorial_pdb.zarr
readme
annotations
pdb
is_pointerTrue
modalityPROTEIN_3D
descriptionNone
user_attributes
dtypeobject
sourceNone
licenseNone
curation_referenceNone
cache_dir/Users/lu.zhu/Library/Caches/polaris/datasets/b0895f92-5a11-4e48-953f-3f969c6a9ca6
md5sum66f3c7774e655bc6d48c907100d6912f
artifact_idNone
n_rows1
n_columns1
" - ], - "text/plain": [ - "{\n", - " \"name\": null,\n", - " \"description\": \"\",\n", - " \"tags\": [],\n", - " \"user_attributes\": {},\n", - " \"owner\": null,\n", - " \"polaris_version\": \"0.7.10.dev22+g8edf177.d20240814\",\n", - " \"default_adapters\": {\n", - " \"pdb\": \"ARRAY_TO_PDB\"\n", - " },\n", - " \"zarr_root_path\": \"/Users/lu.zhu/Library/Caches/polaris-tutorials/002/tutorial_pdb.zarr\",\n", - " \"readme\": \"\",\n", - " \"annotations\": {\n", - " \"pdb\": {\n", - " \"is_pointer\": true,\n", - " \"modality\": \"PROTEIN_3D\",\n", - " \"description\": null,\n", - " \"user_attributes\": {},\n", - " \"dtype\": \"object\"\n", - " }\n", - " },\n", - " \"source\": null,\n", - " \"license\": null,\n", - " \"curation_reference\": null,\n", - " \"cache_dir\": \"/Users/lu.zhu/Library/Caches/polaris/datasets/b0895f92-5a11-4e48-953f-3f969c6a9ca6\",\n", - " \"md5sum\": \"66f3c7774e655bc6d48c907100d6912f\",\n", - " \"artifact_id\": null,\n", - " \"n_rows\": 1,\n", - " \"n_columns\": 1\n", - "}" - ] - }, - "execution_count": 15, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "dataset" - ] - }, - { - "cell_type": "markdown", - "id": "e5f904bc", - "metadata": {}, - "source": [ - "### Check data table" - ] - }, - { - "cell_type": "code", - "execution_count": 16, - "id": "6b7017ad", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
pdb
0pdb/6s89
\n", - "
" - ], - "text/plain": [ - " pdb\n", - "0 pdb/6s89" - ] - }, - "execution_count": 16, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "dataset.table" - ] - }, - { - "cell_type": "markdown", - "id": "a89953b8", - "metadata": {}, - "source": [ - "### Get PDB data from specific row\n", - "A array of list of `biotite.Atom` will be returned.\n", - "See more details at [fastpdb](https://github.com/biotite-dev/fastpdb) and [Atom](https://github.com/biotite-dev/biotite/blob/main/src/biotite/structure/atoms.py)." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "f2583c8d", - "metadata": {}, - "outputs": [], - "source": [ - "dataset.get_data(0, \"pdb\")" - ] - }, - { - "cell_type": "markdown", - "id": "5b3c1be6", - "metadata": {}, - "source": [ - "### Create dataset from multiple PDB files" - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "id": "5647c8ff", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "['/Users/lu.zhu/Library/Caches/polaris-tutorials/002/1l2y.pdb', '/Users/lu.zhu/Library/Caches/polaris-tutorials/002/4i23.pdb']\n" - ] - } - ], - "source": [ - "pdb_paths = rcsb.fetch([\"1l2y\", \"4i23\"], \"pdb\", SAVE_DIR)\n", - "print(pdb_paths)" - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "id": "1bd32964", - "metadata": {}, - "outputs": [], - "source": [ - "factory = DatasetFactory(SAVE_DIR.join(\"pdbs.zarr\"))\n", - "\n", - "converter = PDBConverter()\n", - "factory.register_converter(\"pdb\", converter)\n", - "\n", - "factory.add_from_files(pdb_paths, axis=0)\n", - "dataset = factory.build()" - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "id": "1e05109e", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
pdb
0pdb/1l2y
1pdb/4i23
\n", - "
" - ], - "text/plain": [ - " pdb\n", - "0 pdb/1l2y\n", - "1 pdb/4i23" - ] - }, - "execution_count": 9, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "dataset.table" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a064942f", - "metadata": {}, - "outputs": [], - "source": [ - "dataset.get_data(1, \"pdb\")" - ] - }, - { - "cell_type": "markdown", - "id": "72767ef2", - "metadata": { - "editable": true, - "slideshow": { - "slide_type": "" - }, - "tags": [] - }, - "source": [ - "The process of completing the dataset's metadata and uploading it to the hub follows the same steps as outlined in the tutorial [dataset_zarr.ipynb](docs/tutorials/dataset_zarr.ipynb)\n", - "\n", - "The End. " - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.12.2" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/docs/tutorials/dataset_sdf.ipynb b/docs/tutorials/dataset_sdf.ipynb deleted file mode 100644 index 96914316..00000000 --- a/docs/tutorials/dataset_sdf.ipynb +++ /dev/null @@ -1,520 +0,0 @@ -{ - "cells": [ - { - "cell_type": "code", - "execution_count": 1, - "id": "e558d600-68d2-473f-89b4-4a356277c078", - "metadata": { - "editable": true, - "slideshow": { - "slide_type": "" - }, - "tags": [ - "remove_cell" - ] - }, - "outputs": [], - "source": [ - "# Note: Cell is tagged to not show up in the mkdocs build\n", - "%load_ext autoreload\n", - "%autoreload 2" - ] - }, - { - "cell_type": "markdown", - "id": "f842d55c-6327-4e81-ba07-79eafe9d47a3", - "metadata": {}, - "source": [ - "
\n", - "

In short

\n", - "

This tutorial shows how we can create more complicated datasets with SDF file by leveraging the dataset factory in Polaris.

\n", - "
\n", - "\n", - "
\n", - "

This feature is still very new.

\n", - "

The features we will show in this tutorial are still experimental. We would love to learn from the community how we can make it easier to create datasets.

\n", - "
" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "id": "278ce19e-0b47-43f1-9876-b3b69a2154e1", - "metadata": {}, - "outputs": [], - "source": [ - "import platformdirs\n", - "import datamol as dm" - ] - }, - { - "cell_type": "markdown", - "id": "856afc27-6f9e-40a1-97c2-e4c1188b2faf", - "metadata": {}, - "source": [ - "## Dataset Factory\n", - "Datasets in Polaris are expected to be saved in a very specific format. This format has been carefully designed to be as universal and performant as possible. Nevertheless, we expect very few datasets to be readily available in this format. We therefore provide the `DatasetFactory` as a way to more easily convert datasets to the Polaris specific format.\n", - "\n", - "Let's assume we have a dataset in the SDF format. " - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "id": "d8b3087e-8c50-45b4-ada7-44bf783cc929", - "metadata": {}, - "outputs": [], - "source": [ - "SAVE_DIR = dm.fs.join(platformdirs.user_cache_dir(appname=\"polaris-tutorials\"), \"dataset_sdf\")" - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "id": "0776f067-d01b-4b7c-89f6-a3c817f934fb", - "metadata": {}, - "outputs": [ - { - "data": { - "image/png": "iVBORw0KGgoAAAANSUhEUgAAAcIAAACWCAIAAADCEh9HAAAABmJLR0QA/wD/AP+gvaeTAAAZJUlEQVR4nO3daVRUV7428KcoJsEogzKoqEGNA6IROlmixAEFkTAognMwU5t4jencmNtqnAhxwE7U+LbRqFfTGLEJgwSnqKgdlwja7dAK6DXIKCJDREQJFFXUeT8U7YCaCHUGlef3SXdV7f2vtWo97HPOPvuoBEEAERG1lInSBRARPdsYo0RERmGMEhEZhTFKRGQUxigRkVEYo0RERjFVugAi2Wm1OHQIly5BENC3L/z8YG6udE30DFNx3Si1Ljk5CAxERQX+8AcAOH0a9vbYvRt9+ypdGT2rGKPUmuh0GDgQtrbYvx/t2gFAdTUCAlBejuxsmJkpXR89k3hulFqTAwdw8SLWrm3MUADt2mHdOuTkYO9eRSujZxhjlFqTjAy0b49XXnmg0dMT9vbIyFCoJnrmMUapNamshLPzI9o7d8aNG7JXQ88Jxii1JlZWqKx8RPuNG2jbVvZq6DnBGKXWxN0dFRVNJ55lZbh+He7uCtVEzzzGKLUmISFo2xbR0Q80rlgBa2uEhipUEz3zuPyeWhNbW2zZgunTUVgIX1+oVEhNxa5d2L4ddnZKF0fPKq4bpdbh+HFoNBg9GgDOnMHXXyMrC4KA/v0xe3bjUnyiFmGMUisgCPD0xLlziIvDpElKV0PPG54bpVYgORnnzsHZGcHBAKDVIi1N6Zro+cEYpeedICAqCgAWLUKbNgAQE4PXXsOsWcrWRc8NxigpQxCE2tpaOUZKTMT583BxwTvvAIBWixUrAGDECDlGp1aAMUoK2LNnj6ur65IlSyQfSa/HsmUAsHAhLCwAYOtW5OfDzQ3h4ZKPTq0DY5QUYG9vX1BQ8P333+v1emlHio/HhQvo2hVvvQUA9fVYtQoAIiNhwh8/iYO/JFKAl5dX9+7dr169mp6eLuEwDQ2NZ0UXL27cmHnLFhQUoH9/LrYnETFGSQEqlWrixIkA4uLiJBwmLg6XLqF7d0REAIBG03j/UlQUp6IkIv6YSBmTJ08GEB8fr9PpJBmgoaHxrOjSpY1T0U2bUFyMQYMwbpwkI1JrxRglZQwaNKhv374VFRX/+Mc/JBkgNhb/93/o2RPTpwNAXR3+8hcAiIyESiXJiNRaMUZJMdId1+t0ujJDt4sXw9QUADZuxLVr8PBAUJDow1ErxxglxUyZMgXArl27NBqNuD3v2LHD6ccfvwwMxLRpAFBXhy+/BICoKE5FSXSMUVJM7969X375ZRsbh59+KhSxW61W+/nnnwPoNGUK1GoANf/7vygpwauv4vXXRRyIyIAxSkp6883UgoLLMTEvidhnTExMXl7eSy+9NGnSJAA1NTU9ly1b/uqrtYYrTkRiY4ySksaP76BSISUFNTXidKjValesWAEgKipKrVYDWL9+fWlZ2T61uo2vrzhjED2IMUpK6toVXl749VfRHm+8bdu2/Pz8fv36hYeHA6ipqVmzZg2AKMM6fCIJMEZJYYb9P0W5XK/VaqOjowFERkaamJgAWLduXXl5+dChQ0cbNmwmkgC3bSaFlZaiSxeYmqK0FDY2RnW1YcOG2bNnu7m5XbhwwcTE5M6dO66urhUVFUePHh05cqRI9RI1xdkoKczJCSNGQKPBDz8Y1Y9Go1m5ciWAqKgow1R07dq1FRUV3t7ezFCSFGOUlCfKcf3mzZuLi4vd3d3HjRsH4NatW1999RWAZbxATxJjjJLywsJgbo4jR1Be3sIe6urqVq1aBaBv374zZswoLS1du3ZtZWXlqFGjhg8fLmatRA/hA5ZJeba2mD4d1tZowS4ltbW1J06c+PLLL69duwYgPj7ezMysuLj4/PnzAOTYGZpaPV5iIuUVFyMjA6++im7dGlv0eiQlYcgQdO786I/k5t7Yty923759x44du3svaYcOHXx9fePi4tRqtU6n8/PzO3jwoCzfgFo1zkZJeadOYeJEvPoqMjIaNwLV6TBxIpKTH4jRhgZkZGDvXhw+jPLy6qtX/wTAxMTE09Nz9OjRgYGBQ4cOValURUVFJ06cALB48WJlvg+1MoxReiqYmCA/H1u34o9/bPpSaSn278f+/UhNRXV1Y6Ot7Yt//ONHr73mMXbs2A4dOtz//oiIiBMnTri4uHh7e8tSO7V2PKgn5SUlYfJkrFmDyEhcugQHB9TXw8ICycnYuROJibj7I3V3R0AAAgIwZEjjBngPu3nzprOzs06nKy4udnJyku1bUKvFK/X0tJg1C05O+OSTBxo7d4alJUaPxldfoaAAFy4gOhrDhj02QwHY2tqOHTu2oaHh+++/l7pmIjBG6elhaoqNG7FjB44evde4aBFu3EBqKv70p3sXoH7XtGnTAMTGxkpQJlFTjFF6igwbhilT8MEH0GobW+zt0aZNs/sJDAxs37795culP//8i7gVEj2MMUpPl9WrUVKCTZuM6sTS0vL998/X1xfFxnb4/XcTGYcxSk8XJydERcH4Gzh9fbvV1SE2FryGSlJjjNJTZ/ZsdO9ubCcjR6JzZ+Tm4l//EqEkot/AGCXlOThg1Kh7/1WrsXEjRo+Gg0PL+zQxadzxhNeZSGpcN0rK++tfERYGZ2eRuz17Fp6ecHDAtWu/tUCKyEicjZLCfvoJH34IT0/U14vcs4cH3NxQXo4jR8TpsLS0NDU1VZy+6DnCGCWFffopAHzwAczNxe988mTA6OP6/Pz8devW+fr6uri4hIaG1tXViVIbPTd4qENKSk5GRgYcHDBnjiT9T52K5cuhUjX7g4IgnD59Ojk5OSUl5eLFi4bGNm3ajBo1qrKyslOnTiIXSs8yxigppqEBhj2Yli7FCy9IMoSTE3JyHtgmSqvFnTuwsXl0tmq1+Okn/PTTupiYLwwbmAKws7MLDAwcN27cmDFjrKysJCmUnmWMUVLM9u3IzsaLL+Ldd6Ua4r//G5s347vvMH16Y8uBAwgOxq+/PnBz1K+/4sgRJCRgzx5UVWH48IZr16517drV398/MDDQ39/fzMxMqhLp2ccYJWXU1+PzzwFg2TJJzoreZWmJuXPx+uuwtW36Unk5du/GDz/gyBHcPeE5YADGjJm+Zs0IDw8PCcui5whjlJSxfj3y8+Hu3ngVSDojRqCkBPPnP3CDaVUV/PyQng69HgDUarz2GkJCMH48XF0BOABGrFmlVoYxSgq4cwerVgFAdHTjdvfSUavx9dcYPhwzZmDIkMZGGxuUlMDcHN7eCAzExInir1ql1oMxSgr44guUl8PbGwEBcgzn7Y2wMMyahTNn7jUmJ8PVFW3bylEAPd+4bpTkVlGBtWsBIDpavkHXrEF+Pv7613stAwYwQ0kcjFGS24YNuzt2rA8KwtCh8g3auTMiIxEVhYoK+QalVoL31JOsCgoK+vTpIwjqc+fy+vVzlHq4997DtWvYuxcAdDp4eKCuDjk5TRc8ERmDs1GS1ZIlSzQazZQp4TJkaBOmptiwAVeuyDwsPf8YoySfrKys2NhYc3PzJUuWKFKAtzfeeEORkel5xhgl+Xz66ad6vX7WrFmurq4yDKfX4+23G4/o74qJgSDwiJ7ExBglmZw6dWrv3r1t27ZdsGCBPCMmJGDwYLzzjjyjUevFGCWZzJ8/XxCEuXPnOjrKcVa0oQGRkQAweLAMo1Grxiv1JIfY2Njp06d36NAhNze3Xbt2Moy4dSvefRe9eiE7G9xXhCTFu5hIQkVFRcnJyQkJCSdOnACwcOFCeTK0vh7LlwPAZ58xQ0lynI2S+C5fvrxr166kpKQz99196eDgUFRUZGFhIUMB69bho4/Qvz/On5f8nn0izkZJPJmZuw8dWvi3v2VlZRka2rVr5+/vf+zYsbKyslWrVsmToTU1WLkSAJYvZ4aSHPgrI6NlZyMyEv36YcCAl1JSsrKybG1t33jjjfj4+JKSEj8/v8rKyo4dO74h14rNdetQVoZXXkFQkDwDUmvH2Si1iF6PkyeRlIRdu1BQ0Njo6NjN0/PQ4sUjR440/c8TjauqqrRabX19vVqtlqGuqiqsXg0A0dEteQQTUQswRuk3VVdj505cuACdDn36YPJkdOqE/Hx4e6OkpPE9Xbpg/HhMmABv7zZqte99n75165adnZ2ZmdmtW7cuXbrUt29fqev94gtUVmLYMPj4SD0UUSNeYqLHy8zEmDGwtMSoUTA3R1oacnMRHw9/f3TpAnNzhIQgPBxDhjQ5B3nz5s09e/YkJCSkpqZqNBpD47JlyxYuXChpvRUV6NEDt28jPR1eXpIORXQPY5Qeo6EBbm5wcsLBgzBcGhIEzJyJhATk5KC+/oHnbRqUleGHH+oPHGi/f39dfT0AtVo9fPjw3r17b9y4cdCgQWfPnpW05KVLr0RF9QwORkqKpOMQPYAxSo9x+DB8fZvO66qq4OyMVavw4Yf3GouLsX8/9uzBgQPQ6QB88PLL/7a2Dg8PnzRpkpOTk0ajcXBwqK6uvnLlSo8ePSSqt6io6KWXXurff9i33+5zd+diUZIPz43SY5w5A1NTvPLKA402NnB1bXwWR34+EhOxaxdOnYLhj7GlJQICMGHC/wsKMrnvOZwWFhavv/763//+9+Tk5E8++USieqOiojQaTe/eHZmhJDMueKLHqK6GnR1MH/pD6+iIW7cAYPly/PnPOHkSlpYIDERMDMrKkJKCiAiTh55lHBoaCiApKUmiYnNycmJiYtRqtVJb8FFrxtkoPUb79rhxAzpd0yQtK0PXrgAwdSrq6jBhAvz9f3fjuYCAAGtr61OnTl29etXFxUX0YpcsWaLT6WbOnNm7d2/ROyf6bZyN0mN4eKChAdnZDzSWlyM3F56eAODjgx07MH78k2zeaWVl5efnJwhCigRXfzIzM+Pj4y0tLRctWiR650S/izFKjzFyJHr1wqefGq4aNVq6FBYWmDKlBf1NmDAB0hzX390NWop5LtHv4pV6erwzZ+DvDycnjB0Lc3McO4bTpxEXh5CQFnR2+/ZtBwcHrVZbUlLi4OAgVo3//Oc/Bw8ebG1tfeXKFXl2MiVqgrNRejxPT1y6hDfewLVryMmBjw8uXmxZhgJ44YUXfHx8GhoaxD2uX7BggSAIH330ETOUlMLZKMln27Zt77zzzpgxYw4cOCBKh4cPH/b19bWxscnLy7N9aHkAkTw4GyX5hISEmJqaHj16tLKyUpQOFy9eDGDevHnMUFIQY5TkY29vP3z4cK1Wu7fJ4zqb7+zZsxERESdPnnRycpozZ44o5RG1DGOUZGXk9frs7OzIyMjevXt7enp+99139vb2wcHB1tbWotZI1Dxcfk+yCg0NnTNnzsGDB6urq5/wuUyCIJw8eTIxMTEpKamwsNDQ2KlTJ1dX17S0tDZ85DwpjbNRkpWjo6OXl5dGo/nxxx9/+516vT4tLW3+/Pm9evUaMmTImjVrCgsLu3Tp8uGHH6amphYWFr7//vsAysvLZSmc6LE4GyW5hYaGpqWlJSUlTZo06eFX9Xp9enp6QkJCYmJiyX92hnZxcRk/fnx4ePjQoUNV/9nU3rDCqaysTLbKiR6JC55IblevXu3WrVubNm0qKiqsrKwMjQ0NDRkZGQkJCQkJCdevXzc0duvWLSQkpEl63pWZmTlgwAA3N7e7T9AjUgRnoyQ3FxcXT0/P06dPHzp0KCgoyJCe8fHxpaWlhjd07949ODj4kelZUVHRsWNHw78Nt0JxNkqK42yUFBAdHb1gwYKePXtWVlbeXUPap0+fsLCwsLCwgQMHNnl/UVFRcnJyQkLCyZMni4qKOnXqBKChocHCwkKv19fX15s+vKEfkVz44yMFDBs2zNnZOTc3VxCEfv36BQUFBQYGent7N3lbXl5eYmJiYmLi6dOnDX/vra2tMzMzDTGqVqvt7e3Ly8t/+eUXJycnBb4GEQDGKCkiIyPj+vXrHh4esbGxffr0afJqQUFBSkpKQkJCenq6IT3btGkzatSo8PDw0NDQtm3b3n2no6NjeXl5WVkZY5QUxBglBWzfvh3AkiVL7s/QvLw8w/NET5w4YWixsrLy8fEJDw+fMGHCI9fYOzo6ZmZm8vQoKYsxSnI7d+7chQsX7O3tx44da2hJSkpaunRp9n+2iLa1tQ0ODg4LC/P19bUwPJT0MdzdV5SVRVZW9pe8aKLHY4yS3GJiYgBMmzbN3Nzc0KJSqbKzs21tbQMDA8PDw8eMGXP3pd/zSmYmrl2TrFaiJ8AYJVnpdLq4uDgAM2bMuNvo7+9/8OBBHx+f5l5wN2wxymN6UhZjlGS1b9++srIyNzc3Dw+Pu42GJzW1oDfDJvq8HZSUxXvqSVaGI/o333xTlN44G6WnAWOU5FNZWbl//35TU9Np06aJ0iFjlJ4GjFGSz86dOzUajZ+fn7OzsygdGg7qGaOkLMYoycdwRH//xSUjOThApUJ5OXhLMymI99STTC5evOjm5ta+ffvr16+LuNeyrS2qqnDjBuzsxOqSqHl4pZ5ksnNnr6FD84cPzxB3v3pHR1RVoayMMUqK4WyU5KDXo1s3FBcjPR1eXmL2fOQITE3xhz+AD2QipXA2SnJITUVxMXr1wuDB4nRYV4eICLi5YenSe42ffYbOnfHuu+IMQfSEeImJ5BATAwAzZuChPexbSKdDQgIiI3H48L3GY8dw9qw4/RM9OcYoSa66GikpUKkwdarIPXt5YfZsaDQid0vULIxRktz33+PXX+HjgxdfFLnn//kfVFcjOlrkbomahTFKkrt7RC+6du2wciWio5GTI37nRE+IMUrSys9HejqsrTF+vCT9z5gBT0/8139J0jnRk2CMkrS+/RaCgPBw3Pfsj5a4devR7SoVvvkGx45h1y6j+idqMcYoSUgQEBsLGH1En52NXr2wadOjX+3fH7Nn489/Rn29UaMQtQxjlCR07Bjy8tCtG4YNa3knBQXw80NFBQ4deuy98599htpaZGS0fBSiFmOMkoRSUgAgIgImLf2h/fIL/P1RUoIRIxAb+9hlp+3aYfVq6PUtHIXIGLyLiSS0ejUCA9G7dws/fvs2/P1x+TIGDEByMiwtAWDvXiQmYsMGbNqE+5/NPHkytFrx11QR/S7eU09iCgqCRoPduxsjD8Bf/oKcHGzZ0uyu6usRFIRDh9CjB9LSYHgQfVoa/PxQW4u//U2SFVRELcCDehJTZiZSU7Fy5b2Wq1fx88/N7kevx/TpOHQInTohNbUxQ7OyEByM2lrMnMkMpacIY5REFhCA6GhcumRUJ0uX5iQkwMYGP/7YeJyenw8/P9y8iZAQfP21KJUSiYMxSiIbMwajRmHWrJbvSL9o0aLly/uMHv3PPXswYAAAVFRg7Fhcv46RIxEXh2Y+hplIWoxREt+aNcjIaFwx2lwbNmxYvny5iYlq1qxib28AqK5uvNA0cCB27bp31pXoKcEYJfH16YOPP8bcuaiqat4Hd+7cOWfOHJVKtXnz5tDQUAD19QgLw9mz6NkTBw/CxkaSgomMwRglSSxeDCsrREU14yOHDx9+66239Hr9F1988fbbbwNoaMC0aUhNbbzQZHicMtHThjFK4qisfOC/VlZYswbr19+7TP/ba+NPnTo1bty4+vr6+fPnz507F4AgCEuW7P/hB9ja4uBBdO8uTd1ERmOMkrFqazF/Pvr2RWnpA+3jx8PPD4cOAUB5Ofr1Q1zco3vIzs4OCAioqamJiIhYsWKFoXHhwoUrVrw+ePCqvXvRv7+k34DIKIxRMkpaGtzdsWoVqqqQnt701a+/hpUVAGzYgMuXMWUKZs5ETU3TtyUlJVVWVo4bN27btm0qlQrA+vXrV65caWZmtmCB+5AhMnwPIiMIRC1SWyvMmyeo1QIg9O8vnD0rCIJw/LhQVPTA286dE/71L0GvFzZtEqysBEB48UXhxImmvW3fvr22ttbw7x07dpiYmKhUqm+//Vb670FkLMYotcT588LAgQIgmJoK8+YJdXVP9KnMTGHAAAEQzMyE1asrdTrdw+/Zu3evmZkZgNWrV4tcNJE0GKPUPFqtEB0tmJsLgNCjh3D8ePM+XlcnzJsnmJo2DBrk4+XllZube/+rJ0+etLa2BrBw4UIxiyaSEmOUmiErS/D0FABBpRJmzhTu3GlhP0eP5jo7OwOwsbGJi4szNGZmZtrZ2QGYMWOGXq8XrWgiiTFG6YnodLqVK1cOGHAHEFxdhWPHjO2woqIiJCTEcII+PDw8KyvLxcUFQHBwsFarFaNkIplwozz6fT///PObb76ZkZHRq9d4X9+kVatURj5Y6a7Nmzd//PHHNTU1arW6oaFh5MiR+/fvt+T9nvRMYYzSbxEEYcuWLYakc3Jy2rJlS2BgoLhD5OXlTZ06NTs7297e/t///rcN7/ekZw1jlB6rsLDw7bffPnr0KIDw8PBvvvnGcO5SdHfu3CksLOzZs6eFhYUU/RNJijFKj7Z9+/YPPvjg9u3bDg4O33zzzXiJHjNP9Ozjxo3UVGlp6Xvvvbd7924AYWFhGzdu7NChg9JFET29GKP0gOPHj4eEhNy8edPOzm79+vVTpkxRuiKipx0P6ukBVVVV7u7ubm5uW7du7dy5s9LlED0DGKPU1NWrVw1LOInoSTBGiYiMwo3yiIiMwhglIjIKY5SIyCiMUSIiozBGiYiM8v8BBq3eN4zITYIAAB+5elRYdHJka2l0UEtMIHJka2l0IDIwMjMuMDkuNQAAeJyFmWdUFGvw5hlyjkOOZgREFCPIW4MoCJjjvYqKCiYUEBVFFBEMiGSQHBQQJaigZPCtVxGzYsSMCYmSRJDMztzzP7t79sv2lzrT3TPn13Weqnqqp4NerRPiHwpC/+eQ/5/P/hxxAzN+FBGVcDYURBFx5zmCyPm/TvwXRf/fKGnwXxT+3/f9zw/9/67LCHGEOMLOwiI7hUVEnUXFdgqLiTuLSxgKi0s6S0oJSUo7S8sYCkvLOsvKCYlxnGVFnbnC/C+Kcfi3i4mLS0rLioqpC40VEprOP8v572EqmySZt5gjE5eNxu7cCFSbqM+sMq/TqrmdNNncit17bwFlS6tI/EMOy65OgHsqkWQodg++9vlIso4a4TzrVJz0XRx32R/Be9NC6IIjlTgjvxVL++Rw6agqc/5WQFtW7AL9s0VommOGwg+7wfO1Fkv6xYXYfXFgKt5Ge9ZGQx9T4dG4elqquRM66Fly8XQnCV33B0otVaFxPwWOAFd+i7Q4c4txYGs3x+LCB8GYfVSPBZ6upEYSb6jNG8Lm8aZCpWgduTlfmLWqnwdVjzRib7ML6abXZFzFePzYkYxfT0ng9rSjGJZ7hhYFlmHSuHb88VkGTe247PHSh3SM1j7QXVqIU5Ks0fbpIPDKNVimixScKE6AwaVttIZ3DuouqPFy332ng6d3gs/EOPLStpU4Lu+BLbdU4fnmW8ARpF1+mbEsE+c5sPtTQ7BrWzyGR+iw0vQ0um+RLM7bMZfJ/OsAn7YnE3eFQVRYlQbBq3eRpq3bcNGbb6T6xgwMkU7EzdliGPLXF7VXn6VbHSmO6rbgugdyuKlRjZ14e4UWrN8OojklaHFZH/9Z0QWPWnWYlz8XDh2Igy7tXhrcFgfLXijzPOoaaOcVZ7juHkMa6nvImTNt8MNSB17uR+CICGjdzkiznyqOTHd5GLpujMAqKX322L2Iyi9/R2uDCRtSMgX7Yz9JrbUwm+IYDltDUsnhwx5YlInkis8UjDdJwXJpGdzQGYDNR4/Q/LwyHLe4Bz+86KFXB5TZ0vR7tC3PEy4q30SPibY4yXwAngSps9L3XHinFwfcyb/pho8RQC1VeVVlw1RawwvE4v3Io2cfyKPZw7A0RRFKn+YDR1RAy9snxQ46OrJn9yPxjVE4brumzwyiblKDAx/oP6GEVd2cDtu4n8mvRmE2/n00HA1JIrZG7li/6BHp6TPCMywVffwkMarhGHJ7j1N1tTL8496B3ieGqVugCpsleocG3/KCA92FWCvDw9SXQ/AwRZOlFMrDqyPxsIs7QG1MIuCCDZfn6dlOrb7tgZoZwcTN6gtZrtcPb7xVIF6iGDhiAtqGfAm2L3oJWz4xFC+YRuI2Zz02pfwWLYUmml41j8m/mwL705+SkyuFWNulKNgwJphoTtqFby/fI1LJZqhUmYQqnnK42CgArdqO05SIChxd3YlJWT3UYJ0qizz9kF7bsA9y8SbKN89Fn4kjkOGqzvTVZcDkWgK483X71SwUHI9yeQ4ddTR6yAVKesNI2clGYq79B2rWqMLyOeXAERfQ5ipJsMC/jmzt4SjcdyEM33fqshI1SmvVvtDoS/PYgY9mcP3SC+LuJMTutMSARFo0eTvNBbPya8isvMn451Qiv8qk8N7nE+h14xR1DalEZc9O3BYui+nSqsxg3xPq/sUdhrwLcV434NC8fhh212Dt+yUB3yRB2uku+qAsBDb1qfI2wUc62Xo7zIu9SNobm8lz79/Qv4kL5q78KpMQ0L5PlWYi2o5MaXEQpqnGYUqtFutYkEt/Zsqgd94cFtfCA+X8dLJ12wAmc5KhpmA/oWZbMKLuLSEHp2OCUhyWbpTGgi5fND7jT2e9RFya2Yp9raN0iYcme3Api+oYekDRrzI0Fx+HFq//wGmeHrNPUoOvGfFgcG6AeifGAK9FhdeYy0H/RS6Q2htHKnkdpDmsEyTvK4P+otvAkRTQnn0kwZJtF7Om3gjUsIvETI4um72vkrLk39Sj3oKtMpoDH7sryePsYSRj4mHZwSByWMMVF4W+IpvfmeAd2USMyJHDSfE+WKl4hFqYIu5QbkaXN3+o2HMNlm9yjRbU7wPyvRgTOWa46nQvaBjrsKIyJVC34Xewyx20iK+wS6+4vNrFX6nwMlcIYOdJf3UXOQhdMKlIA1Ja+T1BSkC7+rEE65y8hAU9jMR9Y6JwMFOPSYsW01H3EdoSPY9ViVrCzjf5xGm6EOMqJ4HkBn8SxXbissbPpG7nFFTUTsF+fUmUVT+E3pfP0KrVlbg/+ydOduunZrfUWUJECfV/7gH5q0rRfuEU1PPqgz/rdJh3nywM98eDXckv+vpxGHzo5fLc279R14mbYW1vLLH6t5e8kGwDekUTxu25AxxpAa3dJSl27LADe/MsHC8URqKSiAEbO3yBLs/opHLtVmyZsjVEWt0lY7kc5jQ3EQauBJHWDXtwKP8tseyYihvOpaGPiBgGLfTFupbj9JlvOR56+gsfL5bBehkuuxxfTEUeeYJvTBFum2yK3S4DoF+pyaxdFaD2zXn4PaGDXpoTDhdGVXhPhXpoyQ5XmBR+hrzObieveN3Qt1UbPi/kV5mMgNZhujQT+2jPfm+IRPFXUdiwQp/Vz0qlwcnD1NDBit3InA9vE8rI6tdC7GRJCizuDCBv7u3Ggt315K+IGdq6pOK7j0K4qNQHZ5w+QuumlqECtOK41bIodEuF5QwV0gp7L0icW4hNUWb4TrMPHpprsYDdymC6NwGk2r/StKxQUFVU5fmf7aR9G7bA/JUhROxDP9k+0gFm9voQ8JpfZbICWj8xeda+yoE9uxWMhSHxWLBXj62Oj6XRfrI4vsGSda9whHsHr5AJR4VYy9NUsFc8Qy7dc0Mvmy/kR64ZdtqmoVIRB3WKvDAlI4RG11SgdG4TTlqjgLErVBl3wXkK/Dr3V+LnvJCL9wN+g7KMHlN6MA4uOsTCqZMc/Ho3GZ6Kq/BORcrgh/5dcKg7kJTe/EpiZvXA08/q4EdKgCMnoLVOl2LVDxxY8/ho3MWLxgkf9Nimvks03kQYbUbnsfNpNjBG9CrR8hNiXzJTodTQj0jp7MJm42bSxZmKdS4pGBoujBMCvdHzqw9doF+BcVua8OU6WWxMUmUr716l22S84G1/IW4MMsRhTh9MeKnF4iqUQeddAthBHfUVDQOTAVVeukU7zb63Ca5tjCCir0ZJ5PtfYDYyBu4jA47ATMoHKMqxnB327HJUEOq4xuH8NdrMjl2lpWVieG+qJfu6eCak9VWQ9b1CbIliLOT0RJPmjF34NOcduSM/HX/+jMNjSRJYoeiP3Lbt1Ph5GfpaduGqv+10zCcVtjvwKp0RfgDqFhXitVtT8YDuABhc12QmQTqgkhEH43Z00KeesSD9UIVX6CKE/cke8PboaXJpoJFUlA7Am89KYDvAn2UC2yv/slyMtRU7sMjWc6gYw59npTps4rNL9M+LZlqtYsneH+VBcmA1mTRmFDedToa2zGAypdgVZ0V9ILeyTZBXk4AZ02RQ84AfXis4TM83MJSvakV68je1fKTJ/lEqpbVaHvBTqBRx7RSUqOyDVx7azPaILBSoJkOjqRD2CUVDrJkqz/hYHdUo3whOJ9NISdtHsia0FfJmKcBf2UrgKApopR5Ks0Vt9mxPVyi+t4vCLl891mJ2kWad6qTWslbMR5MHZ548ILWMw+YkJEJ1azTZyXNDCas60hdpgrOXpGDHBA4WXfbBS17+tMe6BO89akXxP0IYk6/McmNLqFs/P7fHizHz+0y8d78fjo1qsev7VCBvcRz4v+qkpD0SjhxQ4ZVP7aZSzrtgs3o06XH6SebV9ILFKVWYcIDfE5QEtOvkpNgR/UUsYn4I/usRiQsjdJlkYhTdPomD2s8s2cSZS+C3bwEJDh/B9kOpIHIhlLgscMG9478SNQ1jPCKUhGOzxFB/hQ/mfYqkd2wYXoptQeF9CjgcrcG4m7NpWLQr5P3hu3R9Q3Qtbofx4QZM/IE67N8cC2lSorjSPxY+NSjxFlwVQbXDO2Dbm1hSLfadZN3shCAnGRivw3fjygLafDsJluqyiC0vC0FtzzAM2KjLUt5n0COjzfT6TSt2vG8e8CbVkGxtEZa/PQmkTYNIhOl+/CD3hrxZaIq3ryein540lo31xTzuFtpoLqiyXyi7v56qNqoyOc8i+s3zIPg0FKLfR3M0XjIIUwo02eO9qiDtmgSTT7TTWIiCK99UeUL32+nTuo0AswOJYVQt2bmvA5btkYLp9XzdqghojXZLsbO/7JnJxSh8WxmFlY902YXym7Suc4B+JPPYi1dzQCWxkkw/JMQ+xCbAkbyzZHH2LvSp+kgiXkzBTauT8B8PSTT4fgiHQ/3oVF45jkS04KsOCdTu5bIF16/Tz2aeEL2gEBVGpuITkb+gsEOLvfNWBoXhOEh+0UQfKkXB8imqvPs2DXSPnCvUTgolxhe7iUV9F0R368PzKxXA4Qpoi86KsUYXe5ZeHYUqxqE47KDHsp3y6NqLzdT9rRWbEW8FbBEjtjs4LCQ4Ef41DSa7ZXejx+PPZLTFGJ2y+ZvOUSmsNjiOcdxQ+tiO4rm3rfg6qY+mH9FkoebV9Jr6TsikxWh20xyLHf/C141arK1CAkyWxMGcZZ3U6u5ZMBrL5V23/kV7zTeDU0EUKTjWSSx626BqnzqseMH3CaoCWkVPcTbnlQPbujoSsy6FY7WxAdMxiafrrg5TOS0rduusAwS+ukqmyXCY/6QMgKoT5NpJN7zY1k3EYo0wcHManpwihI/93fHz12A69wffa7V/x3tHuGhwTZVpt+fRq4EeUDKlDBP+mYQe23sgpkeXrWuRhiGjONhZ0Unp0Qj4oMTlBel9pCu2OcHhujgSdHaI7KtuBr8n+jA1oAo4agJaMlucBXOWsJUbonE3hGLvE32mr3eL6t35Tq9tJmz47hToKn9OAseLsLis88CVSyZC9W743qaWjD42ROn8VJT3l8FN7YfQJ9afqk6oQPn2VvwyKItzDFTZDPFbtKHCC8qHirAzdyaeXjQIbzma7Gu9PGTlJsCC8e3UelwoWIuo8eLutNDqf13hpVw4+XriJ/kypgvGeqlCtkMZcNQFtCfaJZmxxGL2+FoE6gdF4UCiLnPJqqRb3YfoqgPzmJryNAj4eouo8Cfv3SmxMOFmCGm5uBsdsmpJ0TpT9DBMwnu2shgj5IvprYfo2oxKDGttQc0NbTTPXp1ZOuTRHEsv6Nl7E88dMkc/flcq36rF9jI1kJFJhIcJP+kabjRInFLj7elrpAfnbwf3smCS3N5BNvb8BnMvTXj8ia9bDQGt41xJpnbanl15GoLXgiKx8aYum7kxhx5U76V7cyzZ0r/8KlOqInciR1Gbv5tuGRtGPozbjvuNXhH9TWbo/plP+0sac1b54iP0oRb1Faj2pRUHjw5RmxJVtsGmlOZOPwCLykvxbvE0tEwfAKNv2oyzTQEuqcSD9c4hKtYfA+XzVXhOP75S9VMuED0hnGwO+UYqnnZBepwCvBbm91tNAW0mV5rpJjuwFf+EIe8Lv4ON6DMp5VyakfeS2gUAez84DVZnNJFyb1EWkRQFR6uyiMVDL1SNfU4szhijcXYqyqwUw2P/HsP2XbuoBN9rtRt2o4UxBwelFdkTmfu0YtAHHPEGBoYuwlnjR+CFgzpbeUERwufFQ9aTLrrzYDicu6nKu6jYS7O63WFWTiDRHXxP0rcPwvtkOdB6UwAcLQFtXoMky9xqx2T8wvGARjhK1uizhVtD6Np/e2jSYmBtS+1AZsEdUtUuwrpZGuyvOUeSph/GiWe6iGa8ES7NSUX1/X00zPwomgRF0UK7EvQZ+YU3eiRx7hgu+3fFHVr2yQ2SJUvwy5tpeJDXB64OWuzuB1noE4+F2Om/ac63MDAnKrzLtl20b+ZG+F5yiATsbSIDX9og4IIazNbjezBtAa1WhjjTXOPIHsuGYZxQBI5O0WV32wupwo8e+uzDPBa+dxYcv3aLDNZw2PY5CVBjFUpS/XbjROvP5JivIdYJJeCd3RIofuogGnLdqdvsckzMbMSrOt10+nlV5tiUR/0O+kBQXSFeeT4TRwb7QMlfm5mVKsDKdYkQHtBOr2+NhFV5XJ6N0Ud6d9lmGG8aSpa/byQuju2wwFIRPOv5PkFHQCuySZxFfF7E/haFoZB+GKaf1WOrHqfQlJSfVKHFkv20WgiLKx4QczaCnsPJUOscQRba7sAdn7+TGL3JaNWchMtyRHDenMMYs+sE1WYV+FK5FfMTVXCgicuuzqik3zZ5QMZQKTp/mobHuH0guVyHkYVSECwbB6NxQ3TGsRAwXMbfImN76cf8baDVmUzcA1tIm3Q7TL+kBDK+fDeuK6CtWS7BTE45sqKgCJSSj0CvWzpM17KSwoIeOu6FJZs+aToEhlKS0sHf0PX4/XFONDnz2hVzfN+R03ON8KpfAsYoyeLpvoMoX7CXrllTibNfNOOy+8L43U2Nif24ShOOHwDVV0V4rs4URV37wbxEiynVKUFHcgJw7Trp2mnRMNVZlRdx+gvdtmwnLEmJI/eKO8je6t9Qm64BEbV8Wj0BbdywJIuycGQ/jobiIYNIDJfSZzUkk94o7KKFawmbeQagLrKaiL8UZqFFSfDSN4I8affAsAs/yNBjE5TvTMFrs0TwmcUBrLE4QKMjirH8Nt/VLONg3msVJh98gxoGHwS7zkLc02iOPVcGwLNKiykbK8GLcwng96WLzjGKAINGLi9X5jeVi90C6U5BZFtQPcl+1QFzLPkb+qZS4OgLaNfHS7Azt+3Y8W+hKP8oDEuu67KpLecp16efmmZYsakvbEFzVyl5tInDku1SIOthKOl8y5+8pQ3kSdlktJiehNGZwmgtdhjD3P3oinOVeHF8C/LmyqGmrSrb4VFCpTd4Qf6OCjxTbIrT9/bBi1+67MwsWYiNiYUCDQ5OroiAEUMVnp1WHy085QwzWiJI3LgfRNSgHV7+kgO17XzdGghof2yWYdv67dmE0mDksRj8wN8djhYVUzcpIex2m8f+0TKHaeZ3iPNlYba25DxsbIkjJ/k7b8fpWhK9ywTDDsZjcbs40nfHkWO3ne4NqsA2hS6cW9tLJ7tzmdadcqo0sg8CtUvwm+xsfBrWD1sfazGNu8owLT4elC2EMW5NLDxtVea1TvtNyY7dcDE+mPS1viWbu3rBUUMaGlYWAWeMgDbZUJwZqy5isy+GY9yrUDzD1WWetVdoeuUvusXQit0VmgdXd1aTJ0IiTGxFMnTeiST6np5493EzcYswQgOvJJwYKoY5LUfx8usT1ONcBf4Q/4UJkb+po6E6U75QSh8J7QejpkLckjAXvaf+hdn1fCU0yECSUyzov2qhlp9DYfIyZZ7cvM/03oktYHwziJhYdpLZ67ug6qEqtK7k+9uxAtrRZHH2R9We/W05i/6PgrH9jD5rfJpMxbhPaL+DNau8Mhfc5FpJ6koxZrLhPLTZJpMXH3wwvOEFmWVngpJKKSg9SwJzegLwQak7nb2sFFPru7HiUjfVvK3EfIc+ULkAbyg3KcY/rTbY0M/hzeSqMxc9MWhbEQtGBSN0x6wweOOvwltU2UD3znCBC6u9ya+CGlJd+AeO7ReBS3vzgTNOQLvophAL0F/C3v0bi0knTuJotgEba15ML7kW0+NiPAZ6c6B+cQMJUhdm1w7FwxPJUDKJ64EHf74g9+8YoW1xKgqJK2PX2iPIDh+g5zMq8GN4M05X7qYax9TZEdF3dPmhozBNtATDr9ih05AQL3m7Jtsq1ktGpiRDYFUrdfA7CeOmqfECR5BeCVsNB9zPE3HvH8S2uwHMqAysvc7P7fj/lJAgwXpXOrKCxBgsuh2G51z12dNxN+kdj2Zact6KYcMcWHHoHlnYy2EJ0xJgxbR4ss90N+4arCM7qg2xxCsVE3sl8N72AxiqeIRWD1fgzhVNeLREFs9xVZlaRj4tWLMfEtRKcHKEKfqE98PyHi1mVMuFt88SwffzL9p5ORzqX6rxhsJ66OB0V9g/J4o0FTUT2wudUDaiCd9E+a5mgoB2728x9mSnA7u5OAJD/4RitJsu63+dTQvke+mRsfOYUZYltEgUERfhUazjz7Kc96dJzI4dOND1mVi9NEL1XYn4aJEUztfyxW0vj1NOOmKJZTMWnpBF3w41dtK2lM6c4QHdE0rxVpExLlk0AE4vtdnMMmlwsUyEA1F99KRTGNhRLi/gdgNNX+QEBjppZJPHD7I7vAVgphz8tGbAmSig1ZCQYk+zHFnT+CgsLotAhWB9Vse5QnO0/9JHFoRNMydgklRBzl4QZrtZEqxSjCBRph6oENhEOt1NMGtrGu4IE0V7My88tiqQrugrQaf8JrytLoPi2Vx24lwB7Z64H6KlS/DHtcko6T4IShVazHIFF6K3JIL37Vb65Xs4SE5S41273EM1wBkefT3L93ktxOhZO8yMVwcHN35uJwloTTdIsWNL7Bg5H4paB0KxeboukzPLpS1vX1BOEWGVhjNhslYLmbxdhNE1UdBqWEAq1N1Rr+85Cdk8Eb8+SMTy02L4asYxjGjcRQuSizHvzG+0LPxDt0UpsZljKqjbdm/ImHsT4zsX4AnjfqifoMFeHVKF9K3nIby+h66vjoVdUsq8kYM/KHuyH7KKTpD1Im/Jl0sjkC8pCUWRN4Aj+C9O3nCTKFtjY8+aLcPx+cMgXBKvx9aPyaQ9s1/Tqb+B6URZguKr1+TNemE2kHMehK5EklVSHhikV0MmpRlh+e9EpBflcMc/x5B17Kde4ypxWLcN8/w5+FhTlRWZ3qXfPh0AuRnFuMbLGp84DUDKTi12u1ACNFISIDB6kLYah8I2UOXVTGigLpGbwNo+mkx2eUlGSAfsPSoJtTf4s2zyf+9vR8WZ8gpHlqgUjAuaQ3FLoR4LCk6k8PwX3fSYMN5ja2gPriLxY4TZj/cJ8DY0lCyLdEcni88kIWEqFo2moN9zCfx76TAaWXrTc9llmExb8OONAfr0C5cFnLhDXZIPgc+5coyVmY331o/AQIMOezNTBlyfJcDc6UKY5R4JX3VVeZ0nW+nh2I2wjD8vJ/a/JdlC7TDwRxpMi/m0RgLah/nSDOpt2ca60/hrQSwqntJkAZ2htNJFC4OuW7CF3HUg6xZIVL1H8PzaTNh26AhRmr4Dr93oJYHVplgnHoft5WJ4rcQdqfFJ2tx2G8+Vf8NipwGq4qTN7lacpOVH3CDGGvHqFzmcNdoGbrfGscVi48BtJBZu5oni6+o0kMxV5sVclsDVIc7gmXeWlJ9tIrYOrRAdqwRJk/gdzFhAW6onzfbGOrDCWUGYuz4ac7brMqvbGXRSBQfVagkT9yXg9KGMPBcWZjPnx4FbVBjBR55Y1f+e0MxpGLc3CR9ISeNMk+P4Quc4XR5WiQWtbWgt2UY3jlVjUY9u0X8qPKCnvRwPXjNCoy/DUPRdm7034cK9awkgM1UUu6/EQa29Cm9+cje1cd0Kec6niGJuDTGe8Bsut0jC9bhC4JgIaOVSRVn1Z0cmNBCGe/j+9nWIPjvyMJbm2kqg4woeE165DGboJBDuCmEmrpcBM0W3koecg/jcpoPYlxijr1Iafvgjjv8ud8OHGEqfGFH0kKrDdHExzFfRYudks2jceHcIFaP48aMaGhj0wd3T+qxztizUhMVC3IcBOjEqAkx/q/DCbVrp8Jrl4KrpTz7FtBOVuq+wtk4N0ByBM+U/JdSLMfLNjv02i0bRR2fRZ54BSw3MoMfevaSuodbsuJYF5M7+Qv42iLC/D2Nh6ZxkcjnrEEYa15GR1ROx9HwaunYJ491jvnhnTQgNGCnC1TfaMP75INVBFUaePaC8ie6Q8PMGjiuzQZs5A/DHXYOdTJeE7QHnYergLzonPhg8vLi8w6otNF5tC6iOO0pcg+uIbdAfcDdXA7M0/qZjKqBdIi3MgnExe9d3Hme9P4MhfQZMpbyShsW/pLll1mz9QyN4JP2WFO0WZXuGYqDCIpYssPHCE1VfiHHFBJw5mow9RVJoJOONWa7b6dCeYhT3bsIY114qXqfCzua8pOTdcfgx+yZKytvjuGAOz7ZSnakocKCQpUBz5ifqqHgKmjrVebdL7tAS43WwzSSUvLnUSmZINUPwHhUoceY7xqkC2mlTxJmwtiOb2BaMkbfCccJzXXZPLZkaNIqgnR9hHuvtQdolnfhxOMxhdgpMdfAirRruaDrxK7npPxXH1Sfh4BIZVBjvgddkDtAXW25hbvw3VKv8Q02a1JlV6BWafNEb/prewlrLMSj/bRDe5PId/zZFsBqOh7ZjI7TGKQJknnB5gfuF0Er7X/jmHEKq/qkn8zc0gYeEPDzI4FeZ2X9vP5rE2JeIRcz9SCSWDwVj6joDljwaS8cr1NP75/lK6FoAq0/eJ0t2ijLvb0nwjRtD8p288eTcX2RcliHG8H3CmD0iePP0QWzFEzTZoBSX327EQ1W91PAxlznWUDrP4wDs+qcE88Pm4qbDA9BzTJtxn0nDad94sOR204mpYWA2wuVdLP1BtS5uAKPM48R+5w/y3OsXTDsrDxYpZcCZJqCdWSzBmpY4souGYZgzJQLdsvVYxp48CvHdVOYFYXcXzAHDpCpyaECYVU9IgFvaEURhqScqb/1OMn+b4t41KVjzVgIPS3jje3Sl73JL0P9TE77N+U5fX+Uyy2xKzY74gFFpMTomm+OwtBAvoEWT3b4iC1PuJ0FT+R9qYx0KxtJqPOWaTlr5dwMc3XSOtHE/kV1+v6CjTRFevikGtf8F1vHABLVkwC0AAAHGelRYdE1PTCByZGtpdCAyMDIzLjA5LjUAAHicfVNLjtswDN37FLpADP7Ez3KSDIpBMQ7Qpr1D970/SirNSLMZyzQk4pF6fKS3Vs+P6/c/f9vHw9dtaw2+eCOi/WYA2N5bbdr59dvb0S73l/PTc7n9Ou4/G0rDnjG5PmNf7rf3pwfbpZ1oRw6C3k68oxGBtxPuKO4gM5jaMdwMJF5B4moA7QQ7C0Nf7uHKCrtoYGBBoytZktmNwr1PpGTSjHcUhQICWVA03AmQZQH2TAk7OXOXyh0kaJEuIVddaOrjbtUe6U7CxMqq5euk7j6h9oCGmTrXjiyZWAWJZZE4od5u5XZBTlErV5duWpSzeIMla5ROpYkJcRIMRAEdSdFJbCIR/nMNgLAsOiCT0tAWg1AXKI76e9Ljnkj05CEPkSl4uR4pqSagI6T2SQ9c+6hORMWWLiEP8QNSPypxkwdXk7JvZitPycuLnQfRYEGREVmQW+R9H8DX4/ppuh7zdr4d1zlvtWjOlKTxnBspm8NRq88RyEPT2WhJs9lMSvPZMMljzKZg2Sq9lAMXgak+SIuQMjy8CIbjI4syRWLGFMdE9FWRtf46P//V3G//AL/ZwTizd7A3AAABBnpUWHRTTUlMRVMgcmRraXQgMjAyMy4wOS41AAB4nCWQza3DQAiEW3lHW1pby/AvK6cUkCJ8dwUpPrDvhviGYeD90L29PvuNe3vuB+/92d77Ta/P33c75mk5Z/qgM6eojwOnUAJXIRYX8JhnEsm0cVCxgDZLdwseVcE5yBd0SFxVhBBTM9XytDZlYaceNNM0aTnY2JcKFhFXWQWzSrcSQp61WhBmUqyWyHKa8OTKi0ks7SiWlNQo1eA15MgI7STEE7qOClu7KscsgiKJcfBJDlDHIVkZctY+lAnqM6xj+bv71R+KbKKBLEkdHJ50dYO8pOURTDpWJiRHzZDSzFH3z7B/Iia+f396aFLjBFvx3AAAAABJRU5ErkJggg==", - "text/html": [ - "\n", - "
my_propertymy_value
" - ], - "text/plain": [ - "" - ] - }, - "execution_count": 4, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Let's generate a toy dataset with a single molecule\n", - "smiles = \"Cn1cnc2c1c(=O)n(C)c(=O)n2C\"\n", - "mol = dm.to_mol(smiles)\n", - "\n", - "# We will generate 3D conformers for this molecule with some conformers\n", - "mol = dm.conformers.generate(mol, align_conformers=True)\n", - "\n", - "# Let's also set a molecular property\n", - "mol.SetProp(\"my_property\", \"my_value\")\n", - "\n", - "mol" - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "id": "d5b6aa13-3951-461d-b4fc-dfcaeb169301", - "metadata": {}, - "outputs": [], - "source": [ - "path = dm.fs.join(SAVE_DIR, \"caffeine.sdf\")\n", - "dm.to_sdf(mol, path)" - ] - }, - { - "cell_type": "markdown", - "id": "a79bf673-1bed-4c78-be92-e98a10cf5ec0", - "metadata": {}, - "source": [ - "This being a toy example, it is a very small dataset. However, for many real-world datasets SDF files can quickly get large, at which point it is no longer efficient to store everything directly in the Pandas DataFrame. This is why Polaris supports [pointer columns](./dataset_zarr.html) to store large data outside of the DataFrame in a Zarr archive. But... How to convert from SDF to Zarr? \n", - "\n", - "There are a lot of considerations here: \n", - "- You want read and write operations to be quick.\n", - "- You want to reduce the storage requirements.\n", - "- You want the conversion to be lossless.\n", - "\n", - "Chances are you've no in-depth understanding of how Zarr works, making it a big investment to convert your SDF dataset to Zarr.\n", - "\n", - "`DatasetFactory` to the rescue!" - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "id": "2955c572-6d1d-47ff-8101-5c2781fc1c4d", - "metadata": {}, - "outputs": [], - "source": [ - "from polaris.dataset import DatasetFactory\n", - "from polaris.dataset.converters import SDFConverter\n", - "\n", - "# Create a new factory object\n", - "save_dst = dm.fs.join(SAVE_DIR, \"data.zarr\")\n", - "factory = DatasetFactory(zarr_root_path=save_dst)\n", - "\n", - "# Register a converter for the SDF file format\n", - "factory.register_converter(\"sdf\", SDFConverter())\n", - "\n", - "# Process your SDF file\n", - "factory.add_from_file(path)\n", - "\n", - "# Build the dataset\n", - "dataset = factory.build()" - ] - }, - { - "cell_type": "markdown", - "id": "b5f0b66f-36c0-48ab-8da9-817414ba6083", - "metadata": {}, - "source": [ - "That's all! Let's take a closer look at what this has actually done." - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "id": "34022d65-7d1f-41ca-902d-a8385c4b6e40", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "{'smiles': ColumnAnnotation(is_pointer=False, modality=, description=None, user_attributes={}, dtype=dtype('O')),\n", - " 'my_property': ColumnAnnotation(is_pointer=False, modality=, description=None, user_attributes={}, dtype=dtype('O')),\n", - " 'molecule': ColumnAnnotation(is_pointer=True, modality=, description=None, user_attributes={}, dtype=dtype('O'))}" - ] - }, - "execution_count": 7, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "dataset.annotations" - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "id": "451b687e-34dd-4a86-9d36-b39d6247a24e", - "metadata": {}, - "outputs": [ - { - "data": { - "image/png": "iVBORw0KGgoAAAANSUhEUgAAAcIAAACWCAIAAADCEh9HAAAABmJLR0QA/wD/AP+gvaeTAAAZI0lEQVR4nO3de1hU1d4H8O8w3DHlYggaat4V0YRTjyhlgiAaCCKopGLlm+Xr8fR27Bw1FDnkBctL9qqlnkxMjEAkDK947PCGICcvKaAVISgoDigoSTAwM/v9Y0gDxYTZe7bK9/OXrplZ6zfPM8+Xvfdae22FIAggIqK2MpG7ACKiRxtjlIjIIIxRIiKDMEaJiAzCGCUiMghjlIjIIKZyF0BkdA0NOHwY589DEDBwIPz8YG4ud030CFNw3Si1LwUFCAhARQX+9CcAOHECDg7YuxcDB8pdGT2qGKPUnmg0GDoUdnbYvx8dOwJAdTXGj0d5OfLzYWYmd330SOK1UWpPDh7EuXNYt64xQwF07Ij161FQgLQ0WSujRxhjlNqT7Gx06oRnn23S6OEBBwdkZ8tUEz3yGKPUnlRWwtn5Hu3duuH6daNXQ48Jxii1J9bWqKy8R/v16+jQwejV0GOCMUrtiZsbKipQVtakUaVCWRnc3GSqiR55jFFqT4KC0KEDli9v0rhiBWxsEBIiU030yOPye2pP7OywdSumT4dKBV9fKBRIT0dKCnbsgL293MXRo4rrRql9+PZbqNUYMwYATp7Exo3Iy4MgYPBgzJ3buBSfqE0Yo9QOCAI8PHD6NBISMGWK3NXQ44bXRqkdSEnB6dNwdsaECQDQ0IDMTLlroscHY5Qed4KAmBgAWLwYVlYAEBeH55/HnDny1kWPDcYoyUMQhNraWmOMtHs3zpyBiwtmzQKAhgasWAEAL75ojNGpHWCMkgy+/vrrXr16RUVFST6STodlywAgMhIWFgDw6acoKoKrK8LCJB+d2gfGKMnAwcGhuLj4yy+/1Ol00o6UmIizZ9G9O159FQDq67FqFQBER8OEP34SB39JJANPT8+ePXuWlJRkZWVJOIxW23hVdMmSxo2Zt25FcTEGD+ZiexIRY5RkoFAoJk+eDCAhIUHCYRIScP48evZERAQAqNWIjQWAmBgeipKI+GMieUydOhVAYmKiRqORZACttvGq6NKljYeimzejtBTDhiE4WJIRqb1ijJI8hg0bNnDgwIqKim+++UaSAeLj8cMP6NMH06cDQF0d3n8fAKKjoVBIMiK1V4xRko105/UajUal73bJEpiaAsDHH+PyZbi7IzBQ9OGonWOMkmzCw8MB7NmzR61Wi9vzzp07nQ4cWB0QgGnTAKCuDqtXA0BMDA9FSXSMUZJN//79n3nmGVtbx3//+6KI3TY0NLz33nsAuoaHQ6kEUPPPf+LKFTz3HF56ScSBiPQYoySnV15JLy7+MS6un4h9xsXFXbhwoV+/flOmTAFQU1PTZ9my5c89V6ufcSISG2OU5DRxYmeFAqmpqKkRp8OGhoYVK1YAiImJUSqVADZs2HBVpdqnVFr5+oozBlFTjFGSU/fu8PTEr7+K9njjbdu2FRUVubq6hoWFAaipqVm7di2AGP06fCIJMEZJZvr9P0WZrq+vr4+NjQUQHR1tYmICYP369eXl5SNHjhyj37CZSALctplkdvUqnnoKpqa4ehW2tgZ1tWnTprlz5w4ePPjMmTMmJia3bt3q1atXRUXF0aNHR48eLVK9RM3xaJRk5uSEF1+EWo2vvjKoH7VavXLlSgAxMTH6Q9F169ZVVFR4eXkxQ0lSjFGSnyjn9Vu2bCktLR02bFhwcDCAmzdvfvjhhwCWcYKeJMYYJfmFhsLcHP/6F8rL29hDXV3dqlWrAPTr1y8iIuLq1avr1q2rrKz08fEZNWqUmLUS3YUPWCb52dlh+nTY2KANu5TU1tYeO3Zs9erVly9fBvDll1+amZmVlpaeOXMGgDF2hqZ2j1NMJL/SUmRn47nn0KNHY4tOh+RkjBiBbt3u/ZHCwuv79sXv27cvIyPj9r2knTt39vX1TUhIUCqVGo3Gz8/v0KFDRvkG1K7xaJTkl5ODyZPx3HPIzm7cCFSjweTJSElpEqNaLbKzkZaGI0dQXl5dUvIWABMTEw8PjzFjxgQEBIwcOVKhUFy8eFG/G/SSJUvk+T7UzjBG6aFgYoKiIvzzn5g9u/lLV69i/37s34/0dFRXNzba2T39+uv/8/zz7uPGjevcufPv3x8REZGVldW9e3cvLy+j1E7tHU/qSX7JyZg6FWvXIjoa58/D0RH19bCwQEoKdu3C7t24/SN1c8P48Rg/HiNGNG6Ad7eqqipnZ2eNRlNaWurk5GS0b0HtFmfq6WExZw6cnfHOO00au3WDpSXGjMGHH6K4GGfPIjYWL7zQYoYCsLOz8/f312q1iYmJUtdMBMYoPTxMTfHxx9i5E//6153GxYtx/TrS0/HWW3cmoP7QtGnTAMTHx0tQJlFzjFF6iDz/PF5+GfPmoaGhscXBAVZWre4nMDCwU6dOP/xQVlBwXdwKie7GGKWHy+rVuHIFmzcb1ImlpeWbb56tr78UH+8gUl1ELWKM0sPFyQkxMTD8Bs4xY7rX1WHnTnAOlaTGGKWHzty56NnT0E68vdGtGwoL8d13IpREdB+MUZKfoyN8fO78V6nEpk0YMwaOjm3v08SkcccTzjOR1LhulOS3YQMmTYKzs8jdnjoFDw84OuLy5fstkCIyEI9GSWYZGZg3Dx4ed2bnxeLuDldXlJc3WUFliLKysvT0dHH6oscIY5RktmgRAMydCzMz8TufOhUw+Ly+qKho/fr1vr6+3bt3DwkJqaurE6U2emzwVIfklJKC7Gw4OuIvf5Gk/5dfxvLlUCha/UFBEE6cOJGSkpKamnru3Dl9o5WVlY+PT2VlZdeuXUUulB5ljFGSjVYL/R5MUVF44glJhnByQkFBk22iGhpw6xZsbe+drQ0NyMjAv/+9Pi5udWlpqb7R3t4+ICAgKCjI39/f2tpakkLpUcYYJdns2IH8fDz9NF5/Xaoh3n4bW7bg888xfXpjy8GDmDABv/7a5Oao2locOYK0NHz1FcrLMWqUtrS01MXFZdy4cQEBAWPHjjU3N5eqRHr0MUZJHvX1eO89AFi2DJJmlKUl5s/HSy/Bzq75S+Xl2LsXqak4cgS3L3i6uWHs2Olr1oxyd3dXtOFyALU/jFGSx8aNKCqCm1vjLJB0XnwRV65g4cImN5jeuAE/P2RlQacDAKUSXl4IDkZwMHr3BuAIGLBmldoZxijJ4NYtxMYCQGxs43b30lEqsW4dfH0xcyZGjGhstLXFlSswN4eXFwICMHmy+KtWqf1gjJIMPvgA5eXw8sL48cYYztsboaGYMwcnT95pTElBr17o0MEYBdDjjetGydgqKrBuHYDGA1LjWLsWRUX43/+90zJkCDOUxMEYJWPbtGnvk0/WBwZi5EjjDdqtG5YuRUwMrl0z3qDUTvCeejKq4uLiAQMGCILy9OkLgwZ1kXq4N97A5ctISwMAjQbDhkGtRkFB8wVPRIbg0SgZVVRUlFqtDg8PM0KGNqN/SMnPPxt5WHr8MUbJePLy8uLj483NzaOiomQpwMsLM2bIMjI9zhijZDzvvvuuTqebM2dOr169jDCcIOC11xrP6G+Li4Mg8IyexMQYJSPJyclJS0vr0KHDIv2eTtJLSsLw4Zg1yzijUfvFGCUjWbhwoSAI8+fP79LFGFdFtVpERwOAp6cRRqN2jTP1ZAzx8fHTp0/v3LlzYWFhx44djTDitm2YNQt9++LcOW59T9Li74skdOnSpZSUlKSkpGPHjgGIjIw0TobW1zc+W/Qf/2CGkuR4NEri+/HHH/fs2ZOcnHzyd3dfOjo6Xrp0ycLCwggFfPQR3noLbm74/nvJ79kn4l9qEk9u7t7DhyO3b8/Ly9M3dOzY0d/fPyMjQ6VSrVq1yjgZWlODlSsBYPlyZigZA39lZLD8fERHY9AgDBnSLzU1Ly/Pzs5uxowZiYmJV65c8fPzq6ysfPLJJ2cYa8XmRx/h6lU8+ywCAowzILV3PBqlNhEEHD+O5GQkJ6O4uLHR0bGHh8fhJUtGjx5t+tslyRs3bjQ0NNTX1yuVSiPUdfMmVq8GgNjYtjyCiagNGKN0X9XV+OILnD0LrRb9+mHqVHTtiuJieHnh8uXG93TrhpAQTJoELy8rpdL3d5++efOmvb29mZnZzZs3z58/P3DgQKnr/eADVFbCxwfe3lIPRdSIU0zUstxcjB0LS0v4+MDcHJmZKCxEYiL8/eHiAlNTBAcjLAwjRjS7BllVVfX1118nJSWlp6er1Wp947JlyyIjIyWt99o19OqFX35BVhaXi5LxMEapBVotBg9Gly44dAj6qSH9zZUpKSgogFqNp55q/hGVCl99VX/wYKf9++vq6wEolcpRo0b179//448/HjZs2KlTpyQteenSwpiY3hMmIDVV0nGImmCMUguOHIGvb/PjOpUKPXrg/febPFe+ogIHDiApCQcPQqMB8OdnnvnexiYsLGzKlClOTk5qtdrR0bG6uvrnn3/u3bu3RPWWlJT069fP1fWFzz5Lc3Mzk2gUorvx2ii14ORJmJri2WebNHbpgt69G5/FUVTUOMWUkwP9H2NLS4wfj0mTPgoMNPndczgtLCxeeumlL774IiUl5Z133pGo3piYmLq6uv79OzNDyci44IlaUF0Ne/t73APUpQtu3gSA5cvxt7/h+HFYWiIgAHFxUKmQmoqICJO7nmUcEhICIDk5WaJiCwoKtm/frlQq5dqCj9ozHo1SCzp1wvXr0GiaJ6lKhe7dASA8HLW1mDQJ/v6wtr5/Z+PHj7exscnJySkpKXFxcRG92KioKI1G8/rrr/fv31/0zonuj0ej1AJ3d2i1yM9v0lhejsJCeHgAgI8P4uMREvKHGQrA2traz89PEIRUCWZ/cnNzExMTLS0tlyxZInrnRH+IMUotGD0affvi3Xf1s0aNoqNhYYHw8Db0N2nSJEhzXh8ZGanT6d58800pjnOJ/hBn6qllJ09i7Fg4O2PcOJibIyMDJ04gIQFBQW3o7JdffnF0dGxoaLhy5Yqjo6NYNf7nP/8ZPny4tbV1YWGhcXYyJWqGR6PUMg8P/PADZszA5csoKIC3N86da1uGAnjiiSe8vb21Wq245/WLFi0SBOHtt99mhpJceDRKxrNt27ZZs2aNHTv24MGDonR45MgRX19fW1vbCxcu2N21PIDIOHg0SsYTFBRkamp69OjRyspKUTrUzyktWLCAGUoyYoyS8Tg4OIwaNaqhoSGt2eM6W+/06dMRERHHjx93cnKaN2+eKOURtQ1jlIzKwPn6/Pz86OjoAQMGuLu7f/755507dw4MDLSxsRG1RqLW4fJ7MqqQkJB58+YdOnSourr6AZ/LJAhCTk7O7t27k5OTi3/b29TZ2bl3796ZmZnWD7BqlUhSPBolo+rSpYunp6darT5w4MD936nT6U6ePBkdHd23b19PT881a9YUFxc/9dRTs2fP3rt376VLl958800AKpXKKIUTtYhHo2RsISEhmZmZycnJU6ZMuftVnU6XlZWVlJSUnJx8+bedoV1cXCZOnBgWFjZixAiT3/Y21a9wKi8vN1rlRPfEBU9kbCUlJT169LCysqqoqLh9Sq7VarOzs5OSkpKSksrKyvSNPXr0CAoKCgsLGzlypOKuR4Lk5uYOGTLE1dX19hP0iGTBo1EyNhcXFw8PjxMnThw+fDgwMFCfnomJiVevXtW/oWfPnhMmTLhnel67dq1z5876f+uPRnlST7Lj0SjJIDY2dtGiRX369KmsrLy9hnTAgAGhoaGhoaFDhw5t9v6SkpI9e/YkJSXl5ORcvHixa9euAHQ6nYWFhVarVavVZmbcY5Rkw6NRksELL7zg7OxcWFgoCMKgQYMCAwMDAgK8vLyave3ChQvJycm7d+/+7rvv9H/vra2tz549q49RExMTBwcHlUp17do1Z2dnGb4GEQDGKMkiOzu7rKzM3d09Pj5+wIABzV4tLi5OTU1NSkrKysrSp6eVlZWPj09YWNjEiROfeOKJ2+/s0qWLSqVSqVSMUZIRY5RksGPHDgBRUVG/z9ALFy7onyd67NgxfYu1tbW3t3dYWFhISEiHDh3u7ke/UxQvj5K8GKNkbKdPnz579qyDg8O4ceP0LcnJyUuXLs3/bYtoOzu7wMDA0NBQPz8/C/1DSVswZMgKlSq6qmqw5EUTtYwxSsYWFxcHYNq0aebm5voWhUKRn59vZ2cXEBAQFhY2duzY2y/9kWdzc1FaKlmtRA+AMUpGpdFoEhISAMycOfN2o7+//6FDh7y9vU3vfoLefem3GOU5PcmLMUpGtW/fPpVK5erq6u7ufrtR/6SmNvSm30Sf9zGRvHhPPRmV/oz+lVdeEaU3Ho3Sw4AxSsZTWVm5f/9+U1PTadOmidIhY5QeBoxRMp5du3ap1Wo/Pz+xlnnqT+oZoyQvxigZj/6M/veTSwZydIRCgYoK6HRidUnUarynnozk3Llzrq6unTp1Kisrs7KyEqtbe3tUVeHaNTg4iNUlUetwpp6MZNeuviNHFo0alS1ihgLo0gVVVVCpGKMkGx6NkjHodOjRA6WlyMqCp6eYPR89CqUSf/oT+EAmkguPRskY0tNRWoq+fTF8uDgd1tUhIgKurli69E7jP/6Bbt3wX/8lzhBED4hTTGQMcXEAMHMm7trDvo00GiQlIToaR47caczIwKlT4vRP9OAYoyS56mqkpkKhwMsvi9yzpyfmzoVaLXK3RK3CGCXJJSbi118xejSeflrknv/2N1RXIzZW5G6JWoUxSpK7fUYvuo4dsXIlYmNRUCB+50QPiDFK0ioqwrFjsLFBSIgk/c+cCQ8P/Pd/S9I50YNgjJK0tm+HICA0FPfavb4Vbt68d7tCgU8+QUYG9uwxqH+iNmOMkoQEATt3Agaf0efno29fbN5871cHD8af/4y//x319QaNQtQ2jFGS0P/9Hy5cQI8eGDWq7Z0UF8PPDxUVOHwYLd0sEh2N2lpkZ7d9FKI2Y4yShFJTAWDGDJi09Yd27Rr8/XHlCl58EfHxLS477dgRa9ZwgxKSB+9iIgmtXo2AAPTt28aP//IL/P3x448YMgQpKbC0BIC0NOzejU2bsHkzfv9s5qlT0dAg/poqoj/Ee+pJTIGBUKuxd29j5AF4/30UFGDr1lZ3VV+PwEAcPozevZGZCScnAMjMhJ8famuxfbskK6iI2oAn9SSm3Fykp2PlyjstJSX46adW96PTYfp0HD6Mrl2Rnt6YoXl5mDABtbWYPZsZSg8RxiiJbNw4xMbi/HmDOlm6tCApCba2OHCg8Ty9qAh+fqiqQlAQNm4UpVIicTBGSWT+/vDxwZw5Lc6q/6HFixcvXz5gzJj/fP01hgwBgIoKjBuHsjKMHo2EBLTyMcxE0mKMkvjWrkV2duOK0dbatGnT8uXLTUwUc+aUenkBQHV140TT0KHYs+fOVVeihwRjlMQ3YADmz8c776CqqnUf3LVr17x58xQKxZYtW0JCQgDU1yM0FKdOoU8fHDoEW1tJCiYyBGOUJLF4MWxsEBPTio8cOXLk1Vdf1el0H3zwwWuvvQZAq8W0aUhPb5xo0j9OmehhwxglcVRWNvmvtTXWrsXGjXem6e+/Nj4nJyc4OLi+vn7hwoXz588HIAhCVNT+r76CnR0OHULPntLUTWQwxigZqrYWCxdi4EBcvdqkPTgYfn44fBgAyssxaBASEu7dQ35+/vjx42tqaiIiIlasWKFvjIyMXLHipeHDV6WlYfBgSb8BkUEYo2SQzEy4uWHVKty4gays5q9u3AhrawDYtAk//ojwcMyejZqa5m9LTk6urKwMDg7etm2bQqEAsGHDhpUrV5qZmS1a5DZihBG+B5EBBKI2qa0VFiwQlEoBEAYPFk6dEgRB+PZb4dKlJm87fVr47jtBpxM2bxasrQVAePpp4dix5r3t2LGjtrZW/++dO3eamJgoFIrPPvtM+u9BZCjGKLXFmTPC0KECIJiaCgsWCHV1D/Sp3FxhyBABEMzMhDVrqjQazd3vSUtLMzMzA7BmzRqRiyaSBmOUWqehQYiNFczNBUDo3Vv49tvWfbyuTliwQDA11Q4b5uPp6VlYWPj7V48fP25jYwMgMjJSzKKJpMQYpVbIyxM8PARAUCiE2bOFW7fa2M8331xwdnYGYGtr+8UXX+gbc3Nz7e3tAcycOVOn04lWNJHEGKP0QDQaTWxs7JAhtwChVy8hI8PQDisqKoKDg/UX6MPCwnJzc11cXABMmDChoaFBjJKJjIQb5dEf++mnn1555ZXs7Oy+fYPHjNnz/vsKAx+sdNuWLVv++te/1tTUKJVKrVY7evTo/fv3W/J+T3qkMEbpfgRB2Lp1qz7pnJyctm7dGhAQIO4QRUVF4eHh+fn5Dg4O33//vS3v96RHDWOUWnTx4sXXXnvt6NGjAMLCwj755BP9tUvR3bp16+LFi3369LGwsJCifyJJMUbp3pKSkt54442qqipHR8dPPvlk4sSJcldE9JDiXUzUnEqlCgoKmjx5clVVVWhoaF5eHjOU6D64/y01kZmZGRQUVFlZaW9vv2HDhvDwcLkrInrY8aSemrhx44abm5urq+unn37arVs3ucshegQwRqm5kpIS/RJOInoQjFEiIoNwiomIyCCMUSIigzBGiYgMwhglIjIIY5SIyCD/D9+G3rhq5bBLAAABTnpUWHRyZGtpdFBLTCByZGtpdCAyMDIzLjA5LjUAAHice79v7T0GIOBnQAA+KL+BkY0hAyTAyMzOoAFiMEMFmBkRAmCaBZ3mgNBMaBoZmQkq4GZgZGBkYmBi5mBiZmFgYeVgYmVjYGPnYGLjYODgZODgYuDi5mDi4mHg4WVgZWTgYWEQYQJqZGUEKmdlY+Pg4mFhFd8EMgqKGfiWv+A4EMzqfeAh9+T9qasm7JdQkz+wae76fb+tPuxjNbE9sOuWlf2P4MN2sicZDxgfmWl/TnKiXfyMnP0T627b/arT2h/tNG//60ds+3u8qvY36/fse1i1Z/+O9a/3u/zi3a/3X/RA1r2N+5oDM+2ntG8Fmm+w//3Jz/Y6V6QOeL8SsZcsnm5vzfh2n334ZPt9B4Udls1+su+DWIZ93K5OuwWdH+yuhX2xf28hat9UvM9eDACM1GEAYrR3BQAAAaF6VFh0TU9MIHJka2l0IDIwMjMuMDkuNQAAeJx9U0tOxDAM3fcUvgCR7TgfL5kZhBCiI8HAHdhzf2EnHZJsaOsqcZ+d52d3A7/eL6/fP/B3xcu2AeA/j6rCV0TE7Q18Aaen55cdzrfH091zvn7utw8gAUoWY/eKfbxd3+4egjM8cKCojAgPMVDhtqJAUitOwQx7c0dsAA5Sc/EVhijOaECjZ8UgWQ+ophwdEAqrzkixpBZfKXUgcvHvFBgpzsBkKTFwjdKPVBZuGYVrmYG5n51zxl4Hx3zEJM7L4aVDtZSjDi6x9iApnGZohau7qzRWnitJ7pSt+IWAuk6uSZFGUInkSEqVZyThwVWRWtGK96RCukKp1Z8qlYakGrkTkcxLUcRG1QCJPKfRw3pUJ5KXmig28RVdKhfX1WmSZl06b5N0buxqO8lYcG+S1LKc/bRflunq83a67pcxb37zmCnbQBxzYxuQMRxklsYIsFkejSazMprJZnU0jGyroynkNkvfHDQJzP4inoSk5omTYNReMinjJEZM6og0KzLX7/v7v2rr7RdbH7+0RVgL8gAAAOF6VFh0U01JTEVTIHJka2l0IDIwMjMuMDkuNQAAeJwlT0uuxDAIu8pbtlIaBUP4qOoq+86F5vAPMiuC7dhmvbSed+FZtI7nc77HOn8PrL/vcaETB9rFnQw5qZO43zV55I4urtau0Vn4ziEaG46p3EY3RNxJOs1CBywadQziROEs9TUgSK3ArSxUtYLAutkJjYLDbAfB2IsXw6wiLsRbN0UrI4tsGxaTdA0i2XJybPcYlBVi/NRCgawynSxRckbbVyDuXCdRS8fhO1lEZ2pjZMHsodH2vX6Xndc2HXWfuMX5/QcEX0W59Lht5AAAAABJRU5ErkJggg==", - "text/plain": [ - "" - ] - }, - "execution_count": 8, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "dataset.get_data(row=0, col=\"molecule\")" - ] - }, - { - "cell_type": "markdown", - "id": "6c822bda-1fe1-4c9e-8437-4714a6855baf", - "metadata": {}, - "source": [ - "We can see that Polaris has: \n", - "- Saved the molecule in an external Zarr archive and set the column annotations accordingly.\n", - "- Has extracted the molecule-level properties as additional columns.\n", - "- Has added an additional column with the SMILES.\n", - "- Effortlessly saves and loads the molecule object from the Zarr." - ] - }, - { - "cell_type": "markdown", - "id": "8913a820-3649-4566-8049-195480070d9c", - "metadata": {}, - "source": [ - "## Factory Design Pattern\n", - "If you've been dilligently going through the tutorials, you might remember that there is a function that seems to be doing something similar. And you would be right!" - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "id": "18beb7e0-95f2-4fd2-917d-8d4bcceb65af", - "metadata": {}, - "outputs": [ - { - "data": { - "image/png": "iVBORw0KGgoAAAANSUhEUgAAAcIAAACWCAIAAADCEh9HAAAABmJLR0QA/wD/AP+gvaeTAAAZI0lEQVR4nO3de1hU1d4H8O8w3DHlYggaat4V0YRTjyhlgiAaCCKopGLlm+Xr8fR27Bw1FDnkBctL9qqlnkxMjEAkDK947PCGICcvKaAVISgoDigoSTAwM/v9Y0gDxYTZe7bK9/OXrplZ6zfPM8+Xvfdae22FIAggIqK2MpG7ACKiRxtjlIjIIIxRIiKDMEaJiAzCGCUiMghjlIjIIKZyF0BkdA0NOHwY589DEDBwIPz8YG4ud030CFNw3Si1LwUFCAhARQX+9CcAOHECDg7YuxcDB8pdGT2qGKPUnmg0GDoUdnbYvx8dOwJAdTXGj0d5OfLzYWYmd330SOK1UWpPDh7EuXNYt64xQwF07Ij161FQgLQ0WSujRxhjlNqT7Gx06oRnn23S6OEBBwdkZ8tUEz3yGKPUnlRWwtn5Hu3duuH6daNXQ48Jxii1J9bWqKy8R/v16+jQwejV0GOCMUrtiZsbKipQVtakUaVCWRnc3GSqiR55jFFqT4KC0KEDli9v0rhiBWxsEBIiU030yOPye2pP7OywdSumT4dKBV9fKBRIT0dKCnbsgL293MXRo4rrRql9+PZbqNUYMwYATp7Exo3Iy4MgYPBgzJ3buBSfqE0Yo9QOCAI8PHD6NBISMGWK3NXQ44bXRqkdSEnB6dNwdsaECQDQ0IDMTLlroscHY5Qed4KAmBgAWLwYVlYAEBeH55/HnDny1kWPDcYoyUMQhNraWmOMtHs3zpyBiwtmzQKAhgasWAEAL75ojNGpHWCMkgy+/vrrXr16RUVFST6STodlywAgMhIWFgDw6acoKoKrK8LCJB+d2gfGKMnAwcGhuLj4yy+/1Ol00o6UmIizZ9G9O159FQDq67FqFQBER8OEP34SB39JJANPT8+ePXuWlJRkZWVJOIxW23hVdMmSxo2Zt25FcTEGD+ZiexIRY5RkoFAoJk+eDCAhIUHCYRIScP48evZERAQAqNWIjQWAmBgeipKI+GMieUydOhVAYmKiRqORZACttvGq6NKljYeimzejtBTDhiE4WJIRqb1ijJI8hg0bNnDgwIqKim+++UaSAeLj8cMP6NMH06cDQF0d3n8fAKKjoVBIMiK1V4xRko105/UajUal73bJEpiaAsDHH+PyZbi7IzBQ9OGonWOMkmzCw8MB7NmzR61Wi9vzzp07nQ4cWB0QgGnTAKCuDqtXA0BMDA9FSXSMUZJN//79n3nmGVtbx3//+6KI3TY0NLz33nsAuoaHQ6kEUPPPf+LKFTz3HF56ScSBiPQYoySnV15JLy7+MS6un4h9xsXFXbhwoV+/flOmTAFQU1PTZ9my5c89V6ufcSISG2OU5DRxYmeFAqmpqKkRp8OGhoYVK1YAiImJUSqVADZs2HBVpdqnVFr5+oozBlFTjFGSU/fu8PTEr7+K9njjbdu2FRUVubq6hoWFAaipqVm7di2AGP06fCIJMEZJZvr9P0WZrq+vr4+NjQUQHR1tYmICYP369eXl5SNHjhyj37CZSALctplkdvUqnnoKpqa4ehW2tgZ1tWnTprlz5w4ePPjMmTMmJia3bt3q1atXRUXF0aNHR48eLVK9RM3xaJRk5uSEF1+EWo2vvjKoH7VavXLlSgAxMTH6Q9F169ZVVFR4eXkxQ0lSjFGSnyjn9Vu2bCktLR02bFhwcDCAmzdvfvjhhwCWcYKeJMYYJfmFhsLcHP/6F8rL29hDXV3dqlWrAPTr1y8iIuLq1avr1q2rrKz08fEZNWqUmLUS3YUPWCb52dlh+nTY2KANu5TU1tYeO3Zs9erVly9fBvDll1+amZmVlpaeOXMGgDF2hqZ2j1NMJL/SUmRn47nn0KNHY4tOh+RkjBiBbt3u/ZHCwuv79sXv27cvIyPj9r2knTt39vX1TUhIUCqVGo3Gz8/v0KFDRvkG1K7xaJTkl5ODyZPx3HPIzm7cCFSjweTJSElpEqNaLbKzkZaGI0dQXl5dUvIWABMTEw8PjzFjxgQEBIwcOVKhUFy8eFG/G/SSJUvk+T7UzjBG6aFgYoKiIvzzn5g9u/lLV69i/37s34/0dFRXNzba2T39+uv/8/zz7uPGjevcufPv3x8REZGVldW9e3cvLy+j1E7tHU/qSX7JyZg6FWvXIjoa58/D0RH19bCwQEoKdu3C7t24/SN1c8P48Rg/HiNGNG6Ad7eqqipnZ2eNRlNaWurk5GS0b0HtFmfq6WExZw6cnfHOO00au3WDpSXGjMGHH6K4GGfPIjYWL7zQYoYCsLOz8/f312q1iYmJUtdMBMYoPTxMTfHxx9i5E//6153GxYtx/TrS0/HWW3cmoP7QtGnTAMTHx0tQJlFzjFF6iDz/PF5+GfPmoaGhscXBAVZWre4nMDCwU6dOP/xQVlBwXdwKie7GGKWHy+rVuHIFmzcb1ImlpeWbb56tr78UH+8gUl1ELWKM0sPFyQkxMTD8Bs4xY7rX1WHnTnAOlaTGGKWHzty56NnT0E68vdGtGwoL8d13IpREdB+MUZKfoyN8fO78V6nEpk0YMwaOjm3v08SkcccTzjOR1LhulOS3YQMmTYKzs8jdnjoFDw84OuLy5fstkCIyEI9GSWYZGZg3Dx4ed2bnxeLuDldXlJc3WUFliLKysvT0dHH6oscIY5RktmgRAMydCzMz8TufOhUw+Ly+qKho/fr1vr6+3bt3DwkJqaurE6U2emzwVIfklJKC7Gw4OuIvf5Gk/5dfxvLlUCha/UFBEE6cOJGSkpKamnru3Dl9o5WVlY+PT2VlZdeuXUUulB5ljFGSjVYL/R5MUVF44glJhnByQkFBk22iGhpw6xZsbe+drQ0NyMjAv/+9Pi5udWlpqb7R3t4+ICAgKCjI39/f2tpakkLpUcYYJdns2IH8fDz9NF5/Xaoh3n4bW7bg888xfXpjy8GDmDABv/7a5Oao2locOYK0NHz1FcrLMWqUtrS01MXFZdy4cQEBAWPHjjU3N5eqRHr0MUZJHvX1eO89AFi2DJJmlKUl5s/HSy/Bzq75S+Xl2LsXqak4cgS3L3i6uWHs2Olr1oxyd3dXtOFyALU/jFGSx8aNKCqCm1vjLJB0XnwRV65g4cImN5jeuAE/P2RlQacDAKUSXl4IDkZwMHr3BuAIGLBmldoZxijJ4NYtxMYCQGxs43b30lEqsW4dfH0xcyZGjGhstLXFlSswN4eXFwICMHmy+KtWqf1gjJIMPvgA5eXw8sL48cYYztsboaGYMwcnT95pTElBr17o0MEYBdDjjetGydgqKrBuHYDGA1LjWLsWRUX43/+90zJkCDOUxMEYJWPbtGnvk0/WBwZi5EjjDdqtG5YuRUwMrl0z3qDUTvCeejKq4uLiAQMGCILy9OkLgwZ1kXq4N97A5ctISwMAjQbDhkGtRkFB8wVPRIbg0SgZVVRUlFqtDg8PM0KGNqN/SMnPPxt5WHr8MUbJePLy8uLj483NzaOiomQpwMsLM2bIMjI9zhijZDzvvvuuTqebM2dOr169jDCcIOC11xrP6G+Li4Mg8IyexMQYJSPJyclJS0vr0KHDIv2eTtJLSsLw4Zg1yzijUfvFGCUjWbhwoSAI8+fP79LFGFdFtVpERwOAp6cRRqN2jTP1ZAzx8fHTp0/v3LlzYWFhx44djTDitm2YNQt9++LcOW59T9Li74skdOnSpZSUlKSkpGPHjgGIjIw0TobW1zc+W/Qf/2CGkuR4NEri+/HHH/fs2ZOcnHzyd3dfOjo6Xrp0ycLCwggFfPQR3noLbm74/nvJ79kn4l9qEk9u7t7DhyO3b8/Ly9M3dOzY0d/fPyMjQ6VSrVq1yjgZWlODlSsBYPlyZigZA39lZLD8fERHY9AgDBnSLzU1Ly/Pzs5uxowZiYmJV65c8fPzq6ysfPLJJ2cYa8XmRx/h6lU8+ywCAowzILV3PBqlNhEEHD+O5GQkJ6O4uLHR0bGHh8fhJUtGjx5t+tslyRs3bjQ0NNTX1yuVSiPUdfMmVq8GgNjYtjyCiagNGKN0X9XV+OILnD0LrRb9+mHqVHTtiuJieHnh8uXG93TrhpAQTJoELy8rpdL3d5++efOmvb29mZnZzZs3z58/P3DgQKnr/eADVFbCxwfe3lIPRdSIU0zUstxcjB0LS0v4+MDcHJmZKCxEYiL8/eHiAlNTBAcjLAwjRjS7BllVVfX1118nJSWlp6er1Wp947JlyyIjIyWt99o19OqFX35BVhaXi5LxMEapBVotBg9Gly44dAj6qSH9zZUpKSgogFqNp55q/hGVCl99VX/wYKf9++vq6wEolcpRo0b179//448/HjZs2KlTpyQteenSwpiY3hMmIDVV0nGImmCMUguOHIGvb/PjOpUKPXrg/febPFe+ogIHDiApCQcPQqMB8OdnnvnexiYsLGzKlClOTk5qtdrR0bG6uvrnn3/u3bu3RPWWlJT069fP1fWFzz5Lc3Mzk2gUorvx2ii14ORJmJri2WebNHbpgt69G5/FUVTUOMWUkwP9H2NLS4wfj0mTPgoMNPndczgtLCxeeumlL774IiUl5Z133pGo3piYmLq6uv79OzNDyci44IlaUF0Ne/t73APUpQtu3gSA5cvxt7/h+HFYWiIgAHFxUKmQmoqICJO7nmUcEhICIDk5WaJiCwoKtm/frlQq5dqCj9ozHo1SCzp1wvXr0GiaJ6lKhe7dASA8HLW1mDQJ/v6wtr5/Z+PHj7exscnJySkpKXFxcRG92KioKI1G8/rrr/fv31/0zonuj0ej1AJ3d2i1yM9v0lhejsJCeHgAgI8P4uMREvKHGQrA2traz89PEIRUCWZ/cnNzExMTLS0tlyxZInrnRH+IMUotGD0affvi3Xf1s0aNoqNhYYHw8Db0N2nSJEhzXh8ZGanT6d58800pjnOJ/hBn6qllJ09i7Fg4O2PcOJibIyMDJ04gIQFBQW3o7JdffnF0dGxoaLhy5Yqjo6NYNf7nP/8ZPny4tbV1YWGhcXYyJWqGR6PUMg8P/PADZszA5csoKIC3N86da1uGAnjiiSe8vb21Wq245/WLFi0SBOHtt99mhpJceDRKxrNt27ZZs2aNHTv24MGDonR45MgRX19fW1vbCxcu2N21PIDIOHg0SsYTFBRkamp69OjRyspKUTrUzyktWLCAGUoyYoyS8Tg4OIwaNaqhoSGt2eM6W+/06dMRERHHjx93cnKaN2+eKOURtQ1jlIzKwPn6/Pz86OjoAQMGuLu7f/755507dw4MDLSxsRG1RqLW4fJ7MqqQkJB58+YdOnSourr6AZ/LJAhCTk7O7t27k5OTi3/b29TZ2bl3796ZmZnWD7BqlUhSPBolo+rSpYunp6darT5w4MD936nT6U6ePBkdHd23b19PT881a9YUFxc/9dRTs2fP3rt376VLl958800AKpXKKIUTtYhHo2RsISEhmZmZycnJU6ZMuftVnU6XlZWVlJSUnJx8+bedoV1cXCZOnBgWFjZixAiT3/Y21a9wKi8vN1rlRPfEBU9kbCUlJT169LCysqqoqLh9Sq7VarOzs5OSkpKSksrKyvSNPXr0CAoKCgsLGzlypOKuR4Lk5uYOGTLE1dX19hP0iGTBo1EyNhcXFw8PjxMnThw+fDgwMFCfnomJiVevXtW/oWfPnhMmTLhnel67dq1z5876f+uPRnlST7Lj0SjJIDY2dtGiRX369KmsrLy9hnTAgAGhoaGhoaFDhw5t9v6SkpI9e/YkJSXl5ORcvHixa9euAHQ6nYWFhVarVavVZmbcY5Rkw6NRksELL7zg7OxcWFgoCMKgQYMCAwMDAgK8vLyave3ChQvJycm7d+/+7rvv9H/vra2tz549q49RExMTBwcHlUp17do1Z2dnGb4GEQDGKMkiOzu7rKzM3d09Pj5+wIABzV4tLi5OTU1NSkrKysrSp6eVlZWPj09YWNjEiROfeOKJ2+/s0qWLSqVSqVSMUZIRY5RksGPHDgBRUVG/z9ALFy7onyd67NgxfYu1tbW3t3dYWFhISEiHDh3u7ke/UxQvj5K8GKNkbKdPnz579qyDg8O4ceP0LcnJyUuXLs3/bYtoOzu7wMDA0NBQPz8/C/1DSVswZMgKlSq6qmqw5EUTtYwxSsYWFxcHYNq0aebm5voWhUKRn59vZ2cXEBAQFhY2duzY2y/9kWdzc1FaKlmtRA+AMUpGpdFoEhISAMycOfN2o7+//6FDh7y9vU3vfoLefem3GOU5PcmLMUpGtW/fPpVK5erq6u7ufrtR/6SmNvSm30Sf9zGRvHhPPRmV/oz+lVdeEaU3Ho3Sw4AxSsZTWVm5f/9+U1PTadOmidIhY5QeBoxRMp5du3ap1Wo/Pz+xlnnqT+oZoyQvxigZj/6M/veTSwZydIRCgYoK6HRidUnUarynnozk3Llzrq6unTp1Kisrs7KyEqtbe3tUVeHaNTg4iNUlUetwpp6MZNeuviNHFo0alS1ihgLo0gVVVVCpGKMkGx6NkjHodOjRA6WlyMqCp6eYPR89CqUSf/oT+EAmkguPRskY0tNRWoq+fTF8uDgd1tUhIgKurli69E7jP/6Bbt3wX/8lzhBED4hTTGQMcXEAMHMm7trDvo00GiQlIToaR47caczIwKlT4vRP9OAYoyS56mqkpkKhwMsvi9yzpyfmzoVaLXK3RK3CGCXJJSbi118xejSeflrknv/2N1RXIzZW5G6JWoUxSpK7fUYvuo4dsXIlYmNRUCB+50QPiDFK0ioqwrFjsLFBSIgk/c+cCQ8P/Pd/S9I50YNgjJK0tm+HICA0FPfavb4Vbt68d7tCgU8+QUYG9uwxqH+iNmOMkoQEATt3Agaf0efno29fbN5871cHD8af/4y//x319QaNQtQ2jFGS0P/9Hy5cQI8eGDWq7Z0UF8PPDxUVOHwYLd0sEh2N2lpkZ7d9FKI2Y4yShFJTAWDGDJi09Yd27Rr8/XHlCl58EfHxLS477dgRa9ZwgxKSB+9iIgmtXo2AAPTt28aP//IL/P3x448YMgQpKbC0BIC0NOzejU2bsHkzfv9s5qlT0dAg/poqoj/Ee+pJTIGBUKuxd29j5AF4/30UFGDr1lZ3VV+PwEAcPozevZGZCScnAMjMhJ8famuxfbskK6iI2oAn9SSm3Fykp2PlyjstJSX46adW96PTYfp0HD6Mrl2Rnt6YoXl5mDABtbWYPZsZSg8RxiiJbNw4xMbi/HmDOlm6tCApCba2OHCg8Ty9qAh+fqiqQlAQNm4UpVIicTBGSWT+/vDxwZw5Lc6q/6HFixcvXz5gzJj/fP01hgwBgIoKjBuHsjKMHo2EBLTyMcxE0mKMkvjWrkV2duOK0dbatGnT8uXLTUwUc+aUenkBQHV140TT0KHYs+fOVVeihwRjlMQ3YADmz8c776CqqnUf3LVr17x58xQKxZYtW0JCQgDU1yM0FKdOoU8fHDoEW1tJCiYyBGOUJLF4MWxsEBPTio8cOXLk1Vdf1el0H3zwwWuvvQZAq8W0aUhPb5xo0j9OmehhwxglcVRWNvmvtTXWrsXGjXem6e+/Nj4nJyc4OLi+vn7hwoXz588HIAhCVNT+r76CnR0OHULPntLUTWQwxigZqrYWCxdi4EBcvdqkPTgYfn44fBgAyssxaBASEu7dQ35+/vjx42tqaiIiIlasWKFvjIyMXLHipeHDV6WlYfBgSb8BkUEYo2SQzEy4uWHVKty4gays5q9u3AhrawDYtAk//ojwcMyejZqa5m9LTk6urKwMDg7etm2bQqEAsGHDhpUrV5qZmS1a5DZihBG+B5EBBKI2qa0VFiwQlEoBEAYPFk6dEgRB+PZb4dKlJm87fVr47jtBpxM2bxasrQVAePpp4dix5r3t2LGjtrZW/++dO3eamJgoFIrPPvtM+u9BZCjGKLXFmTPC0KECIJiaCgsWCHV1D/Sp3FxhyBABEMzMhDVrqjQazd3vSUtLMzMzA7BmzRqRiyaSBmOUWqehQYiNFczNBUDo3Vv49tvWfbyuTliwQDA11Q4b5uPp6VlYWPj7V48fP25jYwMgMjJSzKKJpMQYpVbIyxM8PARAUCiE2bOFW7fa2M8331xwdnYGYGtr+8UXX+gbc3Nz7e3tAcycOVOn04lWNJHEGKP0QDQaTWxs7JAhtwChVy8hI8PQDisqKoKDg/UX6MPCwnJzc11cXABMmDChoaFBjJKJjIQb5dEf++mnn1555ZXs7Oy+fYPHjNnz/vsKAx+sdNuWLVv++te/1tTUKJVKrVY7evTo/fv3W/J+T3qkMEbpfgRB2Lp1qz7pnJyctm7dGhAQIO4QRUVF4eHh+fn5Dg4O33//vS3v96RHDWOUWnTx4sXXXnvt6NGjAMLCwj755BP9tUvR3bp16+LFi3369LGwsJCifyJJMUbp3pKSkt54442qqipHR8dPPvlk4sSJcldE9JDiXUzUnEqlCgoKmjx5clVVVWhoaF5eHjOU6D64/y01kZmZGRQUVFlZaW9vv2HDhvDwcLkrInrY8aSemrhx44abm5urq+unn37arVs3ucshegQwRqm5kpIS/RJOInoQjFEiIoNwiomIyCCMUSIigzBGiYgMwhglIjIIY5SIyCD/D9+G3rhq5bBLAAABTnpUWHRyZGtpdFBLTCByZGtpdCAyMDIzLjA5LjUAAHice79v7T0GIOBnQAA+KL+BkY0hAyTAyMzOoAFiMEMFmBkRAmCaBZ3mgNBMaBoZmQkq4GZgZGBkYmBi5mBiZmFgYeVgYmVjYGPnYGLjYODgZODgYuDi5mDi4mHg4WVgZWTgYWEQYQJqZGUEKmdlY+Pg4mFhFd8EMgqKGfiWv+A4EMzqfeAh9+T9qasm7JdQkz+wae76fb+tPuxjNbE9sOuWlf2P4MN2sicZDxgfmWl/TnKiXfyMnP0T627b/arT2h/tNG//60ds+3u8qvY36/fse1i1Z/+O9a/3u/zi3a/3X/RA1r2N+5oDM+2ntG8Fmm+w//3Jz/Y6V6QOeL8SsZcsnm5vzfh2n334ZPt9B4Udls1+su+DWIZ93K5OuwWdH+yuhX2xf28hat9UvM9eDACM1GEAYrR3BQAAAaF6VFh0TU9MIHJka2l0IDIwMjMuMDkuNQAAeJx9U0tOxDAM3fcUvgCR7TgfL5kZhBCiI8HAHdhzf2EnHZJsaOsqcZ+d52d3A7/eL6/fP/B3xcu2AeA/j6rCV0TE7Q18Aaen55cdzrfH091zvn7utw8gAUoWY/eKfbxd3+4egjM8cKCojAgPMVDhtqJAUitOwQx7c0dsAA5Sc/EVhijOaECjZ8UgWQ+ophwdEAqrzkixpBZfKXUgcvHvFBgpzsBkKTFwjdKPVBZuGYVrmYG5n51zxl4Hx3zEJM7L4aVDtZSjDi6x9iApnGZohau7qzRWnitJ7pSt+IWAuk6uSZFGUInkSEqVZyThwVWRWtGK96RCukKp1Z8qlYakGrkTkcxLUcRG1QCJPKfRw3pUJ5KXmig28RVdKhfX1WmSZl06b5N0buxqO8lYcG+S1LKc/bRflunq83a67pcxb37zmCnbQBxzYxuQMRxklsYIsFkejSazMprJZnU0jGyroynkNkvfHDQJzP4inoSk5omTYNReMinjJEZM6og0KzLX7/v7v2rr7RdbH7+0RVgL8gAAAOF6VFh0U01JTEVTIHJka2l0IDIwMjMuMDkuNQAAeJwlT0uuxDAIu8pbtlIaBUP4qOoq+86F5vAPMiuC7dhmvbSed+FZtI7nc77HOn8PrL/vcaETB9rFnQw5qZO43zV55I4urtau0Vn4ziEaG46p3EY3RNxJOs1CBywadQziROEs9TUgSK3ArSxUtYLAutkJjYLDbAfB2IsXw6wiLsRbN0UrI4tsGxaTdA0i2XJybPcYlBVi/NRCgawynSxRckbbVyDuXCdRS8fhO1lEZ2pjZMHsodH2vX6Xndc2HXWfuMX5/QcEX0W59Lht5AAAAABJRU5ErkJggg==", - "text/plain": [ - "" - ] - }, - "execution_count": 9, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from polaris.dataset import create_dataset_from_file\n", - "\n", - "dataset = create_dataset_from_file(path, save_dst)\n", - "dataset.get_data(row=0, col=\"molecule\")" - ] - }, - { - "cell_type": "markdown", - "id": "f31f61de-3818-4b52-9548-a8f8d2cc752d", - "metadata": {}, - "source": [ - "The `DatasetFactory` is based on the factory design pattern. That way, you can easily create and add your own file converters. However, the defaults are set to be a good option for most people. \n", - "\n", - "Let's consider two cases that show the power of the `DatasetFactory` design. \n", - "\n", - "### Configuring the converter\n", - "Let's assume we do not want to extract the properties as separate columns, but rather keep them in the RDKit object. We cannot do this with the default converter, but we can configure its behavior to achieve this. " - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "id": "35b6e2cb-3b45-4944-903d-7da81ff1e7a4", - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "\u001b[32m2024-03-26 13:16:43.897\u001b[0m | \u001b[1mINFO \u001b[0m | \u001b[36mpolaris.dataset._factory\u001b[0m:\u001b[36mregister_converter\u001b[0m:\u001b[36m112\u001b[0m - \u001b[1mYou are overwriting the converter for the sdf extension.\u001b[0m\n" - ] - } - ], - "source": [ - "save_dst = dm.fs.join(SAVE_DIR, \"data2.zarr\")\n", - "factory.reset(save_dst)\n", - "\n", - "# Configure the converter\n", - "converter = SDFConverter(mol_prop_as_cols=False)\n", - "\n", - "# Overwrite the converter for SDF files\n", - "factory.register_converter(\"sdf\", converter)\n", - "\n", - "# Process the SDF file again\n", - "factory.add_from_file(path)\n", - "\n", - "# Build the dataset\n", - "dataset = factory.build()" - ] - }, - { - "cell_type": "markdown", - "id": "d066c9b5-2c5c-471c-8739-c63f1eca8b54", - "metadata": {}, - "source": [ - "And voila! The property is saved to the Zarr instead of to a separate column. " - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "id": "dbd94922-a9b9-4096-b42b-4e593581b947", - "metadata": {}, - "outputs": [ - { - "data": { - "image/png": "iVBORw0KGgoAAAANSUhEUgAAAcIAAACWCAIAAADCEh9HAAAABmJLR0QA/wD/AP+gvaeTAAAZI0lEQVR4nO3de1hU1d4H8O8w3DHlYggaat4V0YRTjyhlgiAaCCKopGLlm+Xr8fR27Bw1FDnkBctL9qqlnkxMjEAkDK947PCGICcvKaAVISgoDigoSTAwM/v9Y0gDxYTZe7bK9/OXrplZ6zfPM8+Xvfdae22FIAggIqK2MpG7ACKiRxtjlIjIIIxRIiKDMEaJiAzCGCUiMghjlIjIIKZyF0BkdA0NOHwY589DEDBwIPz8YG4ud030CFNw3Si1LwUFCAhARQX+9CcAOHECDg7YuxcDB8pdGT2qGKPUnmg0GDoUdnbYvx8dOwJAdTXGj0d5OfLzYWYmd330SOK1UWpPDh7EuXNYt64xQwF07Ij161FQgLQ0WSujRxhjlNqT7Gx06oRnn23S6OEBBwdkZ8tUEz3yGKPUnlRWwtn5Hu3duuH6daNXQ48Jxii1J9bWqKy8R/v16+jQwejV0GOCMUrtiZsbKipQVtakUaVCWRnc3GSqiR55jFFqT4KC0KEDli9v0rhiBWxsEBIiU030yOPye2pP7OywdSumT4dKBV9fKBRIT0dKCnbsgL293MXRo4rrRql9+PZbqNUYMwYATp7Exo3Iy4MgYPBgzJ3buBSfqE0Yo9QOCAI8PHD6NBISMGWK3NXQ44bXRqkdSEnB6dNwdsaECQDQ0IDMTLlroscHY5Qed4KAmBgAWLwYVlYAEBeH55/HnDny1kWPDcYoyUMQhNraWmOMtHs3zpyBiwtmzQKAhgasWAEAL75ojNGpHWCMkgy+/vrrXr16RUVFST6STodlywAgMhIWFgDw6acoKoKrK8LCJB+d2gfGKMnAwcGhuLj4yy+/1Ol00o6UmIizZ9G9O159FQDq67FqFQBER8OEP34SB39JJANPT8+ePXuWlJRkZWVJOIxW23hVdMmSxo2Zt25FcTEGD+ZiexIRY5RkoFAoJk+eDCAhIUHCYRIScP48evZERAQAqNWIjQWAmBgeipKI+GMieUydOhVAYmKiRqORZACttvGq6NKljYeimzejtBTDhiE4WJIRqb1ijJI8hg0bNnDgwIqKim+++UaSAeLj8cMP6NMH06cDQF0d3n8fAKKjoVBIMiK1V4xRko105/UajUal73bJEpiaAsDHH+PyZbi7IzBQ9OGonWOMkmzCw8MB7NmzR61Wi9vzzp07nQ4cWB0QgGnTAKCuDqtXA0BMDA9FSXSMUZJN//79n3nmGVtbx3//+6KI3TY0NLz33nsAuoaHQ6kEUPPPf+LKFTz3HF56ScSBiPQYoySnV15JLy7+MS6un4h9xsXFXbhwoV+/flOmTAFQU1PTZ9my5c89V6ufcSISG2OU5DRxYmeFAqmpqKkRp8OGhoYVK1YAiImJUSqVADZs2HBVpdqnVFr5+oozBlFTjFGSU/fu8PTEr7+K9njjbdu2FRUVubq6hoWFAaipqVm7di2AGP06fCIJMEZJZvr9P0WZrq+vr4+NjQUQHR1tYmICYP369eXl5SNHjhyj37CZSALctplkdvUqnnoKpqa4ehW2tgZ1tWnTprlz5w4ePPjMmTMmJia3bt3q1atXRUXF0aNHR48eLVK9RM3xaJRk5uSEF1+EWo2vvjKoH7VavXLlSgAxMTH6Q9F169ZVVFR4eXkxQ0lSjFGSnyjn9Vu2bCktLR02bFhwcDCAmzdvfvjhhwCWcYKeJMYYJfmFhsLcHP/6F8rL29hDXV3dqlWrAPTr1y8iIuLq1avr1q2rrKz08fEZNWqUmLUS3YUPWCb52dlh+nTY2KANu5TU1tYeO3Zs9erVly9fBvDll1+amZmVlpaeOXMGgDF2hqZ2j1NMJL/SUmRn47nn0KNHY4tOh+RkjBiBbt3u/ZHCwuv79sXv27cvIyPj9r2knTt39vX1TUhIUCqVGo3Gz8/v0KFDRvkG1K7xaJTkl5ODyZPx3HPIzm7cCFSjweTJSElpEqNaLbKzkZaGI0dQXl5dUvIWABMTEw8PjzFjxgQEBIwcOVKhUFy8eFG/G/SSJUvk+T7UzjBG6aFgYoKiIvzzn5g9u/lLV69i/37s34/0dFRXNzba2T39+uv/8/zz7uPGjevcufPv3x8REZGVldW9e3cvLy+j1E7tHU/qSX7JyZg6FWvXIjoa58/D0RH19bCwQEoKdu3C7t24/SN1c8P48Rg/HiNGNG6Ad7eqqipnZ2eNRlNaWurk5GS0b0HtFmfq6WExZw6cnfHOO00au3WDpSXGjMGHH6K4GGfPIjYWL7zQYoYCsLOz8/f312q1iYmJUtdMBMYoPTxMTfHxx9i5E//6153GxYtx/TrS0/HWW3cmoP7QtGnTAMTHx0tQJlFzjFF6iDz/PF5+GfPmoaGhscXBAVZWre4nMDCwU6dOP/xQVlBwXdwKie7GGKWHy+rVuHIFmzcb1ImlpeWbb56tr78UH+8gUl1ELWKM0sPFyQkxMTD8Bs4xY7rX1WHnTnAOlaTGGKWHzty56NnT0E68vdGtGwoL8d13IpREdB+MUZKfoyN8fO78V6nEpk0YMwaOjm3v08SkcccTzjOR1LhulOS3YQMmTYKzs8jdnjoFDw84OuLy5fstkCIyEI9GSWYZGZg3Dx4ed2bnxeLuDldXlJc3WUFliLKysvT0dHH6oscIY5RktmgRAMydCzMz8TufOhUw+Ly+qKho/fr1vr6+3bt3DwkJqaurE6U2emzwVIfklJKC7Gw4OuIvf5Gk/5dfxvLlUCha/UFBEE6cOJGSkpKamnru3Dl9o5WVlY+PT2VlZdeuXUUulB5ljFGSjVYL/R5MUVF44glJhnByQkFBk22iGhpw6xZsbe+drQ0NyMjAv/+9Pi5udWlpqb7R3t4+ICAgKCjI39/f2tpakkLpUcYYJdns2IH8fDz9NF5/Xaoh3n4bW7bg888xfXpjy8GDmDABv/7a5Oao2locOYK0NHz1FcrLMWqUtrS01MXFZdy4cQEBAWPHjjU3N5eqRHr0MUZJHvX1eO89AFi2DJJmlKUl5s/HSy/Bzq75S+Xl2LsXqak4cgS3L3i6uWHs2Olr1oxyd3dXtOFyALU/jFGSx8aNKCqCm1vjLJB0XnwRV65g4cImN5jeuAE/P2RlQacDAKUSXl4IDkZwMHr3BuAIGLBmldoZxijJ4NYtxMYCQGxs43b30lEqsW4dfH0xcyZGjGhstLXFlSswN4eXFwICMHmy+KtWqf1gjJIMPvgA5eXw8sL48cYYztsboaGYMwcnT95pTElBr17o0MEYBdDjjetGydgqKrBuHYDGA1LjWLsWRUX43/+90zJkCDOUxMEYJWPbtGnvk0/WBwZi5EjjDdqtG5YuRUwMrl0z3qDUTvCeejKq4uLiAQMGCILy9OkLgwZ1kXq4N97A5ctISwMAjQbDhkGtRkFB8wVPRIbg0SgZVVRUlFqtDg8PM0KGNqN/SMnPPxt5WHr8MUbJePLy8uLj483NzaOiomQpwMsLM2bIMjI9zhijZDzvvvuuTqebM2dOr169jDCcIOC11xrP6G+Li4Mg8IyexMQYJSPJyclJS0vr0KHDIv2eTtJLSsLw4Zg1yzijUfvFGCUjWbhwoSAI8+fP79LFGFdFtVpERwOAp6cRRqN2jTP1ZAzx8fHTp0/v3LlzYWFhx44djTDitm2YNQt9++LcOW59T9Li74skdOnSpZSUlKSkpGPHjgGIjIw0TobW1zc+W/Qf/2CGkuR4NEri+/HHH/fs2ZOcnHzyd3dfOjo6Xrp0ycLCwggFfPQR3noLbm74/nvJ79kn4l9qEk9u7t7DhyO3b8/Ly9M3dOzY0d/fPyMjQ6VSrVq1yjgZWlODlSsBYPlyZigZA39lZLD8fERHY9AgDBnSLzU1Ly/Pzs5uxowZiYmJV65c8fPzq6ysfPLJJ2cYa8XmRx/h6lU8+ywCAowzILV3PBqlNhEEHD+O5GQkJ6O4uLHR0bGHh8fhJUtGjx5t+tslyRs3bjQ0NNTX1yuVSiPUdfMmVq8GgNjYtjyCiagNGKN0X9XV+OILnD0LrRb9+mHqVHTtiuJieHnh8uXG93TrhpAQTJoELy8rpdL3d5++efOmvb29mZnZzZs3z58/P3DgQKnr/eADVFbCxwfe3lIPRdSIU0zUstxcjB0LS0v4+MDcHJmZKCxEYiL8/eHiAlNTBAcjLAwjRjS7BllVVfX1118nJSWlp6er1Wp947JlyyIjIyWt99o19OqFX35BVhaXi5LxMEapBVotBg9Gly44dAj6qSH9zZUpKSgogFqNp55q/hGVCl99VX/wYKf9++vq6wEolcpRo0b179//448/HjZs2KlTpyQteenSwpiY3hMmIDVV0nGImmCMUguOHIGvb/PjOpUKPXrg/febPFe+ogIHDiApCQcPQqMB8OdnnvnexiYsLGzKlClOTk5qtdrR0bG6uvrnn3/u3bu3RPWWlJT069fP1fWFzz5Lc3Mzk2gUorvx2ii14ORJmJri2WebNHbpgt69G5/FUVTUOMWUkwP9H2NLS4wfj0mTPgoMNPndczgtLCxeeumlL774IiUl5Z133pGo3piYmLq6uv79OzNDyci44IlaUF0Ne/t73APUpQtu3gSA5cvxt7/h+HFYWiIgAHFxUKmQmoqICJO7nmUcEhICIDk5WaJiCwoKtm/frlQq5dqCj9ozHo1SCzp1wvXr0GiaJ6lKhe7dASA8HLW1mDQJ/v6wtr5/Z+PHj7exscnJySkpKXFxcRG92KioKI1G8/rrr/fv31/0zonuj0ej1AJ3d2i1yM9v0lhejsJCeHgAgI8P4uMREvKHGQrA2traz89PEIRUCWZ/cnNzExMTLS0tlyxZInrnRH+IMUotGD0affvi3Xf1s0aNoqNhYYHw8Db0N2nSJEhzXh8ZGanT6d58800pjnOJ/hBn6qllJ09i7Fg4O2PcOJibIyMDJ04gIQFBQW3o7JdffnF0dGxoaLhy5Yqjo6NYNf7nP/8ZPny4tbV1YWGhcXYyJWqGR6PUMg8P/PADZszA5csoKIC3N86da1uGAnjiiSe8vb21Wq245/WLFi0SBOHtt99mhpJceDRKxrNt27ZZs2aNHTv24MGDonR45MgRX19fW1vbCxcu2N21PIDIOHg0SsYTFBRkamp69OjRyspKUTrUzyktWLCAGUoyYoyS8Tg4OIwaNaqhoSGt2eM6W+/06dMRERHHjx93cnKaN2+eKOURtQ1jlIzKwPn6/Pz86OjoAQMGuLu7f/755507dw4MDLSxsRG1RqLW4fJ7MqqQkJB58+YdOnSourr6AZ/LJAhCTk7O7t27k5OTi3/b29TZ2bl3796ZmZnWD7BqlUhSPBolo+rSpYunp6darT5w4MD936nT6U6ePBkdHd23b19PT881a9YUFxc/9dRTs2fP3rt376VLl958800AKpXKKIUTtYhHo2RsISEhmZmZycnJU6ZMuftVnU6XlZWVlJSUnJx8+bedoV1cXCZOnBgWFjZixAiT3/Y21a9wKi8vN1rlRPfEBU9kbCUlJT169LCysqqoqLh9Sq7VarOzs5OSkpKSksrKyvSNPXr0CAoKCgsLGzlypOKuR4Lk5uYOGTLE1dX19hP0iGTBo1EyNhcXFw8PjxMnThw+fDgwMFCfnomJiVevXtW/oWfPnhMmTLhnel67dq1z5876f+uPRnlST7Lj0SjJIDY2dtGiRX369KmsrLy9hnTAgAGhoaGhoaFDhw5t9v6SkpI9e/YkJSXl5ORcvHixa9euAHQ6nYWFhVarVavVZmbcY5Rkw6NRksELL7zg7OxcWFgoCMKgQYMCAwMDAgK8vLyave3ChQvJycm7d+/+7rvv9H/vra2tz549q49RExMTBwcHlUp17do1Z2dnGb4GEQDGKMkiOzu7rKzM3d09Pj5+wIABzV4tLi5OTU1NSkrKysrSp6eVlZWPj09YWNjEiROfeOKJ2+/s0qWLSqVSqVSMUZIRY5RksGPHDgBRUVG/z9ALFy7onyd67NgxfYu1tbW3t3dYWFhISEiHDh3u7ke/UxQvj5K8GKNkbKdPnz579qyDg8O4ceP0LcnJyUuXLs3/bYtoOzu7wMDA0NBQPz8/C/1DSVswZMgKlSq6qmqw5EUTtYwxSsYWFxcHYNq0aebm5voWhUKRn59vZ2cXEBAQFhY2duzY2y/9kWdzc1FaKlmtRA+AMUpGpdFoEhISAMycOfN2o7+//6FDh7y9vU3vfoLefem3GOU5PcmLMUpGtW/fPpVK5erq6u7ufrtR/6SmNvSm30Sf9zGRvHhPPRmV/oz+lVdeEaU3Ho3Sw4AxSsZTWVm5f/9+U1PTadOmidIhY5QeBoxRMp5du3ap1Wo/Pz+xlnnqT+oZoyQvxigZj/6M/veTSwZydIRCgYoK6HRidUnUarynnozk3Llzrq6unTp1Kisrs7KyEqtbe3tUVeHaNTg4iNUlUetwpp6MZNeuviNHFo0alS1ihgLo0gVVVVCpGKMkGx6NkjHodOjRA6WlyMqCp6eYPR89CqUSf/oT+EAmkguPRskY0tNRWoq+fTF8uDgd1tUhIgKurli69E7jP/6Bbt3wX/8lzhBED4hTTGQMcXEAMHMm7trDvo00GiQlIToaR47caczIwKlT4vRP9OAYoyS56mqkpkKhwMsvi9yzpyfmzoVaLXK3RK3CGCXJJSbi118xejSeflrknv/2N1RXIzZW5G6JWoUxSpK7fUYvuo4dsXIlYmNRUCB+50QPiDFK0ioqwrFjsLFBSIgk/c+cCQ8P/Pd/S9I50YNgjJK0tm+HICA0FPfavb4Vbt68d7tCgU8+QUYG9uwxqH+iNmOMkoQEATt3Agaf0efno29fbN5871cHD8af/4y//x319QaNQtQ2jFGS0P/9Hy5cQI8eGDWq7Z0UF8PPDxUVOHwYLd0sEh2N2lpkZ7d9FKI2Y4yShFJTAWDGDJi09Yd27Rr8/XHlCl58EfHxLS477dgRa9ZwgxKSB+9iIgmtXo2AAPTt28aP//IL/P3x448YMgQpKbC0BIC0NOzejU2bsHkzfv9s5qlT0dAg/poqoj/Ee+pJTIGBUKuxd29j5AF4/30UFGDr1lZ3VV+PwEAcPozevZGZCScnAMjMhJ8famuxfbskK6iI2oAn9SSm3Fykp2PlyjstJSX46adW96PTYfp0HD6Mrl2Rnt6YoXl5mDABtbWYPZsZSg8RxiiJbNw4xMbi/HmDOlm6tCApCba2OHCg8Ty9qAh+fqiqQlAQNm4UpVIicTBGSWT+/vDxwZw5Lc6q/6HFixcvXz5gzJj/fP01hgwBgIoKjBuHsjKMHo2EBLTyMcxE0mKMkvjWrkV2duOK0dbatGnT8uXLTUwUc+aUenkBQHV140TT0KHYs+fOVVeihwRjlMQ3YADmz8c776CqqnUf3LVr17x58xQKxZYtW0JCQgDU1yM0FKdOoU8fHDoEW1tJCiYyBGOUJLF4MWxsEBPTio8cOXLk1Vdf1el0H3zwwWuvvQZAq8W0aUhPb5xo0j9OmehhwxglcVRWNvmvtTXWrsXGjXem6e+/Nj4nJyc4OLi+vn7hwoXz588HIAhCVNT+r76CnR0OHULPntLUTWQwxigZqrYWCxdi4EBcvdqkPTgYfn44fBgAyssxaBASEu7dQ35+/vjx42tqaiIiIlasWKFvjIyMXLHipeHDV6WlYfBgSb8BkUEYo2SQzEy4uWHVKty4gays5q9u3AhrawDYtAk//ojwcMyejZqa5m9LTk6urKwMDg7etm2bQqEAsGHDhpUrV5qZmS1a5DZihBG+B5EBBKI2qa0VFiwQlEoBEAYPFk6dEgRB+PZb4dKlJm87fVr47jtBpxM2bxasrQVAePpp4dix5r3t2LGjtrZW/++dO3eamJgoFIrPPvtM+u9BZCjGKLXFmTPC0KECIJiaCgsWCHV1D/Sp3FxhyBABEMzMhDVrqjQazd3vSUtLMzMzA7BmzRqRiyaSBmOUWqehQYiNFczNBUDo3Vv49tvWfbyuTliwQDA11Q4b5uPp6VlYWPj7V48fP25jYwMgMjJSzKKJpMQYpVbIyxM8PARAUCiE2bOFW7fa2M8331xwdnYGYGtr+8UXX+gbc3Nz7e3tAcycOVOn04lWNJHEGKP0QDQaTWxs7JAhtwChVy8hI8PQDisqKoKDg/UX6MPCwnJzc11cXABMmDChoaFBjJKJjIQb5dEf++mnn1555ZXs7Oy+fYPHjNnz/vsKAx+sdNuWLVv++te/1tTUKJVKrVY7evTo/fv3W/J+T3qkMEbpfgRB2Lp1qz7pnJyctm7dGhAQIO4QRUVF4eHh+fn5Dg4O33//vS3v96RHDWOUWnTx4sXXXnvt6NGjAMLCwj755BP9tUvR3bp16+LFi3369LGwsJCifyJJMUbp3pKSkt54442qqipHR8dPPvlk4sSJcldE9JDiXUzUnEqlCgoKmjx5clVVVWhoaF5eHjOU6D64/y01kZmZGRQUVFlZaW9vv2HDhvDwcLkrInrY8aSemrhx44abm5urq+unn37arVs3ucshegQwRqm5kpIS/RJOInoQjFEiIoNwiomIyCCMUSIigzBGiYgMwhglIjIIY5SIyCD/D9+G3rhq5bBLAAABTnpUWHRyZGtpdFBLTCByZGtpdCAyMDIzLjA5LjUAAHice79v7T0GIOBnQAA+KL+BkY0hAyTAyMzOoAFiMEMFmBkRAmCaBZ3mgNBMaBoZmQkq4GZgZGBkYmBi5mBiZmFgYeVgYmVjYGPnYGLjYODgZODgYuDi5mDi4mHg4WVgZWTgYWEQYQJqZGUEKmdlY+Pg4mFhFd8EMgqKGfiWv+A4EMzqfeAh9+T9qasm7JdQkz+wae76fb+tPuxjNbE9sOuWlf2P4MN2sicZDxgfmWl/TnKiXfyMnP0T627b/arT2h/tNG//60ds+3u8qvY36/fse1i1Z/+O9a/3u/zi3a/3X/RA1r2N+5oDM+2ntG8Fmm+w//3Jz/Y6V6QOeL8SsZcsnm5vzfh2n334ZPt9B4Udls1+su+DWIZ93K5OuwWdH+yuhX2xf28hat9UvM9eDACM1GEAYrR3BQAAAaF6VFh0TU9MIHJka2l0IDIwMjMuMDkuNQAAeJx9U0tOxDAM3fcUvgCR7TgfL5kZhBCiI8HAHdhzf2EnHZJsaOsqcZ+d52d3A7/eL6/fP/B3xcu2AeA/j6rCV0TE7Q18Aaen55cdzrfH091zvn7utw8gAUoWY/eKfbxd3+4egjM8cKCojAgPMVDhtqJAUitOwQx7c0dsAA5Sc/EVhijOaECjZ8UgWQ+ophwdEAqrzkixpBZfKXUgcvHvFBgpzsBkKTFwjdKPVBZuGYVrmYG5n51zxl4Hx3zEJM7L4aVDtZSjDi6x9iApnGZohau7qzRWnitJ7pSt+IWAuk6uSZFGUInkSEqVZyThwVWRWtGK96RCukKp1Z8qlYakGrkTkcxLUcRG1QCJPKfRw3pUJ5KXmig28RVdKhfX1WmSZl06b5N0buxqO8lYcG+S1LKc/bRflunq83a67pcxb37zmCnbQBxzYxuQMRxklsYIsFkejSazMprJZnU0jGyroynkNkvfHDQJzP4inoSk5omTYNReMinjJEZM6og0KzLX7/v7v2rr7RdbH7+0RVgL8gAAAOF6VFh0U01JTEVTIHJka2l0IDIwMjMuMDkuNQAAeJwlT0uuxDAIu8pbtlIaBUP4qOoq+86F5vAPMiuC7dhmvbSed+FZtI7nc77HOn8PrL/vcaETB9rFnQw5qZO43zV55I4urtau0Vn4ziEaG46p3EY3RNxJOs1CBywadQziROEs9TUgSK3ArSxUtYLAutkJjYLDbAfB2IsXw6wiLsRbN0UrI4tsGxaTdA0i2XJybPcYlBVi/NRCgawynSxRckbbVyDuXCdRS8fhO1lEZ2pjZMHsodH2vX6Xndc2HXWfuMX5/QcEX0W59Lht5AAAAABJRU5ErkJggg==", - "text/html": [ - "\n", - "
my_propertymy_value
" - ], - "text/plain": [ - "" - ] - }, - "execution_count": 11, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "dataset.get_data(row=0, col=\"molecule\")" - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "id": "2b12c7c0-23be-4286-8dca-23d0e7a606cf", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
smilesmolecule
0CN1C=NC2=C1C(=O)N(C)C(=O)N2Cmolecule#0
\n", - "
" - ], - "text/plain": [ - " smiles molecule\n", - "0 CN1C=NC2=C1C(=O)N(C)C(=O)N2C molecule#0" - ] - }, - "execution_count": 12, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "dataset.table" - ] - }, - { - "cell_type": "markdown", - "id": "4db64c4d-4712-4dfe-81ae-c8daa01066de", - "metadata": {}, - "source": [ - "### Merging data from different sources\n", - "\n", - "Another case is when you want to merge data from multiple sources. Maybe you have two different SDF files." - ] - }, - { - "cell_type": "code", - "execution_count": 13, - "id": "ef15bf98-f301-465d-9e93-2531f9f1f98c", - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "\u001b[32m2024-03-26 13:16:43.938\u001b[0m | \u001b[1mINFO \u001b[0m | \u001b[36mpolaris.dataset._factory\u001b[0m:\u001b[36mregister_converter\u001b[0m:\u001b[36m112\u001b[0m - \u001b[1mYou are overwriting the converter for the sdf extension.\u001b[0m\n", - "\u001b[32m2024-03-26 13:16:43.945\u001b[0m | \u001b[1mINFO \u001b[0m | \u001b[36mpolaris.dataset._factory\u001b[0m:\u001b[36mregister_converter\u001b[0m:\u001b[36m112\u001b[0m - \u001b[1mYou are overwriting the converter for the sdf extension.\u001b[0m\n" - ] - } - ], - "source": [ - "save_dst = dm.fs.join(SAVE_DIR, \"data3.zarr\")\n", - "factory.reset(save_dst)\n", - "\n", - "# Let's pretend these are two different SDF files\n", - "factory.register_converter(\"sdf\", SDFConverter(mol_column=\"molecule1\", smiles_column=None))\n", - "factory.add_from_file(path)\n", - "\n", - "# We change the configuration between files\n", - "factory.register_converter(\"sdf\", SDFConverter(mol_column=\"molecule2\", mol_prop_as_cols=False))\n", - "factory.add_from_file(path)\n", - "\n", - "dataset = factory.build()" - ] - }, - { - "cell_type": "code", - "execution_count": 14, - "id": "65960c85-ee0d-4d37-b50d-b1c8ba1cec64", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
my_propertymolecule1smilesmolecule2
0my_valuemolecule1#0CN1C=NC2=C1C(=O)N(C)C(=O)N2Cmolecule2#0
\n", - "
" - ], - "text/plain": [ - " my_property molecule1 smiles molecule2\n", - "0 my_value molecule1#0 CN1C=NC2=C1C(=O)N(C)C(=O)N2C molecule2#0" - ] - }, - "execution_count": 14, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "dataset.table" - ] - }, - { - "cell_type": "markdown", - "id": "a5d7bf37-7950-4026-b6c1-3fac556754ba", - "metadata": {}, - "source": [ - "The End. " - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.12.2" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/docs/tutorials/dataset_zarr.ipynb b/docs/tutorials/dataset_zarr.ipynb deleted file mode 100644 index 54764c8f..00000000 --- a/docs/tutorials/dataset_zarr.ipynb +++ /dev/null @@ -1,739 +0,0 @@ -{ - "cells": [ - { - "cell_type": "code", - "execution_count": 1, - "id": "217690be-9836-4e06-930e-ba7efbb37d91", - "metadata": { - "editable": true, - "slideshow": { - "slide_type": "" - }, - "tags": [ - "remove_cell" - ] - }, - "outputs": [], - "source": [ - "# Note: Cell is tagged to not show up in the mkdocs build\n", - "%load_ext autoreload\n", - "%autoreload 2" - ] - }, - { - "cell_type": "markdown", - "id": "39b58e71", - "metadata": { - "editable": true, - "slideshow": { - "slide_type": "" - }, - "tags": [] - }, - "source": [ - "
\n", - "

In short

\n", - "

This tutorial shows how to create datasets with more advanced data-modalities through the .zarr format.

\n", - "
\n", - "\n", - "## Pointer columns\n", - "\n", - "Not all data might fit the tabular format, e.g. images or conformers. In that case, we have _pointer_ columns. Pointer columns do not contain the data itself, but rather store a reference to an external file from which the content can be loaded.\n", - "\n", - "For now, we only support `.zarr` files as references. To learn more about `.zarr`, visit their documentation. Their [tutorial](https://zarr.readthedocs.io/en/stable/tutorial.html) specifically is a good read to better understand the main features. " - ] - }, - { - "cell_type": "markdown", - "id": "e154bb54", - "metadata": { - "editable": true, - "slideshow": { - "slide_type": "" - }, - "tags": [] - }, - "source": [ - "### Dummy example\n", - "For the sake of simplicity, let's assume we have just two datapoints. We will use this to demonstrate the idea behind pointer columns. " - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "id": "5e201379", - "metadata": { - "editable": true, - "slideshow": { - "slide_type": "" - }, - "tags": [] - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "/mnt/ps/home/CORP/lu.zhu/miniconda3/envs/po_datasets/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", - " from .autonotebook import tqdm as notebook_tqdm\n" - ] - } - ], - "source": [ - "import zarr\n", - "import platformdirs\n", - "\n", - "import numpy as np\n", - "import datamol as dm\n", - "import pandas as pd\n", - "\n", - "SAVE_DIR = dm.fs.join(platformdirs.user_cache_dir(appname=\"polaris-tutorials\"), \"dataset_zarr\")" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "id": "07442028", - "metadata": { - "editable": true, - "slideshow": { - "slide_type": "" - }, - "tags": [] - }, - "outputs": [ - { - "data": { - "text/plain": [ - "" - ] - }, - "execution_count": 3, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Create two images and save them to a Zarr archive\n", - "base_path = dm.fs.join(SAVE_DIR, \"data.zarr\")\n", - "inp_col_name = \"images\"\n", - "\n", - "images = np.random.random((2, 64, 64, 3))\n", - "root = zarr.open(base_path, \"w\")\n", - "root.array(inp_col_name, images)" - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "id": "05712cbd", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "" - ] - }, - "execution_count": 4, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Consolidate the dataset for efficient loading from the cloud bucket\n", - "zarr.consolidate_metadata(base_path)" - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "id": "15df9619-e659-4558-9c69-416a186c1f3a", - "metadata": { - "editable": true, - "slideshow": { - "slide_type": "" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "# For performance reasons, Polaris expects all data related to a column to be saved in a single Zarr array.\n", - "# To index a specific element in that array, the pointer path can have a suffix to specify the index.\n", - "train_path = f\"{inp_col_name}#0\"\n", - "test_path = f\"{inp_col_name}#1\"" - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "id": "16543db7", - "metadata": { - "editable": true, - "slideshow": { - "slide_type": "" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "tgt_col_name = \"target\"\n", - "\n", - "table = pd.DataFrame(\n", - " {\n", - " inp_col_name: [train_path, test_path], # Instead of the content, we specify paths\n", - " tgt_col_name: np.random.random(2),\n", - " }\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "id": "a257b09d", - "metadata": { - "editable": true, - "slideshow": { - "slide_type": "" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "from polaris.dataset import Dataset, ColumnAnnotation\n", - "\n", - "dataset = Dataset(\n", - " table=table,\n", - " # To indicate that we are dealing with a pointer column here,\n", - " # we need to annotate the column.\n", - " annotations={\"images\": ColumnAnnotation(is_pointer=True)},\n", - " # We also need to specify the path to the root of the Zarr archive\n", - " zarr_root_path=base_path,\n", - ")" - ] - }, - { - "cell_type": "markdown", - "id": "2524c795", - "metadata": { - "editable": true, - "slideshow": { - "slide_type": "" - }, - "tags": [] - }, - "source": [ - "Note how the table does not contain the image data, but rather stores a path relative to the root of the Zarr. " - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "id": "19a39fab", - "metadata": { - "editable": true, - "slideshow": { - "slide_type": "" - }, - "tags": [] - }, - "outputs": [ - { - "data": { - "text/plain": [ - "'images#0'" - ] - }, - "execution_count": 8, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "dataset.table.loc[0, \"images\"]" - ] - }, - { - "cell_type": "markdown", - "id": "5c051877", - "metadata": { - "editable": true, - "slideshow": { - "slide_type": "" - }, - "tags": [] - }, - "source": [ - "To load the data that is being pointed to, you can simply use the `Dataset.get_data()` utility method. " - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "id": "8189f312", - "metadata": { - "editable": true, - "slideshow": { - "slide_type": "" - }, - "tags": [] - }, - "outputs": [ - { - "data": { - "text/plain": [ - "(64, 64, 3)" - ] - }, - "execution_count": 9, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "dataset.get_data(col=\"images\", row=0).shape" - ] - }, - { - "cell_type": "markdown", - "id": "17aaff10", - "metadata": { - "editable": true, - "slideshow": { - "slide_type": "" - }, - "tags": [] - }, - "source": [ - "Creating a benchmark and the associated `Subset` objects will automatically do so! " - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "id": "6f1c8766", - "metadata": { - "editable": true, - "slideshow": { - "slide_type": "" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "from polaris.benchmark import SingleTaskBenchmarkSpecification\n", - "\n", - "benchmark = SingleTaskBenchmarkSpecification(\n", - " dataset=dataset,\n", - " input_cols=inp_col_name,\n", - " target_cols=tgt_col_name,\n", - " metrics=\"mean_absolute_error\",\n", - " split=([0], [1]),\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "id": "9a0c635c", - "metadata": { - "editable": true, - "slideshow": { - "slide_type": "" - }, - "tags": [] - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "(64, 64, 3)\n" - ] - } - ], - "source": [ - "train, test = benchmark.get_train_test_split()\n", - "\n", - "for x, y in train:\n", - " # At this point, the content is loaded from the path specified in the table\n", - " print(x.shape)" - ] - }, - { - "cell_type": "markdown", - "id": "67d2e77d", - "metadata": { - "editable": true, - "slideshow": { - "slide_type": "" - }, - "tags": [] - }, - "source": [ - "## Creating datasets from `.zarr` arrays\n", - "\n", - "While the above example works, creating the table with all paths from scratch is time-consuming when datasets get large. Instead, you can also automatically parse a Zarr archive into the expected tabular data structure. \n", - "\n", - "A Zarr archive can contain groups and arrays, where each group can again contain groups and arrays. Within Polaris, we expect the root to be a flat hierarchy that contains a single array per column.\n" - ] - }, - { - "cell_type": "markdown", - "id": "d6977165", - "metadata": {}, - "source": [ - "### A single array for _all_ datapoints \n", - "\n", - "Polaris expects a flat zarr hierarchy, with a single array per pointer column: \n", - "```\n", - "/\n", - " column_a\n", - "```\n", - "\n", - "Which will get parsed into a table like: \n", - "\n", - "| column_a |\n", - "| ----------------- |\n", - "| column_a/array#1 |\n", - "| column_a/array#2 |\n", - "| ... |\n", - "| column_a/array#N |\n", - "\n", - "
\n", - "

Note

\n", - "

Notice the # suffix in the path, which indicates the index at which the data-point is stored within the big array.

\n", - "
" - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "id": "622287ed-16ad-484e-a0d7-ca6cf648ed5d", - "metadata": {}, - "outputs": [], - "source": [ - "# Let's first create some dummy dataset with 1000 64x64 \"images\"\n", - "images = np.random.random((1000, 64, 64, 3))" - ] - }, - { - "cell_type": "code", - "execution_count": 13, - "id": "12a06b89", - "metadata": {}, - "outputs": [], - "source": [ - "path = dm.fs.join(SAVE_DIR, \"zarr\", \"data.zarr\")\n", - "\n", - "with zarr.open(path, \"w\") as root:\n", - " root.array(inp_col_name, images)" - ] - }, - { - "cell_type": "markdown", - "id": "59ddcf4b-6858-45d0-afd2-b396ee0bc498", - "metadata": {}, - "source": [ - "To create a dataset from a Zarr archive, we can use the convenience function `create_dataset_from_file()`." - ] - }, - { - "cell_type": "code", - "execution_count": 14, - "id": "3c7c11ac", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "'images#0'" - ] - }, - "execution_count": 14, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from polaris.dataset import create_dataset_from_file\n", - "\n", - "# Because Polaris might restructure the Zarr archive,\n", - "# we need to specify a location to save the Zarr file to.\n", - "dataset = create_dataset_from_file(path, zarr_root_path=dm.fs.join(SAVE_DIR, \"zarr\", \"processed.zarr\"))\n", - "\n", - "# The path refers to the original zarr directory we created in the above code block\n", - "dataset.table.iloc[0][inp_col_name]" - ] - }, - { - "cell_type": "code", - "execution_count": 15, - "id": "f8d1b42d", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "(64, 64, 3)" - ] - }, - "execution_count": 15, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "dataset.get_data(col=inp_col_name, row=0).shape" - ] - }, - { - "cell_type": "markdown", - "id": "51493c81", - "metadata": {}, - "source": [ - "## Saving the dataset\n", - "\n", - "We can still easily save the dataset. All the pointer columns will be automatically updated. " - ] - }, - { - "cell_type": "code", - "execution_count": 16, - "id": "1cd94077", - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "\u001b[32m2024-07-21 13:11:49.273\u001b[0m | \u001b[1mINFO \u001b[0m | \u001b[36mpolaris._mixins\u001b[0m:\u001b[36mmd5sum\u001b[0m:\u001b[36m27\u001b[0m - \u001b[1mComputing the checksum. This can be slow for large datasets.\u001b[0m\n", - "Finding all files in the Zarr archive: 60%|██████ | 79/131 [00:00<00:00, 375.21it/s]" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Finding all files in the Zarr archive: 100%|██████████| 131/131 [00:00<00:00, 396.17it/s]\n", - "\u001b[32m2024-07-21 13:11:49.616\u001b[0m | \u001b[1mINFO \u001b[0m | \u001b[36mpolaris.dataset._dataset\u001b[0m:\u001b[36mto_json\u001b[0m:\u001b[36m431\u001b[0m - \u001b[1mCopying Zarr archive to /mnt/ps/home/CORP/lu.zhu/.cache/polaris-tutorials/002/json/data.zarr. This may take a while.\u001b[0m\n" - ] - } - ], - "source": [ - "savedir = dm.fs.join(SAVE_DIR, \"json\")\n", - "json_path = dataset.to_json(savedir)" - ] - }, - { - "cell_type": "code", - "execution_count": 17, - "id": "c5147684", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "['/mnt/ps/home/CORP/lu.zhu/.cache/polaris-tutorials/002/json',\n", - " '/mnt/ps/home/CORP/lu.zhu/.cache/polaris-tutorials/002/data.zarr',\n", - " '/mnt/ps/home/CORP/lu.zhu/.cache/polaris-tutorials/002/zarr']" - ] - }, - "execution_count": 17, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "fs = dm.fs.get_mapper(path).fs\n", - "fs.ls(SAVE_DIR)" - ] - }, - { - "cell_type": "markdown", - "id": "b9bf6c19", - "metadata": {}, - "source": [ - "Besides the `table.parquet` and `dataset.yaml`, we can now also see a `data` folder which stores the content for the additional content from the pointer columns." - ] - }, - { - "cell_type": "markdown", - "id": "3801c96f", - "metadata": {}, - "source": [ - "## Load the dataset" - ] - }, - { - "cell_type": "code", - "execution_count": 18, - "id": "33c25a55", - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "\u001b[32m2024-07-21 13:12:16.485\u001b[0m | \u001b[1mINFO \u001b[0m | \u001b[36mpolaris._mixins\u001b[0m:\u001b[36mmd5sum\u001b[0m:\u001b[36m27\u001b[0m - \u001b[1mComputing the checksum. This can be slow for large datasets.\u001b[0m\n", - "Finding all files in the Zarr archive: 17%|█▋ | 22/131 [00:00<00:00, 211.62it/s]" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Finding all files in the Zarr archive: 100%|██████████| 131/131 [00:00<00:00, 246.81it/s]\n" - ] - }, - { - "data": { - "text/html": [ - "
nameNone
description
tags
user_attributes
ownerNone
polaris_versiondev
default_adapters
zarr_root_path/mnt/ps/home/CORP/lu.zhu/.cache/polaris-tutorials/002/zarr/processed.zarr
readme
annotations
images
is_pointerTrue
modalityUNKNOWN
descriptionNone
user_attributes
dtypeobject
sourceNone
licenseNone
curation_referenceNone
cache_dir/mnt/ps/home/CORP/lu.zhu/.cache/polaris/datasets/97d642a2-001c-40aa-ac98-0e24353005d2
md5sumb7c52acfbda1f9bba47ae218e9c4717f
artifact_idNone
n_rows1000
n_columns1
" - ], - "text/plain": [ - "{\n", - " \"name\": null,\n", - " \"description\": \"\",\n", - " \"tags\": [],\n", - " \"user_attributes\": {},\n", - " \"owner\": null,\n", - " \"polaris_version\": \"dev\",\n", - " \"default_adapters\": {},\n", - " \"zarr_root_path\": \"/mnt/ps/home/CORP/lu.zhu/.cache/polaris-tutorials/002/zarr/processed.zarr\",\n", - " \"readme\": \"\",\n", - " \"annotations\": {\n", - " \"images\": {\n", - " \"is_pointer\": true,\n", - " \"modality\": \"UNKNOWN\",\n", - " \"description\": null,\n", - " \"user_attributes\": {},\n", - " \"dtype\": \"object\"\n", - " }\n", - " },\n", - " \"source\": null,\n", - " \"license\": null,\n", - " \"curation_reference\": null,\n", - " \"cache_dir\": \"/mnt/ps/home/CORP/lu.zhu/.cache/polaris/datasets/97d642a2-001c-40aa-ac98-0e24353005d2\",\n", - " \"md5sum\": \"b7c52acfbda1f9bba47ae218e9c4717f\",\n", - " \"artifact_id\": null,\n", - " \"n_rows\": 1000,\n", - " \"n_columns\": 1\n", - "}" - ] - }, - "execution_count": 18, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "Dataset.from_json(json_path)" - ] - }, - { - "cell_type": "markdown", - "id": "0503a3a7", - "metadata": {}, - "source": [ - "### Upload zarr dataset to Hub" - ] - }, - { - "cell_type": "code", - "execution_count": 22, - "id": "cf0f7e69", - "metadata": {}, - "outputs": [], - "source": [ - "# Define the zarr dataset metadata before uploading\n", - "dataset.name = \"tutorial_zarr\"\n", - "dataset.license = \"CC-BY-4.0\"\n", - "dataset.source = \"https://github.com/polaris-hub/polaris\"" - ] - }, - { - "cell_type": "code", - "execution_count": 23, - "id": "5251b027", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "⠙ Uploading dataset... " - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "⠦ Uploading dataset... " - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "\u001b[32m2024-07-21 13:19:12.188\u001b[0m | \u001b[1mINFO \u001b[0m | \u001b[36mpolaris.hub.client\u001b[0m:\u001b[36mupload_dataset\u001b[0m:\u001b[36m602\u001b[0m - \u001b[1mCopying Zarr archive to the Hub. This may take a while.\u001b[0m\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "✅ SUCCESS: \u001b[1mYour dataset has been successfully uploaded to the Hub. View it here: https://polarishub.io/datasets/polaris/tutorial_zarr\u001b[0m\n", - " \n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "/mnt/ps/home/CORP/lu.zhu/miniconda3/envs/po_datasets/lib/python3.12/site-packages/yaspin/core.py:228: UserWarning: color, on_color and attrs are not supported when running in jupyter\n", - " self._color = self._set_color(value) if value else value\n" - ] - } - ], - "source": [ - "dataset.upload_to_hub(owner=\"polaris\")" - ] - }, - { - "cell_type": "markdown", - "id": "72767ef2", - "metadata": { - "editable": true, - "slideshow": { - "slide_type": "" - }, - "tags": [] - }, - "source": [ - "The End. " - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.12.4" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/docs/tutorials/optimization.ipynb b/docs/tutorials/optimization.ipynb deleted file mode 100644 index d086821f..00000000 --- a/docs/tutorials/optimization.ipynb +++ /dev/null @@ -1,409 +0,0 @@ -{ - "cells": [ - { - "cell_type": "code", - "execution_count": 1, - "id": "217690be-9836-4e06-930e-ba7efbb37d91", - "metadata": { - "editable": true, - "slideshow": { - "slide_type": "" - }, - "tags": [ - "remove_cell" - ] - }, - "outputs": [], - "source": [ - "# Note: Cell is tagged to not show up in the mkdocs build\n", - "%load_ext autoreload\n", - "%autoreload 2" - ] - }, - { - "cell_type": "markdown", - "id": "39b58e71", - "metadata": { - "editable": true, - "slideshow": { - "slide_type": "" - }, - "tags": [] - }, - "source": [ - "
\n", - "

In short

\n", - "

This tutorial shows how to optimize a Polaris dataset to improve its efficiency.

\n", - "
\n", - "\n", - "
\n", - "

No magic bullet

\n", - "

What works best really depends on the specific dataset you're using and you will benefit from trying out different ways of storing the data.

\n", - "
\n", - "\n", - "## Datasets that fit in memory\n", - "Through the Polaris `Subset` class, we aim to provide a _general purpose_ data loader that serves as a good default for a variety of use cases.\n", - "\n", - "**As a dataset creator**, it is important to be mindful of some design decisions you can make to improve performance for your downstream users. These design decisions are most impactful!\n", - "\n", - "**As a dataset user**, we provide the `Dataset.load_to_memory()` method to load the uncompressed dataset into memory. This is limited though, because there is only so much we can do automatically without risking data integrity.\n", - "\n", - "Despite our best efforts to provide a data loader that is as efficient as possible, you will always be able to optimize things further for a specific use case if needed.\n", - "\n", - "### _Without_ Zarr\n", - "Without pointer columns, the best way to optimize your dataset's performance is by making sure you use the appropriate dtype. A smaller memory footprint not only reduces storage requirements, but also speeds up moving data around (e.g. to the GPU or to create `torch.Tensor` objects)." - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "id": "7c7d338f-8021-4331-a030-289cc9c7e5cc", - "metadata": {}, - "outputs": [], - "source": [ - "import numpy as np\n", - "import pandas as pd" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "id": "04920850-621b-4f08-bc2c-6dfaade87c90", - "metadata": {}, - "outputs": [], - "source": [ - "# Let's create a dummy dataset with two columns\n", - "rng = np.random.default_rng(0)\n", - "col_a = rng.choice(list(range(100)), 10000)\n", - "col_b = rng.random(10000)\n", - "table = pd.DataFrame({\"A\": col_a, \"B\": col_b})" - ] - }, - { - "cell_type": "markdown", - "id": "877b05eb-794c-4770-81fa-c1bdf5a4e103", - "metadata": {}, - "source": [ - "By default, Pandas (and NumPy) use the largest dtype available." - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "id": "c8531c7b-612b-4ac0-95c1-a085cff6a44d", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "A int64\n", - "B float64\n", - "dtype: object" - ] - }, - "execution_count": 4, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "table.dtypes" - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "id": "640fab44-1fd8-473a-ba57-1f7b8284344f", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "160132" - ] - }, - "execution_count": 5, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "table.memory_usage().sum()" - ] - }, - { - "cell_type": "markdown", - "id": "399b8e83-e08a-48db-805a-a45d36609d79", - "metadata": {}, - "source": [ - "However, we know that column A only has values between 0 and 99, so we won't need the full `int64` dtype. The `np.int16` is already more appropriate! " - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "id": "3cc1132b-ebf8-4bb2-815c-71627dc8a3b6", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "100132" - ] - }, - "execution_count": 6, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "table[\"A\"] = table[\"A\"].astype(np.int16)\n", - "table.memory_usage().sum()" - ] - }, - { - "cell_type": "markdown", - "id": "c55b5c5d-32fb-4a1b-8e23-f6b1ac0628ba", - "metadata": {}, - "source": [ - "We managed to reduce the number of bytes by ~60k (or 60KB). **That's 37.5% less!**\n", - "\n", - "Now imagine we would be talking about gigabyte-sized dataset!" - ] - }, - { - "cell_type": "markdown", - "id": "490fd21e-db29-4539-b514-e83493060a55", - "metadata": {}, - "source": [ - "### _With_ Zarr\n", - "If part of the dataset is stored in a Zarr archive - and that Zarr archive fits in memory (remember to optimize the `dtype`) - the most efficient thing to do is to just convert from Zarr to a NumPy array. Zarr is not built to support this use case specifically and NumPy is optimized for it. For more information, see e.g. [this Github issue](https://github.com/zarr-developers/zarr-python/issues/1395).\n", - "\n", - "Luckily, you don't have to do this yourself. You can use Polaris its `Dataset.load_to_memory()`.\n", - "\n", - "Let's again start by creating a dummy dataset!" - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "id": "624b55c3-389f-4e2c-bc49-58db2542e3a4", - "metadata": {}, - "outputs": [], - "source": [ - "import os\n", - "import zarr\n", - "from tempfile import mkdtemp\n", - "\n", - "tmpdir = mkdtemp()\n", - "\n", - "# For the ones familiar with Zarr, this is not optimized at all.\n", - "# If you wouldn't want to convert to NumPy, you would want to\n", - "# optimize the chunking / compression.\n", - "\n", - "path = os.path.join(tmpdir, \"data.zarr\")\n", - "root = zarr.open(path, \"w\")\n", - "root.array(\"A\", rng.random(10000))\n", - "root.array(\"B\", rng.random(10000));" - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "id": "dcf9f62e-7be3-4d85-b41c-4e45a73e5f34", - "metadata": {}, - "outputs": [], - "source": [ - "from polaris.dataset import create_dataset_from_file\n", - "\n", - "root_path = os.path.join(tmpdir, \"data\", \"data.zarr\")\n", - "dataset = create_dataset_from_file(path, zarr_root_path=root_path)" - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "id": "e2e441ef-7a9f-4306-81c6-5f2414655130", - "metadata": {}, - "outputs": [], - "source": [ - "from polaris.dataset import Subset\n", - "\n", - "subset = Subset(dataset, np.arange(len(dataset)), \"A\", \"B\")" - ] - }, - { - "cell_type": "markdown", - "id": "9fb12a2d-733b-4e52-83f1-f6a3a3e02054", - "metadata": {}, - "source": [ - "For the sake of this example, we will use PyTorch." - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "id": "6c562fb4-0a7e-49a7-81bd-0a65cf8c8109", - "metadata": {}, - "outputs": [], - "source": [ - "from torch.utils.data import DataLoader\n", - "\n", - "dataloader = DataLoader(subset, batch_size=64, shuffle=True)" - ] - }, - { - "cell_type": "markdown", - "id": "e4f3cca7-4a3f-46ba-b50f-15f6af99120d", - "metadata": {}, - "source": [ - "Let's see how fast this is!" - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "id": "f53c3249-aa24-4323-8acf-7593bdd12dc0", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "1.45 s ± 22 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" - ] - } - ], - "source": [ - "%%timeit\n", - "for batch in dataloader:\n", - " pass" - ] - }, - { - "cell_type": "markdown", - "id": "2465116b-e69f-47e7-a77f-a93f69a55ec3", - "metadata": {}, - "source": [ - "That's pretty slow... Let's see if Polaris its optimization helps. " - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "id": "861ecad6-a141-4527-a583-19ec7ed7ea78", - "metadata": {}, - "outputs": [], - "source": [ - "dataset.load_to_memory()" - ] - }, - { - "cell_type": "code", - "execution_count": 13, - "id": "be163e3a-b054-4496-9bc4-bfdc063a42aa", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "99.4 ms ± 2.45 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" - ] - } - ], - "source": [ - "%%timeit\n", - "for batch in dataloader:\n", - " pass" - ] - }, - { - "cell_type": "markdown", - "id": "e85e1089-9969-4ed0-999d-7b6327148a37", - "metadata": {}, - "source": [ - "That's a lot faster! \n", - "\n", - "Now all that's left to do, is to clean up the temporary directory." - ] - }, - { - "cell_type": "code", - "execution_count": 14, - "id": "24781758-8c10-447f-a56b-d20da5fa297f", - "metadata": {}, - "outputs": [], - "source": [ - "from shutil import rmtree\n", - "\n", - "rmtree(tmpdir)" - ] - }, - { - "cell_type": "markdown", - "id": "47f8babc-ae30-402e-80e6-1039ac60207e", - "metadata": {}, - "source": [ - "## Datasets that fit on a local disk\n", - "\n", - "For datasets that don't fit in memory, but that can be stored on a local disk, the most impactful design decision is how the dataset is chunked. \n", - "\n", - "Zarr datasets are chunked. When you try to load one piece of data, the entire chunk that data is part of has to be loaded into memory and decompressed. Remember that in ML, data access is typically random, which is a terrible access pattern because you are likely to reload chunks into memory.\n", - "\n", - "Most efficient is thus to chunk the data such that each chunk only contains a single data point.\n", - "\n", - "- Benefit: No longer induce a performance penalty due to loading additional data into memory that it might not need.\n", - "- Downside: You might be able to compress the data more if you can consider similarities across data points while compressing.\n", - "\n", - "**A note on rechunking**: Within Polaris, you do not have control over how a dataset on the Hub is chunked. In that case, rechunking is needed. This can induce a one-time, but nevertheless big performance penalty (see also the [Zarr docs](https://zarr.readthedocs.io/en/stable/tutorial.html#changing-chunk-shapes-rechunking)). I don’t expect this to be an issue in the short-term given the size of the dataset we will be working with, but Zarr recommends using the [rechunker](https://github.com/pangeo-data/rechunker?tab=readme-ov-file) Python package to improve performance." - ] - }, - { - "cell_type": "markdown", - "id": "160b3cb4-0069-402e-ae35-c5410d68285a", - "metadata": {}, - "source": [ - "## Remote Datasets\n", - "In this case, you really benefit from improving memory storage by trying different compressors.\n", - "\n", - "See also [this article](https://earthmover.io/blog/cloud-native-dataloader)." - ] - }, - { - "cell_type": "markdown", - "id": "72767ef2", - "metadata": { - "editable": true, - "slideshow": { - "slide_type": "" - }, - "tags": [] - }, - "source": [ - "The End. " - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.11.8" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/docs/tutorials/submit_to_benchmark.ipynb b/docs/tutorials/submit_to_benchmark.ipynb new file mode 100644 index 00000000..d8db1516 --- /dev/null +++ b/docs/tutorials/submit_to_benchmark.ipynb @@ -0,0 +1,374 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "40f99374-b47e-4f84-bdb9-148a11f9c07d", + "metadata": { + "editable": true, + "slideshow": { + "slide_type": "" + }, + "tags": [] + }, + "source": [ + "This tutorial is an extended version of the [Quickstart Guide](../quickstart.html)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3d66f466", + "metadata": { + "editable": true, + "slideshow": { + "slide_type": "" + }, + "tags": [ + "remove_cell" + ] + }, + "outputs": [], + "source": [ + "# Note: Cell is tagged to not show up in the mkdocs build\n", + "%load_ext autoreload\n", + "%autoreload 2" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9b465ea4-7c71-443b-9908-3f9e567ee4c4", + "metadata": {}, + "outputs": [], + "source": [ + "import polaris as po" + ] + }, + { + "cell_type": "markdown", + "id": "168c7f21-f9ec-43e2-b123-2bdcba2e8a71", + "metadata": {}, + "source": [ + "## Login\n", + "We first need to authenticate ourselves using our Polaris account. If you don't have an account yet, you can create one [here](https://polarishub.io/sign-up)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "de8bf4bf-4dbd-42eb-8f74-bf8aa0339469", + "metadata": {}, + "outputs": [], + "source": [ + "from polaris.hub.client import PolarisHubClient\n", + "\n", + "with PolarisHubClient() as client:\n", + " client.login()" + ] + }, + { + "cell_type": "markdown", + "id": "5edee39f-ce29-4ae6-91ce-453d9190541b", + "metadata": {}, + "source": [ + "## Load from the Hub\n", + "Datasets and benchmarks are identified by a `owner/slug` id. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4e004589-6c48-4232-b353-b1700536dde6", + "metadata": {}, + "outputs": [], + "source": [ + "benchmark = po.load_benchmark(\"polaris/hello-world-benchmark\")" + ] + }, + { + "cell_type": "markdown", + "id": "9c6efb7f-b59f-4d28-a374-9a5336e5c817", + "metadata": {}, + "source": [ + "Loading a benchmark will automatically load the underlying dataset. \n", + "\n", + "You can also load the dataset directly. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1e65d085-7f93-4b6f-8c2b-03b547b89e69", + "metadata": {}, + "outputs": [], + "source": [ + "dataset = po.load_dataset(\"polaris/hello-world\")" + ] + }, + { + "cell_type": "markdown", + "id": "1ce8e0e5-88c8-4d3b-9292-e75c97315833", + "metadata": {}, + "source": [ + "## The Benchmark API\n", + "The benchmark object provides two main API endpoints. \n", + "\n", + "- `get_train_test_split()`: For creating objects through which we can access the different dataset partitions.\n", + "- `evaluate()`: For evaluating a set of predictions in accordance with the benchmark protocol.\n", + "\n", + "### Train-test split" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "054563dd-fe8b-4681-89d6-869b35d8a210", + "metadata": {}, + "outputs": [], + "source": [ + "train, test = benchmark.get_train_test_split()" + ] + }, + { + "cell_type": "markdown", + "id": "1926b12f-2c19-4be8-8d8f-d7eef606b2da", + "metadata": {}, + "source": [ + "The created objects support various flavours to access the data.\n", + "- The objects are iterable;\n", + "- The objects can be indexed;\n", + "- The objects have properties to access all data at once." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "43cbe460", + "metadata": {}, + "outputs": [], + "source": [ + "for x, y in train:\n", + " pass" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2f317c10", + "metadata": {}, + "outputs": [], + "source": [ + "for i in range(len(train)):\n", + " x, y = train[i]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "08ce24c7-992a-40a7-b8ef-c862fab99e6e", + "metadata": {}, + "outputs": [], + "source": [ + "x = train.inputs\n", + "y = train.targets" + ] + }, + { + "cell_type": "markdown", + "id": "d5fa35c5-e2d0-4d75-a2cb-75b4749d91ef", + "metadata": {}, + "source": [ + "To avoid accidental access to the test targets, the test object does not expose the labels and will throw an error if you try access them explicitly." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8c33b7d4-fa82-4994-a7ab-5d0821ad5fd4", + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "for x in test:\n", + " pass" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8b4ac073", + "metadata": {}, + "outputs": [], + "source": [ + "for i in range(len(test)):\n", + " x = test[i]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5664eb87", + "metadata": {}, + "outputs": [], + "source": [ + "x = test.inputs\n", + "\n", + "# NOTE: The below will throw an error!\n", + "# y = test.targets" + ] + }, + { + "cell_type": "markdown", + "id": "2f9f2b23-2621-461d-95bc-3a8ddb2d3970", + "metadata": {}, + "source": [ + "We also support conversion to other typical formats." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ece710e5-e271-4c25-9d7b-32e098db194d", + "metadata": {}, + "outputs": [], + "source": [ + "df_train = train.as_dataframe()" + ] + }, + { + "cell_type": "markdown", + "id": "955ad9db-3468-4f34-b303-18e6d642be56", + "metadata": {}, + "source": [ + "### Submit your results\n", + "\n", + "In this example, we will train a simple Random Forest model on the ECFP representation through [scikit-learn](https://scikit-learn.org/stable/) and [datamol](https://github.com/datamol-io/datamol)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "748dd278-0fd0-4c5b-ac6a-8d974143c3b9", + "metadata": {}, + "outputs": [], + "source": [ + "import datamol as dm\n", + "from sklearn.ensemble import RandomForestRegressor\n", + "\n", + "# We will recreate the split to pass a featurization function.\n", + "train, test = benchmark.get_train_test_split(featurization_fn=dm.to_fp)\n", + "\n", + "# Define a model and train\n", + "model = RandomForestRegressor(max_depth=2, random_state=0)\n", + "model.fit(train.X, train.y)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6633ec79-a6ff-4ce0-bc7d-cdb9e1042462", + "metadata": {}, + "outputs": [], + "source": [ + "predictions = model.predict(test.X)" + ] + }, + { + "cell_type": "markdown", + "id": "d59b969e-5a66-4626-a865-f2b2aeea890d", + "metadata": {}, + "source": [ + "As said before, evaluating the submissions should be done through the `evaluate()` endpoint." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "79c072cf-683e-4257-b31e-59fdbcf5e979", + "metadata": {}, + "outputs": [], + "source": [ + "results = benchmark.evaluate(predictions)\n", + "results" + ] + }, + { + "cell_type": "markdown", + "id": "90114c20-4c01-432b-9f4d-b31863881cc6", + "metadata": {}, + "source": [ + "Before uploading the results to the Hub, you can provide some additional information about the results that will be displayed on the Polaris Hub." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a601f415-c563-4efe-94c3-0d44f3fd6576", + "metadata": {}, + "outputs": [], + "source": [ + "# For a complete list of meta-data, check out the BenchmarkResults object\n", + "results.name = \"hello-world-result\"\n", + "results.github_url = \"https://github.com/polaris-hub/polaris-hub\"\n", + "results.paper_url = \"https://polarishub.io/\"\n", + "results.description = \"Hello, World!\"\n", + "results.tags = [\"random_forest\", \"ecfp\"]\n", + "results.user_attributes = {\"Framework\": \"Scikit-learn\"}" + ] + }, + { + "cell_type": "markdown", + "id": "4e7cc06d", + "metadata": {}, + "source": [ + "Finally, let's upload the results to the Hub!" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "60cbf4b9-8514-480d-beda-8a50e5f7c9a6", + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "results.upload_to_hub(owner=\"my-username\", access=\"public\")" + ] + }, + { + "cell_type": "markdown", + "id": "78fe8d63", + "metadata": {}, + "source": [ + "That's it! Just like that you have submitted a result to a Polaris benchmark\n", + "\n", + "---\n", + "\n", + "The End.\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.8" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/docs/tutorials/submit_to_competition.ipynb b/docs/tutorials/submit_to_competition.ipynb new file mode 100644 index 00000000..89779cc9 --- /dev/null +++ b/docs/tutorials/submit_to_competition.ipynb @@ -0,0 +1,169 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "40f99374-b47e-4f84-bdb9-148a11f9c07d", + "metadata": { + "editable": true, + "slideshow": { + "slide_type": "" + }, + "tags": [] + }, + "source": [ + "On Polaris, submitting to a competition is very similar to submitting to a benchmark. \n", + "\n", + "The main difference lies in how predictions are prepared and how they are evaluated" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "3d66f466", + "metadata": { + "editable": true, + "slideshow": { + "slide_type": "" + }, + "tags": [ + "remove_cell" + ] + }, + "outputs": [], + "source": [ + "# Note: Cell is tagged to not show up in the mkdocs build\n", + "%load_ext autoreload\n", + "%autoreload 2" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "66cd175c-1f8d-4209-ad78-8d959ea31d9f", + "metadata": {}, + "outputs": [], + "source": [ + "import polaris as po" + ] + }, + { + "cell_type": "markdown", + "id": "84b6d1b9-3ee8-4ff4-9d92-8ed91ffa2f51", + "metadata": {}, + "source": [ + "## Login\n", + "As before, we first need to authenticate ourselves using our Polaris account. If you don't have an account yet, you can create one [here](https://polarishub.io/sign-up)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9b465ea4-7c71-443b-9908-3f9e567ee4c4", + "metadata": {}, + "outputs": [], + "source": [ + "from polaris.hub.client import PolarisHubClient\n", + "\n", + "with PolarisHubClient() as client:\n", + " client.login()" + ] + }, + { + "cell_type": "markdown", + "id": "5edee39f-ce29-4ae6-91ce-453d9190541b", + "metadata": {}, + "source": [ + "## Load the Competition\n", + "As with regular benchmarks, a competition is identified by the `owner/slug` id." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "4e004589-6c48-4232-b353-b1700536dde6", + "metadata": {}, + "outputs": [], + "source": [ + "competition = po.load_competition(\"polaris/hello-world-competition\")" + ] + }, + { + "cell_type": "markdown", + "id": "36f3e829", + "metadata": {}, + "source": [ + "## The Competition API\n", + "Similar to the benchmark API, the competition exposes two main API endpoints:\n", + "\n", + "- `get_train_test_split()`, which does exactly the same as for benchmarks. \n", + "- `submit_predictions()`, which is used to submit your predictions to a competition.\n", + "\n", + "Note that different from regular benchmarks, competitions don't have an `evaluate()` endpoint. \n", + "\n", + "That's because the evaluation happens server side. This gives the competition organizers precise control over how and when the test set and associated results get published, providing a unique opportunity for unbiased evaluation and comparison of different methods.\n", + "\n", + "### Submit your _predictions_\n", + "Similar to your actual results, you can also provide metadata about your predictions." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2b36e09b", + "metadata": {}, + "outputs": [], + "source": [ + "competition.submit_predictions(\n", + " predictions=predictions,\n", + " prediction_name=\"my-first-predictions\",\n", + " prediction_owner=\"my-username\",\n", + " report_url=\"https://www.example.com\", \n", + " # The below metadata is optional, but recommended.\n", + " github_url=\"https://github.com/polaris-hub/polaris\",\n", + " description=\"Just testing the Polaris API here!\",\n", + " tags=[\"tutorial\"],\n", + " user_attributes={\"Framework\": \"Scikit-learn\", \"Method\": \"Gradient Boosting\"}\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "44973556", + "metadata": {}, + "source": [ + "That's it! Just like that you have partaken in your first Polaris competition. \n", + "\n", + "
\n", + "

Where are my results?

\n", + "

The results will only be published at predetermined intervals, as detailed in the competition details. Keep an eye on that leaderboard when it goes public and best of luck!

\n", + "
\n", + "\n", + "\n", + "---\n", + "\n", + "The End. " + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.8" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/mkdocs.yml b/mkdocs.yml index d7bb02fd..0714b1a4 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -16,16 +16,14 @@ nav: - Getting started: - Polaris: index.md - Quickstart: quickstart.md + - Resources: resources.md - Tutorials: - - Basics: tutorials/basics.ipynb - - Data Models: tutorials/custom_dataset_benchmark.ipynb - - Creating Datasets: - - Zarr Datasets: tutorials/dataset_zarr.ipynb - - PDB Datasets: tutorials/dataset_pdb.ipynb - - SDF Datasets: tutorials/dataset_sdf.ipynb - - Optimization: tutorials/optimization.ipynb - - Competitions: - - tutorials/competition.participate.ipynb + - Submit: + - Submit to a Benchmark: tutorials/submit_to_benchmark.ipynb + - Submit to a Competition: tutorials/submit_to_competition.ipynb + - Create: + - Create a Dataset: tutorials/create_a_dataset.ipynb + - Create a Benchmark: tutorials/create_a_benchmark.ipynb - API Reference: - Load: api/load.md - Core: @@ -38,9 +36,6 @@ nav: - Client: api/hub.client.md - External Auth Client: api/hub.external_client.md - Additional: - - Dataset Factory: api/factory.md - - Data Converters: api/converters.md - - Data Adapters: api/adapters.md - Base classes: api/base.md - Types: api/utils.types.md - Community: https://discord.gg/vBFd8p6H7u diff --git a/polaris/benchmark/__init__.py b/polaris/benchmark/__init__.py index ea800576..c0762bf4 100644 --- a/polaris/benchmark/__init__.py +++ b/polaris/benchmark/__init__.py @@ -1,9 +1,11 @@ from polaris.benchmark._base import BenchmarkSpecification, BenchmarkV1Specification +from polaris.benchmark._benchmark_v2 import BenchmarkV2Specification from polaris.benchmark._definitions import MultiTaskBenchmarkSpecification, SingleTaskBenchmarkSpecification __all__ = [ "BenchmarkSpecification", "BenchmarkV1Specification", + "BenchmarkV2Specification", "SingleTaskBenchmarkSpecification", "MultiTaskBenchmarkSpecification", ] diff --git a/polaris/benchmark/_base.py b/polaris/benchmark/_base.py index 6be6ba83..5df71b9c 100644 --- a/polaris/benchmark/_base.py +++ b/polaris/benchmark/_base.py @@ -15,7 +15,7 @@ model_validator, ) from sklearn.utils.multiclass import type_of_target -from typing_extensions import Self +from typing_extensions import Self, deprecated from polaris._artifact import BaseArtifactModel from polaris.benchmark._split import SplitSpecificationV1Mixin @@ -75,19 +75,14 @@ def n_test_datapoints(self) -> dict[str, int]: class BenchmarkSpecification( PredictiveTaskSpecificationMixin, BaseArtifactModel, BaseSplitSpecificationMixin, abc.ABC ): - """This class wraps a [`Dataset`][polaris.dataset.Dataset] with additional data - to specify the evaluation logic. + """This class wraps a dataset with additional data to specify the evaluation logic. Specifically, it specifies: - 1. Which dataset to use (see [`Dataset`][polaris.dataset.Dataset]); + 1. Which dataset to use; 2. A task definition (we currently only support predictive tasks); 3. A predefined, static train-test split to use during evaluation. - info: Subclasses - Polaris includes various subclasses of the `BenchmarkSpecification` that provide a more precise data-model or - additional logic, e.g. [`SingleTaskBenchmarkSpecification`][polaris.benchmark.SingleTaskBenchmarkSpecification]. - Examples: Basic API usage: ```python @@ -236,7 +231,6 @@ def to_json(self, destination: str) -> str: Warning: Multiple files Perhaps unintuitive, this method creates multiple files in the destination directory as it also saves the dataset it is based on to the specified destination. - See the docstring of [`Dataset.to_json`][polaris.dataset.Dataset.to_json] for more information. Args: destination: The _directory_ to save the associated data to. @@ -273,6 +267,9 @@ def __str__(self): return self.__repr__() +@deprecated( + "Use BenchmarkV2Specification instead. If you're loading this dataset from the Polaris Hub, you can ignore this warning." +) class BenchmarkV1Specification(SplitSpecificationV1Mixin, ChecksumMixin, BenchmarkSpecification): _version: ClassVar[Literal[1]] = 1 diff --git a/polaris/experimental/_benchmark_v2.py b/polaris/benchmark/_benchmark_v2.py similarity index 96% rename from polaris/experimental/_benchmark_v2.py rename to polaris/benchmark/_benchmark_v2.py index 7ea916db..1578a229 100644 --- a/polaris/experimental/_benchmark_v2.py +++ b/polaris/benchmark/_benchmark_v2.py @@ -4,8 +4,8 @@ from typing_extensions import Self from polaris.benchmark import BenchmarkSpecification +from polaris.benchmark._split_v2 import SplitSpecificationV2Mixin from polaris.dataset import DatasetV2, Subset -from polaris.experimental._split_v2 import SplitSpecificationV2Mixin from polaris.utils.errors import InvalidBenchmarkError from polaris.utils.types import ColumnName @@ -14,7 +14,7 @@ class BenchmarkV2Specification(SplitSpecificationV2Mixin, BenchmarkSpecification _version: ClassVar[Literal[2]] = 2 dataset: DatasetV2 = Field(exclude=True) - n_classes: dict[ColumnName, int] + n_classes: dict[ColumnName, int] = Field(default_factory=dict) @field_validator("dataset", mode="before") @classmethod diff --git a/polaris/experimental/_split_v2.py b/polaris/benchmark/_split_v2.py similarity index 100% rename from polaris/experimental/_split_v2.py rename to polaris/benchmark/_split_v2.py diff --git a/polaris/competition/__init__.py b/polaris/competition/__init__.py index 57e2bb9b..43e070af 100644 --- a/polaris/competition/__init__.py +++ b/polaris/competition/__init__.py @@ -23,9 +23,7 @@ class CompetitionSpecification(DatasetV2, PredictiveTaskSpecificationMixin, SplitSpecificationV1Mixin): - """An instance of this class represents a Polaris competition. It defines fields and functionality - that in combination with the [`DatasetV2`][polaris.dataset.DatasetV2] class, allow - users to participate in competitions hosted on Polaris Hub. + """An instance of this class represents a Polaris competition. Examples: Basic API usage: diff --git a/polaris/dataset/_base.py b/polaris/dataset/_base.py index 92f17761..880d2337 100644 --- a/polaris/dataset/_base.py +++ b/polaris/dataset/_base.py @@ -50,8 +50,7 @@ class BaseDataset(BaseArtifactModel, abc.ABC): At its core, a dataset in Polaris can _conceptually_ be thought of as tabular data structure that stores data-points in a row-wise manner, where each column correspond to a variable associated with that datapoint. - A Dataset can have multiple modalities or targets, can be sparse and can be part of one or multiple - [`BenchmarkSpecification`][polaris.benchmark.BenchmarkSpecification] objects. + A Dataset can have multiple modalities or targets, can be sparse and can be part of one or multiple benchmarks. Attributes: default_adapters: The adapters that the Dataset recommends to use by default to change the format of the data @@ -164,7 +163,7 @@ def zarr_root(self) -> zarr.Group | None: Note: Different to `zarr_data` The `zarr_data` attribute references either to the Zarr archive or to a in-memory copy of the data. - See also [`Dataset.load_to_memory`][polaris.dataset.Dataset.load_to_memory]. + See also `dataset.load_to_memory()`. """ from polaris.hub.storage import StorageSession diff --git a/polaris/dataset/_column.py b/polaris/dataset/_column.py index cb0b81bc..d4d36512 100644 --- a/polaris/dataset/_column.py +++ b/polaris/dataset/_column.py @@ -23,7 +23,7 @@ class Modality(enum.Enum): class ColumnAnnotation(BaseModel): """ - The `ColumnAnnotation` class is used to annotate the columns of the [`Dataset`][polaris.dataset.Dataset] object. + The `ColumnAnnotation` class is used to annotate the columns of the object. This mostly just stores meta-data and does not affect the logic. The exception is the `is_pointer` attribute. Attributes: @@ -37,7 +37,7 @@ class ColumnAnnotation(BaseModel): molecules (e.g. "chemical/x-smiles"), visualization for its content will be activated on the Hub side """ - is_pointer: bool = False + is_pointer: bool = Field(False, deprecated=True) modality: Modality = Modality.UNKNOWN description: str | None = None user_attributes: dict[str, str] = Field(default_factory=dict) diff --git a/polaris/dataset/_dataset.py b/polaris/dataset/_dataset.py index 5829b837..abdc12b3 100644 --- a/polaris/dataset/_dataset.py +++ b/polaris/dataset/_dataset.py @@ -10,7 +10,7 @@ import zarr from datamol.utils import fs as dmfs from pydantic import PrivateAttr, computed_field, field_validator, model_validator -from typing_extensions import Self +from typing_extensions import Self, deprecated from polaris.dataset._adapters import Adapter from polaris.dataset._base import BaseDataset @@ -29,6 +29,9 @@ _INDEX_SEP = "#" +@deprecated( + "Use DatasetV2 instead. If you're loading this dataset from the Polaris Hub, you can ignore this warning." +) class DatasetV1(BaseDataset, ChecksumMixin): """First version of a Polaris Dataset. diff --git a/polaris/dataset/_factory.py b/polaris/dataset/_factory.py index 83597c9a..3df6fca0 100644 --- a/polaris/dataset/_factory.py +++ b/polaris/dataset/_factory.py @@ -5,6 +5,7 @@ import datamol as dm import pandas as pd import zarr +from typing_extensions import deprecated from polaris.dataset import ColumnAnnotation, DatasetV1 from polaris.dataset._adapters import Adapter @@ -13,6 +14,9 @@ logger = logging.getLogger(__name__) +@deprecated( + "Please create the Zarr archive directly. For guidance, see https://polaris-hub.github.io/polaris/stable/tutorials/create_a_dataset.html." +) def create_dataset_from_file(path: str, zarr_root_path: str | None = None) -> DatasetV1: """ This function is a convenience function to create a dataset from a file. @@ -29,6 +33,9 @@ def create_dataset_from_file(path: str, zarr_root_path: str | None = None) -> Da return factory.build() +@deprecated( + "Please create the Zarr archive directly. For guidance, see https://polaris-hub.github.io/polaris/stable/tutorials/create_a_dataset.html." +) def create_dataset_from_files( paths: list[str], zarr_root_path: str | None = None, axis: Literal[0, 1, "index", "columns"] = 0 ) -> DatasetV1: @@ -52,6 +59,9 @@ def create_dataset_from_files( return factory.build() +@deprecated( + "Please create the Zarr archive directly. For guidance, see https://polaris-hub.github.io/polaris/stable/tutorials/create_a_dataset.html." +) class DatasetFactory: """ The `DatasetFactory` makes it easier to create complex datasets. @@ -196,7 +206,7 @@ def add_columns( If not specifying a key to merge on, the columns will simply be added to the dataset that has been built so far without any reordering. They are therefore expected to meet all - the same expectations as for [`add_column`][polaris.dataset.DatasetFactory.add_column]. + the same expectations as for `add_column()`. Args: df: A Pandas DataFrame with the columns that we want to add to the dataset. diff --git a/polaris/dataset/converters/_pdb.py b/polaris/dataset/converters/_pdb.py index c76d1e97..db9f4df0 100644 --- a/polaris/dataset/converters/_pdb.py +++ b/polaris/dataset/converters/_pdb.py @@ -6,6 +6,7 @@ import pandas as pd import zarr from fastpdb import struc +from typing_extensions import deprecated from polaris.dataset import ColumnAnnotation, Modality from polaris.dataset._adapters import Adapter @@ -76,6 +77,7 @@ def zarr_to_pdb(atom_dict: zarr.Group): return struc.array(atom_array) +@deprecated("Please use the custom codecs in `polaris.dataset.zarr.codecs` instead.") class PDBConverter(Converter): """ Converts PDB files into a Polaris dataset based on fastpdb. diff --git a/polaris/dataset/converters/_sdf.py b/polaris/dataset/converters/_sdf.py index f998f5e1..1f5f9d77 100644 --- a/polaris/dataset/converters/_sdf.py +++ b/polaris/dataset/converters/_sdf.py @@ -4,6 +4,7 @@ import datamol as dm import pandas as pd from rdkit import Chem +from typing_extensions import deprecated from polaris.dataset import ColumnAnnotation, Modality from polaris.dataset._adapters import Adapter @@ -13,6 +14,7 @@ from polaris.dataset import DatasetFactory +@deprecated("Please use the custom codecs in `polaris.dataset.zarr.codecs` instead.") class SDFConverter(Converter): """ Converts a SDF file into a Polaris dataset. diff --git a/polaris/dataset/converters/_zarr.py b/polaris/dataset/converters/_zarr.py index 417c20fb..c8f9d436 100644 --- a/polaris/dataset/converters/_zarr.py +++ b/polaris/dataset/converters/_zarr.py @@ -1,9 +1,10 @@ +import os from collections import defaultdict from typing import TYPE_CHECKING -import os import pandas as pd import zarr +from typing_extensions import deprecated from polaris.dataset import ColumnAnnotation from polaris.dataset.converters._base import Converter, FactoryProduct @@ -12,13 +13,10 @@ from polaris.dataset import DatasetFactory +@deprecated("Please use the custom codecs in `polaris.dataset.zarr.codecs` instead.") class ZarrConverter(Converter): """Parse a [.zarr](https://zarr.readthedocs.io/en/stable/index.html) archive into a Polaris `Dataset`. - Tip: Tutorial - To learn more about the zarr format, see the - [tutorial](../tutorials/dataset_zarr.ipynb). - Warning: Loading from `.zarr` Loading and saving datasets from and to `.zarr` is still experimental and currently not fully supported by the Hub. diff --git a/polaris/evaluate/_results.py b/polaris/evaluate/_results.py index e04ed0d6..7b85f3f5 100644 --- a/polaris/evaluate/_results.py +++ b/polaris/evaluate/_results.py @@ -136,7 +136,8 @@ def _serialize_results(self, value: pd.DataFrame) -> list[ResultRecords]: class BenchmarkResults(EvaluationResult): """Class specific to results for standard benchmarks. - This object is returned by [`BenchmarkSpecification.evaluate`][polaris.benchmark.BenchmarkSpecification.evaluate]. + This object is returned by `benchmark.evaluate()`. + In addition to the metrics on the test set, it contains additional meta-data and logic to integrate the results with the Polaris Hub. diff --git a/polaris/hub/client.py b/polaris/hub/client.py index 52f2e6a5..69986c3f 100644 --- a/polaris/hub/client.py +++ b/polaris/hub/client.py @@ -20,10 +20,10 @@ MultiTaskBenchmarkSpecification, SingleTaskBenchmarkSpecification, ) +from polaris.benchmark._benchmark_v2 import BenchmarkV2Specification from polaris.competition import CompetitionSpecification from polaris.dataset import Dataset, DatasetV1, DatasetV2 from polaris.evaluate import BenchmarkResults, CompetitionPredictions -from polaris.experimental._benchmark_v2 import BenchmarkV2Specification from polaris.hub.external_client import ExternalAuthClient from polaris.hub.oauth import CachedTokenAuth from polaris.hub.settings import PolarisHubSettings diff --git a/polaris/loader/load.py b/polaris/loader/load.py index 60108b4e..05077a9d 100644 --- a/polaris/loader/load.py +++ b/polaris/loader/load.py @@ -4,8 +4,8 @@ from datamol.utils import fs from polaris.benchmark import MultiTaskBenchmarkSpecification, SingleTaskBenchmarkSpecification +from polaris.benchmark._benchmark_v2 import BenchmarkV2Specification from polaris.dataset import DatasetV1, create_dataset_from_file -from polaris.experimental._benchmark_v2 import BenchmarkV2Specification from polaris.hub.client import PolarisHubClient from polaris.utils.types import ChecksumStrategy @@ -24,8 +24,7 @@ def load_dataset(path: str, verify_checksum: ChecksumStrategy = "verify_unless_z provide the `owner/name` slug. This can be easily copied from the relevant dataset page on the Hub. - **Directory**: When loading the dataset from a directory, you should provide the path - as returned by [`Dataset.to_json`][polaris.dataset.Dataset.to_json]. - The path can be local or remote. + as returned by `dataset.to_json()`. The path can be local or remote. """ extension = fs.get_extension(path) @@ -64,8 +63,7 @@ def load_benchmark(path: str, verify_checksum: ChecksumStrategy = "verify_unless provide the `owner/name` slug. This can be easily copied from the relevant benchmark page on the Hub. - **Directory**: When loading the benchmark from a directory, you should provide the path - as returned by [`BenchmarkSpecification.to_json`][polaris.benchmark._base.BenchmarkSpecification.to_json]. - The path can be local or remote. + as returned by `benchmmark.to_json()`. The path can be local or remote. """ is_file = fs.is_file(path) or fs.get_extension(path) == "zarr" diff --git a/pyproject.toml b/pyproject.toml index 4d5ec5ab..dab2e6eb 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -144,3 +144,4 @@ lint.per-file-ignores."__init__.py" = [ ] line-length = 110 target-version = "py310" +extend-exclude = ["*.ipynb"] \ No newline at end of file diff --git a/tests/test_benchmark_v2.py b/tests/test_benchmark_v2.py index 732c4b67..2bf46fed 100644 --- a/tests/test_benchmark_v2.py +++ b/tests/test_benchmark_v2.py @@ -2,8 +2,8 @@ from pydantic import ValidationError from pyroaring import BitMap -from polaris.experimental._benchmark_v2 import BenchmarkV2Specification -from polaris.experimental._split_v2 import IndexSet, SplitV2 +from polaris.benchmark._benchmark_v2 import BenchmarkV2Specification +from polaris.benchmark._split_v2 import IndexSet, SplitV2 @pytest.fixture