Skip to content

Commit

Permalink
Merge pull request #53 from rgommers/sphinx-site
Browse files Browse the repository at this point in the history
Add content for a Sphinx site specifically for the protocol
  • Loading branch information
rgommers authored Sep 2, 2021
2 parents 8498cf1 + 070b9cf commit 27b8e1c
Show file tree
Hide file tree
Showing 9 changed files with 636 additions and 174 deletions.
72 changes: 72 additions & 0 deletions protocol/API.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
# API of the `__dataframe__` protocol

Specification for objects to be accessed, for the purpose of dataframe
interchange between libraries, via the `__dataframe__` method on a libraries'
data frame object.

For guiding requirements, see {ref}`design-requirements`.


## Concepts in this design

1. A `Buffer` class. A *buffer* is a contiguous block of memory - this is the
only thing that actually maps to a 1-D array in a sense that it could be
converted to NumPy, CuPy, et al.
2. A `Column` class. A *column* has a single dtype. It can consist
of multiple *chunks*. A single chunk of a column (which may be the whole
column if ``num_chunks == 1``) is modeled as again a `Column` instance, and
contains 1 data *buffer* and (optionally) one *mask* for missing data.
3. A `DataFrame` class. A *data frame* is an ordered collection of *columns*,
which are identified with names that are unique strings. All the data
frame's rows are the same length. It can consist of multiple *chunks*. A
single chunk of a data frame is modeled as again a `DataFrame` instance.
4. A *mask* concept. A *mask* of a single-chunk column is a *buffer*.
5. A *chunk* concept. A *chunk* is a sub-dividing element that can be applied
to a *data frame* or a *column*.

Note that the only way to access these objects is through a call to
`__dataframe__` on a data frame object. This is NOT meant as public API;
only think of instances of the different classes here to describe the API of
what is returned by a call to `__dataframe__`. They are the concepts needed
to capture the memory layout and data access of a data frame.


## Design decisions

1. Use a separate column abstraction in addition to a dataframe interface.

Rationales:

- This is how it works in R, Julia and Apache Arrow.
- Semantically most existing applications and users treat a column similar to a 1-D array
- We should be able to connect a column to the array data interchange mechanism(s)

Note that this does not imply a library must have such a public user-facing
abstraction (ex. ``pandas.Series``) - it can only be accessed via
``__dataframe__``.

2. Use methods and properties on an opaque object rather than returning
hierarchical dictionaries describing memory.

This is better for implementations that may rely on, for example, lazy
computation.

3. No row names. If a library uses row names, use a regular column for them.

See discussion at
[wesm/dataframe-protocol/pull/1](https://github.com/wesm/dataframe-protocol/pull/1/files#r394316241)
Optional row names are not a good idea, because people will assume they're
present (see cuDF experience, forced to add because pandas has them).
Requiring row names seems worse than leaving them out. Note that row labels
could be added in the future - right now there's no clear requirements for
more complex row labels that cannot be represented by a single column. These
do exist, for example Modin has has table and tree-based row labels.

## Interface



```{literalinclude} dataframe_protocol.py
---
language: python
---
20 changes: 20 additions & 0 deletions protocol/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Minimal makefile for Sphinx documentation
#

# You can set these variables from the command line, and also
# from the environment for the first two.
SPHINXOPTS ?=
SPHINXBUILD ?= sphinx-build
SOURCEDIR = .
BUILDDIR = _build

# Put it first so that "make" without argument is like "make help".
help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

.PHONY: help Makefile

# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
146 changes: 146 additions & 0 deletions protocol/conf.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@
# Configuration file for the Sphinx documentation builder.
#
# This file only contains a selection of the most common options. For a full
# list see the documentation:
# https://www.sphinx-doc.org/en/master/usage/configuration.html

# -- Path setup --------------------------------------------------------------

# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
#
# import os
# import sys
# sys.path.insert(0, os.path.abspath('.'))

import sphinx_material

# -- Project information -----------------------------------------------------

project = 'Python dataframe interchange protocol'
copyright = '2021, Consortium for Python Data API Standards'
author = 'Consortium for Python Data API Standards'

# The full version, including alpha/beta/rc tags
release = '2021-DRAFT'


# -- General configuration ---------------------------------------------------

# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = [
'myst_parser',
'sphinx.ext.extlinks',
'sphinx.ext.intersphinx',
'sphinx.ext.todo',
'sphinx_markdown_tables',
'sphinx_copybutton',
]

# Add any paths that contain templates here, relative to this directory.
templates_path = ['_templates']

# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
# This pattern also affects html_static_path and html_extra_path.
exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']

# MyST options
myst_heading_anchors = 3
myst_enable_extensions = ["colon_fence"]

# -- Options for HTML output -------------------------------------------------

# The theme to use for HTML and HTML Help pages. See the documentation for
# a list of builtin themes.
#
extensions.append("sphinx_material")
html_theme_path = sphinx_material.html_theme_path()
html_context = sphinx_material.get_html_context()
html_theme = 'sphinx_material'

# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ['_static']


# -- Material theme options (see theme.conf for more information) ------------
html_show_sourcelink = False
html_sidebars = {
"**": ["logo-text.html", "globaltoc.html", "localtoc.html", "searchbox.html"]
}

html_theme_options = {

# Set the name of the project to appear in the navigation.
'nav_title': 'Python dataframe interchange protocol',

# Set you GA account ID to enable tracking
#'google_analytics_account': 'UA-XXXXX',

# Specify a base_url used to generate sitemap.xml. If not
# specified, then no sitemap will be built.
#'base_url': 'https://project.github.io/project',

# Set the color and the accent color (see
# https://material.io/design/color/the-color-system.html)
'color_primary': 'indigo',
'color_accent': 'green',

# Set the repo location to get a badge with stats
#'repo_url': 'https://github.com/project/project/',
#'repo_name': 'Project',

"html_minify": False,
"html_prettify": True,
"css_minify": True,
"logo_icon": "&#xe869",
"repo_type": "github",
"touch_icon": "images/apple-icon-152x152.png",
"theme_color": "#2196f3",
"master_doc": False,

# Visible levels of the global TOC; -1 means unlimited
'globaltoc_depth': 2,
# If False, expand all TOC entries
'globaltoc_collapse': True,
# If True, show hidden TOC entries
'globaltoc_includehidden': True,

"nav_links": [
{"href": "index", "internal": True, "title": "Dataframe interchange protcol"},
{
"href": "https://data-apis.org",
"internal": False,
"title": "Consortium for Python Data API Standards",
},
],
"heroes": {
"index": "A protocol for zero-copy data interchange between Python dataframe libraries",
#"customization": "Configuration options to personalize your site.",
},

#"version_dropdown": True,
#"version_json": "_static/versions.json",
"table_classes": ["plain"],
}


todo_include_todos = True
#html_favicon = "images/favicon.ico"

html_use_index = True
html_domain_indices = True

extlinks = {
"duref": (
"http://docutils.sourceforge.net/docs/ref/rst/" "restructuredtext.html#%s",
"",
),
"durole": ("http://docutils.sourceforge.net/docs/ref/rst/" "roles.html#%s", ""),
"dudir": ("http://docutils.sourceforge.net/docs/ref/rst/" "directives.html#%s", ""),
}
67 changes: 0 additions & 67 deletions protocol/dataframe_protocol.py
Original file line number Diff line number Diff line change
@@ -1,70 +1,3 @@
"""
Specification for objects to be accessed, for the purpose of dataframe
interchange between libraries, via the ``__dataframe__`` method on a libraries'
data frame object.
For guiding requirements, see https://github.com/data-apis/dataframe-api/pull/35
Concepts in this design
-----------------------
1. A `Buffer` class. A *buffer* is a contiguous block of memory - this is the
only thing that actually maps to a 1-D array in a sense that it could be
converted to NumPy, CuPy, et al.
2. A `Column` class. A *column* has a single dtype. It can consist
of multiple *chunks*. A single chunk of a column (which may be the whole
column if ``num_chunks == 1``) is modeled as again a `Column` instance, and
contains 1 data *buffer* and (optionally) one *mask* for missing data.
3. A `DataFrame` class. A *data frame* is an ordered collection of *columns*,
which are identified with names that are unique strings. All the data
frame's rows are the same length. It can consist of multiple *chunks*. A
single chunk of a data frame is modeled as again a `DataFrame` instance.
4. A *mask* concept. A *mask* of a single-chunk column is a *buffer*.
5. A *chunk* concept. A *chunk* is a sub-dividing element that can be applied
to a *data frame* or a *column*.
Note that the only way to access these objects is through a call to
``__dataframe__`` on a data frame object. This is NOT meant as public API;
only think of instances of the different classes here to describe the API of
what is returned by a call to ``__dataframe__``. They are the concepts needed
to capture the memory layout and data access of a data frame.
Design decisions
----------------
**1. Use a separate column abstraction in addition to a dataframe interface.**
Rationales:
- This is how it works in R, Julia and Apache Arrow.
- Semantically most existing applications and users treat a column similar to a 1-D array
- We should be able to connect a column to the array data interchange mechanism(s)
Note that this does not imply a library must have such a public user-facing
abstraction (ex. ``pandas.Series``) - it can only be accessed via ``__dataframe__``.
**2. Use methods and properties on an opaque object rather than returning
hierarchical dictionaries describing memory**
This is better for implementations that may rely on, for example, lazy
computation.
**3. No row names. If a library uses row names, use a regular column for them.**
See discussion at https://github.com/wesm/dataframe-protocol/pull/1/files#r394316241
Optional row names are not a good idea, because people will assume they're present
(see cuDF experience, forced to add because pandas has them).
Requiring row names seems worse than leaving them out.
Note that row labels could be added in the future - right now there's no clear
requirements for more complex row labels that cannot be represented by a single
column. These do exist, for example Modin has has table and tree-based row
labels.
"""


class Buffer:
"""
Data in the buffer is guaranteed to be contiguous in memory.
Expand Down
Loading

0 comments on commit 27b8e1c

Please sign in to comment.