generate json object (#21)

* added test data * added first class * added pandas to package requirements * fix linter issues * remove link checker config * added "from_convokit" function that showcases desired funcationality to parse and write corpora * sync ipynb * add requirements and update corpus functions * add documentation * fix flake8 errors * add tests * remove json from cfg * add save to jsonl method * notebook update * play with user experience * add metadata property to corpus * add modules to top level for neater imports * add to_json method to conversation * initialize corpus with metadata in function call * first block explains the use of the conversation to json dump * simplify to json method * add docstring to method * move use of asdict to conversation class * remove initial files and corpus test * update notebook to reflect current workflow * delete old corpus python code * setup testfile for conversation class * remove unused test data * return markdown link checker config * add a test to verify corpus initialization * add conversations property and always return empty list * add conversation to a corpus and check type * check for error if conversation is the wrong type * fix test * make asdict method for utterance class * autopep8 run to fix linting errors * newline at eof * json instead of jsonl add encoding and remove duplicate code * fix linter error * noqa imports * apply isort * remove pandas and jupyter * move json write to corpus * apply isort * add example notebook to docs * typo * add nbsphinx to dependencies * typo again * place fixtures in external file * rename fixtures * instantiate conversation object in test * add utterance objects separately * simplify Utterance fixtures by removing Participant, made issue #32 * increase robustness of conversation init and add tests * autopep8 and add corpus append test * age optional and autopep8 * add tests for asdict * fix linting issues * fix linting issues * move all usage examples to example notebook * run isort * update notebook * implement write_json method as abstract class * add test for json write to conversation * add json write test to corpus * update example notebooks with write_json * autopep8 and isort * fix linter issue * Update sktalk/__init__.py Co-authored-by: Carsten Schnober <[email protected]> * Update sktalk/corpus/conversation.py Co-authored-by: Carsten Schnober <[email protected]> * Update tests/corpus/test_conversation.py Co-authored-by: Carsten Schnober <[email protected]> * update notebook * process review comments * add docstrings for asdict method * use pathlib for json file path --------- Co-authored-by: Barbara Vreede <[email protected]> Co-authored-by: Carsten Schnober <[email protected]>
elpaco-escience · Nov 2, 2023 · 9399445 · 9399445
1 parent 7e7e9c8
commit 9399445
Show file tree

Hide file tree

Showing 17 changed files with 564 additions and 53 deletions.
diff --git a/docs/conf.py b/docs/conf.py
@@ -46,6 +46,7 @@
     "sphinx.ext.viewcode",
     "autoapi.extension",
     "myst_parser",
+    "nbsphinx",
 ]
 
 # Add any paths that contain templates here, relative to this directory.

diff --git a/docs/getting_started.rst b/docs/getting_started.rst
diff --git a/docs/index.rst b/docs/index.rst
@@ -22,7 +22,7 @@ You can read the full paper here: `Liesenfeld et al. (2021) <https://aclantholog
    :caption: Contents:
 
    installation
-   getting_started
+   notebooks/example
 
 
 .. Indices and tables

diff --git a/docs/notebooks/example.ipynb b/docs/notebooks/example.ipynb
@@ -0,0 +1,288 @@
+{
+ "cells": [
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Getting started with `scikit-talk`\n",
+    "\n",
+    "`scikit-talk` can be used to explore and analyse conversation files.\n",
+    "\n",
+    "It contains three main levels of objects:\n",
+    "\n",
+    "- Corpora; described with the `Corpus` class\n",
+    "- Conversations; described with the `Conversation` class\n",
+    "- Utterances; described with the `Utterance` class\n",
+    "\n",
+    "To explore the power of `scikit-talk`, the best entry point is a parser. With the parsers, we can load data into a `scikit-talk` object.\n",
+    "\n",
+    "`scikit-talk` currently has the following parsers:\n",
+    "\n",
+    "- `ChaFile.parse()`, which parsers .cha files.\n",
+    "\n",
+    "Future plans include the creation of parsers for:\n",
+    "\n",
+    "- .eaf files\n",
+    "- .TextGrid files\n",
+    "- .xml files\n",
+    "- .csv files\n",
+    "- .json files\n",
+    "\n",
+    "Parsers return an object of the `Conversation` class.\n",
+    "\n",
+    "To get started with `scikit-talk`, import the module:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import sktalk"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "To see it in action, we will need to start with a transcription file.\n",
+    "\n",
+    "For example, you can download a file from the\n",
+    "[Griffith Corpus of Spoken Australian English](https://ca.talkbank.org/data-orig/GCSAusE/). This publicly available corpus contains transcription files in `.cha` format.\n",
+    "\n",
+    "We use the `ChaFile.parse` module to create the `Conversation` object:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "<sktalk.corpus.conversation.Conversation at 0x10ea2bd60>"
+      ]
+     },
+     "execution_count": 2,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "cha01 = sktalk.ChaFile('GCSAusE_01.cha').parse()\n",
+    "\n",
+    "cha01"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "A parsed cha file is a conversation object. It has metadata, and a collection of utterances:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[Utterance(utterance='0', participant='S', time=(0, 1500), begin='00:00:00.000', end='00:00:01.500', metadata=None),\n",
+       " Utterance(utterance=\"mm I'm glad I saw you⇗\", participant='S', time=(1500, 2775), begin='00:00:01.500', end='00:00:02.775', metadata=None),\n",
+       " Utterance(utterance=\"I thought I'd lost you (0.3)\", participant='S', time=(2775, 3773), begin='00:00:02.775', end='00:00:03.773', metadata=None),\n",
+       " Utterance(utterance=\"⌈no I've been here for a whi:le⌉,\", participant='H', time=(4052, 5515), begin='00:00:04.052', end='00:00:05.515', metadata=None),\n",
+       " Utterance(utterance='⌊xxx⌋ (0.3)', participant='S', time=(4052, 5817), begin='00:00:04.052', end='00:00:05.817', metadata=None),\n",
+       " Utterance(utterance=\"⌊hm:: (.) if ʔI couldn't boʔrrow, (1.3) the second (0.2) book of readings fo:r\", participant='S', time=(6140, 9487), begin='00:00:06.140', end='00:00:09.487', metadata=None),\n",
+       " Utterance(utterance='commu:nicating acro-', participant='H', time=(12888, 14050), begin='00:00:12.888', end='00:00:14.050', metadata=None),\n",
+       " Utterance(utterance='no: for family gender and sexuality', participant='H', time=(14050, 17014), begin='00:00:14.050', end='00:00:17.014', metadata=None),\n",
+       " Utterance(utterance=\"+≋ ah: that's the second on is itʔ\", participant='S', time=(17014, 18611), begin='00:00:17.014', end='00:00:18.611', metadata=None),\n",
+       " Utterance(utterance=\"+≋ I think it's s⌈ame family gender⌉ has a second book\", participant='H', time=(18611, 21090), begin='00:00:18.611', end='00:00:21.090', metadata=None)]"
+      ]
+     },
+     "execution_count": 3,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "cha01.utterances[:10]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{'source': 'GCSAusE_01.cha',\n",
+       " 'UTF8': '',\n",
+       " 'PID': '11312/t-00017232-1',\n",
+       " 'Languages': ['eng'],\n",
+       " 'Participants': {'S': {'name': 'Sarah',\n",
+       "   'language': 'eng',\n",
+       "   'corpus': 'GCSAusE',\n",
+       "   'age': '',\n",
+       "   'sex': '',\n",
+       "   'group': '',\n",
+       "   'ses': '',\n",
+       "   'role': 'Adult',\n",
+       "   'education': '',\n",
+       "   'custom': ''},\n",
+       "  'H': {'name': 'Hannah',\n",
+       "   'language': 'eng',\n",
+       "   'corpus': 'GCSAusE',\n",
+       "   'age': '',\n",
+       "   'sex': '',\n",
+       "   'group': '',\n",
+       "   'ses': '',\n",
+       "   'role': 'Adult',\n",
+       "   'education': '',\n",
+       "   'custom': ''}},\n",
+       " 'Options': 'CA',\n",
+       " 'Media': '01, audio'}"
+      ]
+     },
+     "execution_count": 4,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "cha01.metadata"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We can write the conversation to file as a json file:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "cha01.write_json(path = \"CGSAusE_01.json\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## The `Corpus` object\n",
+    "\n",
+    "A Corpus is a way to collect conversations.\n",
+    "\n",
+    "A Corpus can be initialized from a single conversation, or a list of conversations.\n",
+    "It can also be initialized as an empty object, with metadata."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{'name': 'Griffith Corpus of Spoken Australian English',\n",
+       " 'url': 'https://ca.talkbank.org/data-orig/GCSAusE/'}"
+      ]
+     },
+     "execution_count": 6,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "GCSAusE = sktalk.Corpus(name = \"Griffith Corpus of Spoken Australian English\",\n",
+    "                        url = \"https://ca.talkbank.org/data-orig/GCSAusE/\")\n",
+    "\n",
+    "GCSAusE.metadata"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We can add conversations to a `Corpus`:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[<sktalk.corpus.conversation.Conversation at 0x10ea2bd60>]"
+      ]
+     },
+     "execution_count": 7,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "GCSAusE.append(cha01)\n",
+    "\n",
+    "GCSAusE.conversations"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We can turn objects of type `Conversation` and `Corpus` to a dictionary with, `cha01.asdict()` and `GCSAusE.asdict()`, respectively.\n",
+    "\n",
+    "A `Corpus` object can also be stored as a `.json` file:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "GCSAusE.write_json(path = \"CGSAusE.json\")\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.11"
+  },
+  "orig_nbformat": 4,
+  "vscode": {
+   "interpreter": {
+    "hash": "aee8b7b246df8f9039afb4144a1f6fd8d2ca17a180786b69acc140d282b71a49"
+   }
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/setup.cfg b/setup.cfg
@@ -48,6 +48,7 @@ dev =
     coverage [toml]
     prospector[with_pyroma]
     isort
+    nbsphinx
     pytest
     pytest-cov
     sphinx

diff --git a/sktalk/__init__.py b/sktalk/__init__.py
@@ -1,5 +1,10 @@
 """Documentation about scikit-talk"""
 import logging
+# Import the modules that are part of the sktalk package
+from .corpus.conversation import Conversation  # noqa: F401
+from .corpus.corpus import Corpus  # noqa: F401
+from .corpus.parsing.cha import ChaFile  # noqa: F401
+from .corpus.utterance import Utterance  # noqa: F401
 
 
 logging.getLogger(__name__).addHandler(logging.NullHandler())