Skip to content

Commit

Permalink
generate json object (#21)
Browse files Browse the repository at this point in the history
* added test data

* added first class

* added pandas to package requirements

* fix linter issues

* remove link checker config

* added "from_convokit" function that showcases desired funcationality to parse and write corpora

* sync ipynb

* add requirements and update corpus functions

* add documentation

* fix flake8 errors

* add tests

* remove json from cfg

* add save to jsonl method

* notebook update

* play with user experience

* add metadata property to corpus

* add modules to top level for neater imports

* add to_json method to conversation

* initialize corpus with metadata in function call

* first block explains the use of the conversation to json dump

* simplify to json method

* add docstring to method

* move use of asdict to conversation class

* remove initial files and corpus test

* update notebook to reflect current workflow

* delete old corpus python code

* setup testfile for conversation class

* remove unused test data

* return markdown link checker config

* add a test to verify corpus initialization

* add conversations property and always return empty list

* add conversation to a corpus and check type

* check for error if conversation is the wrong type

* fix test

* make asdict method for utterance class

* autopep8 run to fix linting errors

* newline at eof

* json instead of jsonl add encoding and remove duplicate code

* fix linter error

* noqa imports

* apply isort

* remove pandas and jupyter

* move json write to corpus

* apply isort

* add example notebook to docs

* typo

* add nbsphinx to dependencies

* typo again

* place fixtures in external file

* rename fixtures

* instantiate conversation object in test

* add utterance objects separately

* simplify Utterance fixtures by removing Participant, made issue #32

* increase robustness of conversation init and add tests

* autopep8 and add corpus append test

* age optional and autopep8

* add tests for asdict

* fix linting issues

* fix linting issues

* move all usage examples to example notebook

* run isort

* update notebook

* implement write_json method as abstract class

* add test for json write to conversation

* add json write test to corpus

* update example notebooks with write_json

* autopep8 and isort

* fix linter issue

* Update sktalk/__init__.py

Co-authored-by: Carsten Schnober <[email protected]>

* Update sktalk/corpus/conversation.py

Co-authored-by: Carsten Schnober <[email protected]>

* Update tests/corpus/test_conversation.py

Co-authored-by: Carsten Schnober <[email protected]>

* update notebook

* process review comments

* add docstrings for asdict method

* use pathlib for json file path

---------

Co-authored-by: Barbara Vreede <[email protected]>
Co-authored-by: Carsten Schnober <[email protected]>
  • Loading branch information
3 people authored Nov 2, 2023
1 parent 7e7e9c8 commit 9399445
Show file tree
Hide file tree
Showing 17 changed files with 564 additions and 53 deletions.
1 change: 1 addition & 0 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@
"sphinx.ext.viewcode",
"autoapi.extension",
"myst_parser",
"nbsphinx",
]

# Add any paths that contain templates here, relative to this directory.
Expand Down
30 changes: 0 additions & 30 deletions docs/getting_started.rst

This file was deleted.

2 changes: 1 addition & 1 deletion docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ You can read the full paper here: `Liesenfeld et al. (2021) <https://aclantholog
:caption: Contents:

installation
getting_started
notebooks/example


.. Indices and tables
Expand Down
288 changes: 288 additions & 0 deletions docs/notebooks/example.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,288 @@
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Getting started with `scikit-talk`\n",
"\n",
"`scikit-talk` can be used to explore and analyse conversation files.\n",
"\n",
"It contains three main levels of objects:\n",
"\n",
"- Corpora; described with the `Corpus` class\n",
"- Conversations; described with the `Conversation` class\n",
"- Utterances; described with the `Utterance` class\n",
"\n",
"To explore the power of `scikit-talk`, the best entry point is a parser. With the parsers, we can load data into a `scikit-talk` object.\n",
"\n",
"`scikit-talk` currently has the following parsers:\n",
"\n",
"- `ChaFile.parse()`, which parsers .cha files.\n",
"\n",
"Future plans include the creation of parsers for:\n",
"\n",
"- .eaf files\n",
"- .TextGrid files\n",
"- .xml files\n",
"- .csv files\n",
"- .json files\n",
"\n",
"Parsers return an object of the `Conversation` class.\n",
"\n",
"To get started with `scikit-talk`, import the module:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import sktalk"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To see it in action, we will need to start with a transcription file.\n",
"\n",
"For example, you can download a file from the\n",
"[Griffith Corpus of Spoken Australian English](https://ca.talkbank.org/data-orig/GCSAusE/). This publicly available corpus contains transcription files in `.cha` format.\n",
"\n",
"We use the `ChaFile.parse` module to create the `Conversation` object:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<sktalk.corpus.conversation.Conversation at 0x10ea2bd60>"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cha01 = sktalk.ChaFile('GCSAusE_01.cha').parse()\n",
"\n",
"cha01"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A parsed cha file is a conversation object. It has metadata, and a collection of utterances:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[Utterance(utterance='0', participant='S', time=(0, 1500), begin='00:00:00.000', end='00:00:01.500', metadata=None),\n",
" Utterance(utterance=\"mm I'm glad I saw you⇗\", participant='S', time=(1500, 2775), begin='00:00:01.500', end='00:00:02.775', metadata=None),\n",
" Utterance(utterance=\"I thought I'd lost you (0.3)\", participant='S', time=(2775, 3773), begin='00:00:02.775', end='00:00:03.773', metadata=None),\n",
" Utterance(utterance=\"⌈no I've been here for a whi:le⌉,\", participant='H', time=(4052, 5515), begin='00:00:04.052', end='00:00:05.515', metadata=None),\n",
" Utterance(utterance='⌊xxx⌋ (0.3)', participant='S', time=(4052, 5817), begin='00:00:04.052', end='00:00:05.817', metadata=None),\n",
" Utterance(utterance=\"⌊hm:: (.) if ʔI couldn't boʔrrow, (1.3) the second (0.2) book of readings fo:r\", participant='S', time=(6140, 9487), begin='00:00:06.140', end='00:00:09.487', metadata=None),\n",
" Utterance(utterance='commu:nicating acro-', participant='H', time=(12888, 14050), begin='00:00:12.888', end='00:00:14.050', metadata=None),\n",
" Utterance(utterance='no: for family gender and sexuality', participant='H', time=(14050, 17014), begin='00:00:14.050', end='00:00:17.014', metadata=None),\n",
" Utterance(utterance=\"+≋ ah: that's the second on is itʔ\", participant='S', time=(17014, 18611), begin='00:00:17.014', end='00:00:18.611', metadata=None),\n",
" Utterance(utterance=\"+≋ I think it's s⌈ame family gender⌉ has a second book\", participant='H', time=(18611, 21090), begin='00:00:18.611', end='00:00:21.090', metadata=None)]"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cha01.utterances[:10]"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'source': 'GCSAusE_01.cha',\n",
" 'UTF8': '',\n",
" 'PID': '11312/t-00017232-1',\n",
" 'Languages': ['eng'],\n",
" 'Participants': {'S': {'name': 'Sarah',\n",
" 'language': 'eng',\n",
" 'corpus': 'GCSAusE',\n",
" 'age': '',\n",
" 'sex': '',\n",
" 'group': '',\n",
" 'ses': '',\n",
" 'role': 'Adult',\n",
" 'education': '',\n",
" 'custom': ''},\n",
" 'H': {'name': 'Hannah',\n",
" 'language': 'eng',\n",
" 'corpus': 'GCSAusE',\n",
" 'age': '',\n",
" 'sex': '',\n",
" 'group': '',\n",
" 'ses': '',\n",
" 'role': 'Adult',\n",
" 'education': '',\n",
" 'custom': ''}},\n",
" 'Options': 'CA',\n",
" 'Media': '01, audio'}"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cha01.metadata"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can write the conversation to file as a json file:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"cha01.write_json(path = \"CGSAusE_01.json\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## The `Corpus` object\n",
"\n",
"A Corpus is a way to collect conversations.\n",
"\n",
"A Corpus can be initialized from a single conversation, or a list of conversations.\n",
"It can also be initialized as an empty object, with metadata."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'name': 'Griffith Corpus of Spoken Australian English',\n",
" 'url': 'https://ca.talkbank.org/data-orig/GCSAusE/'}"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"GCSAusE = sktalk.Corpus(name = \"Griffith Corpus of Spoken Australian English\",\n",
" url = \"https://ca.talkbank.org/data-orig/GCSAusE/\")\n",
"\n",
"GCSAusE.metadata"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can add conversations to a `Corpus`:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[<sktalk.corpus.conversation.Conversation at 0x10ea2bd60>]"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"GCSAusE.append(cha01)\n",
"\n",
"GCSAusE.conversations"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can turn objects of type `Conversation` and `Corpus` to a dictionary with, `cha01.asdict()` and `GCSAusE.asdict()`, respectively.\n",
"\n",
"A `Corpus` object can also be stored as a `.json` file:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"GCSAusE.write_json(path = \"CGSAusE.json\")\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.11"
},
"orig_nbformat": 4,
"vscode": {
"interpreter": {
"hash": "aee8b7b246df8f9039afb4144a1f6fd8d2ca17a180786b69acc140d282b71a49"
}
}
},
"nbformat": 4,
"nbformat_minor": 2
}
1 change: 1 addition & 0 deletions setup.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@ dev =
coverage [toml]
prospector[with_pyroma]
isort
nbsphinx
pytest
pytest-cov
sphinx
Expand Down
5 changes: 5 additions & 0 deletions sktalk/__init__.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,10 @@
"""Documentation about scikit-talk"""
import logging
# Import the modules that are part of the sktalk package
from .corpus.conversation import Conversation # noqa: F401
from .corpus.corpus import Corpus # noqa: F401
from .corpus.parsing.cha import ChaFile # noqa: F401
from .corpus.utterance import Utterance # noqa: F401


logging.getLogger(__name__).addHandler(logging.NullHandler())
Expand Down
Loading

0 comments on commit 9399445

Please sign in to comment.