-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* added test data * added first class * added pandas to package requirements * fix linter issues * remove link checker config * added "from_convokit" function that showcases desired funcationality to parse and write corpora * sync ipynb * add requirements and update corpus functions * add documentation * fix flake8 errors * add tests * remove json from cfg * add save to jsonl method * notebook update * play with user experience * add metadata property to corpus * add modules to top level for neater imports * add to_json method to conversation * initialize corpus with metadata in function call * first block explains the use of the conversation to json dump * simplify to json method * add docstring to method * move use of asdict to conversation class * remove initial files and corpus test * update notebook to reflect current workflow * delete old corpus python code * setup testfile for conversation class * remove unused test data * return markdown link checker config * add a test to verify corpus initialization * add conversations property and always return empty list * add conversation to a corpus and check type * check for error if conversation is the wrong type * fix test * make asdict method for utterance class * autopep8 run to fix linting errors * newline at eof * json instead of jsonl add encoding and remove duplicate code * fix linter error * noqa imports * apply isort * remove pandas and jupyter * move json write to corpus * apply isort * add example notebook to docs * typo * add nbsphinx to dependencies * typo again * place fixtures in external file * rename fixtures * instantiate conversation object in test * add utterance objects separately * simplify Utterance fixtures by removing Participant, made issue #32 * increase robustness of conversation init and add tests * autopep8 and add corpus append test * age optional and autopep8 * add tests for asdict * fix linting issues * fix linting issues * move all usage examples to example notebook * run isort * update notebook * implement write_json method as abstract class * add test for json write to conversation * add json write test to corpus * update example notebooks with write_json * autopep8 and isort * fix linter issue * Update sktalk/__init__.py Co-authored-by: Carsten Schnober <[email protected]> * Update sktalk/corpus/conversation.py Co-authored-by: Carsten Schnober <[email protected]> * Update tests/corpus/test_conversation.py Co-authored-by: Carsten Schnober <[email protected]> * update notebook * process review comments * add docstrings for asdict method * use pathlib for json file path --------- Co-authored-by: Barbara Vreede <[email protected]> Co-authored-by: Carsten Schnober <[email protected]>
- Loading branch information
1 parent
7e7e9c8
commit 9399445
Showing
17 changed files
with
564 additions
and
53 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,288 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"attachments": {}, | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Getting started with `scikit-talk`\n", | ||
"\n", | ||
"`scikit-talk` can be used to explore and analyse conversation files.\n", | ||
"\n", | ||
"It contains three main levels of objects:\n", | ||
"\n", | ||
"- Corpora; described with the `Corpus` class\n", | ||
"- Conversations; described with the `Conversation` class\n", | ||
"- Utterances; described with the `Utterance` class\n", | ||
"\n", | ||
"To explore the power of `scikit-talk`, the best entry point is a parser. With the parsers, we can load data into a `scikit-talk` object.\n", | ||
"\n", | ||
"`scikit-talk` currently has the following parsers:\n", | ||
"\n", | ||
"- `ChaFile.parse()`, which parsers .cha files.\n", | ||
"\n", | ||
"Future plans include the creation of parsers for:\n", | ||
"\n", | ||
"- .eaf files\n", | ||
"- .TextGrid files\n", | ||
"- .xml files\n", | ||
"- .csv files\n", | ||
"- .json files\n", | ||
"\n", | ||
"Parsers return an object of the `Conversation` class.\n", | ||
"\n", | ||
"To get started with `scikit-talk`, import the module:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 1, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"import sktalk" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"To see it in action, we will need to start with a transcription file.\n", | ||
"\n", | ||
"For example, you can download a file from the\n", | ||
"[Griffith Corpus of Spoken Australian English](https://ca.talkbank.org/data-orig/GCSAusE/). This publicly available corpus contains transcription files in `.cha` format.\n", | ||
"\n", | ||
"We use the `ChaFile.parse` module to create the `Conversation` object:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 2, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"<sktalk.corpus.conversation.Conversation at 0x10ea2bd60>" | ||
] | ||
}, | ||
"execution_count": 2, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"cha01 = sktalk.ChaFile('GCSAusE_01.cha').parse()\n", | ||
"\n", | ||
"cha01" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"A parsed cha file is a conversation object. It has metadata, and a collection of utterances:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 3, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"[Utterance(utterance='0', participant='S', time=(0, 1500), begin='00:00:00.000', end='00:00:01.500', metadata=None),\n", | ||
" Utterance(utterance=\"mm I'm glad I saw you⇗\", participant='S', time=(1500, 2775), begin='00:00:01.500', end='00:00:02.775', metadata=None),\n", | ||
" Utterance(utterance=\"I thought I'd lost you (0.3)\", participant='S', time=(2775, 3773), begin='00:00:02.775', end='00:00:03.773', metadata=None),\n", | ||
" Utterance(utterance=\"⌈no I've been here for a whi:le⌉,\", participant='H', time=(4052, 5515), begin='00:00:04.052', end='00:00:05.515', metadata=None),\n", | ||
" Utterance(utterance='⌊xxx⌋ (0.3)', participant='S', time=(4052, 5817), begin='00:00:04.052', end='00:00:05.817', metadata=None),\n", | ||
" Utterance(utterance=\"⌊hm:: (.) if ʔI couldn't boʔrrow, (1.3) the second (0.2) book of readings fo:r\", participant='S', time=(6140, 9487), begin='00:00:06.140', end='00:00:09.487', metadata=None),\n", | ||
" Utterance(utterance='commu:nicating acro-', participant='H', time=(12888, 14050), begin='00:00:12.888', end='00:00:14.050', metadata=None),\n", | ||
" Utterance(utterance='no: for family gender and sexuality', participant='H', time=(14050, 17014), begin='00:00:14.050', end='00:00:17.014', metadata=None),\n", | ||
" Utterance(utterance=\"+≋ ah: that's the second on is itʔ\", participant='S', time=(17014, 18611), begin='00:00:17.014', end='00:00:18.611', metadata=None),\n", | ||
" Utterance(utterance=\"+≋ I think it's s⌈ame family gender⌉ has a second book\", participant='H', time=(18611, 21090), begin='00:00:18.611', end='00:00:21.090', metadata=None)]" | ||
] | ||
}, | ||
"execution_count": 3, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"cha01.utterances[:10]" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 4, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"{'source': 'GCSAusE_01.cha',\n", | ||
" 'UTF8': '',\n", | ||
" 'PID': '11312/t-00017232-1',\n", | ||
" 'Languages': ['eng'],\n", | ||
" 'Participants': {'S': {'name': 'Sarah',\n", | ||
" 'language': 'eng',\n", | ||
" 'corpus': 'GCSAusE',\n", | ||
" 'age': '',\n", | ||
" 'sex': '',\n", | ||
" 'group': '',\n", | ||
" 'ses': '',\n", | ||
" 'role': 'Adult',\n", | ||
" 'education': '',\n", | ||
" 'custom': ''},\n", | ||
" 'H': {'name': 'Hannah',\n", | ||
" 'language': 'eng',\n", | ||
" 'corpus': 'GCSAusE',\n", | ||
" 'age': '',\n", | ||
" 'sex': '',\n", | ||
" 'group': '',\n", | ||
" 'ses': '',\n", | ||
" 'role': 'Adult',\n", | ||
" 'education': '',\n", | ||
" 'custom': ''}},\n", | ||
" 'Options': 'CA',\n", | ||
" 'Media': '01, audio'}" | ||
] | ||
}, | ||
"execution_count": 4, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"cha01.metadata" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"We can write the conversation to file as a json file:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 5, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"cha01.write_json(path = \"CGSAusE_01.json\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## The `Corpus` object\n", | ||
"\n", | ||
"A Corpus is a way to collect conversations.\n", | ||
"\n", | ||
"A Corpus can be initialized from a single conversation, or a list of conversations.\n", | ||
"It can also be initialized as an empty object, with metadata." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 6, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"{'name': 'Griffith Corpus of Spoken Australian English',\n", | ||
" 'url': 'https://ca.talkbank.org/data-orig/GCSAusE/'}" | ||
] | ||
}, | ||
"execution_count": 6, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"GCSAusE = sktalk.Corpus(name = \"Griffith Corpus of Spoken Australian English\",\n", | ||
" url = \"https://ca.talkbank.org/data-orig/GCSAusE/\")\n", | ||
"\n", | ||
"GCSAusE.metadata" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"We can add conversations to a `Corpus`:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 7, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"[<sktalk.corpus.conversation.Conversation at 0x10ea2bd60>]" | ||
] | ||
}, | ||
"execution_count": 7, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"GCSAusE.append(cha01)\n", | ||
"\n", | ||
"GCSAusE.conversations" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"We can turn objects of type `Conversation` and `Corpus` to a dictionary with, `cha01.asdict()` and `GCSAusE.asdict()`, respectively.\n", | ||
"\n", | ||
"A `Corpus` object can also be stored as a `.json` file:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 8, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"GCSAusE.write_json(path = \"CGSAusE.json\")\n" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.10.11" | ||
}, | ||
"orig_nbformat": 4, | ||
"vscode": { | ||
"interpreter": { | ||
"hash": "aee8b7b246df8f9039afb4144a1f6fd8d2ca17a180786b69acc140d282b71a49" | ||
} | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -48,6 +48,7 @@ dev = | |
coverage [toml] | ||
prospector[with_pyroma] | ||
isort | ||
nbsphinx | ||
pytest | ||
pytest-cov | ||
sphinx | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.