Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate turn type dynamics #1

Merged
merged 55 commits into from
Dec 1, 2023
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
55 commits
Select commit Hold shift + click to select a range
8dbd860
code from notebook
bvreede Mar 31, 2023
477b7e9
merge main
bvreede Apr 7, 2023
214a7b3
add turndynamics code from notebook
bvreede Apr 7, 2023
2d1a6f0
instructions for disabling the bloody githook
bvreede Apr 7, 2023
d2b11b0
move code from test to notebook
bvreede Apr 7, 2023
6289d45
Merge branch 'main' into first-functions
bvreede Nov 2, 2023
da8ef4d
remove notebook from repository
bvreede Nov 2, 2023
1f77d4d
start adding utterance functions
bvreede Nov 3, 2023
f76aaba
add calculated fields to dataclass
bvreede Nov 3, 2023
3736596
add python 3.6 to show that it breaks
bvreede Nov 7, 2023
99a18d2
add python 3.8 to show that it breaks
bvreede Nov 7, 2023
808e3e2
add python 3.6 to show that it breaks
bvreede Nov 7, 2023
be49481
add python 3.8 to show that it breaks
bvreede Nov 7, 2023
5d8862b
remove earlier python versions again
bvreede Nov 7, 2023
8d55ced
add fields to initial data class
bvreede Nov 8, 2023
ea95de3
implement until method calculating time differences between utterances
bvreede Nov 8, 2023
bf1bb60
inmplement subconversation and until next method at conversation object
bvreede Nov 8, 2023
4edab6b
autopep8
bvreede Nov 8, 2023
31a98d3
elaborate subconversation
bvreede Nov 9, 2023
642c941
move time processing to utterance
bvreede Nov 10, 2023
308ab94
object oriented and small fixes
bvreede Nov 10, 2023
d5c62f9
refactor post init
bvreede Nov 10, 2023
12a3d3a
add subconversation functionaliry
bvreede Nov 10, 2023
1fb7323
subconversation can select based on index or time
bvreede Nov 14, 2023
51f1a9e
make linter happy
bvreede Nov 14, 2023
f7db0f6
also pack arguments for second subconversation test
bvreede Nov 14, 2023
47dc536
fix linting issues
bvreede Nov 14, 2023
aaff48a
add dyadic property
bvreede Nov 14, 2023
7ee747f
apply conversation wide calculations dyadic and time to nxt
bvreede Nov 16, 2023
c0593cb
subconversation is internal
bvreede Nov 21, 2023
319ca96
count number of participants
bvreede Nov 21, 2023
b8ca53c
calculate FTO
bvreede Nov 21, 2023
6a34489
remove old code
bvreede Nov 21, 2023
c5c3430
address linter comments
bvreede Nov 21, 2023
7802063
address linter issues and update fto calculation
bvreede Nov 21, 2023
3bbeb00
fix noqa
bvreede Nov 21, 2023
cc91bbc
fix noqa
bvreede Nov 21, 2023
61a1950
calculations update in metadata corrected
bvreede Nov 21, 2023
f79b5ea
allow warning supppression on empty conversations inside subconversation
bvreede Nov 24, 2023
d4c9880
refer to hidden _utterances instead of property
bvreede Nov 24, 2023
20c65db
allow participant counting to exclude None
bvreede Nov 24, 2023
00ad82a
add test for FTO calculation
bvreede Nov 24, 2023
90c4ac6
ensure participant count does not include future utterances
bvreede Nov 24, 2023
69912f6
split subconversation into two functions
bvreede Nov 28, 2023
d62a868
fix linter issue
bvreede Nov 28, 2023
9229e8a
update example notebook
bvreede Nov 28, 2023
574a729
Update sktalk/corpus/conversation.py
bvreede Nov 29, 2023
473e753
Update sktalk/corpus/conversation.py
bvreede Nov 29, 2023
e076b6c
add comments re: error
bvreede Nov 29, 2023
3af95fc
Update sktalk/corpus/conversation.py
bvreede Nov 29, 2023
4c7ac0c
rewrite FTO calculation
bvreede Nov 30, 2023
ada29ce
rename overlap function to make it available
bvreede Nov 30, 2023
13829bb
update FTO calculation to account for partial overlap
bvreede Dec 1, 2023
2f64c94
refactor overlap functions
bvreede Dec 1, 2023
b1e096f
Update sktalk/corpus/parsing/cha.py
bvreede Dec 1, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
286 changes: 286 additions & 0 deletions scikit-talk/turndynamics.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,286 @@
import os
import glob
import math
import re
import datetime
import grapheme
import numpy as np
import pandas as pd
from collections import Counter

from tqdm.autonotebook import tqdm
from joblib import Parallel, delayed

def readcorpus (filename, langshort = None, langfull = None):
""" returns a formatted language corpus with turn and transition measures

:param a: filename of the corpus
:type a: string
:param b: short version of language name, defaults to none
:type b: string, optional
:param c: full version of language name, defaults to none
:type c: string, optional

:return: formatted dataframe of the language corpus
"""
# convert time strings to the ISO 8601 time format hh:mm:ss.sss
def _converttime(text):
if pd.isna(text) == True:
return pd.NA
else:
h,m,s = text.split(':')
return int(datetime.timedelta(hours=int(h),
minutes=int(m),
seconds=float(s)).total_seconds()*1000)

# number of unique sources
def _getsourceindex(source):
return n_sources.index(source) + 1

# talk, laugh, breath, or other conduct classification
def _getnature(utterance):
if pd.isna(utterance) == True:
return pd.NA
if utterance == '[laugh]':
return 'laugh'
if utterance == '[breath]':
return 'breath'
if utterance in ['[cough]', '[sneeze]', '[nod]', '[blow]', '[sigh]',
'[yawn]', '[sniff]', '[clearsthroat]',
'[lipsmack]', '[inhales]', '[groan]']:
return utterance
else:
return 'talk'

# count number of characters
def _getnchar(utterance):
if pd.isna(utterance) == True:
return pd.NA
else:
utterance = Counter(utterance.replace(" ",""))
return sum(utterance.values())

# create a 'window' for each utterance
# The window looks at 10s prior the begin of the current utterance (lookback)
# Only turns that begin within this lookback are included
# in the window. This means that if the prior turn began later
# than 10s before the current utterance, then the prior turn is
# not included in the window.
def _createwindow(begin, participant):
lookback = 10000
lookfwd = 0
filter = (df_transitions['begin'] >= (begin - lookback)) & (df_transitions['begin'] <= (begin + lookfwd))
window = df_transitions.loc[filter]
# identify who produced the utterance
window['turnby'] = np.where(window['participant'] == participant, 'self',
'other')
# calculate duration of all turns in window
stretch = window['end'].max() - window['begin'].min()
# calculate sum of all turn durations
talk_all = window['duration'].sum()
# calculate amount of talk produced by the participant in relation
# to the total amount of talk in the window
try:
talk_rel = window.loc[window['turnby'] == 'self']['duration'].sum() / talk_all
except ZeroDivisionError:
talk_rel = pd.NA
# calculate amount of loading of the channel
# (1 = no empty space > overlap, < silences)
load = talk_all / stretch
# calculate total amount of turns in this time window
turns_all = len(window.index)
# calculate amount of turns by this participant relative to turns by others
try:
turns_rel = (len(window[window['turnby'] == 'self'].index)) / turns_all
except ZeroDivisionError:
turns_rel = pd.NA

participants = window['participant'].nunique()
# create list of all measures computed
measures = [talk_all, talk_rel, load, turns_all, turns_rel, participants]
return measures

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a first (untested) python port of my R 'windowed transitions' code. As it needs to run on every single row in the db (creating a 10s window and computing these summary measures for it) it's likely to be quite a drag on performance — optimization and modularization needed.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[also just testing the review mechanics — don't mind me if I'm doing this wrong)


df = pd.read_csv(filename)
filename = re.sub('.csv', "", filename)
filename = re.sub('\.\/ElPaCo Dataset\/', '', filename)
#filename = re.sub('ElPaCo dataset\/', '', filename)
df['language'] = re.sub("[0-9]", "", filename)
if langshort is not None:
df['langshort'] = langshort
else:
df['langshort'] = df['language']
if langfull is not None:
df['langfull'] = langfull
else:
df['langfull'] = df['language']
df['corpus'] = filename
df['begin'] = df['begin'].apply(_converttime)
df['end'] = df['end'].apply(_converttime)

# calculate duration of the turn
df['duration'] = df['end'] - df['begin']

# define improbably long (more than 40 seconds) and negative durations
n_weird_durations = df.loc[((df['duration'] > 40000) | (df['duration'] < 0))]

# set weird durations to NA under the ff columns: begin, end, and duration
df.loc[(df['duration'] > 40000) | (df['duration'] < 0), ['duration']] = pd.NA
df.loc[(df['duration'] > 40000) | (df['duration'] < 0), ['end']] = pd.NA
df.loc[(df['duration'] > 40000) | (df['duration'] < 0), ['begin']] = pd.NA

# create UID

# list of unique sources in the corpus
n_sources = df['source'].unique().tolist()
# length of the number of sources (i.e. 20 sources = 2 chars), for padding
x = len(str(len(n_sources)))
# length of the number of turns in a source
# (i.e. 100 conversations = 3 chars), for padding
y = len(str(len(df.groupby(['source', 'utterance']).size())))

# UID format: language-source number-turn number (within a source)
uidbegin = np.where(pd.isna(df['begin']) == True, 'NA', df['begin'].astype(str))
df['uid'] = df['language'] + '-' + (df['source'].apply(_getsourceindex)).astype(str).str.zfill(x) + '-' + (df.groupby(['source']).cumcount() + 1).astype(str).str.zfill(y) + '-' + uidbegin

# deal with "unknown utterance" content
na_strings = ['[unk_utterance', '[unk_noise]', '[distortion]',
'[background]', '[background] M', '[static]', 'untranscribed',
'[noise]', '[inintel]', '[distorted]', 'tlyam kanəw']

# set unknown utterances to NA
df.loc[(df['utterance'].isin(na_strings)), ['utterance']] = pd.NA
n_unknown = df['utterance'][df['utterance'].isin(na_strings)].count()

# get nature of utterance
df['nature'] = df['utterance'].apply(_getnature)

# create a stripped version of the utterance
df['utterance_stripped'] = df['utterance'].str.strip()
df['utterance_stripped'] = df['utterance_stripped'].str.replace(r'\[[^[]*\]',
'', regex=True)
df['utterance_stripped'] = df['utterance_stripped'].str.replace(r'[\\(\\)]+',
'', regex=True)
# set blank utterances to NA
df.loc[df['utterance_stripped'] == '', 'utterance_stripped'] = pd.NA

# measure number of words by counting spaces
df['nwords'] = df['utterance_stripped'].str.count(' ') + 1

# measure number of characters
df['nchar'] = df['utterance_stripped'].apply(_getnchar)#.astype(float)

# add turn and frequency rank measures

# create a new dataframe without NA utterances (for easier calculations)
df_ranking = df.dropna(subset=['utterance_stripped'])
# count how frequent the utterance occurs in the corpus
df_ranking['n'] = df_ranking.groupby('utterance')['utterance'].transform('count').astype(float)
# rank the frequency of the utterance
df_ranking['rank'] = df_ranking['n'].rank(method='dense', ascending=False)
# calculate total number of uttrances
df_ranking['total'] = df_ranking['n'].sum()
# calculate frequency of utterance in relation to the total number of utterances
df_ranking['frequency'] = df_ranking['n'] / df_ranking['total']
# merge the new dataframe with the original dataframe
df = pd.merge(df, df_ranking)

# categorize overlap, look at overlap with turns up to four positions down
# overlap can either be full or partial
# set to NA if no overlap is found
df['overlap'] = np.where((df['begin'] > df['begin'].shift(1)) & (df['end'] < df['end'].shift(1)) |
(df['begin'] > df['begin'].shift(2)) & (df['end'] < df['end'].shift(2)) |
(df['begin'] > df['begin'].shift(3)) & (df['end'] < df['end'].shift(3)) |
(df['begin'] > df['begin'].shift(4)) & (df['end'] < df['end'].shift(4)),
'full', np.where((df['begin'] > df['begin'].shift()) & (df['begin'] <= df['end'].shift()),
'partial', pd.NA))

# identify who produced the prior utterance: other, self,
# or self during other (if previous utterance by the same participant
# was fully overlapped by an utterance of a different pariticpant)
# the priorby of the first utterance in the corpus is set to NA
df['priorby'] = np.where(df['participant'].index == 0, pd.NA,
np.where(df['participant'] != df['participant'].shift(),
'other', np.where((df['overlap'].shift() == 'full') &
(df['participant'].shift() == df['participant']),
'self_during_other', 'self'
)))

# calculate FTO (Flow Time Overlap)
# This refers to the duration of the overlap between the current utterance
# and the most relevant prior turn by other, which is not necessatily the
# prior row in the df. By default we only get 0, 1 and 5 right. Cases 2
# and 3 are covered by a rule that looks at turns coming in early for which
# prior turn is by self but T-2 is by other. Some cases of 4 (but not all is
# covered by looking for turns that do not come in early but have a prior
# turn in overlap and look for the turn at T-2 by a different participant.

# A turn doesn't receive an FTO if it follows a row in the db that doesn't
# have timing information, or if it is such a row.

# A [------------------] [0--]
# B [1-] [2--] [3--] [4--] [5--]

df['FTO'] = np.where((df['priorby'] == 'other') & (df['begin'] - df['begin'].shift() < 200) &
(df['priorby'].shift() != 'other'), df['begin'] - df['end'].shift(2),
np.where((df['priorby'] == 'other') &
(df['begin'] - df['begin'].shift() < 200) &
(df['priorby'].shift() != 'self') &
df['priorby'].shift(2) == 'other',
df['begin'] - df['end'].shift(3),
np.where((df['priorby'] == 'self_during_other') &
(df['participant'].shift(2) != df['participant']),
df['begin'] - df['end'].shift(2),
np.where((df['priorby'] == 'self_during_other') &
(df['priorby'].shift() == 'self_during_other'),
df['begin'] - df['end'].shift(3),
np.where(df['priorby'] == 'other',
df['begin'] - df['end'].shift(),
np.where(df['priorby'] == 'self', pd.NA, pd.NA
))))))

# identify whther a turn is overlapped by what succeeds it
# if not, set to NA
df['overlapped'] = np.where((df['begin'] < df['begin'].shift(-1)) &
(df['end'] > df['begin'].shift(-1)),'overlapped', pd.NA)

# set FTO to NA if it is higher than 10s or lower than -10s, on the
# grounds that (a) psycholinguistically it is implausible that these
# actually relate to the end of the 'prior', and (b) conversation
# analytically it is necessary to treat such cases on their
# own terms rather than take an FTO at face value

df['FTO'] = np.where(df['FTO'] > 9999, pd.NA, np.where(df['FTO'] < -9999, pd.NA, df['FTO']))
# set FTO to NA if it is negative < -99999, on the
# grounds that (a) psycholinguistically it is
# impossible to relate to the end of the 'prior' turn,
# and (b) conversation analytically it is necessary
# to treat such cases on their own terms rather than
# take an FTO at face value


# add transitions metadata

# create new dataframe with only the relevant columns
df_transitions = df.copy()
df_transitions = df_transitions.drop(columns=['langshort', 'langfull',
'corpus', 'nature',
'utterance_stripped',
'nwords', 'nchar', 'n',
'rank', 'total',
'frequency', 'overlap'])

# put all the calculated transition measures into one column
df['transitions'] = df.apply(lambda x: _createwindow(x['begin'],
x['participant']),
axis=1)

# split the list into six columns, one column representing each measure
df_split = pd.DataFrame(df['transitions'].tolist(), columns=['talk_all', 'talk_rel', 'load',
'turns_all','turns_rel', 'participants'])

# add transition measures to original df
df = pd.concat([df, df_split], axis=1)
# drop column containing list of transition measures
df = df.drop(columns='transitions')

return df
16 changes: 16 additions & 0 deletions tests/test_turndynamics.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
corpora_list = []

metadatafile = pd.read_csv("_overview.csv", encoding="ISO-8859-1", sep=',')
corpora_for_d_latest = pd.DataFrame(columns=['corpus_path','langshort','langfull'])
# loop over csv files (language corpus)
for index, row in metadatafile.iterrows():
if row["ElPaCo_included"] == "yes":
corpus_name = row["File_name"]
corpus_path = './Elpaco dataset/'+corpus_name+'.csv'
langshort = row["langshort"]
langfull = row["Langfull"]
corpora_list.append([corpus_path, langshort,langfull])

corpora_for_d_latest = pd.DataFrame(corpora_list, columns = ['language', 'langshort', 'langfull'])

print(corpora_for_d_latest)