Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate turn type dynamics #1

Merged
merged 55 commits into from
Dec 1, 2023
Merged

Generate turn type dynamics #1

merged 55 commits into from
Dec 1, 2023

Conversation

bvreede
Copy link
Contributor

@bvreede bvreede commented Mar 31, 2023

Adding utterance properties:

  • Number of words in an utterance (nwords)
  • Number of characters in an utterance (nchar)
  • list of words in the utterance
  • FTO; compound of (FTO)
    • Dyadic conversation in window; boolean -- needs to be True to return an FTO
    • FTO calculation
  • Overlap properties moved to Overlap calculations #46

Closes #5
Closes #8
Closes #42

@bvreede bvreede changed the title code from notebook Generate turn type dynamics Mar 31, 2023
@mdingemanse
Copy link

Issue #8 is relevant here too — FTO (Floor Transfer Offset) is computed as part of the turn dynamics.

Since there's already some windowed computation done —which needs checking by the way, as that code is not our production code— the number of participants in the prior 10s window should be known, which also means we know it's dyadic or triadic.

Comment on lines 63 to 101
# create a 'window' for each utterance
# The window looks at 10s prior the begin of the current utterance (lookback)
# Only turns that begin within this lookback are included
# in the window. This means that if the prior turn began later
# than 10s before the current utterance, then the prior turn is
# not included in the window.
def _createwindow(begin, participant):
lookback = 10000
lookfwd = 0
filter = (df_transitions['begin'] >= (begin - lookback)) & (df_transitions['begin'] <= (begin + lookfwd))
window = df_transitions.loc[filter]
# identify who produced the utterance
window['turnby'] = np.where(window['participant'] == participant, 'self',
'other')
# calculate duration of all turns in window
stretch = window['end'].max() - window['begin'].min()
# calculate sum of all turn durations
talk_all = window['duration'].sum()
# calculate amount of talk produced by the participant in relation
# to the total amount of talk in the window
try:
talk_rel = window.loc[window['turnby'] == 'self']['duration'].sum() / talk_all
except ZeroDivisionError:
talk_rel = pd.NA
# calculate amount of loading of the channel
# (1 = no empty space > overlap, < silences)
load = talk_all / stretch
# calculate total amount of turns in this time window
turns_all = len(window.index)
# calculate amount of turns by this participant relative to turns by others
try:
turns_rel = (len(window[window['turnby'] == 'self'].index)) / turns_all
except ZeroDivisionError:
turns_rel = pd.NA

participants = window['participant'].nunique()
# create list of all measures computed
measures = [talk_all, talk_rel, load, turns_all, turns_rel, participants]
return measures

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a first (untested) python port of my R 'windowed transitions' code. As it needs to run on every single row in the db (creating a 10s window and computing these summary measures for it) it's likely to be quite a drag on performance — optimization and modularization needed.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[also just testing the review mechanics — don't mind me if I'm doing this wrong)

@sonarqubecloud
Copy link

sonarqubecloud bot commented Apr 7, 2023

SonarCloud Quality Gate failed.    Quality Gate failed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 37 Code Smells

0.0% 0.0% Coverage
0.0% 0.0% Duplication

@mdingemanse
Copy link

Here's some code that is based on trawling the diverse corpora.

First a bit I use to set utterance to NA:

    na_strings <- c("[unk_utterance]","[unk_noise]","[distortion]","[background]",
                    "[background] M","[static]","untranscribed",
                    "[noise]","[inintel]","[distorted]","tlyam kanəw")

Then some code I use to classify types of conduct other than talk (this should really not be hardcoded, ideally)

    # add talk / other conduct classification
    d <- d %>%
      mutate(nature = case_when(
        utterance == "[laugh]" ~ "laugh",
        utterance == "[breath]" ~ "breath",
        utterance %in% c("[cough]","[sneeze]","[nod]","[blow]","[sigh]",
                         "[yawn]","[sniff]","[clearsthroat]","[lipsmack]",
                         "[inhales]","[groan]") ~ utterance,
        is.na(utterance) ~ as.character(NA),
        TRUE ~ "talk"
      ))

Then some code I use to generate nchar

    # create stripped and squished version of utterance and add length measures (note that
    # "\\W+" doesn't play well with unicode, so we count spaces)
    d <- d %>%
      mutate(utterance_stripped = gsub("\\[+[^()]*\\]+?", "", trimws(utterance))) %>%
      mutate(utterance_stripped = gsub("[\\(\\)]+", "", utterance_stripped)) %>%
      mutate(utterance_stripped = ifelse(utterance_stripped == "",NA,str_squish(utterance_stripped))) %>%
      mutate(
           nwords = str_count(utterance_stripped, " ") + 1, # crudely count spaces
           nchar = str_count(utterance_stripped)            # NOTE that nchar = bad for ideographs
           ) 

@mdingemanse
Copy link

mdingemanse commented Nov 3, 2023

Also this reminds me that we have two more utterance-level column that are really quite useful, but that relate to the corpus (and therefore often the language) as a whole:

  • rank : the rank of this utterance in the frequency distribution of utterances in this language
  • n : the number of tokens of this utterance type in this language

E.g. below you can see that the n of î (third row) is 53: there are 53 utterances like this in the language. Its rank is 1: it is the most frequently attested turn format in the language.

Debatable whether this is truly an utterance feature, but we do use it for a lot of things, e.g. to pull up the most frequent utterance formats in a language, identify 'streaks' of similarly frequent utterances, etc.

image

@bvreede bvreede mentioned this pull request Nov 21, 2023
@bvreede bvreede marked this pull request as ready for review November 28, 2023 10:07
Copy link
Contributor

@jiqicn jiqicn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @bvreede, I have left mostly minor questions and remarks.

Let me know if you want to further discuss any of the comments.

sktalk/corpus/conversation.py Outdated Show resolved Hide resolved
sktalk/corpus/conversation.py Outdated Show resolved Hide resolved
sktalk/corpus/conversation.py Outdated Show resolved Hide resolved
sktalk/corpus/conversation.py Show resolved Hide resolved
sktalk/corpus/conversation.py Outdated Show resolved Hide resolved
sktalk/corpus/conversation.py Outdated Show resolved Hide resolved
sktalk/corpus/parsing/cha.py Outdated Show resolved Hide resolved
tests/corpus/test_conversation.py Outdated Show resolved Hide resolved
tests/corpus/test_conversation.py Outdated Show resolved Hide resolved
@@ -1,11 +1,12 @@
import warnings
from typing import Optional
from .utterance import Utterance
from .write.writer import Writer


class Conversation(Writer):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A minor one: from a semantics and OOP perspective, I'm having a bit of a hard time imagining what/how a conversation can inherit from a writer. Is there any special reason that the base class is call writer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this question, and I hope my answer makes sense, this is all quite new to me too!
Writer is not a proper parent class, but a class that is used to collect writer functionality that can be used by different objects. Is this weird architecture? Please let me know 😅 CC @carschno

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fully understand the conceptual doubts, as Writer really does not make sense as a parent class for a Conversation. However, the Writer here takes the role of a MixIn in Python terminology (similar to an Interface in other languages). Python does not make a technical distinction between a class serving as a parent class in the strict sense, and a class serving as a MixIn.
The distinction is typically visible by the order of inheritance, where the parent class is the first, followed by one or multiple MixIns (e.g. Conversation(AbstractConversation, Writer)). In this case, however, Conversation does not have a parent class in the common sense, but only a MixIn.

Specifically, the Writer class enables other classes to inherit and/or override common serialization methods -- for writing, in this case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice explanation @bvreede and @carschno! I didn't really check the other modules/classes defined outside this PR, so mostly I missed the context there. But since you called it a MixIn, I'm expecting to either see some other similar classes provided as optional features to be inherited, or classes rather than Covnersation who will also use this particular feature (and maybe some other features). Otherwise, making Writer a MixIn may not make a lot of sense (at least to myself). But I also understand that this can be a future work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation @carschno! I didn't know about MixIns yet, but having a separate class that is used to provide some default methods here is logical to me. We indeed have another class that inherits from Writer: Corpus. There is some common functionality that we want to keep DRY as both Conversation and Corpus are objects that e.g. will be saved as json (#21) and csv (#47).

@mdingemanse
Copy link

@bvreede some belated responses to this

  • you have access to that basecamp thread now
  • I agree that your Scenario 1 is perfectly sensible, and that Scenarios 2 and 3 are increasingly more questionable. However, there is nothing we can do about that without making another arbitrary determination of when an utterance stops being "short". This is why I introduced prior_by (link to basecamp comment). Quoting from my Basecamp thread:

this allows us to decide whether FTO should only be done when prior is by "other" (most conservative), versus perhaps also when prior row in db is by "self during other" (also reasonable)

So my thinking at the time was that folks might want to choose between the most conversative measure versus a slightly looser one. And for ElPaCo we went with the looser one:

next we deal with self during other. The difference is hard to see but now turns whose prior is a self-during-other are also getting an FTO, timed relative to the nearest prior turn by other (instead of being treated as a self-transition)

Copy link

sonarqubecloud bot commented Dec 1, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 1 Code Smell

93.3% 93.3% Coverage
0.0% 0.0% Duplication

@bvreede bvreede requested a review from mdingemanse December 1, 2023 14:03
@bvreede bvreede dismissed mdingemanse’s stale review December 1, 2023 14:04

review comments were used to provide clarifications

@bvreede bvreede merged commit ccd4bc4 into main Dec 1, 2023
9 checks passed
@bvreede bvreede deleted the first-functions branch December 1, 2023 14:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants