Corpus construction from pandas Dataframes #69

calebchiam · 2020-09-25T01:37:49Z

This could be implemented as a static method in Corpus, i.e. Corpus.from_pandas(...), that takes in three arguments: a speakers dataframe, an utterances dataframe, and a conversations dataframe.

The columns of the dataframes should mirror the primary data fields of the respective components exactly. All additional metadata should be specified in columns that are prefixed with 'meta.' For example, an utterance with a subreddit metadata attribute would have a column called 'meta.subreddit' in the utterances dataframe.

The text was updated successfully, but these errors were encountered:

Ap1075 · 2021-09-24T15:38:34Z

I've picked this up. Will submit a PR soon!

Ap1075 · 2021-09-24T18:08:23Z

Hi @calebchiam, was hoping you could help with a question I had regarding this issue. I noticed that functionality to read json files for utterances, speakers and conversations and create a corpus is already present. Pandas has an inbuilt "to_json" method which can create those files out of the user's dataframes.

Do you think we could internally create json files from the 3 dfs, read them as a corpus (through corpus directory) as usual? This process could be made into a static method as you suggested. Or do you think this intermediate step of creating json files from dataframes should be avoided? Because if it must be avoided, I think this issue would also have to invariably solve #78. Let me know what you think. Thanks!

calebchiam · 2021-09-24T18:55:50Z

Hi @Ap1075, thanks for picking this up. I think we'd want to avoid the intermediate step of creating json files from dataframes because the json file writing process in ConvoKit comes with its own set of complexities and specificities, that would only complicate this from_pandas method.

Instead, we might expect this method to look something like convert_df_to_corpus() in this piece of code -- albeit slightly more complex because it would have to handle conversation and speaker metadata as well. You would not have to solve #78, since that is more about abstracting the metadata update step into its own method, whereas here you can do something much simpler like the metadata initialisation in L41-43 of the linked code. Does that make sense?

calebchiam · 2021-09-24T19:02:19Z

Just to add on, the basic structure of this method would probably look something like:

Initialize Speakers from Speaker dataframe
Initialize Utterances (utt_list) from the Utterance data frame, using the Speaker object as an initialization argument for each utterance
Initialize Corpus using Corpus(utterances=utt_list), which initializes the Corpus for you + the Conversation objects.
Add Conversation metadata from dataframe to Conversation objects

Might be missing some smaller steps, but that's the rough idea.

Ap1075 · 2021-09-25T17:11:22Z

Thanks @calebchiam that does clear things out a lot. I think I might've implemented steps 2 and 3 already for a project using the convert_df_to_corpus function as reference. I'll tie it all together. I thought linking metadata across speakers, utterances and conversations might need an update function as suggested in #78. But perhaps that's not necessary. Thanks again!

calebchiam added feature request good first issue ideal for first-time contributors labels Sep 25, 2020

calebchiam mentioned this issue Dec 11, 2020

Corpus-level updating / adding metadata attributes with pandas dataframes #78

Open

Ap1075 mentioned this issue Sep 26, 2021

Added 'from_pandas' method to create Corpus from DataFrames. #136

Merged

calebchiam closed this as completed Sep 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Corpus construction from pandas Dataframes #69

Corpus construction from pandas Dataframes #69

calebchiam commented Sep 25, 2020

Ap1075 commented Sep 24, 2021

Ap1075 commented Sep 24, 2021

calebchiam commented Sep 24, 2021

calebchiam commented Sep 24, 2021

Ap1075 commented Sep 25, 2021 •

edited

Loading

Corpus construction from pandas Dataframes #69

Corpus construction from pandas Dataframes #69

Comments

calebchiam commented Sep 25, 2020

Ap1075 commented Sep 24, 2021

Ap1075 commented Sep 24, 2021

calebchiam commented Sep 24, 2021

calebchiam commented Sep 24, 2021

Ap1075 commented Sep 25, 2021 • edited Loading

Ap1075 commented Sep 25, 2021 •

edited

Loading