-
Notifications
You must be signed in to change notification settings - Fork 135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Corpus construction from pandas Dataframes #69
Comments
I've picked this up. Will submit a PR soon! |
Hi @calebchiam, was hoping you could help with a question I had regarding this issue. I noticed that functionality to read json files for utterances, speakers and conversations and create a corpus is already present. Pandas has an inbuilt "to_json" method which can create those files out of the user's dataframes. Do you think we could internally create json files from the 3 dfs, read them as a corpus (through corpus directory) as usual? This process could be made into a static method as you suggested. Or do you think this intermediate step of creating json files from dataframes should be avoided? Because if it must be avoided, I think this issue would also have to invariably solve #78. Let me know what you think. Thanks! |
Hi @Ap1075, thanks for picking this up. I think we'd want to avoid the intermediate step of creating json files from dataframes because the json file writing process in ConvoKit comes with its own set of complexities and specificities, that would only complicate this Instead, we might expect this method to look something like convert_df_to_corpus() in this piece of code -- albeit slightly more complex because it would have to handle conversation and speaker metadata as well. You would not have to solve #78, since that is more about abstracting the metadata update step into its own method, whereas here you can do something much simpler like the metadata initialisation in L41-43 of the linked code. Does that make sense? |
Just to add on, the basic structure of this method would probably look something like:
Might be missing some smaller steps, but that's the rough idea. |
Thanks @calebchiam that does clear things out a lot. I think I might've implemented steps 2 and 3 already for a project using the convert_df_to_corpus function as reference. I'll tie it all together. I thought linking metadata across speakers, utterances and conversations might need an update function as suggested in #78. But perhaps that's not necessary. Thanks again! |
This could be implemented as a static method in Corpus, i.e.
Corpus.from_pandas(...)
, that takes in three arguments: a speakers dataframe, an utterances dataframe, and a conversations dataframe.The columns of the dataframes should mirror the primary data fields of the respective components exactly. All additional metadata should be specified in columns that are prefixed with 'meta.' For example, an utterance with a subreddit metadata attribute would have a column called 'meta.subreddit' in the utterances dataframe.
The text was updated successfully, but these errors were encountered: