-
Notifications
You must be signed in to change notification settings - Fork 135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added 'from_pandas' method to create Corpus from DataFrames. #136
Conversation
…ed a couple of helper functions.
Hi @Ap1075, nice work! I made some fixes + added a test (convokit/tests/general/test_from_pandas.py) to evaluate if the method is working correctly. Note that if the column name in the dataframe is 'meta.stickied' for example, then we want to store it in the metadata key as simply 'stickied', but I've fixed that. Another thing to note is that we should be able to regenerate an almost-identical corpus (the only difference would be the absence of Corpus metadata) using the dataframes generated from the |
Thanks @calebchiam! Sure, I'll check this out |
Hi @calebchiam, so I believe the test is failing because of a KeyError, when trying to access metadata columns in the dataframe. I reckon from your commit that you'd like to drop the "meta" prefix when adding metadata to the corpus object. But removing the prefix from the helper function (extract_meta_from_df) causes a KeyError to be raised because the pandas columns still has the "meta" prefix. I plan on making the following change: I'll keep the "meta" prefix while querying the dataframe, but edit the key in the metadata dictionary before adding it to the corpus. Hope that makes sense. |
Ah yep, great catch. Will look out for your fix tomorrow morning then! (: |
Done @calebchiam. I decided to just edit the dataframe column names instead by removing the "meta." prefix. Thought that would be cleaner/faster. Should work as expected now :) |
Thanks, @Ap1075! Really nice work. Fixed a few bugs: had to add 'meta.' for Utterance dataframe column access and fix the Speaker initialisation, but otherwise this looks good to go. I'll merge it in and it'll be available in the next release. Let me know if it'd be helpful for your own projects and we can move up the next release to some time in the next few days. |
@all-contributors please add @Ap1075 for code |
I've put up a pull request to add @Ap1075! 🎉 |
In response to #69.
Added a static method in the corpus class to create a corpus object from speakers, utterances and conversations dataframes. Also added helper functions to extract required metadata from the dataframes.
Utterances DataFrame is expected to contain all primary fields as columns. Speakers and Conversations are expected to have an "id" column; metadata columns are optional for all dataframes. Metadata columns are expected to have a "meta" prefix, as specified in #69.