Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added 'from_pandas' method to create Corpus from DataFrames. #136

Merged
merged 7 commits into from
Sep 27, 2021

Conversation

Ap1075
Copy link
Contributor

@Ap1075 Ap1075 commented Sep 26, 2021

In response to #69.
Added a static method in the corpus class to create a corpus object from speakers, utterances and conversations dataframes. Also added helper functions to extract required metadata from the dataframes.

Utterances DataFrame is expected to contain all primary fields as columns. Speakers and Conversations are expected to have an "id" column; metadata columns are optional for all dataframes. Metadata columns are expected to have a "meta" prefix, as specified in #69.

@calebchiam
Copy link
Collaborator

calebchiam commented Sep 26, 2021

Hi @Ap1075, nice work! I made some fixes + added a test (convokit/tests/general/test_from_pandas.py) to evaluate if the method is working correctly. Note that if the column name in the dataframe is 'meta.stickied' for example, then we want to store it in the metadata key as simply 'stickied', but I've fixed that.

Another thing to note is that we should be able to regenerate an almost-identical corpus (the only difference would be the absence of Corpus metadata) using the dataframes generated from the get_[objects]_dataframe() methods -- which is what the test is evaluating. The test is not currently passing however, could you take a look?

@Ap1075
Copy link
Contributor Author

Ap1075 commented Sep 27, 2021

Thanks @calebchiam! Sure, I'll check this out

@Ap1075
Copy link
Contributor Author

Ap1075 commented Sep 27, 2021

Hi @calebchiam, so I believe the test is failing because of a KeyError, when trying to access metadata columns in the dataframe. I reckon from your commit that you'd like to drop the "meta" prefix when adding metadata to the corpus object. But removing the prefix from the helper function (extract_meta_from_df) causes a KeyError to be raised because the pandas columns still has the "meta" prefix.

I plan on making the following change: I'll keep the "meta" prefix while querying the dataframe, but edit the key in the metadata dictionary before adding it to the corpus. Hope that makes sense.

@calebchiam
Copy link
Collaborator

Ah yep, great catch. Will look out for your fix tomorrow morning then! (:

@Ap1075
Copy link
Contributor Author

Ap1075 commented Sep 27, 2021

Done @calebchiam. I decided to just edit the dataframe column names instead by removing the "meta." prefix. Thought that would be cleaner/faster. Should work as expected now :)

@calebchiam
Copy link
Collaborator

Thanks, @Ap1075! Really nice work. Fixed a few bugs: had to add 'meta.' for Utterance dataframe column access and fix the Speaker initialisation, but otherwise this looks good to go. I'll merge it in and it'll be available in the next release. Let me know if it'd be helpful for your own projects and we can move up the next release to some time in the next few days.

@calebchiam
Copy link
Collaborator

@all-contributors please add @Ap1075 for code

@allcontributors
Copy link
Contributor

@calebchiam

I've put up a pull request to add @Ap1075! 🎉

@calebchiam calebchiam merged commit 0f3030c into CornellNLP:master Sep 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants