Added 'from_pandas' method to create Corpus from DataFrames. #136

Ap1075 · 2021-09-26T17:52:16Z

In response to #69.
Added a static method in the corpus class to create a corpus object from speakers, utterances and conversations dataframes. Also added helper functions to extract required metadata from the dataframes.

Utterances DataFrame is expected to contain all primary fields as columns. Speakers and Conversations are expected to have an "id" column; metadata columns are optional for all dataframes. Metadata columns are expected to have a "meta" prefix, as specified in #69.

…ed a couple of helper functions.

calebchiam · 2021-09-26T19:17:28Z

Hi @Ap1075, nice work! I made some fixes + added a test (convokit/tests/general/test_from_pandas.py) to evaluate if the method is working correctly. Note that if the column name in the dataframe is 'meta.stickied' for example, then we want to store it in the metadata key as simply 'stickied', but I've fixed that.

Another thing to note is that we should be able to regenerate an almost-identical corpus (the only difference would be the absence of Corpus metadata) using the dataframes generated from the get_[objects]_dataframe() methods -- which is what the test is evaluating. The test is not currently passing however, could you take a look?

Ap1075 · 2021-09-27T04:45:48Z

Thanks @calebchiam! Sure, I'll check this out

Ap1075 · 2021-09-27T05:03:04Z

Hi @calebchiam, so I believe the test is failing because of a KeyError, when trying to access metadata columns in the dataframe. I reckon from your commit that you'd like to drop the "meta" prefix when adding metadata to the corpus object. But removing the prefix from the helper function (extract_meta_from_df) causes a KeyError to be raised because the pandas columns still has the "meta" prefix.

I plan on making the following change: I'll keep the "meta" prefix while querying the dataframe, but edit the key in the metadata dictionary before adding it to the corpus. Hope that makes sense.

calebchiam · 2021-09-27T05:35:22Z

Ah yep, great catch. Will look out for your fix tomorrow morning then! (:

Ap1075 · 2021-09-27T06:03:11Z

Done @calebchiam. I decided to just edit the dataframe column names instead by removing the "meta." prefix. Thought that would be cleaner/faster. Should work as expected now :)

calebchiam · 2021-09-27T06:30:22Z

Thanks, @Ap1075! Really nice work. Fixed a few bugs: had to add 'meta.' for Utterance dataframe column access and fix the Speaker initialisation, but otherwise this looks good to go. I'll merge it in and it'll be available in the next release. Let me know if it'd be helpful for your own projects and we can move up the next release to some time in the next few days.

calebchiam · 2021-09-27T06:31:16Z

@all-contributors please add @Ap1075 for code

allcontributors · 2021-09-27T06:31:24Z

@calebchiam

I've put up a pull request to add @Ap1075! 🎉

Ap1075 and others added 2 commits September 26, 2021 23:02

Added 'from_pandas' method to create Corpus from DataFrames. Also add…

9d10440

…ed a couple of helper functions.

bug fixes + added CorpusFromPandas test

1fa0e47

calebchiam and others added 2 commits September 27, 2021 01:36

Update test_from_pandas.py

34109af

Fixed KeyError raised while querying the dataframe

edbccaa

Ap1075 and others added 3 commits September 27, 2021 11:36

Fixed placement of replace command (outside for loop).

b94d437

fixing metadata key name for utts + speaker initialization

d09dff1

expanded tests

d4629b2

allcontributors bot mentioned this pull request Sep 27, 2021

docs: add Ap1075 as a contributor for code #137

Merged

calebchiam merged commit 0f3030c into CornellNLP:master Sep 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added 'from_pandas' method to create Corpus from DataFrames. #136

Added 'from_pandas' method to create Corpus from DataFrames. #136

Ap1075 commented Sep 26, 2021

calebchiam commented Sep 26, 2021 •

edited

Loading

Ap1075 commented Sep 27, 2021

Ap1075 commented Sep 27, 2021

calebchiam commented Sep 27, 2021

Ap1075 commented Sep 27, 2021

calebchiam commented Sep 27, 2021

calebchiam commented Sep 27, 2021

allcontributors bot commented Sep 27, 2021

Added 'from_pandas' method to create Corpus from DataFrames. #136

Added 'from_pandas' method to create Corpus from DataFrames. #136

Conversation

Ap1075 commented Sep 26, 2021

calebchiam commented Sep 26, 2021 • edited Loading

Ap1075 commented Sep 27, 2021

Ap1075 commented Sep 27, 2021

calebchiam commented Sep 27, 2021

Ap1075 commented Sep 27, 2021

calebchiam commented Sep 27, 2021

calebchiam commented Sep 27, 2021

allcontributors bot commented Sep 27, 2021

calebchiam commented Sep 26, 2021 •

edited

Loading