Heartbeat duplicate detection potentially yielding false positives #454

IgnisDa · 2023-01-14T03:05:23Z

Describe the bug

I made an import from Wakatime, but the entire data in not imported. I have about 100hrs missing from each of my major worked on projects. I could not find any issues like this. Is this a known bug?

System information

Output of uname -ar

Linux main 5.15.0-56-generic #62-Ubuntu SMP Tue Nov 22 19:54:14 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Hosted using dokku.

The text was updated successfully, but these errors were encountered:

muety · 2023-01-14T08:52:51Z

Did you get any error messages in the server's console? For how long did you wait? If the amount of data is very large, it might well take a couple of hours, since it's currently downloaded in small batches (see #323).

IgnisDa · 2023-01-14T10:08:05Z

For how long did you wait?

About 10 minutes after which I got an email saying the import has completed successfully.

Here are the entire logs since server startup: http://sprunge.us/mbNHSN

Edit:

I do not see any failed logs. Could it happen that wakatime is not returning the entire data?
update: likely not. i downloaded their JSON dump, looks complete to me.

muety · 2023-01-14T16:15:24Z

It's probably related to #334 (comment). As already explained there, Wakapi and WakaTime calculate coding duration differently, so there will naturally be a discrepancy. Also, a lot of duplicates seem to have been filtered out during your import (only 373086 of 389859 downloaded heartbeats were actually persists).

Could you maybe check the WakaTime CSV dump if you can spot any duplicates or other irregularities? Might well be that there is a bug in Wakapi, that causes too many heartbeats to be filtered out. Would love to get your support on investigating this!

muety · 2023-01-14T16:25:58Z

As mentioned on that other issue, we hash every heartbeat object to check for duplicates. I just briefly reviewed the implementation of that again and there is a chance that hashing for a heartbeat's time attribute might not be working properly, due to mitchellh/hashstructure#38. I'll have to investigate deeper, but if that turns out true, then the "duplicate detection" might yield false positives.

Could you potentially send me a subset of your CSV export (pseudonymized, if you prefer that) that includes a portion of records where all relevant attributes (entity, type, project, language, ...) are identical, except for the timestamp? But don't worry, I can also just handcraft a bit of fake data for that!

Will keep you posted.

IgnisDa · 2023-01-14T16:35:15Z

@muety I am not sure what you want. Would the complete export work? Wakatime gives a download link, you can download it from there. It is about 250MBs.

muety · 2023-01-14T16:48:55Z

Complete export would help as well, but it obviously contains quite a lot of potentially personally identifiable information. So feel free to only take a portion of it and / or replace project names or so. Send it to [email protected].

But no worries if that's too much to ask for! I can probably also just write a quick script to generate test data for debugging the above!

IgnisDa · 2023-01-14T17:07:59Z

Looks like you will have to generate the data yourself, the files are just too large to upload. Sorry :(

muety · 2023-01-14T17:31:10Z

The CSVs can probably be compressed quite tremendously. But no worries if not! Thanks for help.

IgnisDa · 2023-01-14T17:57:49Z

I'm not in front of my system anymore. Will try to convert the json to CSV tomorrow and update here.

IgnisDa · 2023-01-15T02:08:21Z

@muety I was able to compress the entire data to 23MB. Sent you the json.

I used https://github.com/ouch-org/ouch to compress it. You can decompress it with that.

muety · 2023-01-15T19:44:21Z

@muety I was able to compress the entire data to 23MB. Sent you the json.

Where did you send it? Didn't receive an e-mail, yet.

Btw., I did some testing and the hashing seems to be working fine. The cause of this problem has to be somewhere else. Looking at your data will hopefully reveal something in that regard.

IgnisDa · 2023-01-16T01:03:00Z

I sent it to you on the email you wrote above.

If you still did not receive it, perhaps you can share your discord username?

muety · 2023-01-17T23:07:29Z

What is your overall, total coding time shown in WakaTime and what is it in Wakapi?

muety · 2023-01-17T23:23:22Z

I checked the data you sent. The discrepancy between how many heartbeats were downloaded from WakaTime and how many were imported into Wakapi actually solely seems to be due to duplicate timestamps. I wrote a small script to analyze your dump and it outputs that around 4.5 % of WakaTime heartbeats have non-unique timestamps, which is something that Wakapi can not handle.

import json

with open('wakatime-dump.json', 'r') as f:
    data = json.load(f)

timestamps = [heartbeat['time'] for day in data['days'] for heartbeat in day['heartbeats']]
timestamps_unique = frozenset(timestamps)

print(f'got {len(timestamps_unique)} / {len(timestamps)}')  # got 373095 / 389886

To be honest, I tend to think that there's nothing wrong with Wakapi and the difference is just due to the different methodology of interpolating between heartbeats.

IgnisDa · 2023-01-18T01:55:27Z

What is your overall, total coding time shown in WakaTime and what is it in Wakapi?

I am not sure about this since i do not have a wakatime pro account. However one of my projects (incento-server) shows a total time of 700hrs on wakatime while it showed about 500hrs on wakapi.

I checked the data you sent. The discrepancy between how many heartbeats were downloaded from WakaTime and how many were imported into Wakapi actually solely seems to be due to duplicate timestamps. I wrote a small script to analyze your dump and it outputs that around 4.5 % of WakaTime heartbeats have non-unique timestamps, which is something that Wakapi can not handle.

import json

with open('wakatime-dump.json', 'r') as f:
    data = json.load(f)

timestamps = [heartbeat['time'] for day in data['days'] for heartbeat in day['heartbeats']]
timestamps_unique = frozenset(timestamps)

print(f'got {len(timestamps_unique)} / {len(timestamps)}')  # got 373095 / 389886
To be honest, I tend to think that there's nothing wrong with Wakapi and the difference is just due to the different methodology of interpolating between heartbeats.

If that is the case then i think this issue is solved?

muety · 2023-01-18T07:54:12Z

If that is the case then i think this issue is solved?

Frankly, yes. I don't see anything we could do on Wakapi's side at this point, sorry. Once we have #156, you'll be able to tweak the interpolation methodology to your needs in the future. Please stay tuned until then :-).

IgnisDa · 2023-01-18T08:47:35Z

@muety Thank you for looking into this!

muety self-assigned this Jan 14, 2023

muety added bug Something isn't working prio a effort:3 labels Jan 14, 2023

muety changed the title ~~All data from wakatime was not imported~~ heartbeat duplicate detection potentially yielding false positives Jan 14, 2023

muety changed the title ~~heartbeat duplicate detection potentially yielding false positives~~ Heartbeat duplicate detection potentially yielding false positives Jan 14, 2023

muety mentioned this issue Jan 17, 2023

/summaries ignoring user path #455

Closed

muety closed this as completed Jan 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Heartbeat duplicate detection potentially yielding false positives #454

Heartbeat duplicate detection potentially yielding false positives #454

IgnisDa commented Jan 14, 2023

muety commented Jan 14, 2023

IgnisDa commented Jan 14, 2023 •

edited

Loading

muety commented Jan 14, 2023

muety commented Jan 14, 2023 •

edited

Loading

IgnisDa commented Jan 14, 2023

muety commented Jan 14, 2023

IgnisDa commented Jan 14, 2023

muety commented Jan 14, 2023

IgnisDa commented Jan 14, 2023

IgnisDa commented Jan 15, 2023

muety commented Jan 15, 2023 •

edited

Loading

IgnisDa commented Jan 16, 2023 •

edited

Loading

muety commented Jan 17, 2023

muety commented Jan 17, 2023 •

edited

Loading

IgnisDa commented Jan 18, 2023

muety commented Jan 18, 2023

IgnisDa commented Jan 18, 2023

Heartbeat duplicate detection potentially yielding false positives #454

Heartbeat duplicate detection potentially yielding false positives #454

Comments

IgnisDa commented Jan 14, 2023

muety commented Jan 14, 2023

IgnisDa commented Jan 14, 2023 • edited Loading

muety commented Jan 14, 2023

muety commented Jan 14, 2023 • edited Loading

IgnisDa commented Jan 14, 2023

muety commented Jan 14, 2023

IgnisDa commented Jan 14, 2023

muety commented Jan 14, 2023

IgnisDa commented Jan 14, 2023

IgnisDa commented Jan 15, 2023

muety commented Jan 15, 2023 • edited Loading

IgnisDa commented Jan 16, 2023 • edited Loading

muety commented Jan 17, 2023

muety commented Jan 17, 2023 • edited Loading

IgnisDa commented Jan 18, 2023

muety commented Jan 18, 2023

IgnisDa commented Jan 18, 2023

IgnisDa commented Jan 14, 2023 •

edited

Loading

muety commented Jan 14, 2023 •

edited

Loading

muety commented Jan 15, 2023 •

edited

Loading

IgnisDa commented Jan 16, 2023 •

edited

Loading

muety commented Jan 17, 2023 •

edited

Loading