-
-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Heartbeat duplicate detection potentially yielding false positives #454
Comments
Did you get any error messages in the server's console? For how long did you wait? If the amount of data is very large, it might well take a couple of hours, since it's currently downloaded in small batches (see #323). |
About 10 minutes after which I got an email saying the import has completed successfully. Here are the entire logs since server startup: http://sprunge.us/mbNHSN Edit: I do not see any failed logs. Could it happen that wakatime is not returning the entire data? |
It's probably related to #334 (comment). As already explained there, Wakapi and WakaTime calculate coding duration differently, so there will naturally be a discrepancy. Also, a lot of duplicates seem to have been filtered out during your import (only 373086 of 389859 downloaded heartbeats were actually persists). Could you maybe check the WakaTime CSV dump if you can spot any duplicates or other irregularities? Might well be that there is a bug in Wakapi, that causes too many heartbeats to be filtered out. Would love to get your support on investigating this! |
As mentioned on that other issue, we hash every heartbeat object to check for duplicates. I just briefly reviewed the implementation of that again and there is a chance that hashing for a heartbeat's Could you potentially send me a subset of your CSV export (pseudonymized, if you prefer that) that includes a portion of records where all relevant attributes (entity, type, project, language, ...) are identical, except for the timestamp? But don't worry, I can also just handcraft a bit of fake data for that! Will keep you posted. |
@muety I am not sure what you want. Would the complete export work? Wakatime gives a download link, you can download it from there. It is about 250MBs. |
Complete export would help as well, but it obviously contains quite a lot of potentially personally identifiable information. So feel free to only take a portion of it and / or replace project names or so. Send it to [email protected]. But no worries if that's too much to ask for! I can probably also just write a quick script to generate test data for debugging the above! |
Looks like you will have to generate the data yourself, the files are just too large to upload. Sorry :( |
The CSVs can probably be compressed quite tremendously. But no worries if not! Thanks for help. |
I'm not in front of my system anymore. Will try to convert the json to CSV tomorrow and update here. |
@muety I was able to compress the entire data to 23MB. Sent you the json. I used https://github.com/ouch-org/ouch to compress it. You can decompress it with that. |
Where did you send it? Didn't receive an e-mail, yet. Btw., I did some testing and the hashing seems to be working fine. The cause of this problem has to be somewhere else. Looking at your data will hopefully reveal something in that regard. |
I sent it to you on the email you wrote above. If you still did not receive it, perhaps you can share your discord username? |
What is your overall, total coding time shown in WakaTime and what is it in Wakapi? |
I checked the data you sent. The discrepancy between how many heartbeats were downloaded from WakaTime and how many were imported into Wakapi actually solely seems to be due to duplicate timestamps. I wrote a small script to analyze your dump and it outputs that around 4.5 % of WakaTime heartbeats have non-unique timestamps, which is something that Wakapi can not handle. import json
with open('wakatime-dump.json', 'r') as f:
data = json.load(f)
timestamps = [heartbeat['time'] for day in data['days'] for heartbeat in day['heartbeats']]
timestamps_unique = frozenset(timestamps)
print(f'got {len(timestamps_unique)} / {len(timestamps)}') # got 373095 / 389886 To be honest, I tend to think that there's nothing wrong with Wakapi and the difference is just due to the different methodology of interpolating between heartbeats. |
I am not sure about this since i do not have a wakatime pro account. However one of my projects (incento-server) shows a total time of 700hrs on wakatime while it showed about 500hrs on wakapi.
If that is the case then i think this issue is solved? |
Frankly, yes. I don't see anything we could do on Wakapi's side at this point, sorry. Once we have #156, you'll be able to tweak the interpolation methodology to your needs in the future. Please stay tuned until then :-). |
@muety Thank you for looking into this! |
Describe the bug
I made an import from Wakatime, but the entire data in not imported. I have about 100hrs missing from each of my major worked on projects. I could not find any issues like this. Is this a known bug?
System information
Output of
uname -ar
Hosted using dokku.
The text was updated successfully, but these errors were encountered: