Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need regular clean-up of /mnt/ebs/tmp on 10-aws-syd #395

Open
mhidas opened this issue Mar 23, 2016 · 13 comments · Fixed by #418
Open

Need regular clean-up of /mnt/ebs/tmp on 10-aws-syd #395

mhidas opened this issue Mar 23, 2016 · 13 comments · Fixed by #418
Assignees

Comments

@mhidas
Copy link
Contributor

mhidas commented Mar 23, 2016

This directory is used for temp storage during processing (see https://github.com/aodn/chef-private/pull/1776), and it looks like files are not always cleaned up. There's currently 29Gb worth of stuff in there. For the moment this is not a problem as there's still 156Gb free on /mnt/ebs, but as this filesystem is also used for incoming, error, logging and various other things, some bad things could happen if it fills up.

So, it would be good to periodically clean out the oldest files in /mnt/ebs/tmp. Could easily add a cron job here, but maybe this should be set up in chef?

@jonescc @julian1 @danfruehauf any thoughts?

@julian1
Copy link
Contributor

julian1 commented Apr 14, 2016

currently 93GB

$ du -hs /mnt/ebs/tmp/
93G     /mnt/ebs/tmp/
$ find /mnt/ebs/tmp/ -type f | wc -l
7614

Not really sure why a tmp directory is needed.

Ideally all file handling steps (reception, validity checking, talend harvest) should complete as atomic actions.

Easy to implement using stack unwinding with exception handlers doing cleanup, or else using a queue of command objects with do() and restore() methods.

@lbesnard
Copy link
Contributor

lbesnard commented Apr 18, 2016

Lots of logs from harvesters. lots of gliders data. see fix here #417

also will perform a clean up of tmp dir containing anfog data

find . -type d | grep -E glider$ | du -sh
107 GB

I think cleaning 107GB will make everyone happy

@lbesnard
Copy link
Contributor

lbesnard commented Apr 18, 2016

also many soop_ba files in tmp folder, and temporary subfolder called Raw

find . -type d | grep -E Raw$ | du -sh
63 Gb

and

./tmp.AFEnxbUCXX/0256_Cairns20151130
./tmp.iyPybsNrmm/0016_AIMS20151127
./tmp.RBoYOpOrGJ/0015_AIMS20151021
./tmp.RnuUWkAiYq/0016_AIMS20151127
./tmp.FRlyQ7ukvJ/0017_CharlotteBay20151124
./tmp.5R5ZJxhNA5/0255_Yamba20151110

@lbesnard
Copy link
Contributor

@mhidas
also some IMOS_ANMN=NRS and QLD files. no big deal, but maybe would be nice to add some cleaning code in your pipeline functions

@mhidas
Copy link
Contributor Author

mhidas commented Apr 19, 2016

@lbesnard I do have code to clean up temp files at the end of the incoming handler. The problem is when there is an error, the whole thing exits and the clean-up code doesn't get executed. I could add a bunch of rm $tmp_file etc... statements before every file_error, but that's a lot of extra lines to do the same thing.

It would make more sense for the clean-up to be more generic. I did suggest one solution ( #334) a while ago, but that was vetoed. Probably a better solution would be for each incoming_handler process to be provided with its own temp directory, which is removed with all its contents after execution, no matter what. This could be done in https://github.com/aodn/chef/blob/master/cookbooks/imos_po/templates/default/watch-exec-wrapper.sh.erb

@mhidas
Copy link
Contributor Author

mhidas commented Apr 19, 2016

Or the other alternative is to have a separate process that runs every day and removes everything from /mnt/ebs/tmp that is more than X days old, which is what I was suggesting with this issue. (Granted, that wouldn't guarantee that the temp dir won't fill up, but if X is small enough it should be very unlikely).

@mhidas
Copy link
Contributor Author

mhidas commented Jun 6, 2016

Not sure why this was closed. We're still don't have a proper solution for cleaning up these files

@mhidas mhidas reopened this Jun 6, 2016
@julian1
Copy link
Contributor

julian1 commented Jun 7, 2016

If pipeline files are being orphaned in any of the /tmp dirs then we should find out why - and whether that's a hint of other more serious issues.

After the last round of fixes to data-services - all tmp files were being cleaned up correctly. I spent quite a bit of time verifying this.

I would like to know if this is a new issue, or whether it's a pre-existing/intermittent issue that for some reason wasn't seen or triggered before.

Currently, I feel it's a problem that we cannot actually trace the progress of a file through the "pipeline' or generate any type of audit log of what happened to it.

@mhidas
Copy link
Contributor Author

mhidas commented Jun 7, 2016

@julian1 I mentioned above one basic reason those files are being left in the temp directory, and that's going to keep happening quite a lot.

This is not a new issue, it is the reason this issue was created in the first place. My initial suggestion, to have a separate job doing a regular clean-up is just the simplest thing I could think of at the time. I'd be happy for us to come up with a better solution.

This is really part of https://github.com/aodn/backlog/issues/326 , which we need to re-open.

@mhidas mhidas closed this as completed Jun 7, 2016
@mhidas mhidas reopened this Jun 7, 2016
@julian1
Copy link
Contributor

julian1 commented Jun 7, 2016

I mentioned above one basic reason those files are being left in the temp directory, and that's going to keep happening quite a lot.

Being able to specify resource management on an error condition and making it an invariant enforced by the type-system is something that was solved in modern programming languages about 20 years ago (https://en.wikipedia.org/wiki/Resource_Acquisition_Is_Initialization).

I agree we should re-open the original issue.

@mhidas
Copy link
Contributor Author

mhidas commented Jun 7, 2016

aodn/backlog#371

@ggalibert
Copy link
Contributor

Recently (2017-03-09 16:07:41) faced a similar problem in ACORN with IMOS_ACORN_RV_20160413T014000Z_TAN_FV01_radial.nc .

IMOS_ACORN_RV_20160413T014000Z_TAN_FV01_radial.nc.20170309-160741.log file says:

Going to process a total of '1' files
Processing slice with '1' files
No space left on device - /tmp/d20170309-21516-13ae91v

Might need to have a look at what's going on with /tmp in the incoming handler code.

@mhidas
Copy link
Contributor Author

mhidas commented Mar 14, 2017

@ggalibert I think that's a Talend issue. As far as I know all incoming handlers now use /mnt/ebs/tmp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants