Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File disappears after going to $INCOMING_DIR #410

Closed
lbesnard opened this issue Apr 6, 2016 · 26 comments
Closed

File disappears after going to $INCOMING_DIR #410

lbesnard opened this issue Apr 6, 2016 · 26 comments
Labels

Comments

@lbesnard
Copy link
Contributor

lbesnard commented Apr 6, 2016

Some files completely disappear from the system after going to $INCOMING_DIR

How to reproduce

on aws10, connect as PO user, and do a source ./env at the base of data-services dir (would be good to add this automatically via chef btw)

# copy a SOOP ASF MT file to its incoming dir
cp $WIP_DIR/SOOP/SOOP_XBT_ASF_SST/data_unsorted/ASF_SST/ship/2016/IMOS_SOOP-ASF_MT_20160404T000000Z_VLMJ_FV01_C-20160405T040053Z.nc $INCOMING_DIR/SOOP/ASF/MT   

input_logf SOOP_ASF_MT
Apr  6 14:44:36 10-aws-syd SOOP_ASF_MT: '/mnt/ebs/tmp/tmp.oltFjHYb6U/IMOS_SOOP-ASF_MT_20160404T000000Z_VLMJ_FV01_C-20160405T040053Z.nc' is not a valid file, aborting.

### similar behaviour with 
cp $WIP_DIR/SOOP/SOOP_XBT_ASF_SST/data_unsorted/ASF_SST/ship/2016/IMOS_SOOP-SST_T_20160405T000300Z_VRDU8_FV01_C-20160406T001307Z.nc $INCOMING_DIR/SOOP/SST

input_logf SOOP_SST
Apr  6 15:16:27 10-aws-syd SOOP_SST: '/mnt/ebs/tmp/tmp.hr9huOeRry/IMOS_SOOP-SST_T_20160405T000300Z_VRDU8_FV01_C-20160406T001307Z.nc' is not a valid file, aborting.

Ok why not, the file is 'not valid'. Well the original one is. But it looks like there is an issue here https://github.com/aodn/data-services/blob/master/lib/common/util.sh#L165 with the tmp file. The definition of 'valid' is a bash check to see if the file exists or not. So we're really at a low level check.

The issue is that we should still be able to locate the 'bad' file. But it is nor in $ERROR_DIR nor in $INCOMING_DIR. Anyway, this is EXTREMELY concerning. Of course, Nagios won't warn us on anything because there is no files in the ERROR_DIR. So we have potentially lost many files.

Both these folders are empty, as well as the thredds folder where the file should go to.

ls -la $INCOMING_DIR/SOOP/ASF/MT 
ls -la $ERROR_DIR/SOOP_ASF_MT

I wish I was wrong, I've tried and checked many times.

@smancini @jonescc @pblain

@lbesnard lbesnard added the bug label Apr 6, 2016
@lbesnard
Copy link
Contributor Author

lbesnard commented Apr 6, 2016

@danfruehauf if you're bored in NZ !

@mhidas
Copy link
Contributor

mhidas commented Apr 7, 2016

@lbesnard Would you expect that file to be published? I.e. does it pass the checker and does the dest_path get correctly determined?

Do you get the same behaviour if you do this in the PO box?

It's not surprising that the file isn't in $INCOMING_DIR because it is moved to a temp directory before the incoming_handler is even called (see https://github.com/aodn/chef/blob/master/cookbooks/imos_po/templates/default/watch-exec-wrapper.sh.erb). As you can see the error in the log is referring to the file in /mnt/ebs/tmp/.

@lbesnard
Copy link
Contributor Author

lbesnard commented Apr 7, 2016

@lbesnard Would you expect that file to be published? I.e. does it pass the checker and does the dest_path get correctly determined?

yeap, and it works fine on nsp14.

It's not surprising that the file isn't in $INCOMING_DIR because it is moved to a temp directory before the incoming_handler is even called (see

yeap, that's the all point. Where does this file go then ?

@mhidas
Copy link
Contributor

mhidas commented Apr 7, 2016

I think the problem is running it as projectofficer. Maybe there's a file permissions issue that isn't handled well.

I just tried running your first example above and it worked fine:

Apr  7 10:47:03 10-aws-syd SOOP_ASF_MT: Indexing file 'IMOS/SOOP/SOOP-ASF/VLMJ_Investigator/meteorological_sst_observations/2016/IMOS_SOOP-ASF_MT_20160404T000000Z_VLMJ_FV01_C-20160405T040053Z.nc', source file '/mnt/ebs/tmp/tmp.UbbGBgfnBM/IMOS_SOOP-ASF_MT_20160404T000000Z_VLMJ_FV01_C-20160405T040053Z.nc'
Apr  7 10:47:12 10-aws-syd SOOP_ASF_MT: Moving '/mnt/ebs/tmp/tmp.UbbGBgfnBM/IMOS_SOOP-ASF_MT_20160404T000000Z_VLMJ_FV01_C-20160405T040053Z.nc' -> 's3://imos-data/IMOS/SOOP/SOOP-ASF/VLMJ_Investigator/meteorological_sst_observations/2016/IMOS_SOOP-ASF_MT_20160404T000000Z_VLMJ_FV01_C-20160405T040053Z.nc'
Apr  7 10:47:12 10-aws-syd SOOP_ASF_MT: '/mnt/ebs/tmp/tmp.UbbGBgfnBM/IMOS_SOOP-ASF_MT_20160404T000000Z_VLMJ_FV01_C-20160405T040053Z.nc' -> 's3://imos-data/IMOS/SOOP/SOOP-ASF/VLMJ_Investigator/meteorological_sst_observations/2016/IMOS_SOOP-ASF_MT_20160404T000000Z_VLMJ_FV01_C-20160405T040053Z.nc'  [1 of 1]
Apr  7 10:47:12 10-aws-syd SOOP_ASF_MT: #015   4096 of 420168     0% in    0s  1842.47 kB/s#015 420168 of 420168   100% in    0s     5.31 MB/s  done
Apr  7 10:47:12 10-aws-syd SOOP_ASF_MT: Done. Uploaded 420168 bytes in 1.0 seconds, 410.32 kB/s. Copied 0 files saving 0 bytes transfer.

@lbesnard
Copy link
Contributor Author

lbesnard commented Apr 7, 2016

hum, that's bizarre. How about the crons ? because they all run as projectofficer . And that's the issue, my soop python script pushes data to incoming

@pblain
Copy link
Contributor

pblain commented Apr 7, 2016

@anguss00 , @julian1 ( @jonescc ) can help out if you need support.

@lbesnard
Copy link
Contributor Author

lbesnard commented Apr 7, 2016

Well I think we (I) definitely need support if it works with a normal user but not PO

@lbesnard
Copy link
Contributor Author

lbesnard commented Apr 8, 2016

Also another similar bug, but not as problematic maybe.
If a watch.d script exists, which points to a non existing incoming_handler script, the files landing in the INCOMING _DIR also disappears from the system

Happens on the PO box.

Apr  8 12:40:51 packer-virtualbox-iso-1449624312 AUV: /usr/local/bin/watch-exec-wrapper.sh: line 59: /vagrant/src/data-services/AUV/auv_viewer_processing/incoming_handler.sh: No such file or directory
Apr  8 12:40:51 packer-virtualbox-iso-1449624312 AUV: Could not process file 'File not handled':
Apr  8 12:40:51 packer-virtualbox-iso-1449624312 AUV: Moving 'File not handled' -> '/vagrant/src/error/AUV/.20160408-124051'
Apr  8 12:40:51 packer-virtualbox-iso-1449624312 AUV: Could not move 'File not handled' -> '/vagrant/src/error/AUV/.20160408-124051'

but the file /vagrant/src/error/AUV/.20160408-124051 does not exist

@lbesnard
Copy link
Contributor Author

lbesnard commented Apr 8, 2016

I won't weight my words, it's just disastrous and happens to easily. Any small issue in an incoming handler would result to the same thing.

This bug should really be seen as a priority, we can not guarantee we're not loosing any data.

@julian1
Copy link
Contributor

julian1 commented Apr 10, 2016

Could someone point me at the documentation?

@lbesnard
Copy link
Contributor Author

Just saying but this looks to me as a very important bug. It wouldn't be a bad idea to spend a little bit of time on this just to see this is actually real as we have no idea how long this 'potential bug' has been on, and how many files have already completely disappeared from the FS.

@anguss00
Copy link
Contributor

Yes @lbesnard, I'm on it as of this arvo. Replicating and checking out the chef code atm

@lbesnard
Copy link
Contributor Author

👍 @anguss00 !

@anguss00
Copy link
Contributor

Let's hustle this bug :-)

@lbesnard
Copy link
Contributor Author

serious question: Should all pipelines be disabled while this is being fixed ? I'd say yes.
@pblain @smancini

@anguss00
Copy link
Contributor

in process.7.log.gz
string found containing

Apr  6 12:22:44 10-aws-syd SOOP_ASF_MT: '/mnt/ebs/tmp/tmp.bwrZwTYHpo/IMOS_SOOP-ASF_MT_20160404T000000Z_VLMJ_FV01_C-20160405T040053Z.nc' is not a valid file, aborting.

tried to reprocess same file as PO user

cp $WIP_DIR/SOOP/SOOP_XBT_ASF_SST/data_unsorted/ASF_SST/ship/2016/IMOS_SOOP-ASF_MT_20160404T000000Z_VLMJ_FV01_C-20160405T040053Z.nc  $INCOMING_DIR/SOOP/ASF/MT $INCOMING_DIR/SOOP/ASF/MT
Apr 13 16:18:01 10-aws-syd SOOP_ASF_MT: Indexing file 'IMOS/SOOP/SOOP-ASF/VLMJ_Investigator/meteorological_sst_observations/2016/IMOS_SOOP-ASF_MT_20160404T000000Z_VLMJ_FV01_C-20160405T040053Z.nc', source file '/mnt/ebs/tmp/tmp.qwS5fhjUcs/IMOS_SOOP-ASF_MT_20160404T000000Z_VLMJ_FV01_C-20160405T040053Z.nc'
Apr 13 16:18:06 10-aws-syd SOOP_ASF_MT: Moving '/mnt/ebs/tmp/tmp.qwS5fhjUcs/IMOS_SOOP-ASF_MT_20160404T000000Z_VLMJ_FV01_C-20160405T040053Z.nc' -> 's3://imos-data/IMOS/SOOP/SOOP-ASF/VLMJ_Investigator/meteorological_sst_observations/2016/IMOS_SOOP-ASF_MT_20160404T000000Z_VLMJ_FV01_C-20160405T040053Z.nc'
Apr 13 16:18:07 10-aws-syd SOOP_ASF_MT: '/mnt/ebs/tmp/tmp.qwS5fhjUcs/IMOS_SOOP-ASF_MT_20160404T000000Z_VLMJ_FV01_C-20160405T040053Z.nc' -> 's3://imos-data/IMOS/SOOP/SOOP-ASF/VLMJ_Investigator/meteorological_sst_observations/2016/IMOS_SOOP-ASF_MT_20160404T000000Z_VLMJ_FV01_C-20160405T040053Z.nc'  [1 of 1]
Apr 13 16:18:07 10-aws-syd SOOP_ASF_MT: #015   4096 of 420168     0% in    0s  1861.33 kB/s#015 420168 of 420168   100% in    0s     2.18 MB/s  done
Apr 13 16:18:07 10-aws-syd SOOP_ASF_MT: Done. Uploaded 420168 bytes in 1.0 seconds, 410.32 kB/s. Copied 0 files saving 0 bytes transfer.

@anguss00
Copy link
Contributor

OK, we have a suspicion that the size of /tmp is not sufficient at just over 8GB. If this fills up, we can't add any more files, and its adios muchachos. Suggest to increase the max size of this directory. Investigating system logs further.

@mhidas
Copy link
Contributor

mhidas commented Apr 13, 2016

Are you sure \tmp is relevant? Note that the default temp directory ($TMP) is now \mnt\ebs\tmp, so maybe it's the size of that, and regular clearing out of old tmp data ( #395) that we need to look at?

@anguss00
Copy link
Contributor

Ah yes, sorry @mhidas. I totally agree with you there, we need to look at /mnt/ebs/tmp. Firstly, nagios monitoring on the du of this dir. Secondly, cleanup cronjobs as #395

@julian1
Copy link
Contributor

julian1 commented Apr 14, 2016

List of recent not valid files (some are correctly identified).

not_valid.txt

$ cd /mnt/ebs/log/data-services/
$ zgrep 'not a valid' $( ls -vr process.log* ) > not_valid.txt

@lbesnard
Copy link
Contributor Author

👍 @julian1 glad to see the problem is actually really serious

@atkinsn
Copy link

atkinsn commented Apr 15, 2016

@lbesnard stop working!

@julian1
Copy link
Contributor

julian1 commented Apr 15, 2016

Is that the real @atkinsn on github?

@atkinsn
Copy link

atkinsn commented Apr 15, 2016

@julian1 an imposter..

@julian1
Copy link
Contributor

julian1 commented Apr 19, 2016

#422

@pmbohm pmbohm closed this as completed in 0758315 Apr 20, 2016
pmbohm added a commit that referenced this issue Apr 20, 2016
@pmbohm pmbohm reopened this Apr 20, 2016
@pmbohm
Copy link
Contributor

pmbohm commented Apr 20, 2016

0758315 was only a possible fix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants