Fix deadlock when a path is re-locked while the cache is locked globally #133

bilts · 2016-09-28T18:57:56Z

See comments for details on the deadlock condition.

I don't like that it relies on the internal property _RLock__owner, but it was the least invasive way I could think to do this. I'd appreciate someone else more in-the-know finding a better way.

liath · 2017-04-27T22:10:38Z

I hate watching PRs sit but I don't know enough about this to be able to verify the change without reproduction steps. Could I trouble you to tell me how to trigger this bug?

danilop · 2017-09-11T11:17:54Z

Can anyone test this fix? I like the approach, but I don't know how to test the solution.

bilts · 2017-09-11T14:24:11Z

It's been a long time, but I dug up this test case that I believe was triggering the problem. It's a little arcane (and may no longer work), but the gist is you run a bunch of workers which all read some random bytes from yas3fs, then, while they're running, you reset the caches. Before this patch, they were locking up every time.

#!/bin/bash

trap "trap - SIGTERM && kill -- -$$" SIGINT SIGTERM EXIT

set -e

if [ "$#" -ne 5 ]; then
    echo "readrandom topicqueuename bucket filename chunksize workers"
    echo
    exit 1
fi

if ! hash aws; then
    sudo pip install awscli --ignore-installed six
fi

if hash sha1sum 2>/dev/null; then
    checksum=sha1sum
else
    checksum=shasum
fi

topicqueue=$1
bucket=$2
input=$3
size=$4
workers=$5

if pgrep -f yas3fs; then
    pgrep -f yas3fs | xargs sudo kill -9
    sudo fusermount -uz /myapp/data
fi
sudo rm -f /var/log/yas3fs.log;

sudo /usr/bin/python /usr/bin/yas3fs --region $AWS_REGION --log /var/log/yas3fs.log --log-mb-size 256 --debug --s3-num 100 --download-num 100 --no-metadata --read-retries-num 5 --download-retries-num 10 --cache-mem-size 2048 --cache-disk-size 20480 --cache-check 10 --cache-path /mnt/cache/data --topic arn:aws:sns:us-west-2:$AWS_ACCOUNT_ID:$topicqueue --new-queue s3://$bucket /myapp/data

filesize=$(ls -l $input | awk '{print $5}')
maxstart=$((filesize/size-1))

dowork() {
    SECONDS=0
    i=0
    ptag=$1
    pinput=$2
    psize=$3
    pmaxstart=$4
    pchecksum=$5

    while [[ 1 ]]; do
        randint=$(od -vAn -N8 -tu8 < /dev/urandom | tr -d '[[:space:]]' | tail -c +2 | sed 's/^0*//')
        start=$((randint % pmaxstart))
        out=$(dd if=$pinput bs=$psize skip=$start count=1 2> /dev/null | $pchecksum | awk '{print $1}')
        i=$((i+1))
        printf "[Proc %-3s] [%3s] [%s] [%10s] [%s] %s\n" $1 $SECONDS $(date -u +"%Y-%m-%dT%H:%M:%SZ") $i $out $start
    done
}

export -f dowork

seq 1 $workers | xargs -P $workers -n 1 -I {} bash -c "dowork {} $input $size $maxstart $checksum" &

sleep 5

echo 'Breaking...'

aws sns publish --region us-west-2 --topic-arn arn:aws:sns:$AWS_REGION:$AWS_ACCOUNT_ID:$topicqueue --message '{"default": "[ \"all-myapp-caches\", \"reset\" ]"}' --message-structure json

echo 'Waiting...'

sleep 2

pgrep -f dowork | xargs kill || true
pgrep -f yas3fs | xargs sudo kill -9 || true
sudo fusermount -uz /myapp/data

liath · 2018-03-26T22:58:33Z

Finally managed to get a test case that reliably triggers the deadlock and this patch does resolve the issue.

Could I get someone who actually knows python to take a look at the test and make sure I'm not testing nothing? https://gist.github.com/Liath/49fece9fc6dca640a4d9ecedb7b3c4a7

Fix deadlock when a path is re-locked while the cache is locked globally

f4e423d

liath added a commit to liath/yas3fs that referenced this pull request Mar 28, 2018

a rough regression test for PR danilop#133

66f5063

liath added a commit to liath/yas3fs that referenced this pull request Mar 28, 2018

seemingly working test for PR danilop#133

abfae15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix deadlock when a path is re-locked while the cache is locked globally #133

Fix deadlock when a path is re-locked while the cache is locked globally #133

bilts commented Sep 28, 2016

liath commented Apr 27, 2017

danilop commented Sep 11, 2017

bilts commented Sep 11, 2017

liath commented Mar 26, 2018

Fix deadlock when a path is re-locked while the cache is locked globally #133

Are you sure you want to change the base?

Fix deadlock when a path is re-locked while the cache is locked globally #133

Conversation

bilts commented Sep 28, 2016

liath commented Apr 27, 2017

danilop commented Sep 11, 2017

bilts commented Sep 11, 2017

liath commented Mar 26, 2018