Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix deadlock when a path is re-locked while the cache is locked globally #133

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

bilts
Copy link
Contributor

@bilts bilts commented Sep 28, 2016

See comments for details on the deadlock condition.

I don't like that it relies on the internal property _RLock__owner, but it was the least invasive way I could think to do this. I'd appreciate someone else more in-the-know finding a better way.

@liath
Copy link
Collaborator

liath commented Apr 27, 2017

I hate watching PRs sit but I don't know enough about this to be able to verify the change without reproduction steps. Could I trouble you to tell me how to trigger this bug?

@danilop
Copy link
Owner

danilop commented Sep 11, 2017

Can anyone test this fix? I like the approach, but I don't know how to test the solution.

@bilts
Copy link
Contributor Author

bilts commented Sep 11, 2017

It's been a long time, but I dug up this test case that I believe was triggering the problem. It's a little arcane (and may no longer work), but the gist is you run a bunch of workers which all read some random bytes from yas3fs, then, while they're running, you reset the caches. Before this patch, they were locking up every time.

#!/bin/bash

trap "trap - SIGTERM && kill -- -$$" SIGINT SIGTERM EXIT

set -e

if [ "$#" -ne 5 ]; then
    echo "readrandom topicqueuename bucket filename chunksize workers"
    echo
    exit 1
fi

if ! hash aws; then
    sudo pip install awscli --ignore-installed six
fi

if hash sha1sum 2>/dev/null; then
    checksum=sha1sum
else
    checksum=shasum
fi

topicqueue=$1
bucket=$2
input=$3
size=$4
workers=$5

if pgrep -f yas3fs; then
    pgrep -f yas3fs | xargs sudo kill -9
    sudo fusermount -uz /myapp/data
fi
sudo rm -f /var/log/yas3fs.log;

sudo /usr/bin/python /usr/bin/yas3fs --region $AWS_REGION --log /var/log/yas3fs.log --log-mb-size 256 --debug --s3-num 100 --download-num 100 --no-metadata --read-retries-num 5 --download-retries-num 10 --cache-mem-size 2048 --cache-disk-size 20480 --cache-check 10 --cache-path /mnt/cache/data --topic arn:aws:sns:us-west-2:$AWS_ACCOUNT_ID:$topicqueue --new-queue s3://$bucket /myapp/data

filesize=$(ls -l $input | awk '{print $5}')
maxstart=$((filesize/size-1))

dowork() {
    SECONDS=0
    i=0
    ptag=$1
    pinput=$2
    psize=$3
    pmaxstart=$4
    pchecksum=$5

    while [[ 1 ]]; do
        randint=$(od -vAn -N8 -tu8 < /dev/urandom | tr -d '[[:space:]]' | tail -c +2 | sed 's/^0*//')
        start=$((randint % pmaxstart))
        out=$(dd if=$pinput bs=$psize skip=$start count=1 2> /dev/null | $pchecksum | awk '{print $1}')
        i=$((i+1))
        printf "[Proc %-3s] [%3s] [%s] [%10s] [%s] %s\n" $1 $SECONDS $(date -u +"%Y-%m-%dT%H:%M:%SZ") $i $out $start
    done
}

export -f dowork

seq 1 $workers | xargs -P $workers -n 1 -I {} bash -c "dowork {} $input $size $maxstart $checksum" &

sleep 5

echo 'Breaking...'

aws sns publish --region us-west-2 --topic-arn arn:aws:sns:$AWS_REGION:$AWS_ACCOUNT_ID:$topicqueue --message '{"default": "[ \"all-myapp-caches\", \"reset\" ]"}' --message-structure json

echo 'Waiting...'

sleep 2

pgrep -f dowork | xargs kill || true
pgrep -f yas3fs | xargs sudo kill -9 || true
sudo fusermount -uz /myapp/data

@liath
Copy link
Collaborator

liath commented Mar 26, 2018

Finally managed to get a test case that reliably triggers the deadlock and this patch does resolve the issue.

Could I get someone who actually knows python to take a look at the test and make sure I'm not testing nothing? https://gist.github.com/Liath/49fece9fc6dca640a4d9ecedb7b3c4a7

liath added a commit to liath/yas3fs that referenced this pull request Mar 28, 2018
liath added a commit to liath/yas3fs that referenced this pull request Mar 28, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants