Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tss 1158/related barrier merged 2 #722

Closed
wants to merge 82 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
82 commits
Select commit Hold shift + click to select a range
87da53c
wip
osimuka Nov 3, 2023
f556e8e
installing dependencies
osimuka Nov 3, 2023
eee702b
fixing error on api
osimuka Nov 3, 2023
724fba0
reverting changes
osimuka Nov 3, 2023
2d48cdc
api returning related barriers
osimuka Nov 3, 2023
e8b21c2
isort fix
osimuka Nov 3, 2023
33ef1fd
need to install cuda for pytorch
osimuka Nov 6, 2023
b154a19
added treshold adjuster
osimuka Nov 6, 2023
e29a463
Merge branch 'feature/TSS-1064-display-related-barriers' of https://g…
osimuka Nov 6, 2023
0b27ced
circleci fix
osimuka Nov 6, 2023
d98f3be
fixing pytorch bug
osimuka Nov 7, 2023
a29ec77
issue with freeze time
osimuka Nov 7, 2023
5e607ee
fixing freegun issue
osimuka Nov 7, 2023
631bf80
updating freezegun
osimuka Nov 7, 2023
948d974
feat: fixing freezegun bug
osimuka Nov 7, 2023
b7978a4
fixing dependencies
osimuka Nov 7, 2023
7ddafc7
Merge branch 'master' into feature/TSS-1064-display-related-barriers
osimuka Nov 7, 2023
2f9835b
upgrading freezgun
osimuka Nov 7, 2023
f50c432
Merge branch 'feature/TSS-1064-display-related-barriers' of https://g…
osimuka Nov 7, 2023
42a57c7
fixing tests with freeze time
osimuka Nov 7, 2023
81e40eb
fixing broken tests
osimuka Nov 7, 2023
2b83bb6
fixing lint issues
osimuka Nov 7, 2023
9334681
fixing lint issues
osimuka Nov 7, 2023
ff3968f
removed dependencies on tokenize and nltk.
chris-pettinga Nov 8, 2023
a330d64
regenerating requirments
osimuka Nov 8, 2023
57b254a
increasing memory
chris-pettinga Nov 8, 2023
77a6930
boosting mem
chris-pettinga Nov 8, 2023
9adc4f0
Update Related barrier serializer
ferozerub Nov 9, 2023
77dbcbb
Merge branch 'feature/TSS-1064-display-related-barriers' of github.co…
ferozerub Nov 9, 2023
80001bc
removing some dependencies from pyproject.toml
chris-pettinga Nov 9, 2023
08e73e1
Merge remote-tracking branch 'origin/feature/TSS-1064-display-related…
chris-pettinga Nov 9, 2023
24127a5
rebuild requirements
osimuka Nov 9, 2023
231d9a7
removing memory hardpin
chris-pettinga Nov 9, 2023
4a8a143
trying
chris-pettinga Nov 9, 2023
2178a35
fixing requirements
osimuka Nov 10, 2023
91611f1
fixing dependencies
osimuka Nov 10, 2023
1d30c9f
fixing toml file space
osimuka Nov 10, 2023
107bc2b
refactor related barrier code
osimuka Nov 14, 2023
a0c7b49
changing back to normal settings
osimuka Nov 15, 2023
95bf432
fix lint issues
osimuka Nov 15, 2023
16304f0
refactoring the variable names and general stuff
chris-pettinga Nov 15, 2023
6eb1ae5
reverting changes
chris-pettinga Nov 15, 2023
7841310
using full summary
osimuka Nov 15, 2023
097accc
Merge branch 'feature/TSS-1064-display-related-barriers' of https://g…
osimuka Nov 15, 2023
8bb4785
adding ID to serializer
chris-pettinga Nov 15, 2023
d6309b9
Merge remote-tracking branch 'origin/feature/TSS-1064-display-related…
chris-pettinga Nov 15, 2023
36584d7
Fine tuning the model
ferozerub Nov 20, 2023
1fbbc2e
Remove GPU dependancies
ferozerub Nov 20, 2023
f01ab4d
Add extra detail to serialiser
ferozerub Nov 20, 2023
d5ee02a
Feature - TSS-1064 - Display related barriers deployment (#645)
chris-pettinga Dec 8, 2023
d5d8e82
getting rid of Barrier objects method as unnecessary, return a DataFr…
chris-pettinga Dec 8, 2023
ab2656c
work
chris-pettinga Dec 13, 2023
4592b6b
getting rid of Barrier objects method as unnecessary, return a DataFr…
chris-pettinga Dec 14, 2023
9981e5b
think it's working now, created a new BarrierEmbedding dict subclass …
chris-pettinga Dec 18, 2023
eddb5b7
moving to barrier_ids rather than barrier objects, reduced some dupli…
chris-pettinga Dec 18, 2023
e188dc0
moving to goold old numpy arrays, finally lol.
chris-pettinga Dec 18, 2023
d5d7ae2
fixing bugs
chris-pettinga Dec 18, 2023
e19e0b1
removing redundant code
chris-pettinga Dec 18, 2023
1dc358d
updating branch with master
osimuka Jan 12, 2024
a2f9de4
added freezegun
osimuka Jan 12, 2024
a1ec34b
fixing related barriers pyproject.toml file
osimuka Jan 12, 2024
219f6e5
new lock files
osimuka Jan 12, 2024
f502129
wip
osimuka Jan 12, 2024
6e2b6d7
updating rquirements
osimuka Jan 12, 2024
90ca979
wip
osimuka Jan 12, 2024
fa9d168
isort
osimuka Jan 12, 2024
51979b3
feat: removing configure
osimuka Jan 12, 2024
82fd4da
fixing freeze gon configure
osimuka Jan 12, 2024
f9ef219
removing configure
osimuka Jan 12, 2024
e3a44af
fixing error open_llama tokenization modules import
osimuka Jan 12, 2024
4c1919f
updated freeze libary
osimuka Jan 12, 2024
84d4478
align with master
osimuka Jan 25, 2024
ac4cf28
fixing poetry issues
osimuka Jan 25, 2024
c5bca39
merge
Feb 1, 2024
89a8345
import error
Feb 1, 2024
3265c43
format
Feb 1, 2024
be3d205
Fix - saving changes to barriers were failing due to comparison of st…
ferozerub Feb 7, 2024
1e8942e
Move to related-barriers app
Feb 8, 2024
2bd8790
format
Feb 8, 2024
06d652d
Remove local storage
ferozerub Feb 8, 2024
a15f405
Merge branch 'TSS-1158/related-barrier-merged' of github.com:uktrade/…
ferozerub Feb 8, 2024
e964caf
lock
Feb 23, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ coverage.xml
.hypothesis/
/codecov.sh
/test-reports/
/.pytest_cache/
*/.pytest_cache/

# Translations
*.mo
Expand Down Expand Up @@ -129,3 +129,5 @@ src/
requirements-dev.txt
/local.env
config/settings/debug.py
/barrier_embeddings_dict.pkl
/barrier_similarity_scores_df.json
18 changes: 18 additions & 0 deletions .profile
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
#!/usr/bin/env bash
# custom initialisation tasks
# ref - https://docs.cloudfoundry.org/devguide/deploy-apps/deploy-app.html

echo "---- RUNNING release tasks (.profile) ------"

echo "---- Installing Related Barrier ML Packages ------"
/tmp/lifecycle/shell
python -m pip install sentence-transformers==2.2.2 --no-deps
python -m pip install torch==2.0.0 torchvision==0.15.1 --extra-index-url https://download.pytorch.org/whl/cpu

echo "---- Collecting static ------"
python manage.py collectstatic --noinput

echo "---- Apply Migrations ------"
python manage.py migrate

echo "---- FINISHED release tasks (.profile) ------"
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,7 @@ pip-install: ## Install pip requirements inside the container.
@echo "$$(tput setaf 3)🙈 Installing Pip Packages 🙈$$(tput sgr 0)"
@docker-compose exec web poetry lock
@docker-compose exec web poetry export --without-hashes -f requirements.txt -o requirements.txt
@docker-compose exec web poetry export --dev --without-hashes -f requirements.txt -o requirements-dev.txt
@docker-compose exec web poetry export --with dev --without-hashes -f requirements.txt -o requirements-dev.txt
@docker-compose exec web pip install -r requirements-dev.txt
@docker-compose exec web sed -i '1i# ======\n# DO NOT EDIT - use pyproject.toml instead!\n# Generated: $(__timestamp)\n# ======' requirements.txt
@docker-compose exec web sed -i '1i# ======\n# DO NOT EDIT - use pyproject.toml instead!\n# Generated: $(__timestamp)\n# ======' requirements-dev.txt
Expand Down
4 changes: 4 additions & 0 deletions api/barriers/apps.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
from django.apps import AppConfig
from django.conf import settings
from django.core.cache import cache


class BarriersConfig(AppConfig):
Expand Down Expand Up @@ -48,3 +50,5 @@ def ready(self):
public_barrier_light_touch_reviews_changed,
sender=PublicBarrierLightTouchReviews,
)

cache.delete(settings.BARRIER_SIMILARITY_MATRIX_CACHE_KEY)
7 changes: 4 additions & 3 deletions api/barriers/models.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,9 +23,6 @@
from notifications_python_client.notifications import NotificationsAPIClient
from simple_history.models import HistoricalRecords

from api.barriers import validators
from api.barriers.report_stages import REPORT_CONDITIONS, report_stage_status
from api.barriers.utils import random_barrier_reference
from api.collaboration import models as collaboration_models
from api.commodities.models import Commodity
from api.commodities.utils import format_commodity_code
Expand Down Expand Up @@ -55,6 +52,10 @@
PublicBarrierStatus,
)

from . import validators
from .report_stages import REPORT_CONDITIONS, report_stage_status
from .utils import random_barrier_reference

logger = logging.getLogger(__name__)

MAX_LENGTH = settings.CHAR_FIELD_MAX_LENGTH
Expand Down
21 changes: 20 additions & 1 deletion api/barriers/signals/handlers.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
import logging

from django.conf import settings
from django.db.models.signals import post_save
from django.db.models.signals import post_save, pre_save
from django.dispatch import receiver
from notifications_python_client.notifications import NotificationsAPIClient

Expand All @@ -21,6 +21,7 @@
PublicBarrierLightTouchReviews,
)
from api.metadata.constants import TOP_PRIORITY_BARRIER_STATUS
from api.related_barriers import service as related_barrier_service

logger = logging.getLogger(__name__)

Expand Down Expand Up @@ -350,6 +351,24 @@ def barrier_completion_top_priority_barrier_resolved(
)


@receiver(pre_save, sender=Barrier)
def barrier_update_similarity_scores(sender, instance, *args, **kwargs):
try:
current_barrier_object = sender.objects.get(pk=instance.pk)
except sender.DoesNotExist:
pass # the barrier is new, we handle this elsewhere
else:
changed = any(
getattr(current_barrier_object, field) != getattr(instance, field)
for field in related_barrier_service.RELEVANT_BARRIER_FIELDS
)
if changed and not current_barrier_object.draft:
similarity_score_matrix = (
related_barrier_service.SimilarityScoreMatrix.retrieve_matrix()
)
similarity_score_matrix.update_matrix(instance)


def barrier_changed_after_published(sender, instance, **kwargs):
try:
obj = sender.objects.get(pk=instance.pk)
Expand Down
5 changes: 3 additions & 2 deletions api/barriers/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,14 +6,15 @@
CHARSET = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ"


def random_barrier_reference():
def random_barrier_reference() -> str:

"""
function to produce a random reference number for barriers
format: B-YY-XXXX
where YY is year and Xs are random alpha-numerics
"""
dd = datetime.datetime.now()
ref_code = f"B-{str(dd.year)[-2:]}-"
for i in range(settings.REF_CODE_LENGTH):
for _ in range(settings.REF_CODE_LENGTH):
ref_code += CHARSET[randrange(0, len(CHARSET))]
return ref_code
5 changes: 5 additions & 0 deletions api/related_barriers/apps.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
from django.apps import AppConfig


class RelatedBarriersConfig(AppConfig):
name = "api.related_barriers"
Empty file.
Empty file.
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
from django.core.management import BaseCommand

from api.related_barriers.service import SimilarityScoreMatrix


class Command(BaseCommand):
help = "Compute and store all barrier similarity scores"

def handle(self, *args, **options):
SimilarityScoreMatrix.create_matrix()
15 changes: 15 additions & 0 deletions api/related_barriers/serializers.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
from rest_framework import serializers

from api.barriers.fields import StatusField


# TODO : standard list serialiser may suffice and the following not required base on final designs
class BarrierRelatedListSerializer(serializers.Serializer):
summary = serializers.CharField(read_only=True)
title = serializers.CharField(read_only=True)
id = serializers.UUIDField(read_only=True)
reported_on = serializers.DateTimeField(read_only=True)
modified_on = serializers.DateTimeField(read_only=True)
status = StatusField(required=False)
location = serializers.CharField(read_only=True)
similarity = serializers.FloatField(read_only=True)
195 changes: 195 additions & 0 deletions api/related_barriers/service.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,195 @@
import os

import numpy as np
import pandas as pd
from django.conf import settings
from django.core.cache import cache
from django.db.models import CharField, QuerySet
from django.db.models import Value as V
from django.db.models.functions import Concat
from sentence_transformers import SentenceTransformer, util
from typing_extensions import Self

from api.barriers.models import Barrier

SIMILARITY_THRESHOLD = 0.19
SIMILAR_BARRIERS_LIMIT = 5

# https://www.sbert.net/docs/pretrained_models.html
# Load the sentence transformer model
# albert-small-v2 # 43 MB size
# smaller model
transformer_model = SentenceTransformer("paraphrase-MiniLM-L3-v2")

RELEVANT_BARRIER_FIELDS = [
"title",
"summary",
]


class SimilarityScoreMatrix(pd.DataFrame):
similarity_score_df_path = os.path.join(
settings.ROOT_DIR, "barrier_similarity_scores_df.json"
)

@property
def base_class_view(self):
# use this to view the base class, needed for debugging in some IDEs.
return pd.DataFrame(self)

@property
def _constructor(self):
return SimilarityScoreMatrix

@staticmethod
def get_annotated_barrier_queryset(barrier_ids: list = None):
if barrier_ids:
starting_queryset = Barrier.objects.filter(id__in=barrier_ids)
else:
starting_queryset = Barrier.objects.all()

annotated_queryset = (
starting_queryset.exclude(draft=True)
.annotate(
barrier_corpus=Concat(
"title", V(". "), "summary", output_field=CharField()
)
)
.values("id", "barrier_corpus")
)
if barrier_ids:
annotated_queryset = annotated_queryset.filter(id__in=barrier_ids)
return annotated_queryset

@classmethod
def create_matrix(cls) -> Self:
"""Create a similarity scores matrix for all barriers.

The similarity scores matrix is a square matrix where the rows and columns are barrier ids. And the values
at each intersection is the similarity score between the two barriers.

The similarity score matrix is then saved to a json file and cached."""
barriers = cls.get_annotated_barrier_queryset()
barrier_ids = [str(barrier["id"]) for barrier in barriers]
barrier_corpuses = [barrier["barrier_corpus"] for barrier in barriers]
barrier_embeddings = transformer_model.encode(
barrier_corpuses, convert_to_tensor=True
)
cosine_scores = util.cos_sim(barrier_embeddings, barrier_embeddings)
new_matrix = cls(
cosine_scores.numpy(),
index=barrier_ids,
columns=barrier_ids,
)

return new_matrix.save_matrix() # saving to JSON and caching

@classmethod
def retrieve_matrix(cls) -> Self:
"""Retrieve the similarity scores matrix from the cache or from the json file if it exists."""
if similarity_score_json := cache.get(
settings.BARRIER_SIMILARITY_MATRIX_CACHE_KEY
):
return cls(pd.read_json(similarity_score_json)) # retrieving from cache
# else:
# if os.path.isfile(cls.similarity_score_df_path):
# with open(cls.similarity_score_df_path, "r") as f:
# dataframe_json = json.load(f)
# cache.set( # caching the matrix, so it can be retrieved from the cache next time
# settings.BARRIER_SIMILARITY_MATRIX_CACHE_KEY,
# dataframe_json,
# timeout=None,
# )
# return cls(pd.read_json(dataframe_json))
else:
return cls.create_matrix() # creating the matrix if it doesn't exist

def update_matrix(self, barrier_object) -> Self:
"""Given a barrier, update the similarity scores matrix column for that barrier.

Similarity scores are computed using cosine similarity between the barrier embeddings.
"""
barrier_id = str(barrier_object.id)

if barrier_id not in self.columns:
return self.add_barrier(barrier_object)

barrier_ids = self.index.tolist()
print("************ barrier ids :", barrier_ids)
annotated_barrier_queryset = self.get_annotated_barrier_queryset(barrier_ids)
barrier_corpuses = [
barrier["barrier_corpus"] for barrier in annotated_barrier_queryset
]
barrier_corpus = barrier_object.title + ". " + barrier_object.summary
this_barrier_embedding = transformer_model.encode(
barrier_corpus, convert_to_tensor=True
)
# TODO: We should store the barrier embeddings rather then regenerate on every update
# potential race/load conditions here where multiple barriers are updated at the same time
all_barrier_embeddings = transformer_model.encode(
barrier_corpuses, convert_to_tensor=True
)
# print("barrier embeddings : ", barrier_embeddings)
# # this_barrier_embedding = annotated_barrier_queryset.get(id=barrier_id)[
# # "barrier_corpus"
# # ]
# print("++++++++ barrier embedding :", this_barrier_embedding)
similarity_scores = util.cos_sim(this_barrier_embedding, all_barrier_embeddings)
self[barrier_id] = similarity_scores.numpy()[0]

return self.save_matrix()

def save_matrix(self) -> Self:
"""Save the similarity scores matrix to a JSON file and caching it."""

# converting to string for JSON serialisation
dataframe_json = self.to_json(default_handler=str)
cache.set(
settings.BARRIER_SIMILARITY_MATRIX_CACHE_KEY, dataframe_json, timeout=None
)
# with open(self.similarity_score_df_path, "w") as f:
# json.dump(dataframe_json, f)

return self

def add_barrier(self, barrier_object) -> Self:
"""Add a barrier to the similarity scores matrix.

This is made a little difficult by Pandas, so we create a new matrix with the new barrier and then return that
instead."""
barrier_id = str(barrier_object.id)

self[barrier_id] = np.nan # creating the empty column
new_row = {}
for barrier in self:
new_row[barrier] = np.nan
new_row = pd.Series(new_row, name=barrier_id)

new_matrix = pd.concat([self, pd.DataFrame([new_row], columns=new_row.index)])
return new_matrix.update_matrix(barrier_object)

def retrieve_similar_barriers(
self,
barrier_object,
limit=SIMILAR_BARRIERS_LIMIT,
threshold=SIMILARITY_THRESHOLD,
) -> QuerySet:
"""Retrieve similar barriers for a given barrier object."""
barrier_id = str(barrier_object.id)

if barrier_id not in self.columns:
new_matrix = self.add_barrier(barrier_object)
return new_matrix.retrieve_similar_barriers(
barrier_object, limit, threshold
)

barrier_similarity_scores = (
self[barrier_id].sort_values(ascending=False)[:limit].drop(barrier_id)
)
barrier_similarity_scores = barrier_similarity_scores[
barrier_similarity_scores > threshold
]
barrier_queryset = Barrier.objects.filter(
id__in=barrier_similarity_scores.index
)
return barrier_queryset
15 changes: 15 additions & 0 deletions api/related_barriers/urls.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
from django.urls import path
from rest_framework.routers import DefaultRouter

from api.related_barriers.views import related_barriers

app_name = "related_barriers"

router = DefaultRouter(trailing_slash=False)


urlpatterns = router.urls + [
path(
"barriers/<uuid:pk>/related-barriers", related_barriers, name="related-barriers"
),
]
Loading