Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/tss-1064-display-related-barriers #638

Closed
wants to merge 66 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
66 commits
Select commit Hold shift + click to select a range
87da53c
wip
osimuka Nov 3, 2023
f556e8e
installing dependencies
osimuka Nov 3, 2023
eee702b
fixing error on api
osimuka Nov 3, 2023
724fba0
reverting changes
osimuka Nov 3, 2023
2d48cdc
api returning related barriers
osimuka Nov 3, 2023
e8b21c2
isort fix
osimuka Nov 3, 2023
33ef1fd
need to install cuda for pytorch
osimuka Nov 6, 2023
b154a19
added treshold adjuster
osimuka Nov 6, 2023
e29a463
Merge branch 'feature/TSS-1064-display-related-barriers' of https://g…
osimuka Nov 6, 2023
0b27ced
circleci fix
osimuka Nov 6, 2023
d98f3be
fixing pytorch bug
osimuka Nov 7, 2023
a29ec77
issue with freeze time
osimuka Nov 7, 2023
5e607ee
fixing freegun issue
osimuka Nov 7, 2023
631bf80
updating freezegun
osimuka Nov 7, 2023
948d974
feat: fixing freezegun bug
osimuka Nov 7, 2023
b7978a4
fixing dependencies
osimuka Nov 7, 2023
7ddafc7
Merge branch 'master' into feature/TSS-1064-display-related-barriers
osimuka Nov 7, 2023
2f9835b
upgrading freezgun
osimuka Nov 7, 2023
f50c432
Merge branch 'feature/TSS-1064-display-related-barriers' of https://g…
osimuka Nov 7, 2023
42a57c7
fixing tests with freeze time
osimuka Nov 7, 2023
81e40eb
fixing broken tests
osimuka Nov 7, 2023
2b83bb6
fixing lint issues
osimuka Nov 7, 2023
9334681
fixing lint issues
osimuka Nov 7, 2023
ff3968f
removed dependencies on tokenize and nltk.
chris-pettinga Nov 8, 2023
a330d64
regenerating requirments
osimuka Nov 8, 2023
57b254a
increasing memory
chris-pettinga Nov 8, 2023
77a6930
boosting mem
chris-pettinga Nov 8, 2023
9adc4f0
Update Related barrier serializer
ferozerub Nov 9, 2023
77dbcbb
Merge branch 'feature/TSS-1064-display-related-barriers' of github.co…
ferozerub Nov 9, 2023
80001bc
removing some dependencies from pyproject.toml
chris-pettinga Nov 9, 2023
08e73e1
Merge remote-tracking branch 'origin/feature/TSS-1064-display-related…
chris-pettinga Nov 9, 2023
24127a5
rebuild requirements
osimuka Nov 9, 2023
231d9a7
removing memory hardpin
chris-pettinga Nov 9, 2023
4a8a143
trying
chris-pettinga Nov 9, 2023
2178a35
fixing requirements
osimuka Nov 10, 2023
91611f1
fixing dependencies
osimuka Nov 10, 2023
1d30c9f
fixing toml file space
osimuka Nov 10, 2023
107bc2b
refactor related barrier code
osimuka Nov 14, 2023
a0c7b49
changing back to normal settings
osimuka Nov 15, 2023
95bf432
fix lint issues
osimuka Nov 15, 2023
16304f0
refactoring the variable names and general stuff
chris-pettinga Nov 15, 2023
6eb1ae5
reverting changes
chris-pettinga Nov 15, 2023
7841310
using full summary
osimuka Nov 15, 2023
097accc
Merge branch 'feature/TSS-1064-display-related-barriers' of https://g…
osimuka Nov 15, 2023
8bb4785
adding ID to serializer
chris-pettinga Nov 15, 2023
d6309b9
Merge remote-tracking branch 'origin/feature/TSS-1064-display-related…
chris-pettinga Nov 15, 2023
36584d7
Fine tuning the model
ferozerub Nov 20, 2023
1fbbc2e
Remove GPU dependancies
ferozerub Nov 20, 2023
f01ab4d
Add extra detail to serialiser
ferozerub Nov 20, 2023
d5ee02a
Feature - TSS-1064 - Display related barriers deployment (#645)
chris-pettinga Dec 8, 2023
4592b6b
getting rid of Barrier objects method as unnecessary, return a DataFr…
chris-pettinga Dec 14, 2023
1dc358d
updating branch with master
osimuka Jan 12, 2024
a2f9de4
added freezegun
osimuka Jan 12, 2024
a1ec34b
fixing related barriers pyproject.toml file
osimuka Jan 12, 2024
219f6e5
new lock files
osimuka Jan 12, 2024
f502129
wip
osimuka Jan 12, 2024
6e2b6d7
updating rquirements
osimuka Jan 12, 2024
90ca979
wip
osimuka Jan 12, 2024
fa9d168
isort
osimuka Jan 12, 2024
51979b3
feat: removing configure
osimuka Jan 12, 2024
82fd4da
fixing freeze gon configure
osimuka Jan 12, 2024
f9ef219
removing configure
osimuka Jan 12, 2024
e3a44af
fixing error open_llama tokenization modules import
osimuka Jan 12, 2024
4c1919f
updated freeze libary
osimuka Jan 12, 2024
84d4478
align with master
osimuka Jan 25, 2024
ac4cf28
fixing poetry issues
osimuka Jan 25, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ coverage.xml
.hypothesis/
/codecov.sh
/test-reports/
/.pytest_cache/
*/.pytest_cache/

# Translations
*.mo
Expand Down
18 changes: 18 additions & 0 deletions .profile
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
#!/usr/bin/env bash
# custom initialisation tasks
# ref - https://docs.cloudfoundry.org/devguide/deploy-apps/deploy-app.html

echo "---- RUNNING release tasks (.profile) ------"

echo "---- Installing Related Barrier ML Packages ------"
/tmp/lifecycle/shell
python -m pip install sentence-transformers==2.2.2 --no-deps
python -m pip install torch==2.0.0 torchvision==0.15.1 --extra-index-url https://download.pytorch.org/whl/cpu

echo "---- Collecting static ------"
python manage.py collectstatic --noinput

echo "---- Apply Migrations ------"
python manage.py migrate

echo "---- FINISHED release tasks (.profile) ------"
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,7 @@ pip-install: ## Install pip requirements inside the container.
@echo "$$(tput setaf 3)🙈 Installing Pip Packages 🙈$$(tput sgr 0)"
@docker-compose exec web poetry lock
@docker-compose exec web poetry export --without-hashes -f requirements.txt -o requirements.txt
@docker-compose exec web poetry export --dev --without-hashes -f requirements.txt -o requirements-dev.txt
@docker-compose exec web poetry export --with dev --without-hashes -f requirements.txt -o requirements-dev.txt
@docker-compose exec web pip install -r requirements-dev.txt
@docker-compose exec web sed -i '1i# ======\n# DO NOT EDIT - use pyproject.toml instead!\n# Generated: $(__timestamp)\n# ======' requirements.txt
@docker-compose exec web sed -i '1i# ======\n# DO NOT EDIT - use pyproject.toml instead!\n# Generated: $(__timestamp)\n# ======' requirements-dev.txt
Expand Down
75 changes: 75 additions & 0 deletions api/barriers/related_barrier.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
import numpy as np
import pandas as pd
from django.db.models import QuerySet
from sentence_transformers import SentenceTransformer, util

SIMILARITY_THRESHOLD = 0.19

# https://www.sbert.net/docs/pretrained_models.html
# Load the sentence transformer model
# albert-small-v2 # 43 MB size
# smaller model
transformer_model = SentenceTransformer("paraphrase-MiniLM-L3-v2")


def query_set_to_pandas_df(queryset: QuerySet) -> pd.DataFrame:
"""
function to convert a django query set to a pandas dataframe
"""

return pd.DataFrame.from_records(queryset.values())


def __get_similar_barriers(
barrier_row: pd.DataFrame, barrier_id: str, df: pd.DataFrame, limit: int
) -> pd.DataFrame:
"""
function to get similar barriers based on cosine similarity
"""

# Obtain embeddings for each processed_text
df["embeddings"] = df["barrier_corpus"].apply(
lambda x: transformer_model.encode(x, convert_to_tensor=True)
)

# Create a matrix of the embeddings
embeddings_matrix = np.vstack(df["embeddings"].to_numpy())

# Obtain embeddings for title
barrier_embeddings = transformer_model.encode(
barrier_row["barrier_corpus"].values[0]
)

# Calculate cosine similarity between title and all other processed_text
cosine_scores = util.cos_sim(barrier_embeddings, embeddings_matrix)[0]

# Add cosine scores to dataframe
df["similarity"] = cosine_scores

# Sort dataframe by cosine scores
df = df[df["similarity"] > SIMILARITY_THRESHOLD].sort_values(
by=["similarity"], ascending=False
)

# removing the barrier itself from the dataframe
df = df[df["id"] != barrier_id]

# trimming the dataframe according to the defined limit
return df.head(limit)


def get_similar_barriers(
values_query_set: QuerySet, barrier_id: str, limit: int
) -> pd.DataFrame:
df = query_set_to_pandas_df(values_query_set)

# Getting the barrier row from the dataframe
barrier_row = df[df["id"] == barrier_id].copy()

# Check if title exists
if barrier_row.empty:
raise ValueError("Barrier ID not found in data")

df = __get_similar_barriers(barrier_row, barrier_id, df, limit)

return df
6 changes: 5 additions & 1 deletion api/barriers/serializers/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,8 @@
from .barriers import BarrierDetailSerializer, BarrierListSerializer # noqa
from .barriers import ( # noqa
BarrierDetailSerializer,
BarrierListSerializer,
BarrierRelatedListSerializer,
)
from .csv import BarrierCsvExportSerializer # noqa
from .data_workspace import DataWorkspaceSerializer # noqa
from .progress_updates import ProgressUpdateSerializer # noqa
Expand Down
12 changes: 12 additions & 0 deletions api/barriers/serializers/barriers.py
Original file line number Diff line number Diff line change
Expand Up @@ -133,3 +133,15 @@ def get_current_valuation_assessment(instance):
return f"{rating}"
else:
return None


# TODO : standard list serialiser may suffice and the following not required base on final designs
class BarrierRelatedListSerializer(serializers.Serializer):
summary = serializers.CharField(read_only=True)
title = serializers.CharField(read_only=True)
id = serializers.UUIDField(read_only=True)
reported_on = serializers.DateTimeField(read_only=True)
modified_on = serializers.DateTimeField(read_only=True)
status = StatusField(required=False)
location = serializers.CharField(read_only=True)
similarity = serializers.FloatField(read_only=True)
4 changes: 4 additions & 0 deletions api/barriers/urls.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@
PublicBarrierActivity,
PublicBarrierViewSet,
barrier_count,
related_barriers,
)

app_name = "barriers"
Expand Down Expand Up @@ -175,6 +176,9 @@
BarrierStatusChangeUnknown.as_view(),
name="unknown-barrier",
),
path(
"barriers/<uuid:pk>/related-barriers", related_barriers, name="related-barriers"
),
path("counts", barrier_count, name="barrier-count"),
path("reports", BarrierReportList.as_view(), name="list-reports"),
path("reports/<uuid:pk>", BarrierReportDetail.as_view(), name="get-report"),
Expand Down
5 changes: 3 additions & 2 deletions api/barriers/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,14 +6,15 @@
CHARSET = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ"


def random_barrier_reference():
def random_barrier_reference() -> str:

"""
function to produce a random reference number for barriers
format: B-YY-XXXX
where YY is year and Xs are random alpha-numerics
"""
dd = datetime.datetime.now()
ref_code = f"B-{str(dd.year)[-2:]}-"
for i in range(settings.REF_CODE_LENGTH):
for _ in range(settings.REF_CODE_LENGTH):
ref_code += CHARSET[randrange(0, len(CHARSET))]
return ref_code
28 changes: 27 additions & 1 deletion api/barriers/views.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,11 @@

from dateutil.parser import parse
from django.db import transaction
from django.db.models import Case, CharField, Count, F, Value, When
from django.db.models import Case, CharField, Count, F
from django.db.models import Value
from django.db.models import Value as V
from django.db.models import When
from django.db.models.functions import Concat
from django.http import JsonResponse, StreamingHttpResponse
from django.shortcuts import get_object_or_404
from django.utils import timezone
Expand Down Expand Up @@ -34,6 +38,7 @@
BarrierCsvExportSerializer,
BarrierDetailSerializer,
BarrierListSerializer,
BarrierRelatedListSerializer,
BarrierReportSerializer,
PublicBarrierSerializer,
)
Expand Down Expand Up @@ -66,6 +71,7 @@

from .models import BarrierFilterSet, BarrierProgressUpdate, PublicBarrierFilterSet
from .public_data import public_release_to_s3
from .related_barrier import get_similar_barriers
from .tasks import generate_s3_and_send_email

logger = logging.getLogger(__name__)
Expand Down Expand Up @@ -1199,3 +1205,23 @@ def update(self, request, *args, **kwargs):
instance._prefetched_objects_cache = {}

return Response(serializer.data)


@api_view(["GET"])
def related_barriers(request, pk) -> Response:
"""
Return a list of related barriers
"""
values_query_set = (
Barrier.objects.exclude(draft=True)
.annotate(
barrier_corpus=Concat("title", V(". "), "summary", output_field=CharField())
)
.values("id", "barrier_corpus")
)

related_barriers_df = get_similar_barriers(values_query_set, pk, limit=20)
serializer = BarrierRelatedListSerializer(
related_barriers_df.to_dict("records"), many=True
)
return Response(serializer.data)
3 changes: 3 additions & 0 deletions docker/local/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -28,3 +28,6 @@ RUN pip install --upgrade pip \
&& poetry export --without-hashes -f requirements.txt -o requirements.txt \
&& poetry export --dev --without-hashes -f requirements.txt -o requirements-dev.txt \
&& pip install -r requirements-dev.txt

# install sentence-transformers model for similarity search
RUN python -c 'from sentence_transformers import SentenceTransformer; model = SentenceTransformer("paraphrase-MiniLM-L3-v2"); model.save(".models")'
Loading