Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Pipfile #32

Merged
merged 13 commits into from
Nov 1, 2024
21 changes: 21 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,3 +23,24 @@ jobs:

- name: Run tests
run: pipenv run pytest

ruff:
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v4

- name: Install Python
uses: actions/setup-python@v5
with:
python-version: "3.11"

- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install ruff

# Update output format to enable automatic inline annotations.
- name: Run Ruff
run: ruff check --output-format=github .

8 changes: 8 additions & 0 deletions .vscode/extensions.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
{
"recommendations": [
"charliermarsh.ruff"
],
"unwantedRecommendations": [

]
}
9 changes: 9 additions & 0 deletions .vscode/settings.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
{
"[python]": {
"editor.formatOnSave": true,
"editor.defaultFormatter": "charliermarsh.ruff",
"editor.codeActionsOnSave": {
"source.organizeImports": "explicit"
}
}
}
11 changes: 6 additions & 5 deletions Pipfile
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,14 @@ verify_ssl = true
name = "pypi"

[packages]
requests = "==2.26.0"
python-dotenv = "==1.0.1"
tqdm = "==4.66.5"
pytest = "==8.3.3"
pytest-cov = "==5.0.0"
python-dotenv = "~=1.0"
requests = "~=2.26"
ruff = "~=0.7"
tqdm = "~=4.66"

[dev-packages]
pytest = "~=8.3"
pytest-cov = "~=5.0"

[requires]
python_version = "3.12"
272 changes: 257 additions & 15 deletions Pipfile.lock

Large diffs are not rendered by default.

12 changes: 11 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ Currently, we are only accepting contributions from members of the project who m

This code uses Python 3. It is tested on Python 3.12, but will probably work on versions back to 3.9.

To install the project dependencies, first install pipenv globally with `pip install pipenv`. Then create a virtual env/install dependencies with `pipenv install`.
To install the project dependencies, first install pipenv globally with `pip install pipenv`. Then create a virtual env/install dependencies with `pipenv install --dev`.

To run code in the project, prefix your command with `pipenv run`, a la `pipenv run python -m mediabridge.main`.

Expand All @@ -20,3 +20,13 @@ To run unit tests,

1. Ensure `pipenv` is installed
2. Run `pipenv run pytest`

There is a GitHub actions "check" for passing tests, which must pass for you to be able to merge your PR.

## Code formatting

We use [ruff](https://docs.astral.sh/ruff/) for code formatting, linting, and import sorting. If you've installed the project with the instructions above, you should have access to the `ruff` binary.

The repo comes with a `.vscode` directory that contains a recommended ruff extension, as well as settings to set ruff as your Python formatter and to format code and sort imports on save. If you're not using VSCode, you can run `ruff format` from the project root directory to format all Python code.

There is a GitHub actions "check" for code formatting, which will fail if you have unformatted code in your PR.
2 changes: 1 addition & 1 deletion mediabridge/config/setting.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
# Configuration settings (e.g., MongoDB URI, paths)
# Configuration settings (e.g., MongoDB URI, paths)
2 changes: 1 addition & 1 deletion mediabridge/data_processing/build_matrices.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
# Scripts to build interaction and feature matrices
# Scripts to build interaction and feature matrices
2 changes: 1 addition & 1 deletion mediabridge/data_processing/preprocess.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
# Data preprocessing scripts (e.g., feature extraction)
# Data preprocessing scripts (e.g., feature extraction)
118 changes: 75 additions & 43 deletions mediabridge/data_processing/wiki_to_netflix.py
Original file line number Diff line number Diff line change
@@ -1,45 +1,61 @@
import requests
import csv
import os
import sys
import time

import requests
from tqdm import tqdm
import sys


class WikidataServiceTimeoutException(Exception):
pass

data_dir = os.path.join(os.path.dirname(__file__), '../../data')
out_dir = os.path.join(os.path.dirname(__file__), '../../out')
user_agent = 'Noisebridge MovieBot 0.0.1/Audiodude <[email protected]>'

data_dir = os.path.join(os.path.dirname(__file__), "../../data")
out_dir = os.path.join(os.path.dirname(__file__), "../../out")
user_agent = "Noisebridge MovieBot 0.0.1/Audiodude <[email protected]>"


# Reading netflix text file
def read_netflix_txt(txt_file, test):
num_rows = None
if test == True:
if test:
num_rows = 100

with open(txt_file, "r", encoding = "ISO-8859-1") as netflix_data:
with open(txt_file, "r", encoding="ISO-8859-1") as netflix_data:
for i, line in enumerate(netflix_data):
if num_rows is not None and i >= num_rows:
break
yield line.rstrip().split(',', 2)
yield line.rstrip().split(",", 2)


# Writing netflix csv file
def create_netflix_csv(csv_name, data_list):
with open(csv_name, 'w') as netflix_csv:
def create_netflix_csv(csv_name, data_list):
with open(csv_name, "w") as netflix_csv:
csv.writer(netflix_csv).writerows(data_list)


# Extracting movie info from Wiki data
def wiki_feature_info(data, key):
if len(data['results']['bindings']) < 1 or key not in data['results']['bindings'][0]:
if (
len(data["results"]["bindings"]) < 1
or key not in data["results"]["bindings"][0]
):
return None
if key == 'genreLabel':
return list({d['genreLabel']['value'] for d in data['results']['bindings'] if 'genreLabel' in d})
return data['results']['bindings'][0][key]['value'].split('/')[-1]
if key == "genreLabel":
return list(
{
d["genreLabel"]["value"]
for d in data["results"]["bindings"]
if "genreLabel" in d
}
)
return data["results"]["bindings"][0][key]["value"].split("/")[-1]


# Formatting SPARQL query for Wiki data
def format_sparql_query(title, year):
QUERY = '''
QUERY = """
SELECT * WHERE {
SERVICE wikibase:mwapi {
bd:serviceParam wikibase:api "EntitySearch" ;
Expand Down Expand Up @@ -83,15 +99,16 @@ def format_sparql_query(title, year):
SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . }
}

'''
return QUERY % {'Title': title, 'Year': year}
"""
return QUERY % {"Title": title, "Year": year}


# Getting list of movie IDs, genre IDs, and director IDs from request
def wiki_query(data_csv, user_agent):
wiki_movie_ids = []
wiki_genres = []
wiki_directors = []

for row in tqdm(data_csv):
if row[1] is None:
continue
Expand All @@ -101,61 +118,76 @@ def wiki_query(data_csv, user_agent):
tries = 0
while True:
try:
response = requests.post('https://query.wikidata.org/sparql',
headers={'User-Agent': user_agent},
data={
'query': SPARQL,
'format': 'json',
},
timeout=20,
response = requests.post(
"https://query.wikidata.org/sparql",
headers={"User-Agent": user_agent},
data={
"query": SPARQL,
"format": "json",
},
timeout=20,
)
break
except requests.exceptions.Timeout:
wait_time = 2 ** tries * 5
wait_time = 2**tries * 5
time.sleep(wait_time)
tries += 1
if tries > 5:
raise WikidataServiceTimeoutException(
f'Tried {tries} time, could not reach Wikidata '
f'(movie: {row[2]} {row[1]})'
f"Tried {tries} time, could not reach Wikidata "
f"(movie: {row[2]} {row[1]})"
)

response.raise_for_status()
data = response.json()
wiki_movie_ids.append(wiki_feature_info(data, 'item'))
wiki_genres.append(wiki_feature_info(data, 'genreLabel'))
wiki_directors.append(wiki_feature_info(data, 'directorLabel'))

wiki_movie_ids.append(wiki_feature_info(data, "item"))
wiki_genres.append(wiki_feature_info(data, "genreLabel"))
wiki_directors.append(wiki_feature_info(data, "directorLabel"))

return wiki_movie_ids, wiki_genres, wiki_directors


# Calling all functions
def process_data(test=False):
missing_count = 0
processed_data = []

netflix_data = read_netflix_txt(os.path.join(data_dir, 'movie_titles.txt'), test)
netflix_data = read_netflix_txt(os.path.join(data_dir, "movie_titles.txt"), test)

netflix_csv = os.path.join(out_dir, 'movie_titles.csv')
netflix_csv = os.path.join(out_dir, "movie_titles.csv")

wiki_movie_ids_list, wiki_genres_list, wiki_directors_list = wiki_query(netflix_data, user_agent)
wiki_movie_ids_list, wiki_genres_list, wiki_directors_list = wiki_query(
netflix_data, user_agent
)

num_rows = len(wiki_movie_ids_list)

for index, row in enumerate(netflix_data):
netflix_id, year, title = row
if wiki_movie_ids_list[index] is None:
missing_count += 1
movie = [netflix_id, wiki_movie_ids_list[index], title, year, wiki_genres_list[index], wiki_directors_list[index]]
movie = [
netflix_id,
wiki_movie_ids_list[index],
title,
year,
wiki_genres_list[index],
wiki_directors_list[index],
]
processed_data.append(movie)

create_netflix_csv(netflix_csv, processed_data)

print(f'missing: {missing_count} ({missing_count / num_rows * 100}%)')
print(f'found: {num_rows - missing_count} ({(num_rows - missing_count) / num_rows * 100}%)')
print(f'total: {num_rows}')
print(f"missing: {missing_count} ({missing_count / num_rows * 100}%)")
print(
f"found: {num_rows - missing_count} ({(num_rows - missing_count) / num_rows * 100}%)"
)
print(f"total: {num_rows}")


if __name__ == '__main__':
if __name__ == "__main__":
# Test is true if no argument is passed or if the first argument is not '--prod'.
test = len(sys.argv) < 2 or sys.argv[1] != '--prod'
test = len(sys.argv) < 2 or sys.argv[1] != "--prod"
process_data(test=test)
process_data(test=test)
5 changes: 3 additions & 2 deletions mediabridge/data_processing/wiki_to_netflix_test.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
from wiki_to_netflix import format_sparql_query, wiki_query, process_data
from wiki_to_netflix import format_sparql_query
from wiki_to_netflix_test_data import EXPECTED_SPARQL_QUERY


def test_format_sparql_query():
QUERY = format_sparql_query("The Room", 2003)
assert QUERY == EXPECTED_SPARQL_QUERY
assert QUERY == EXPECTED_SPARQL_QUERY
4 changes: 2 additions & 2 deletions mediabridge/data_processing/wiki_to_netflix_test_data.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
EXPECTED_SPARQL_QUERY ='''
EXPECTED_SPARQL_QUERY = """
SELECT * WHERE {
SERVICE wikibase:mwapi {
bd:serviceParam wikibase:api "EntitySearch" ;
Expand Down Expand Up @@ -42,4 +42,4 @@
SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . }
}

'''
"""
2 changes: 1 addition & 1 deletion mediabridge/db/connect.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
# MongoDB connection setup
# MongoDB connection setup
2 changes: 1 addition & 1 deletion mediabridge/db/queries.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
# Functions to query MongoDB for movies and interactions
# Functions to query MongoDB for movies and interactions
2 changes: 1 addition & 1 deletion mediabridge/main.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
from mediabridge.data_processing import wiki_to_netflix

q = wiki_to_netflix.format_sparql_query('The Room', 2003)
q = wiki_to_netflix.format_sparql_query("The Room", 2003)
print(q)
2 changes: 1 addition & 1 deletion mediabridge/models/predict.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
# Script to make predictions using the trained model
# Script to make predictions using the trained model
2 changes: 1 addition & 1 deletion mediabridge/models/train_model.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
# Script to train the LightFM model
# Script to train the LightFM model
2 changes: 1 addition & 1 deletion mediabridge/models/utils.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
# Utility functions (e.g., for building matrices)
# Utility functions (e.g., for building matrices)
2 changes: 2 additions & 0 deletions ruff.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# Default selections for ruff, plus isort.
lint.select = ["E4", "E7", "E9", "F", "I001"]
Loading