Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

80 recommend movies #81

Merged
merged 70 commits into from
Feb 11, 2025
Merged

80 recommend movies #81

merged 70 commits into from
Feb 11, 2025

Conversation

jhanley634
Copy link
Collaborator

@jhanley634 jhanley634 commented Feb 6, 2025

This PR puts Netflix prize data in a format where it can be conveniently queried and subsetted. Call me crazy, but I tend to think in JOINs, and I appreciate having an RDBMS sweat the memory management details for cases where not everything will conveniently fit in core. So sqlalchemy is managing some sqlite tables. They are safe to DROP, or alternatively just rm out/movies.sqlite and they're all gone and will be rebuilt on the next run.

This PR also trains a LightFM model on (some of) the prize data, and shows how to make predictions. They are, frankly, unimpressive, but I figured we need something in the code base that demonstrates how to do it, so we can all build upon it. The tests run quickly, to support an interactive edit-run cycle.


The etl(max_rows=101_000_000) call ingests everything. And then it will be preserved -- we skip ETL if a table is already populated. Setting it to a smaller number makes interactive edit-debug cycles go much faster. pytest (pipenv run test) regrettably suppresses the initial ETL progress bar, so consider instead using "pipenv run python -m unittest tests//_test.py" during that first twenty-minute run.

The "give me two movies you like and I will recommend some more" is frankly still aspirational at this point, but we're not far from it. That's where two_movies_rec_test.py got its name.

There are some design notes in make_recommendation.py, which may serve as the basis for future PRs. Once this merges down to main, it's possible we will view the lightfm_recommendation.py module as obsolete and slated for deletion. It did offer me some inspiration for the current work.

I'm happy to accept any and all comments, but please be sensitive to what is in-scope for the current PR as opposed to a subsequent PR. It would be useful to get this merged down, with light edits, this week.

…ot entirely clear what its signature ought to be, but this seems like a decent guess
…hing". Also, DRY up the docstring so it doesn't repeat what the signature already told us.
Copy link
Collaborator

@skyfenton skyfenton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid work, glad we're getting movies now. Have a few thoughts and recommendations, but thanks for putting in the time while in Tahoe!

mediabridge/recommender/etl.py Outdated Show resolved Hide resolved
mediabridge/recommender/etl.py Outdated Show resolved Hide resolved
mediabridge/data_processing/wiki_to_netflix.py Outdated Show resolved Hide resolved
mediabridge/recommender/etl.py Outdated Show resolved Hide resolved
mediabridge/recommender/etl.py Outdated Show resolved Hide resolved
mediabridge/recommender/make_recommendation.py Outdated Show resolved Hide resolved
tests/recommender/two_movies_rec_test.py Outdated Show resolved Hide resolved
@jhanley634
Copy link
Collaborator Author

Implemented the requested changes.

@jhanley634 jhanley634 requested a review from skyfenton February 8, 2025 05:00
Pipfile Show resolved Hide resolved
Pipfile Show resolved Hide resolved
Copy link
Collaborator

@skyfenton skyfenton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few additional comments, but it's basically all set.

tests/recommender/two_movies_rec_test.py Show resolved Hide resolved
tests/recommender/two_movies_rec_test.py Show resolved Hide resolved
@jhanley634
Copy link
Collaborator Author

jhanley634 commented Feb 11, 2025

Several days ago I responded to all requested code changes with either a SHA1 hash or "this is out-of-scope for the current PR". I don't recall hearing any pushback on "out-of-scope" arguments, but perhaps my memory is faulty.

ATM on review number 2 of PR 80 I am seeing that "Sky requests changes". But honestly I have no idea what changes you require in this branch before you would approve it. Please clarify. It is my hope that we can get some edits merged down to main before Thursday. Moving to review number 3 currently seems essential, but also has a whiff of "process failure". I confess this PR includes more edits than I would like to see in a PR. The big question here is "what went wrong? How should we have scoped it down?" There are twelve file we're contemplating merging to main. Should we perhaps have focused on ten of them, and deferred the other two to a subsequent PR?

I have made many fewer edits that I would have liked to see included. In particular I have been reluctant to mess with the 83-engine-hints branch and its series of bug fixes, based on my (incorrect) belief that 80 would soon hit main and therefore could be conveniently merged into other dev branches. A sequence of "small", "rapid" PRs would be preferable to the somewhat large 80 branch.

@jhanley634 jhanley634 requested a review from skyfenton February 11, 2025 02:11
Copy link
Collaborator

@skyfenton skyfenton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't totally know how github reviews work in the context of approvals/comments/requests, and I don't have the best reference yet of when something isn't worth making a comment for, so I'll clarify my intention. I wanted to request the DATA_DIR still be changed after the because the codebase and others on the project have expected the dataset to live in the /data directory. The rest are just observations that weren't crucial, like an unused dependency that I assumed was left over from testing.

We should definitely clarify after this week how large we expect prs to be in a week and what sorts of things merit a request for a current pr instead of pushing to another pr, but this still looks great. Thanks for all the help.

@jhanley634
Copy link
Collaborator Author

I wanted to request the DATA_DIR still be changed

I don't exactly understand what that means, but that's cool. I invite you to submit a PR to that effect, which I will quickly approve. I agree that we do not often see project participants doing something like $ make clean install etl recommend, so there's room for drift in what the various laptop filesystems look like. It's not obvious to me how to improve that, given that make clean has been ruled out as not part of the solution space.

@jhanley634 jhanley634 merged commit f09bf1b into main Feb 11, 2025
4 checks passed
@jhanley634 jhanley634 deleted the 80-recommend-movies branch February 11, 2025 04:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

recommend movies
2 participants