80 recommend movies #81

jhanley634 · 2025-02-06T21:19:20Z

This PR puts Netflix prize data in a format where it can be conveniently queried and subsetted. Call me crazy, but I tend to think in JOINs, and I appreciate having an RDBMS sweat the memory management details for cases where not everything will conveniently fit in core. So sqlalchemy is managing some sqlite tables. They are safe to DROP, or alternatively just rm out/movies.sqlite and they're all gone and will be rebuilt on the next run.

This PR also trains a LightFM model on (some of) the prize data, and shows how to make predictions. They are, frankly, unimpressive, but I figured we need something in the code base that demonstrates how to do it, so we can all build upon it. The tests run quickly, to support an interactive edit-run cycle.

The etl(max_rows=101_000_000) call ingests everything. And then it will be preserved -- we skip ETL if a table is already populated. Setting it to a smaller number makes interactive edit-debug cycles go much faster. pytest (pipenv run test) regrettably suppresses the initial ETL progress bar, so consider instead using "pipenv run python -m unittest tests//_test.py" during that first twenty-minute run.

The "give me two movies you like and I will recommend some more" is frankly still aspirational at this point, but we're not far from it. That's where two_movies_rec_test.py got its name.

There are some design notes in make_recommendation.py, which may serve as the basis for future PRs. Once this merges down to main, it's possible we will view the lightfm_recommendation.py module as obsolete and slated for deletion. It did offer me some inspiration for the current work.

I'm happy to accept any and all comments, but please be sensitive to what is in-scope for the current PR as opposed to a subsequent PR. It would be useful to get this merged down, with light edits, this week.

…ot entirely clear what its signature ought to be, but this seems like a decent guess

…o 80-recommend-movies

…hing". Also, DRY up the docstring so it doesn't repeat what the signature already told us.

…ed time

skyfenton

Solid work, glad we're getting movies now. Have a few thoughts and recommendations, but thanks for putting in the time while in Tahoe!

mediabridge/recommender/etl.py

mediabridge/data_processing/wiki_to_netflix.py

mediabridge/recommender/etl.py

mediabridge/recommender/make_recommendation.py

tests/recommender/two_movies_rec_test.py

…is expensive

… against large_movie_id

…ess; replace with a constant

jhanley634 · 2025-02-08T04:37:20Z

Implemented the requested changes.

… the expected recommended titles came back

Pipfile

skyfenton

Just a few additional comments, but it's basically all set.

tests/recommender/two_movies_rec_test.py

jhanley634 · 2025-02-11T02:10:24Z

Several days ago I responded to all requested code changes with either a SHA1 hash or "this is out-of-scope for the current PR". I don't recall hearing any pushback on "out-of-scope" arguments, but perhaps my memory is faulty.

ATM on review number 2 of PR 80 I am seeing that "Sky requests changes". But honestly I have no idea what changes you require in this branch before you would approve it. Please clarify. It is my hope that we can get some edits merged down to main before Thursday. Moving to review number 3 currently seems essential, but also has a whiff of "process failure". I confess this PR includes more edits than I would like to see in a PR. The big question here is "what went wrong? How should we have scoped it down?" There are twelve file we're contemplating merging to main. Should we perhaps have focused on ten of them, and deferred the other two to a subsequent PR?

I have made many fewer edits that I would have liked to see included. In particular I have been reluctant to mess with the 83-engine-hints branch and its series of bug fixes, based on my (incorrect) belief that 80 would soon hit main and therefore could be conveniently merged into other dev branches. A sequence of "small", "rapid" PRs would be preferable to the somewhat large 80 branch.

skyfenton

I still don't totally know how github reviews work in the context of approvals/comments/requests, and I don't have the best reference yet of when something isn't worth making a comment for, so I'll clarify my intention. I wanted to request the DATA_DIR still be changed after the because the codebase and others on the project have expected the dataset to live in the /data directory. The rest are just observations that weren't crucial, like an unused dependency that I assumed was left over from testing.

We should definitely clarify after this week how large we expect prs to be in a week and what sorts of things merit a request for a current pr instead of pushing to another pr, but this still looks great. Thanks for all the help.

jhanley634 · 2025-02-11T04:52:16Z

I wanted to request the DATA_DIR still be changed

I don't exactly understand what that means, but that's cool. I invite you to submit a PR to that effect, which I will quickly approve. I agree that we do not often see project participants doing something like $ make clean install etl recommend, so there's room for drift in what the various laptop filesystems look like. It's not obvious to me how to improve that, given that make clean has been ruled out as not part of the solution space.

jhanley634 added 30 commits January 26, 2025 21:31

Flattening returns dict[str, str]

b9f671b

Annotate bulk_insert(). Given that nothing calls the function, it's n…

0bd5cf4

…ot entirely clear what its signature ought to be, but this seems like a decent guess

Enable --strict mode for mypy linting

5ad14ba

Merge branch 'main' into 76_enable_--strict_mode

780829e

LightFM is a recommender, so move it into the recommender/ folder

2446c0e

Add test_two_movies_rec()

9a70872

Merge branch 'main' of https://github.com/noisebridge/MediaBridge int…

489ac6f

…o 80-recommend-movies

Add pandas + sqlalchemy

74e8078

Add pandas dep

43d3a75

We use tuple for a C struct, and list for arbitrary number of "same t…

8a7dd92

…hing". Also, DRY up the docstring so it doesn't repeat what the signature already told us.

Deal with "unknown year"

a64b527

Dump dataframe to sqlite table

aedb699

Add _etl_user_rating

35305a3

Read files in lexical order

5d5509e

Avoid ETL of movie titles if they're not available, e.g. during a CI run

84cfbec

Push silence_logging() down a level

36543c5

etl() now takes a glob parameter

c422549

Add Rating table

41eb3e9

Insert movie ratings

343a6a5

Write to compressed rating.csv.gz file

c334c4c

use pv | gzip compression pipeline (child won't hold the GIL)

2bd9816

Omit the pv child, as buffering only buys us a 10% reduction in elaps…

bbb60f7

…ed time

Don't re-create CSV if it's already there

899f5ad

Measure the time to write rating_temp

1891445

Prefer the ORM bulk insert API

e6ecf89

Use a class_mapper()

d0a7b80

Order first by movie, then by user

0a43209

Mention why we go to the trouble of spawning a gzip child

b119fb5

Add timing remarks

a1396eb

Looks like the DB belongs in the out/ folder

bbbdd83

Delete the magic number 4

62e925e

skyfenton requested changes Feb 8, 2025

View reviewed changes

jhanley634 added 11 commits February 7, 2025 18:40

Bugfix: avoid trying to convert y-m-d to an integer

a3b6051

Bugfix: avoid changing the sparsity structure of a csr_matrix, as it …

06bfb94

…is expensive

Add tqdm progress bar

30e90bb

Adjust docstring description of "txt_file" parameter

d66d161

Rename module to data_processing/etl.py

c77012e

Elide ORDER BY cnt DESC, as the PK overrules it

2b1fa5c

Expand the docstring critique to mention the integer ID comparison is…

fe5272d

… against large_movie_id

Make a range check a little looser

d131f2a

Add _normalize_rating

ffe34a0

Add comments

410b8b7

The ETL glob parameter, for rapid test runs, has outlived its usefuln…

94e6b89

…ess; replace with a constant

jhanley634 requested a review from skyfenton February 8, 2025 05:00

jhanley634 added 6 commits February 8, 2025 07:59

Default to full 101 M row ETL, skipping it if rows are already there

4df2937

LightFM is returning somewhat deterministic results now, so test that…

84d84f2

… the expected recommended titles came back

Ignore Private Ryan noise

8bc94c6

Add pipenv script to browse through ETL'd data.

f4b008a

Add EDA convenience view: rating_v

2a3fcd1

Rename to RATING_V_DDL (data definition language)

1aa1f5a

skyfenton reviewed Feb 10, 2025

View reviewed changes

Pipfile Show resolved Hide resolved

skyfenton reviewed Feb 10, 2025

View reviewed changes

Pipfile Show resolved Hide resolved

skyfenton reviewed Feb 10, 2025

View reviewed changes

tests/recommender/two_movies_rec_test.py Show resolved Hide resolved

tests/recommender/two_movies_rec_test.py Show resolved Hide resolved

jhanley634 requested a review from skyfenton February 11, 2025 02:11

skyfenton approved these changes Feb 11, 2025

View reviewed changes

jhanley634 merged commit f09bf1b into main Feb 11, 2025
4 checks passed

jhanley634 deleted the 80-recommend-movies branch February 11, 2025 04:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

80 recommend movies #81

80 recommend movies #81

jhanley634 commented Feb 6, 2025 •

edited

Loading

skyfenton left a comment

jhanley634 commented Feb 8, 2025

skyfenton left a comment

jhanley634 commented Feb 11, 2025 •

edited

Loading

skyfenton left a comment

jhanley634 commented Feb 11, 2025

80 recommend movies #81

80 recommend movies #81

Conversation

jhanley634 commented Feb 6, 2025 • edited Loading

skyfenton left a comment

Choose a reason for hiding this comment

jhanley634 commented Feb 8, 2025

skyfenton left a comment

Choose a reason for hiding this comment

jhanley634 commented Feb 11, 2025 • edited Loading

skyfenton left a comment

Choose a reason for hiding this comment

jhanley634 commented Feb 11, 2025

jhanley634 commented Feb 6, 2025 •

edited

Loading

jhanley634 commented Feb 11, 2025 •

edited

Loading