Skip to content

Latest commit

 

History

History
89 lines (67 loc) · 3.71 KB

README.md

File metadata and controls

89 lines (67 loc) · 3.71 KB

Data Fusion Contest 2022. Matching

Solutions of Data Fusion 2022 user matching challenge.
Repository contains 4 solutions (in different branches) with 2 different approaches for user embedding generation.
All solutions share similar basic idea:

  1. Create feature vectors for users based on transactions or clickstream data.
  2. Train siamese neural network with triplet loss.
  3. Get user embeddings such as paired users close to each other and unpaired users are distant.
  4. Rank match probability based on embeddings distance metric.
  5. Assign 0 if pairwise distance is above threshold.

example Solutions mainly differ in a way of user feature vector generation.

Best result:

Solution 4.
XGBoost classifier on top of siamese model embeddings:

R1 (Harm. Mean) MRR @100 Precision @100
0.1794725306 0.1711484254 0.1886477462

Solution 1:

Solution branch.

R1 (Harm. Mean) MRR @100 Precision @100
0.0032381007 0.0017015599 0.0333889816

Embeddings of transactions and clickstream categories were created based on their descriptions.
User features are calculated as weighted categories embeddings, weights are calculated from the number of category occurrences for specific users.

Categories embeddings:

  1. Translate categories description from RU to EN.
  2. Normalize categories description.
  3. Create word embeddings.
  4. For mcc_codes.csv description select only top-k worlds closest to click_categories.csv corpus.
  5. Create final category embeddings as averaged embeddings of description of k-closest words.

User embeddings:

  1. Calculate the sum of category occurrence.
  2. Calculate category weights with softmax from non zero categories.

Example can be found in the notebook.


Solution 2:

Solution branch.

R1 (Harm. Mean) MRR @100 Precision @100
0.00746767 0.0039526563 0.0674457429

User feature vectors are represented as a log of the sum of unique category occurrences (in transactions and clickstream).

Solution 3:

Solution branch.

R1 (Harm. Mean) MRR @100 Precision @100
0.0081377321 0.0042996904 0.0757929883

Is the same as solution 2 but the final embeddings are calculated as an average of embeddings from 5 siamese models trained on 5 folds.


Solution 4:

Solution branch.

R1 (Harm. Mean) MRR @100 Precision @100
0.1794725306 0.1711484254 0.1886477462

Solution based on XGBoost classifier on top of siamese neural network embeddings. Matching score ranked by clasifier proba predicitons. Example can be found in the notebook.


Analysis:

Siamese neural network with triplet loss does not lead to reliable separation of pairwise distances for positive and negative pairs:
(Left image - euclidean distance, right image - cosine distance).

Thus ranking based on pairwise distances appears to be not effective enough.
Adding second model (XGBoost for example) on top of siamese embeddings significantly improves the score.