Solutions of Data Fusion 2022 user matching challenge.
Repository contains 4 solutions (in different branches) with 2 different approaches for user embedding generation.
All solutions share similar basic idea:
- Create feature vectors for users based on transactions or clickstream data.
- Train siamese neural network with triplet loss.
- Get user embeddings such as paired users close to each other and unpaired users are distant.
- Rank match probability based on embeddings distance metric.
- Assign 0 if pairwise distance is above threshold.
Solutions mainly differ in a way of user feature vector generation.
Solution 4.
XGBoost classifier on top of siamese model embeddings:
R1 (Harm. Mean) | MRR @100 | Precision @100 |
---|---|---|
0.1794725306 | 0.1711484254 | 0.1886477462 |
R1 (Harm. Mean) | MRR @100 | Precision @100 |
---|---|---|
0.0032381007 | 0.0017015599 | 0.0333889816 |
Embeddings of transactions and clickstream categories were created based on their descriptions.
User features are calculated as weighted categories embeddings, weights are calculated from the number of category occurrences for specific users.
- Translate categories description from RU to EN.
- Normalize categories description.
- Create word embeddings.
- For
mcc_codes.csv
description select only top-k worlds closest toclick_categories.csv
corpus. - Create final category embeddings as averaged embeddings of description of k-closest words.
- Calculate the sum of category occurrence.
- Calculate category weights with softmax from non zero categories.
Example can be found in the notebook.
R1 (Harm. Mean) | MRR @100 | Precision @100 |
---|---|---|
0.00746767 | 0.0039526563 | 0.0674457429 |
User feature vectors are represented as a log of the sum of unique category occurrences (in transactions and clickstream).
R1 (Harm. Mean) | MRR @100 | Precision @100 |
---|---|---|
0.0081377321 | 0.0042996904 | 0.0757929883 |
Is the same as solution 2 but the final embeddings are calculated as an average of embeddings from 5 siamese models trained on 5 folds.
R1 (Harm. Mean) | MRR @100 | Precision @100 |
---|---|---|
0.1794725306 | 0.1711484254 | 0.1886477462 |
Solution based on XGBoost classifier on top of siamese neural network embeddings. Matching score ranked by clasifier proba predicitons. Example can be found in the notebook.
Siamese neural network with triplet loss does not lead to reliable separation of pairwise distances for positive and negative pairs:
(Left image - euclidean distance, right image - cosine distance).
Thus ranking based on pairwise distances appears to be not effective enough.
Adding second model (XGBoost for example) on top of siamese embeddings significantly improves the score.