Data Fusion Contest 2022. Matching

Solutions of Data Fusion 2022 user matching challenge.
Repository contains 4 solutions (in different branches) with 2 different approaches for user embedding generation.
All solutions share similar basic idea:

Create feature vectors for users based on transactions or clickstream data.
Train siamese neural network with triplet loss.
Get user embeddings such as paired users close to each other and unpaired users are distant.
Rank match probability based on embeddings distance metric.
Assign 0 if pairwise distance is above threshold.

Solutions mainly differ in a way of user feature vector generation.

Best result:

Solution 4.
XGBoost classifier on top of siamese model embeddings:

R1 (Harm. Mean)	MRR @100	Precision @100
0.1794725306	0.1711484254	0.1886477462

Solution 1:

Solution branch.

R1 (Harm. Mean)	MRR @100	Precision @100
0.0032381007	0.0017015599	0.0333889816

Embeddings of transactions and clickstream categories were created based on their descriptions.
User features are calculated as weighted categories embeddings, weights are calculated from the number of category occurrences for specific users.

Categories embeddings:

Translate categories description from RU to EN.
Normalize categories description.
Create word embeddings.
For mcc_codes.csv description select only top-k worlds closest to click_categories.csv corpus.
Create final category embeddings as averaged embeddings of description of k-closest words.

User embeddings:

Calculate the sum of category occurrence.
Calculate category weights with softmax from non zero categories.

Example can be found in the notebook.

Solution 2:

Solution branch.

R1 (Harm. Mean)	MRR @100	Precision @100
0.00746767	0.0039526563	0.0674457429

User feature vectors are represented as a log of the sum of unique category occurrences (in transactions and clickstream).

Solution 3:

Solution branch.

R1 (Harm. Mean)	MRR @100	Precision @100
0.0081377321	0.0042996904	0.0757929883

Is the same as solution 2 but the final embeddings are calculated as an average of embeddings from 5 siamese models trained on 5 folds.

Solution 4:

Solution branch.

R1 (Harm. Mean)	MRR @100	Precision @100
0.1794725306	0.1711484254	0.1886477462

Solution based on XGBoost classifier on top of siamese neural network embeddings. Matching score ranked by clasifier proba predicitons. Example can be found in the notebook.

Analysis:

Siamese neural network with triplet loss does not lead to reliable separation of pairwise distances for positive and negative pairs:
(Left image - euclidean distance, right image - cosine distance).

Thus ranking based on pairwise distances appears to be not effective enough.
Adding second model (XGBoost for example) on top of siamese embeddings significantly improves the score.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
docker		docker
embeddings		embeddings
img		img
src		src
submission		submission
.gitignore		.gitignore
README.md		README.md
embeddings.ipynb		embeddings.ipynb
train.ipynb		train.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Fusion Contest 2022. Matching

Best result:

Solution 1:

Categories embeddings:

User embeddings:

Solution 2:

Solution 3:

Solution 4:

Analysis:

About

Releases

Packages

Languages

kumgleb/data_fusion_matching

Folders and files

Latest commit

History

Repository files navigation

Data Fusion Contest 2022. Matching

Best result:

Solution 1:

Categories embeddings:

User embeddings:

Solution 2:

Solution 3:

Solution 4:

Analysis:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages