forked from eyra/fertility-prediction-challenge
-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
807890e
commit 66c76c6
Showing
4 changed files
with
23 additions
and
5 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,9 +1,16 @@ | ||
# Description of submission | ||
|
||
## The Model | ||
## Summary | ||
|
||
We fit an xgboost model with 66 hand-picked variables, which are converted to 43 predictors. An additional predictor is whether the observation is time-shifted (see the section below). | ||
XGBoost with the following strategies: (1) Expanded sample size with "time-shifted" data, (2) Merged in data from the partner's survey for households where both partners participated, (3) Combined data from related features into "scales". | ||
|
||
## The Data | ||
## Details | ||
|
||
(1) We roughly tripled the amount of training data using a "time-shift" strategy. By adapting the outcome calculation code which was generously provided by the PreFer organizing team, we calculated whether suitably aged people in the training and supplementary data had children between 2018 and 2020, thus creating additional outcome data. For these rows of additional outcome data, we recoded features from year t-minus-1, year t-minus-2, etc., to have the same name as the equivalent features at year t-minus-1, year t-minus-2, etc. in the original data. For example, in a time-shifted row, cf17j128 is renamed as cf20m128 in order to correspond with data 3 years later. To help account for temporal distribution shift, we include a feature that is an indicator of whether the row comes from the time-shifted data or original data. | ||
|
||
(2) For households where both partners participated in the survey, we merge in the partner's fertility intentions from 2019 and 2020 (cf19l128 to cf19l130 and cf20m128 to cf20m130), plus the partner's answers to questions about how many kids they have. | ||
|
||
(3) We generate "scales" in the feature data by averaging related features together. Our scales are: Feelings toward current child, gendered religiosity, attitudes about traditional fertility, attitudes about traditional motherhood, attitudes about traditional fatherhood, attitudes about traditional marriage, attitudes toward working mothers, and sexism. | ||
|
||
We choose the hyperparameters for our XGBoost model via grid-search hyperparameter tuning with 5-fold cross-validation. | ||
|
||
We roughly tripled the amount of training data using a time-shift strategy. By adapting outcome_time_shift.Rmd generously provided by the PreFer organizing team, we calculated whether suitably aged people in the training and supplementary data had children between 2018 and 2020. We then found earlier versions of our predictors and surmised that these earlier predictors predict childbirths between 2018 and 2020 in much the same way our predictors predict childbirths between 2021 and 2023. We then time-shifted those earlier measures to create additional rows in our training data. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters