Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: reported DCR_share with the description when holdout provided #103

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 32 additions & 2 deletions mostlyai/qa/assets/html/report_template.html
Original file line number Diff line number Diff line change
Expand Up @@ -152,7 +152,7 @@ <h1 id="summary"><span>{{ meta.report_title }}</span>{{ meta.report_subtitle }}<
<td style="width: 70px;">
<div class="result-box-title">
Distances
<div data-bs-toggle="tooltip" data-bs-title='This metric represents the average distance between synthetic samples and their nearest training samples. For comparison, the average distances between synthetic samples and samples from a holdout dataset is shown in light gray to assess if the trained model learned the general patterns that are common in training as well as in holdout sets.'>
<div data-bs-toggle="tooltip" data-bs-title='Distances represent the proximity between synthetic samples and their nearest training samples, with an identical match having a distance of zero. For comparison, average distances to holdout samples are shown in light gray, helping assess if the model has learned general patterns common in both training and holdout sets. The DCR share indicates the proportion of synthetic samples that are closer to a training sample than to a holdout sample, and ideally, this value should not significantly exceed 50%, as a higher value could indicate overfitting.'>
{{html_assets['info.svg']}}
</div>
</div>
Expand Down Expand Up @@ -180,6 +180,16 @@ <h1 id="summary"><span>{{ meta.report_title }}</span>{{ meta.report_subtitle }}<
{% endif %}
</td>
</tr>
{% if metrics.distances.dcr_share is not none %}
<tr>
<td>DCR share</td>
<td align="left">
{% if metrics.distances.dcr_holdout is not none %}
{{ "{:.1%}".format(metrics.distances.dcr_share) }}
{% endif %}
</td>
</tr>
{% endif %}
</table>
</td>
</tr>
Expand Down Expand Up @@ -388,11 +398,30 @@ <h2 id="distances" class="anchor">Distances</h2>
</tr>
</tbody>
</table>
<br />
<div class="white-box p-3">
{{ distances_dcr_html_chart }}
</div>
<br />
{% if metrics.distances.dcr_share is not none %}
<div class="table-responsive col-md-12">
<table class='table' style="text-align: left">
<thead>
<tr>
<td style="width: 33%"> </td>
<td style="width: 33%">Observed</td>
<td style="width: 33%"><small class="muted-text">(Optimum)</small></td>
</tr>
</thead>
<tbody>
<tr>
<td>DCR Share</td>
<td>{{ "{:.1%}".format(metrics.distances.dcr_share) }}</td>
<td><small class="muted-text">({{ "{:.1%}".format(0.5) }})</small></td>
</tr>
</tbody>
</table>
</div>
{% endif %}
</div>
<br />
<div class="explainer" style="margin-bottom: 30px">
Expand All @@ -403,6 +432,7 @@ <h2 id="distances" class="anchor">Distances</h2>
<div class="explainer-body">
Synthetic data shall be as close to the original training samples, as it is close to original holdout samples, which serve us as a reference.
This can be asserted empirically by measuring distances between synthetic samples to their closest original samples, whereas training and holdout sets are sampled to be of equal size.
DCR Share is the share of synthetic samples that are closer to a training sample than to a holdout sample. This shall not be significantly larger than 50%. <br />
For the visualization above, the distances of synthetic samples to the training samples are displayed in green, and the distances of synthetic samples to the holdout samples (if available) displayed in gray.
A green line that is significantly left of the gray line implies that synthetic samples are closer to the training samples than to the holdout samples, indicating that the data has overfitted to the training data.
A green line that overlays with the gray line validates that the trained model indeed represents the general rules, that can be found in training just as well as in holdout samples.
Expand Down