Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data integrity test: SIMBAD-resolvable sources? #64

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

arjunsavel
Copy link
Collaborator

@arjunsavel arjunsavel commented Jun 11, 2024

do we want to test whether sources are simbad-resolvable?

@arjunsavel arjunsavel marked this pull request as draft June 11, 2024 20:10
@arjunsavel
Copy link
Collaborator Author

Pretty sure tests will fail — still working on this.

@arjunsavel arjunsavel marked this pull request as ready for review July 9, 2024 20:12
@arjunsavel arjunsavel requested review from dr-rodriguez and kelle July 9, 2024 20:12
@arjunsavel
Copy link
Collaborator Author

we want to instead do this for the ingest_sources function. will make the tests take too long, and tests will fail if SIMBAD is down.

@arjunsavel
Copy link
Collaborator Author

SIMBAD as part of a set of monthly tests?

@arjunsavel arjunsavel marked this pull request as draft July 9, 2024 20:34
@arjunsavel
Copy link
Collaborator Author

todo: make this a ~monthly check.

@kelle
Copy link
Contributor

kelle commented Jul 9, 2024

In SIMPLE, we called it scheduled_checks.py so it's not auto-discovered by pytest.

Here's also the action file to use for inspiration. This runs every 30 days:
https://github.com/SIMPLE-AstroDB/SIMPLE-db/blob/main/.github/workflows/scheduled-tests.yml

@arjunsavel arjunsavel marked this pull request as ready for review July 31, 2024 18:27
Copy link
Contributor

@kelle kelle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks super cool but I don't really understand the difference between the two tests. I think it will be more clear if you make the initial query a fixture that's only executed once.

tests/scheduled_checks.py Outdated Show resolved Hide resolved
tests/scheduled_checks.py Outdated Show resolved Hide resolved
tests/scheduled_checks.py Outdated Show resolved Hide resolved
Comment on lines +43 to +48
# Examine DB for each input, displaying results when more than one source matches
t = db.search_object(
simbad_names, output_table="Sources", fmt="astropy", fuzzy_search=False
)
if len(t) != 1:
duplicate_rows += [row]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, this is a very thorough test. This is making sure every possibly SIMBAD identifier also identifies only one database source, right? This seems extra to me. It seems like just checking that the first identifier matches only one database source should be enough.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it currently does the latter? simbad_results is only 3 rows for the 3 sources in the database.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how simbad_results can be only 3 rows. A quick search for WASP-76 yielded about a dozen names. The logic here should have that looping over each one.

Copy link
Contributor

@dr-rodriguez dr-rodriguez Aug 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I see why it's 3 rows now: the DB has 3 rows and that's what Simbad returns back. Each row can have multiple names so for example simbad_names can contain ['BD+01 316b', 'WASP-76b'] for a single row and search_object considers both of them when searching (since it can take a list as an input).

tests/scheduled_checks.py Outdated Show resolved Hide resolved
tests/scheduled_checks.py Outdated Show resolved Hide resolved
Copy link
Contributor

@kelle kelle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just trying to wrap my head around this test so I ended up making some readability comments.

@arjunsavel
Copy link
Collaborator Author

thanks for all the comments!

@dr-rodriguez
Copy link
Contributor

Tried running in CodeSpaces but this doesn't run as-is, so some more iteration is needed; see the comments above.

@dr-rodriguez
Copy link
Contributor

Pushed a fix to the various issues I noticed above. The script runs successfully for me in CodeSpaces.

@arjunsavel
Copy link
Collaborator Author

todo: rename the files to be the source names

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants