Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lack of explanation of difference between data that describes tasks #478

Open
Make42 opened this issue Feb 16, 2023 · 11 comments
Open

Lack of explanation of difference between data that describes tasks #478

Make42 opened this issue Feb 16, 2023 · 11 comments

Comments

@Make42
Copy link
Contributor

Make42 commented Feb 16, 2023

It is unclear what the differences are conceptually between the following things.

In the web app there are

grafik

  • hyperparameters (in tab "configuration") - which are set using task.connect or task.connect_configuration
  • user properties (in tab "configuration") - which are set using task.set_user_properties
  • execution (tab)
  • info (tab)
  • tags

The hyperparameters are more convenient to set and are more flexible than the user properties programmatically.
Conceptually they are all data describing the task, but it is not clear to me what the differences conceptually are.
Also, it seems to me that the hyperparameters seem to be able to do anything the user properties can do, so what is the point of the user properties?

The main advantage of user properties over hyperparameters seems to be that they can be changed after the task finished; but, tags can also be changed after the task finished and tags can be query arguments when searching for task.
However, tags are effectively only boolean (either a task has a tag or it does not), while user properties have a value, e.g., explicit boolean, integer, float, string.

The tabs "execution" and "info" both seem to contain information that I can not set, but that are extracted by ClearML automatically from running the task. But what is the reasoning behind those two different tabs?

Everything here is "information", so the name "info" for the tab is not very descriptive.

Here is my interpretation:

  • The tab "execution" contains data that describes the code that produced the task, including
    • commit
    • changes of code compared to the commit
    • the code dependencies
  • The tab "info" contains data that describes the execution of the code, including
    • time: start and duration
    • hardware used
    • user
    • code environment: python version, python execution script
    • position of the task in a task hierarchy: parent task
  • The tab "configuration" contains data that describes the task conceptually and the information is provided by the user.
    • "hyperparameters" and "configuration objects" can be set programmatically during the task, but is read-only after the task finished
    • "user properties" can be set programmatically during the task, but can also be added and changed after the task finished; so they are the same as tags, but they cannot be queried when searching for a task.
    • "hyperparameters" can be monitored during the task and the last value is saved, while "configuration objects" need to be overwritten explicitly during the task.
    • "configuration objects" can be large, arbitrary files, while "hyperparameters" can only be certain types of Python objects
    • "configuration objects" are just blobs of data, while "hyperparameters" can be searched later.
  • Tags can be arguments for a query, in contrast to the other type of descriptive data. They can be changed after the task finished. I am not sure what their purpose is conceptually.

Assuming my interpretation is correct, this raises a couple of questions for me:

  • Shouldn't the "code environment: python version, python execution script" be under the tab "execution"?
  • Shouldn't the tab "info" be called "execution"?
  • Shouldn't the tab "execution" be called "code"?
  • Why are we not able to use any type of data above as a query argument, not only "tags"?

Maybe all of this can be documented in more detail.


If someone explains those things here, I am happy to make a pull request.

@idantene
Copy link
Contributor

idantene commented Mar 7, 2023

Hey @Make42!
I'll try and answer these to the best of my understanding as an external ClearML user, I hope it helps.
I think your interpretation is quite close, albeit the "configuration" tab which is a bit more fine-grained than you describe.

My mental mapping is as follows:

  • execution contains data that is relevant to create an exact replica of this task; i.e. commits, uncomitted changed, packages used, etc. These can only be edited when a task is in Draft mode.
  • info contains arbitrary data about the actual task's execution - machine, runtime, user, etc. These are determined by ClearML.
  • configuration contains data that is relevant for a task's execution but can be changed dynamically. ClearML logs some of these automagically (i.e. hyperparameters from your models), and lets you "connect" existing configurations to support remote execution. These are either edited by your code during runtime, or when a task is in Draft mode. The various sections are just an arbitrary division of these, and does not represent anything specific IMO. For example, we use the Hyperparameters to store runtime arguments (such as references to others tasks), we never use the User properties, and the Configuration objects is automatically populated when we connect a configuration.
  • Tags are simply used for filtering and act as additional metadata that's also visible from outside the detailed experiment view.

With these in mind, for your questions:

Shouldn't the "code environment: python version, python execution script" be under the tab "execution"?

The ones that appear in the Info tab represent what the code actually used. The desired python version is captured in the Execution tab, and ClearML will match it as best as it can to an available Python version on the remote machine.
The execution script is not something that described the execution, and can be changed if you clone a task so that it refers to a different entrypoint.

Shouldn't the tab "info" be called "execution"?
Shouldn't the tab "execution" be called "code"?

I agree that Info and Execution are perhaps not the best names for these tabs. At the same time, I cannot come up with good alternatives to suggest for ClearML.

The Execution tab does not relate only to the code though, so I don't think that's an appropriate name.
As per my mental vision of this, Execution represents mutable information that pertains to replicating the exact environment and task, whereas Info represents immutable information that simply describes the runtime environment on a larger scale.

Why are we not able to use any type of data above as a query argument, not only "tags"?

I think this is a great suggestion, maybe @ainoam can comment on this?
One thing from my past experience though, is that these type of queries tend to be cumbersome and hard to read. For example, mlflow offers this functionality, but then you end up with queries such as task.execution.uncomitted_changes like '%added line that I\'m looking for%'.

@ainoam
Copy link
Collaborator

ainoam commented Mar 8, 2023

That's an excellent summary @idantene.

I will add that user properties and tags share post execution mutability to facilitate easier organization. They simply provide different interfaces (as @Make42 noted): existence vs. key-value.

Why are we not able to use any type of data above as a query argument, not only "tags"?

Tags are not the only query arguments (see task_filter in Task.get_tasks()).
ClearML tries to strike some balance between common usability and its resource requirements (which will grow considerably to enable indexing any field for possible query) - So any specific field is a potential extension (as long as there's a compelling use case behind it), having all fields categorically indexed is simply impractical.

@Make42
Copy link
Contributor Author

Make42 commented Apr 14, 2023

@idantene, @ainoam: Thank you so much, you helped me out a lot! Based on your content, I would like to improve on my suggestions. Maybe it is helpful. I put quite a bit of thought into it, but it probably is not yet perfect.

Queries

I did not mean, that one should be able to query every field in the entire Web UI. I wanted to suggest that one could query the field in CONFIGURATION additionally to tags.

Renaming Suggestion for Tasks

Considering, what @idantene, I would suggest to name them as you have described them.

Current Name Suggestion
"Execution" "Evironment Requirements" or "Env. Reqmts" for short (A Dictionary of Abbreviations, Burt Vance, Oxford University Press) or "Requirements"
"Configuration" "Attributes", because that is the stuff that the user changes, be it configuration or statistics.
"User Properties" "Mutable Key-Values" (we already know they are in the context "Attributes")
"Hyperparameters" "Immutable Key-Values" (we already know they are in the context "Attributes")
"Configuration Objects" "Files" (we already know they are in the context "Attributes")
"Info" "Execution" or "Runtime"

Renaming Suggestion for Models

Current Name Suggestion
"Network" "Model Configuration"
"Metadata" "Attributes"

The model configuration does not have to relate to neural networks.
Also, remove "Model Configuration" from within the box under the tab - its redundant.
Metadata is actually data about data - this is not the case here - and I suggest to make consistent to above.
Likewise, remove the text "Metadata" as a heading - its redundant.

However, I would in fact suggest to simply have the same scheme as for tasks.
In fact, I would also suggest that the same scheme is used for Datasets.

@ainoam
Copy link
Collaborator

ainoam commented Apr 16, 2023

Thanks for summarizing @Make42 :)

I did not mean, that one should be able to query every field in the entire Web UI. I wanted to suggest that one could query the field in CONFIGURATION additionally to tags.

Querying Hyperparameters does indeed make a lot of sense - We should definitely make it available in a near version.
In the interim you can already achieve this with some internals hacking e.g.

Task.get_tasks(project_name='my project', task_filter = {'hyperparams.Args.batch_size.value': '64', '_allow_extra_fields_': True})`

Perhaps better open this as a dedicated issue in the package repo?

Renaming Suggestions

Much appreciate the time you took to consider these - We'll take these under advisement, though I'm not sure all terms might make it through.
For example, Configuration objects are not necessarily files, and Hyperparameters are not immutable in every circumstance (quite the opposite - being able to modify them on draft tasks is a key feature).

@Make42
Copy link
Contributor Author

Make42 commented Apr 17, 2023

@ainoam:

Configuration

Configuration or Parameters suggests that this is "input". However, current, I am reporting statistics about the task in "Hyperparameters". "Properties" on the other hand suggests something like statistics about the task, something that results from the way the task is set up or something that is part of the nature of the task.

I suggested "Attributes", because it (kind of) entails both input ("configuration", "parameters", etc.) and resulting information (statistics, properties, etc.). "Metadata" is a type of "resulting information" as well, I think.

User Properties

How about "Descriptors" or "Annotations"? Everything in "Attributes" / "Configuration" is "from the user", but with "User Properties" you mean something that can be changed even after it is final, so it should not be a real property of the task or a configuration or something like that. However, a description or annotation might be changed later.

Configuration Objects

What are "configuration objects" then? They are blobs of data right? If so, "object" (as in "object store") would be right, so "attribute object" might actually be good. One might store not only configuration in them that is why the change might be sensible.

@idantene
Copy link
Contributor

Chiming in, hope that's fine with you 😉 Per @Make42's suggestions, here's my two cents on these (plus explanations, in the hope that it yields a fruitful discussion):

tl;dr: I think Attribute is too vague of a term, and key-value offers no insight about the intended use-case.

  • Execution -> Environment Requirements: This feels like a length renaming to me, which pretty much entails the same information. The Requirements alternative is perhaps more suiting.
  • Configuration -> Attributes: Attributes is a very generic term and it can be applied anywhere. These can be considered input in a way, but since some of these are automagically captured by ClearML (and may not be used by the user otherwise), they're "not exactly input". I think the Configuration title is very accurate in this case. On your end, you should probably not report statistics about the task under Hyperparameters, but rather as a scalar or similar.
  • User Properties, Hyperparameters, Configuration Objects - The suggestion for all of these is identical with "key-value pairs", which makes it already less accessible to a "common" user, and more leaning to a developer. To make things more complicated, a user can set their own parameters and create various headings from there. For example, we set parameters under a ClearML topic. My point with this is that ultimately these are more "free form" then you might anticipate, and the default three are sensible defaults. Maybe the default User Properties is not needed though. @ainoam it would be nice if these were not visible if they have no content though.
  • Info -> Runtime/Execution: Since the Execution is taken, I do think that Runtime Info is a better description. Info on its own is kinda vague.
  • Network -> Model Configuration: This is an important step that ClearML must take IMO. Not all models are neural nets. Not all models have a concept of "iteration".
  • Metadata -> Attributes: Same argument against this (attributes is too vague). Metadata (under "Model", of course) is self-explanatory, and the "data about data" concept still holds - the data in this case is the model. So it's data about the model, hence metadata.

@Make42
Copy link
Contributor Author

Make42 commented Apr 17, 2023

@idantene, sure chime away :-D. Let me respond in the same spirit. I am going to be rather matter-of-factly, I hope you can still read it to be in a friendly spirit. And because this kind of terminology stuff is part of my academic research, I might be a bit picky.

  • "Attribute is too vague" - yes, it is supposed to be vague and offer little insight. I believe, one should use vague terms is something vague is supposed to be conveyed and specific terms if something specific is conveyed. Please note, that I, for example, use "Configuration" for both input-like data like configurations/parameters etc. and output-like data like statistics etc.. In the current state of the Web UI, the term used must include both.
  • "key-value offers no insight about the intended use-case" - true, and that is on purpose, because the way ClearML is currently build, those are arbitrary key-value pairs. If we had two tabs, one for input-like data and one for output-like data, that would be a different story. In fact, that is what I suggest and then we should have more specific words and I would be against "Attribute" and "key-value".
  • Considering my two previous points: It is bad to be to general, when a more specific word would be possible, because the more general word conveys less information. However, it is worse to be too specific, because then one has two problems, namely (a) one does exclude things that belong under the same rubric and the user cannot find this by thinking logically and (b) it is logically wrong. An example for (b): Let's say "Attributes" is the title and in it are only configurations, then as those configurations are all attributes, this is vague but true; but, if the title is "Configurations" and it contains also things that are not configurations, then this is false for those non-configurations.
  • "Environment Requirements [...] feels like a length renaming to me" - that is why I suggested "Env Rqrmt", but there are other ways to shorten this, like "Env. Requirements", should is nearly as long as "Requirements".
  • The term "attributes" does not only convey the idea of being input-like, but also being simply descriptive of the thing, like the term "property", and sometimes it refers to statistics specifically (see https://en.wikipedia.org/wiki/Attribute). Thus, I believe your worries regarding automagically captured things is in fact unfounded. Specifically for those "attribute" is perfect.
  • "My point with this is that ultimately these are more "free form" then you might anticipate" - I am not sure as precisely this is why I chose vague terms before.
  • "Since the Execution is taken" - "Execution" would not be taken as per my suggestion, because "Execution" would be renamed to "Requirements". The term "info" is kind of redundant.
  • "Not all models have a concept of "iteration"." well strictly speaking, even neural network models do not have iterations, but the methods used to train a neural network models have iterations. Considering this, other training methods (usually / always ?) also have iterations. The problem is though, that those iterations are not exposed by all programming frameworks that practitioners use.
  • Models are not data, although I can agree that in ML models are data if they are supposed to be recorded, as any digital record is persisted as data. If we go this route, we might as well call "Configuration"/"Attributes" simply "Metadata" (which for the sake of consistency I could get behind, following the reasoning that everything here is recorded as data.)

@idantene
Copy link
Contributor

I very much agree with the majority of your comments there 👍🏻 (and also highlighting that I'm not a ClearML member in anyway - I'm mostly sharing your frustration with this terminology issue). My last thoughts on this before chiming out 😁:

One thing I perhaps disagree with, is that you use Configuration for outputs. It's something we've been struggling with as well, misusing the system to fit our paradigm rather then trying to understand the ClearML paradigm. There are many ways to achieve the same end result, but if one goes with the intended approach, other parts of the SDK fall neatly into place. I'm not sure that will solve all or any of your troubles - I think it's just a side effect of the terminology confusion introduced by ClearML.

The input/output data is available to varying degree - inputs include the Requirements and Configuration tab, and outputs (considering runtime as output as well) includes pretty much everything else, with some exceptions relating to the Configuration tab.
In my mind, this then boils down to the Configuration tab, which is "a bit of both but also neither" (calling it Metadata would be also accurate, I just think it would look weird in the SDK?).

Finally, I completely agree about the concept of iterations generally in ML. My issue with it (in ClearML) is that it is of no practical use in many cases, and making it a mandatory argument (and forcing it to be integer as well) is extremely annoying IMO.

@Make42
Copy link
Contributor Author

Make42 commented Apr 17, 2023

@idantene: Regarding "going with ClearML's paradigm". I would love to, but I do not understand the paradigm. That is what my original question here was about. Your last post cleared the fog a bit more. So, how do you save statistics/output, "going with the indended approach"?


Finally, I completely agree about the concept of iterations generally in ML. My issue with it (in ClearML) is that it is of no practical use in many cases, and making it a mandatory argument (and forcing it to be integer as well) is extremely annoying IMO.

is so true!

@idantene
Copy link
Contributor

We also discussed this on Slack, but in case anyone follows up on this -

We save any statistics/outputs using the Logger class, accessed by task.get_logger(). If one needs to access those later dynamically, we use task.get_reported_plots(), task.get_reported_scalars(), task.get_reported_single_value(), etc.

One more side note to the ClearML team (@ainoam) - it is quite weird that a dataframe is under the "Plots" tab. I would say it's more of "Scalars" -- could just be another naming issue though. I believe the original "Scalars" were meant to be used with iterations (on the X axis), and anything else was "Plots". If the concept of iterations is dropped (as well it should!), then DataFrames definitely belong in the "Scalars" tab, and "Debug Samples" can fit nicely in the "Plots".

@Make42
Copy link
Contributor Author

Make42 commented Apr 19, 2023

@ainoam , @idantene

I have thought some more about the whole topic. I think either one should stick to one of two approaches:

  • We don't care about the semantic types that are inside our "containers", we just care about the technical types (e.g. key-value store, object store).
  • We care about the semantic types (e.g., input configuration, output statistics) and any technical type can be stored in the respective category.

ClearML follows both approaches simultaneously it seems, which leads to considerable confusion:

  • "plot", "artifact", "scalars" are technical types
  • "configuration", "hyperparameter" are a semantic types
  • "console" is neither (but is still very clear)
  • "info" says nothing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants