Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support getting downloads from all time #2325

Closed
osanseviero opened this issue Jun 11, 2024 · 7 comments
Closed

Support getting downloads from all time #2325

osanseviero opened this issue Jun 11, 2024 · 7 comments
Labels
enhancement New feature or request

Comments

@osanseviero
Copy link
Contributor

osanseviero commented Jun 11, 2024

This can be done by using the /models endpoint and adding ?expand:downloadsAllTime. As it's a frequent request, it might be useful to support this directly in the library. Right now, I can do it with internal utils

from huggingface_hub.utils._pagination import paginate
api_path = "https://huggingface.co/api/models"
models = list(paginate(api_path, params={"expand": "downloadsAllTime"}, headers={}))
@Wauplin
Copy link
Contributor

Wauplin commented Jun 11, 2024

Agree on the need to provide that.

I would rather add the expand parameter to list_models/list_datasets/list_spaces since these methods already exist but the problem is that the returned value is often very shallow: only the mongo id + the repo id + the expand values are returned, making it unsuitable for ModelInfo/DatasetInfo/SpaceInfo. I'm open to suggestion on how to make this consistent with the existing API.

For the record:

  • /models supports [author, cardData, config, createdAt, disabled, downloads, downloadsAllTime, gated, gitalyUid, lastModified, library_name, likes, mask_token, model-index, pipeline_tag, private, safetensors, sha, siblings, spaces, tags, transformersInfo, widgetData]
  • /datasets supports [author, cardData, citation, createdAt, disabled, description, downloads, downloadsAllTime, gated, gitalyUid, lastModified, likes, paperswithcode_id, private, siblings, sha, tags]
  • /spaces supports [author, cardData, datasets, disabled, gitalyUid, lastModified, createdAt, likes, private, runtime, sdk, siblings, sha, subdomain, tags, models]

(those lists can be updated in the future)

@Wauplin Wauplin added this to the in next release? milestone Jun 11, 2024
@Wauplin Wauplin added the enhancement New feature or request label Jun 11, 2024
@julien-c
Copy link
Member

maybe some kind of get_models_attributes(property="downloadsAllTime", ...)? not sure

@Wauplin
Copy link
Contributor

Wauplin commented Jun 12, 2024

Made some tests and to be consistent we should also update model_info/dataset_info/space_info as well (request for a single repo). Thinking twice about it, I would rather reuse the existing methods since all filters/search features remain the same when using expand[]=.

What I suggest for models:

  1. we add a expand parameter to model_info and list_models that accepts a list of strings (defaults to None)
  2. we check that existing parameters like full/cardData/fetch_config and the new expand parameter are not set at the same time since the API cannot fulfill both.
  3. we make private / tags / downloads / likes optional in ModelInfo. At the moment we consider them as always returned by the server which is not the case.
  4. we had downloads_all_time attribute to ModelInfo since it'll be a quite common use case.
  5. when expand is passed, we forward it to the server. Response is parsed as a ModelInfo (or list of ModelInfo) which is mostly empty but that's fine. Only the requested fields + id are populated.
>>> from huggingface_hub import model_info

>>> model = model_info("gpt2", expand=["downloadsAllTime"])
>>> model
ModelInfo(id='openai-community/gpt2', author=None, sha=None, created_at=None, last_modified=None, private=None, gated=None, disabled=None, downloads=None, downloads_all_time=585955419, likes=None, library_name=None, tags=None, pipeline_tag=None, mask_token=None, card_data=None, widget_data=None, model_index=None, config=None, transformers_info=None, siblings=None, spaces=None, safetensors=None)
>>> model.downloads_all_time
585955419

Anticipating the question, I prefer ModelInfo.downloads_all_time rather than ModelInfo.downloadsAllTime for consistency with others attributes + following pythonic convention. We already do that for lastModified, createdAt, cardData.

The "only" thing that bothers me with this solution is that it's not explicit to the user which expand values they can use just looking at the method signature. If they pass an invalid expand value, the API returns an error with the list of available properties which is already nice. A partial solution to that would be to add correct type annotation + docstring but happy to hear if you have a better idea @osanseviero @julien-c

@julien-c
Copy link
Member

sounds like a good plan.

For 3., would it make sense to return something else than a ModelInfo in that case? like ModelInfoProperties where everything is optional? (having all properties optional in ModelInfo sounds dirty to me but not sure if it makes sense in Python world)

@Wauplin
Copy link
Contributor

Wauplin commented Jun 12, 2024

having all properties optional in ModelInfo sounds dirty to me but not sure if it makes sense in Python world

Not very common IMO but that's the path we went for with the ModelInfo class already:

    id: str
    author: Optional[str]
    sha: Optional[str]
    created_at: Optional[datetime]
    last_modified: Optional[datetime]
    private: bool
    gated: Optional[Literal["auto", "manual", False]]
    disabled: Optional[bool]
    downloads: int
    likes: int
    library_name: Optional[str]
    tags: List[str]
    pipeline_tag: Optional[str]
    mask_token: Optional[str]
    card_data: Optional[ModelCardData]
    widget_data: Optional[Any]
    model_index: Optional[Dict]
    config: Optional[Dict]
    transformers_info: Optional[TransformersInfo]
    siblings: Optional[List[RepoSibling]]
    spaces: Optional[List[str]]
    safetensors: Optional[SafeTensorsInfo]

Having a few more optional fields would at least make things consistent even though it's not so satisfying. I'll check in parallel how I can draft a ModelInfoProperties class and see if it makes more sense.

@Wauplin
Copy link
Contributor

Wauplin commented Jun 27, 2024

Updated PR for it (#2333) with everything we talked above. It's ready for review :)

@Wauplin
Copy link
Contributor

Wauplin commented Jul 17, 2024

Closed by #2333.

@Wauplin Wauplin closed this as completed Jul 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants