Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

invalid warning message #341

Open
jordane95 opened this issue Feb 12, 2025 · 1 comment
Open

invalid warning message #341

jordane95 opened this issue Feb 12, 2025 · 1 comment

Comments

@jordane95
Copy link
Contributor

def _default_adapter(self, data: dict, path: str, id_in_file: int | str):
"""
The default data adapter to adapt input data into the datatrove Document format
Args:
data: a dictionary with the "raw" representation of the data
path: file path or source for this sample
id_in_file: its id in this particular file or source
Returns: a dictionary with text, id, media and metadata fields
"""
return {
"text": data.pop(self.text_key, ""),
"id": data.pop(self.id_key, f"{path}/{id_in_file}"),
"media": data.pop("media", []),
"metadata": data.pop("metadata", {}) | data, # remaining data goes into metadata
}
def get_document_from_dict(self, data: dict, source_file: str, id_in_file: int | str):
"""
Applies the adapter to each sample, instantiates a Document object and adds `default_metadata`.
Args:
data: a dictionary with the "raw" representation of the data
source_file: file path or source for this sample
id_in_file: its id in this particular file or source
Returns: a Document
"""
parsed_data = self.adapter(data, source_file, id_in_file)
if not parsed_data.get("text", None):
if not self._empty_warning:
self._empty_warning = True
logger.warning(
f"Found document without text, skipping. "
f'Is your `text_key` ("{self.text_key}") correct? Available keys: {list(data.keys())}'
)

text key is always poped by default adapter, so we would never know if it misses the text key or just because one record has empty field

@guipenedo
Copy link
Collaborator

You can know from the message itself, as it displays the keys from data and not from parsed_data

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants