Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Pass col_len to on_bad_lines callable #60999

Open
1 of 3 tasks
ansalls opened this issue Feb 24, 2025 · 1 comment
Open
1 of 3 tasks

ENH: Pass col_len to on_bad_lines callable #60999

ansalls opened this issue Feb 24, 2025 · 1 comment
Labels
Enhancement IO CSV read_csv, to_csv Needs Info Clarification about behavior needed to assess issue

Comments

@ansalls
Copy link

ansalls commented Feb 24, 2025

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

I have a third-party component that I process the results from. It appends an undefined number of additional fields to a line in a CSV file. This represents a variable number of qualifying attributes associated with the row. In my context, I don't care about this attribute at all and I'm happy to just let them be dropped.

The issue is, to avoid the "Length of header or names does not match length of data. This leads to a loss of data with index_col=False." warning, the callable needs to return the truncated list. The parser doesn't provide the expected length, though. It would be nice if the parser passed col_len (expected number of fields) to the callable to make it easier to drop the additional fields.

Feature Description

In PythonParser:

    def _rows_to_cols(self, content: list[list[Scalar]]) -> list[np.ndarray]:
        col_len = self.num_original_columns

        if self._implicit_index:
            col_len += len(self.index_col)

        max_len = max(len(row) for row in content)

        # Check that there are no rows with too many
        # elements in their row (rows with too few
        # elements are padded with NaN).
        # error: Non-overlapping identity check (left operand type: "List[int]",
        # right operand type: "Literal[False]")
        if (
            max_len > col_len
            and self.index_col is not False  # type: ignore[comparison-overlap]
            and self.usecols is None
        ):
            footers = self.skipfooter if self.skipfooter else 0
            bad_lines = []

            iter_content = enumerate(content)
            content_len = len(content)
            content = []

            for i, _content in iter_content:
                actual_len = len(_content)

                if actual_len > col_len:
                    if callable(self.on_bad_lines):
                        new_l = self.on_bad_lines(_content **, col_len**) #<-- Pass variable col_len to callable
                        if new_l is not None:
                            content.append(new_l)

Alternative Solutions

Use an alternative method to determine the expected number of columns, like processing the header separately to count the columns or hard coding a specific value.

Additional Context

No response

@ansalls ansalls added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 24, 2025
@rhshadrach
Copy link
Member

Thanks for the report. Can you give an example CSV (just a sample - a few lines) that demonstrates the issue.

@rhshadrach rhshadrach added IO CSV read_csv, to_csv Needs Info Clarification about behavior needed to assess issue and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO CSV read_csv, to_csv Needs Info Clarification about behavior needed to assess issue
Projects
None yet
Development

No branches or pull requests

2 participants