Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(chunking): minimal fix for combine arg typo #345

Merged
merged 3 commits into from
Jan 22, 2024

Conversation

scanny
Copy link
Contributor

@scanny scanny commented Jan 19, 2024

This is the minimal fix for the combine_text_under_n_chars arg-name typo bug described in #337.

Chunking parameters were added to the API about 4 months ago in this commit: 49fe710

At that time, the identifier combine_under_n_chars was used instead of combine_text_under_n_chars (as that parameter appears in chunk_by_title()). As a result, any value supplied for this argument via the REST API is ignored and defaults to 500. The combine_under_n_chars parameter name appears in the API documentation and elsewhere (README.md etc.) and differs from the parameter name documented for chunk_by_title().

A fuller remedy will be to add a separate combine_text_under_n_chars parameter to the API endpoint such that either name is accepted for now, then update the documentation such that all mentions are combine_text_under_n_chars. At some point in the future we may remove the combine_under_n_chars parameter from the endpoint.

However, making those changes requires a synchronized release of the unstructured library and in the interest of time we're making this "quick-fix" that can be released independently of any other dependencies.

Copy link
Collaborator

@awalker4 awalker4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Looks like that unit test I mentioned is catching this as expected. Parallel mode isn't so relevant right now, but that test is asserting that the api parameters are properly sent to partition, so I think you'll just need to rename the arg over there as well.

I'd also make a note in the changelog that this is fixed in case users are checking the release notes.

@scanny scanny force-pushed the scanny/fix-337-combine_chunks branch from cbdc673 to 5acd8a7 Compare January 20, 2024 05:17
@scanny
Copy link
Contributor Author

scanny commented Jan 20, 2024

... that test is asserting that the api parameters are properly sent to partition, so I think you'll just need to rename the arg over there as well.

The parameter name in partition() is correct, that's where the disconnect was. combine_under_n_chars just got lumped into **kwargs and hung around unused; combine_text_under_n_chars then took its default (500).

Actually, partition() doesn't define that parameter explicitly, it doesn't appear until partition_pdf() and even there it's not strictly required because partition_pdf() has **kwargs as well. It's the @add_chunking_strategy() decorator that actually goes looking for that specific name.

It's one of the challenges with taking advantage of the **kwargs behaviors, you lose type checking and then bugs like this one can find a home. I think the way I would solve this if we had the engineering will for it would be to create a PartitionOptions object that can be created at the earliest opportunity (by first receiver of call) and passed around whole from there on. I'd have to noodle it some more but for now I think we just have to be very diligent about arg names.

@scanny
Copy link
Contributor Author

scanny commented Jan 20, 2024

@awalker4 how does the versioning work for this repo? Should I be bumping the version to 0.0.62-dev0 in prepline_general/api/app.py and add my CHANGELOG entry under that new heading?

If you can help me understand how this ends up appearing in production after merging, that would help me a lot so I can let the end-users know what to expect :)

@scanny scanny force-pushed the scanny/fix-337-combine_chunks branch from 5acd8a7 to edd5d2f Compare January 20, 2024 21:18
@awalker4
Copy link
Collaborator

@scanny The version change is driven by the changelog - so you can add a section there under 0.0.63-dev0. Running make version-sync will then replace the 0.0.62 string in the code. The endpoint will then live at /general/v0/general and /general/v0.0.63/general. Will dm you with deployment details.

For consistency and to enable further lint-error fixing in future PRs,
add the configuration required for `black` and `ruff` to run from the
command-line.

This also enables these to be run by language servers (LSPs) in popular
editors. Settings are the same as in the `unstructured` repo.
A pair of anomalies, possibly related to Apple M3 Silicon or just newest
MacbookPro. The reported available virtual memory is 90 GB so bumped
that up to 300 GB in `test_general_api_returns_503()`. Also, for
whatever reason, `starlette.Request.client` is `None` when running the
tests locally, at least on my machine, so include that possible case
(which should be accounted for anyway, `None` is a legitimate
value for `Request.client`).

All tests pass locally after these two changes.
@scanny scanny force-pushed the scanny/fix-337-combine_chunks branch from edd5d2f to 6d936e6 Compare January 22, 2024 23:05
@scanny scanny force-pushed the scanny/fix-337-combine_chunks branch from 6d936e6 to a8935ad Compare January 22, 2024 23:26
@scanny scanny merged commit fe93852 into main Jan 22, 2024
6 checks passed
@scanny scanny deleted the scanny/fix-337-combine_chunks branch January 22, 2024 23:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants