Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: CI dependency conflicts, deprecated example file, version update #2

Open
wants to merge 78 commits into
base: main
Choose a base branch
from

Conversation

huangrpablo
Copy link
Collaborator

No description provided.

rbiseck3 and others added 30 commits June 25, 2024 18:00
### Description
Move astradb destination connector over to the new v2 ingest framework
…ing (Unstructured-IO#3287)

## Summary

This PR addresses an issue where the code could attempt to run `soffice`
in multiple processes and closes Unstructured-IO#3284
The fix is to add a wait mechanism when there is another `soffice`
process running in already.

## Diagnosis of issue

- `soffice` can only have one process running when using the command
`soffice` as is.
- on main branch the function `partition.common.convert_office_doc`
simply spawns a subprocess to run `soffice` command to convert a `doc`
or `ppt` file into `docx` or `pptx` format.
- if there are multiple partition calls to process `doc` or `ppt` files
and they all want to spawn `soffice` subprocesses only one will succeed
while other processes will simply fail and return 1 from the subprocess
- in downstream this will lead to errors like `PackageNotFoundError:
Package not found at '/tmp/tmpac6lcu4w/document.docx'`

## solution

While there are
[ways](https://www.reddit.com/r/libreoffice/comments/agk3os/how_to_open_more_than_one_calc_instance_under/)
to circumvent the limit of `soffice` by setting a tmp file as user
installation env, these kind of solutions rely on the internals of
`soffice` and adds maintenance cost to track its changes.

This PR solves this problem by adding a wait mechanism: 
- we first spawning a subprocess to run `soffice` 
- if the `stdout` is empty and we still have wait time budget left the
function first checks if there is another `soffice` running
  * If yes then the function waits for 0.01s before checking again; 
* if no the functions spawns a subprocess to run `soffice` and return to
beginning of this step
* we need to return the the beginning to check if `stdout` is empty
because we could have another collision right after `soffice` becomes
available.

## test

This PR adds two unit tests.
Additionally this can be tested by running partition of `.doc` files
locally with multiprocessing.
Moved numpy pin to `base.in` where it will be picked up by packaging.

Side note:
`constraints.txt` (formerly `constraints.in`) is a really useful
pattern: you put a constraint there, add that file as a `-c` requirement
in other files, and the constraint will be applied when pip-compiling
*only when needed* because the library is required by something else.
Neat! However, unfortunately, in my searches I've never found a similar
pattern for packaging, so any pins we want to propagate to user installs
need to be explicitly placed in the `.in` files.

So what is `constraints.txt` really doing for us? Well in the past I
think there have been instances where something is temporarily broken in
an upstream dependency but we expect it to be patched soon, but in the
meantime we want things to work in our CI builds and development
installs, so it's not worth pinning everywhere it's used. Having said
that, I'm coming to the conclusion that `constraints.txt` causes more
harm than good in the confusion it causes WRT packaging -- maybe we
should remove that pattern at some point.
**Summary**
Remedy gap where `strategy` argument passed to `partition()` was not
forwarded to `partition_doc()` or `partition_odt()` and so was not
making its way to `partition_docx()`.
…tructured-IO#3234)

This PR adds new capabilities for drawing bboxes for each layout
(extracted, inferred, ocr and final) + OD model output dump as a json
file for better analysis.

---------

Co-authored-by: Christine Straub <[email protected]>
Co-authored-by: Michal Martyniak <[email protected]>
### Description
Isolate all log statements that happen per record and make them debug
level to avoid bloating the console output.
### Summary

Bumps to the latest `langchain-community` version to resolve
[CVE-2024-2965](https://nvd.nist.gov/vuln/detail/CVE-2024-2965).

---------

Co-authored-by: ryannikolaidis <[email protected]>
Co-authored-by: MthwRobinson <[email protected]>
### Description
Migrate the onedrive source connector to v2, adding in more rich content
pulled from the response of the SDK to add further metadata to the
FIleData produced by the indexer.

---------

Co-authored-by: ryannikolaidis <[email protected]>
Co-authored-by: rbiseck3 <[email protected]>
**Summary**
The `python-docx` error `docx.opc.exceptions.PackageNotFoundError`
arises both when no file exists at the given path and when the file
exists but is not a ZIP archive (and so is not a DOCX file).

This ambiguity is unwelcome when diagnosing the error as the two
possible conditions generally indicate a different course of action to
resolve the error.

Add detailed validation to `DocxPartitionerOptions` to distinguish these
two and provide more precise exception messages.

**Additional Context**
- `python-pptx` shares the same OPC-Package (file) loading code used by
`python-docx`, so the same ambiguity will be present in `python-pptx`.
- It would be preferable for this distinguished exception behavior to be
upstream in `python-docx` and `python-pptx`. If we're willing to take
the version bump it might be worth considering doing that instead.
### Description
Using a `isinstance` on the destination registry mapping breaks when
inheritance is used for the associated uploader types. This adds a
connector type field to all uploaders so that the entry can be
deterministically fetched when running check for associated stager in
pipeline.
### Summary

Release for `0.14.9`.
…3293)

Migrates OpenSearch destination connector to V2. Relies a lot on the
Elasticsearch connector where possible. (this is expected)
### Summary

Adds links to the serverless api. README updates look like the
following:

<img width="904" alt="image"
src="https://github.com/Unstructured-IO/unstructured/assets/1635179/fcb2b0c5-0dff-4612-8f18-62836ca6de8b">
When we switched community Slack from Paid to Free we lost the CI test
bot. Also if messages delete after 90 days then our expected test data
will disappear.

- created a new bot in our paid company slack
(test_unstructured_ingest_bot)
- added a new private channel (test-ingest)
- invited the bot to the channel
- adjusted the end datetime of the test to cover the first few messages
in the channel

Still to do:
- update the CI secrets with the new bot token
- update the LastPass with the new bot token (I don't have write
access.. :(.
…nstructured-IO#3310)

### Summary

Updates to the latest version of the `wolfi-base` image. Changes
include:
- Version bumps to address CVEs
- `libreoffice` is now included in the `arm64`. `.doc` files are now
supported for `arm64`. `.ppt` do not work with the `libreoffice` package
currently available on `wolfi-os`. We have follow on work to look into
that.
- Updates the location of the `tesseract` `tessdata` files on the
`arm64` build. Closes Unstructured-IO#3290.
- Closes Unstructured-IO#3319 and addes `psutil` to the base dependencies.

### Testing

- `test_dockerfile` should continue to pass with the updates.
Updates opensearch source connector to v2. Leverages elasticsearch v2
heavily.

Expected tests renamed because thats how Elasticsearch names them.
This PR adds a V2 version of the Pinecone destination connector
…Unstructured-IO#3300)

This pull request fixes counting tables metric for three cases:
- False Negatives: when table exist in ground truth but any of the
predicted tables doesn't match the table, the table should count as 0
and the file should not be completely skipped (before it was np.NaN).
- False Positives: When there is a predicted table that didn't match any
ground truth table it should be counted as 0, right now it is skipped in
processing (matched_indices==-1)
- The file should be completely skipped only if there is no tables in
ground truth and in prediction

In short we can say that previous metric calculation didn't consider OD
mistakes
### Description
This PR handles two things:
* Exposing all the connectors via the connector registries by simply
importing the connector module. This should be safe assuming all
connector specific dependencies themselves are imported in the methods
where they are used and wrapped in `@requires_dependencies` decorator
* Remove any import that pulls from the v2 ingest.cli package
This PR provides support for V2 mongodb destination connector.
…3298)

Change unstructured-client pin to setting minimum version instead of max
version and `make pip-compile`.

Integration tests that were dependent on the old version of the client
are removed. These tests should be replicated in/moved to the SDK
repo(s).
### Description
Adds [SingleStore](https://www.singlestore.com/) database destination
connector with associated ingest test.
### Description
Allow used to pass in a reference to a custom defined stager via the
CLI. Checks are run on the instance passed in to be a subclass of the
UploadStager interface.
This pull request add table detection metrics.

One case that was considered by me:

Case: Two tables are predicted and matched with one table in ground
truth
Question: Is this matching correct in both cases or just for on table

There are two subcases:
- table was predicted by OD as two sub tables (so half in two, there are
two non overlapping subtables) -> in my opinion both are correct
- it is false positive from tables matching script in
get_table_level_alignment -> 1 good, 1 wrong

As we don't have bounding boxes I followed the notebook calculation
script and assumed pessimistic, second subcase version
The table metrics considering spans is not used and it messes with the
output thus I have cleaned the code from it. Though, I have left
table_as_cells in the source code - it still may be useful for the users
…3355)

**Summary**
In preparation for further work on auto-partitioning (`partition()`),
improve typing and organize `test_auto.py` by introducing categories.
rbiseck3 and others added 17 commits July 17, 2024 15:53
…IO#3408)

### Description

Looks like some connectors were never added to the registry explicitly
since that change was introduced. All of them are now updated.
…IO#3411)

**Summary**
Elaborate the `FileType` enum to be a complete descriptor of file-types.
Add methods to allow `STR_TO_FILETYPE`, `EXT_TO_FILETYPE` and
`FILETYPE_TO_MIMETYPE` mappings to be replaced, removing those redundant
and noisy declarations.

In the process, fix some lingering file-type identification and
`.metadata.filetype` errors that had been skipped in the tests.

**Additional Context**
Gathering the various attributes of a file-type into the `FileType` enum
eliminates the duplication inherent in the separate `STR_TO_FILETYPE`
etc. mappings and makes access to those values convenient for callers.
These attributes include what MIME-type a file-type should record in
metadata and what MIME-types and extensions map to that file-type. These
values and others are made available as methods and properties directly
on the `FileType` class and members. Because all attributes are defined
in the `FileType` enum there is no risk of inconsistency across multiple
locations and any changes happen in one and only one place. Further
attributes and methods will be added in later commits to support other
file-type related operations like mapping to a partitioner and verifying
its dependencies are installed.
)

### Description

At times, the google drive response doens't have some of the metadata
we're grabbing to populate the `FileData` metadata. This is fine, but
without the added safegaurds, this can cause a `KeyError`.
…ctured-IO#3410)

This PR aims to improve the organization and readability of our example
documents used in unit tests, specifically focusing on PDF and image
files.

### Summary
- Created two new subdirectories in the `example-docs` folder:
  - `pdf/`: for all PDF example files
  - `img/`: for all image example files
- Moved relevant PDF files from `example-docs/` to `example-docs/pdf/`
- Moved relevant image files from `example-docs/` to `example-docs/img/`
- Updated file paths in affected unit & ingest tests to reflect the new
directory structure

### Testing
All unit & ingest tests should be updated and verified to work with the
new file structure.

## Notes
Other file types (e.g., office documents, HTML files) remain in the root
of `example-docs/` for now.

## Next Steps
Consider similar reorganization for other file types if this structure
proves to be beneficial.

---------

Co-authored-by: ryannikolaidis <[email protected]>
Co-authored-by: christinestraub <[email protected]>
…-IO#3405)

### Description

If the id value exists in the stats response from fsspec, save it as a
`file_id` field in the metadata being persisted on each element.

---------

Co-authored-by: ryannikolaidis <[email protected]>
Co-authored-by: rbiseck3 <[email protected]>
…nstructured-IO#3422)

### Summary
Currently, the email partitioner removes only `=\n` characters during
the clearing process. However, email content sometimes contains `=\r\n`
characters, especially when read from file-like objects such as
`SpooledTemporaryFile` (the file type used in our API). This PR updates
the email partitioner to remove both `=\n` and `=\r\n` characters during
the clearing process.

### Testing

```
filename = "example-docs/eml/family-day.eml"

elements = partition_email(
    filename=filename,
)
print(f"From filename: {elements[3].text}")

with open(filename, "rb") as test_file:
    spooled_temp_file = tempfile.SpooledTemporaryFile()
    spooled_temp_file.write(test_file.read())
    spooled_temp_file.seek(0)
    elements = partition_email(file=spooled_temp_file)
    print(f"From spooled_temp_file: {elements[3].text}")
```

**Results:**
- on `main`
```
From filename: Make sure to RSVP!
From spooled_temp_file: Make sure to = RSVP!
```
- on `PR`
```
From filename: Make sure to RSVP!
From spooled_temp_file: Make sure to RSVP!
```
**Summary**
Replace conditional explicit import of partitioner modules in
`.partition.auto` with the new `_PartitionerLoader` class. This avoids
unbound variable warnings and is much less noisy.

`_PartitionerLoader` makes use of the new `FileType` property
`.importable_package_dependencies` to determine whether all required
packages are importable before dispatching the file to its partitioner.
It uses `FileType.extra_name` to form a helpful error message when a
dependency is not installed, so the caller knows which `pip install`
extra to specify to remedy the error.

`PartitionerLoader` uses the `FileType` properties
`.partitioner_module_qname` and `partitioner_function_name` to load
the partitioner once its dependencies are verified. Loaded partitioners
are cached with module lifetime scope for efficiency.
…#3432)

the pinecone python package moved their importing of
PineconeApiException

Chroma `sleep` added because even thought there is a `wait`, there is
still some sort of timing issue.
**Summary**
In preparation for fixing a cluster of bugs with automatic file-type
detection and paving the way for some reliability improvements, refactor
`unstructured.file_utils.filetype` module and improve thoroughness of
tests.

**Additional Context**
Factor type-recognition process into three distinct strategies that are
attempted in sequence. Attempted in order of preference,
type-recognition falls to the next strategy when the one before it is
not applicable or cannot determine the file-type. This provides a clear
basis for organizing the code and tests at the top level.

Consolidate the existing tests around these strategies, adding
additional cases to achieve better coverage.

Several bugs were uncovered in the process. Small ones were just fixed,
bigger ones will be remedied in following PRs.
…of `langchain-community` (Unstructured-IO#3433)

Closes Unstructured-IO#3378.

### Summary
This PR aims to update `OpenAIEmbeddingEncoder` to use
`OpenAIEmbeddings` from `langchain-openai` package instead of the
deprecated version from `langchain-community`. This resolves the
deprecation warning and ensures compatibility with future versions of
langchain.
…` instead of `langchain-community` (Unstructured-IO#3436)

Similar to Unstructured-IO#3433.

### Summary
This PR aims to update `HuggingFaceEmbeddingEncoder` to use
`HuggingFaceEmbeddings` from `langchain_huggingface` package instead of
the deprecated version from `langchain-community`. This resolves the
deprecation warning and ensures compatibility with future versions of
langchain.

### Testing
```
from unstructured.documents.elements import Text
from unstructured.embed.huggingface import HuggingFaceEmbeddingConfig, HuggingFaceEmbeddingEncoder

embedding_encoder = HuggingFaceEmbeddingEncoder(
    config=HuggingFaceEmbeddingConfig()
)
elements = embedding_encoder.embed_documents(
    elements=[Text("This is sentence 1"), Text("This is sentence 2")],
)

query = "This is the query"
query_embedding = embedding_encoder.embed_query(query=query)

[print(e.embeddings, e) for e in elements]
print(query_embedding, query)
print(embedding_encoder.is_unit_vector(), embedding_encoder.num_of_dimensions())
```
**Expected behavior**
No deprecation warning should be displayed. The code should use the
updated `HuggingFaceEmbeddings` class from the `langchain_huggingface`
package.
…ypes (Unstructured-IO#3434)

**Summary**
The `content_type` argument received by `partition()` from the API is
sometimes unreliable for MS-Office 2007+ MIME-types. What we've observed
is that it gets the MS-Office bit right but falls down on distinguishing
PPTX from DOCX or XLSX.

Confirmation of these types is simple, fast, and reliable. Confirm all
MS-Office `content_type` argument values asserted by callers of
`detect_filetype()` and correct swapped values.
**Summary**
A DOC, PPT, or XLS file sent to partition() as a file-like object is
misidentified as a MSG file and raises an exception in python-oxmsg
(which is used to process MSG files).

**Fix**
DOC, PPT, XLS, and MSG are all Microsoft OLE-based files, aka. Compound
File Binary Format (CFBF). These can be reliably distinguished by
inspecting magic bytes in certain locations. `libmagic` is unreliable at
this or doesn't try, reporting the generic `"application/x-ole-storage"`
which corresponds to the "container" CFBF format (vaguely like a
Microsoft Zip format) that all these document types are stored in.

Unconditionally use `filetype.guess_mime()` provided by the `filetype`
package that is part of the base unstructured install. Unlike
`libmagic`, this package reliably detects the distinguished MIME-type
(e.g. `"application/msword"`) for OLE file subtypes.

Fixes Unstructured-IO#3364
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.