Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pipeline hanging on partition #246

Open
mahmoudaymo opened this issue Nov 19, 2024 · 3 comments
Open

Pipeline hanging on partition #246

mahmoudaymo opened this issue Nov 19, 2024 · 3 comments

Comments

@mahmoudaymo
Copy link

mahmoudaymo commented Nov 19, 2024

I am using unstructured-ingest version 0.3.0, using the following code:

from unstructured_ingest.v2.interfaces import ProcessorConfig
from unstructured_ingest.v2.pipeline.pipeline import Pipeline
from unstructured_ingest.v2.processes.chunker import ChunkerConfig
from unstructured_ingest.v2.processes.connectors.local import (
    LocalConnectionConfig,
    LocalDownloaderConfig,
    LocalIndexerConfig,
    LocalUploaderConfig,
)
from unstructured_ingest.v2.processes.partitioner import PartitionerConfig


def ingest(work_dir: Path, storage_dir: Path, partition_dir: Path):
    """ingest data from different file types using unstructured-ingest"""
    num_processes = 8
    process_config = ProcessorConfig(
        reprocess=True,
        verbose=True,
        num_processes=num_processes,
        work_dir=str(work_dir.resolve()),
        iter_delete=True,
    )
    indexer_config = LocalIndexerConfig(input_path=str(storage_dir.resolve()), recursive=True)
    downloader_config = LocalDownloaderConfig()
    source_connection_config = LocalConnectionConfig()
    partitioner_config = PartitionerConfig(
        strategy="auto",
        hi_res_model_name="yolox",
        fields_include=["element_id", "text", "type"],
        additional_partition_args=dict(
            include_page_breaks=True, analysis=True, process_attachments=False
        ),
    )
    uploader_config = LocalUploaderConfig(output_dir=str(partition_dir.resolve()))
    chunker_config = ChunkerConfig(
        chunking_strategy="by_title",
        chunk_combine_text_under_n_chars=500,
        chunk_max_characters=1500,
        chunk_new_after_n_chars=1000,
        chunk_overlap=100,
    )

    Pipeline.from_configs(
        context=process_config,
        indexer_config=indexer_config,
        downloader_config=downloader_config,
        source_connection_config=source_connection_config,
        partitioner_config=partitioner_config,
        chunker_config=chunker_config,
        uploader_config=uploader_config,
    ).run()

I am running this code in a docker container, I also tested it locally but I always get into the same issue. The pipeline never finishes and never enter the chunking phase. I am processing about 43K documents of different types (html, xml, pdf, docx, json ...) since JSON is no supported they just don't get processed. The pipeline hangs on the partition phase and never enter chunking.

@mahmoudaymo
Copy link
Author

sample.log

@EvanMWard
Copy link

EvanMWard commented Dec 13, 2024

I ran into this problem with my pipeline as well, my quick fix was to set num_processes to 1 and that seemed to work. I'm not sure what's causing it but it seems there's some kind of problem with having more than one process -- even the default of 2 hangs for me.

If anyone's experienced this and found a way to run it with multiple processes it'd be great to hear.

@boredjeff
Copy link

I also ran into this problem, but my pipeline hangs between chunker -> embedder. I was not able to find a solution to this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants