Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Google Docs/Sheets/Slides not working in the V2 SDK Google Drive source connector #74

Open
ninalopatina opened this issue Jul 11, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@ninalopatina
Copy link

Describe the bug
Google Docs/Sheets/Slides not working in the V2 SDK Google Drive source connector

To Reproduce

Ingesting from Google Drive, partitioning via Unstructured API, embedding via OpenAI,and writing to AstraDB

runner = GoogleDriveRunner(
processor_config=ProcessorConfig(
verbose=True,
output_dir=os.environ['GOOGLE_DRIVE_OUTPUT'],
num_processes=2,
),
read_config=ReadConfig(),
partition_config=PartitionConfig(
partition_by_api=True,
api_key=os.getenv("UNSTRUCTURED_API_KEY")
),
connector_config=SimpleGoogleDriveConfig(
access_config=GoogleDriveAccessConfig(
service_account_key=os.getenv("GOOGLE_DRIVE_ACCOUNT_KEY")
),
recursive=True,
drive_id=os.getenv("GOOGLE_DRIVE_FOLDER_ID"),
),
chunking_config=ChunkingConfig(chunk_elements=True),
embedding_config=EmbeddingConfig(
provider="langchain-openai",
api_key=os.getenv("OPENAI_API_KEY"),
),
writer=get_writer(),
writer_kwargs={},
)

Expected behavior
As in V1, I expect the file to be parsed

Screenshots

KeyError Traceback (most recent call last)
in <cell line: 1>()
33 stager_config=WeaviateUploadStagerConfig(),
34 uploader_config=WeaviateUploaderConfig(),
---> 35 ).run()

7 frames
/usr/local/lib/python3.10/dist-packages/unstructured/ingest/v2/processes/connectors/google_drive.py in map_file_data(f)
131 file_id = f["id"]
132 filename = f.pop("name")
--> 133 url = f.pop("webContentLink")
134 version = f.pop("version", None)
135 permissions = f.pop("permissions", None)

KeyError: 'webContentLink'

Environment Info
This doesn't only happen in my env but also for anyone else that tries this snippet

Additional context
Add any other context about the problem here.

@ninalopatina ninalopatina added the bug Something isn't working label Jul 11, 2024
@adrian-ciz-intive
Copy link

Same thing happens to me when trying to parse a GDrive word document with some tables, images, TOC, header, footer, etc. about 30 pages long.

@MthwRobinson MthwRobinson transferred this issue from Unstructured-IO/unstructured Aug 26, 2024
@SantoshKumarRavi
Copy link

SantoshKumarRavi commented Nov 19, 2024

is anyone getting this issue in google drive v2 ingestion ?

2024-11-19 22:12:47,003 SpawnProcess-18 ERROR    
C:\Users\SANTHOSH\.cache\unstructured\ingest\pipeline\index\34b4026053f1.json: [download] 'GoogleDriveDownloader' object has no attribute 'meta'

@micmarty-deepsense
Copy link
Contributor

Thanks for reporting that @SantoshKumarRavi! It's a bug (see this line). We need to prepare a fix for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants