Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File Upload API: problem with mime type detection #8344

Closed
landreev opened this issue Jan 12, 2022 · 7 comments · Fixed by #8392
Closed

File Upload API: problem with mime type detection #8344

landreev opened this issue Jan 12, 2022 · 7 comments · Fixed by #8392
Milestone

Comments

@landreev
Copy link
Contributor

landreev commented Jan 12, 2022

[this issue is still work in progress; I may need to investigate some more/add more info; but going to create an issue so that I don't forget, again]

Short version:

When files are uploaded via /api/datasets/{id}/add it appears that the mime type identification step is skipped if the file stream is passed to the API in a certain way; with the file always ending up classified as text/plain.
This is not a fatal problem when using the API on the command line via curl (it works properly when used exactly as specified in our guide). But it becomes a problem when trying to use it from some software clients. Specifically, it appears to be impossible to upload a file via pyDataverse as anything but text/plain.

Excruciating details:

1. Uploading an image file following the example in the API guide:

curl -H X-Dataverse-key:XXX -X POST -F "[email protected]" "http://localhost:8080/api/datasets/NNN/add"

this works, the file is uploaded and identified as image/jpeg.

2. But try to pipe the same input to the API instead:

cat test.jpg | curl -H X-Dataverse-key:XXX -X POST -F "file=@-" -F 'jsonData={"label":"test_stream.jpg"}' http://localhost:8080/api/datasets/NNN/add

the file still uploads, saved as "test_stream.jpg", but identified as "text/plain".

Note that in the first example the mime type is not necessarily derived from the filename extension. You can rename a jpeg as test.xxx, and it will still be typed properly. Meaning, our detection code reads the file and identifies it as a jpeg; but for whatever reason this isn't done when the same file is piped in. I couldn't immediately tell why from looking at the API code.

It appears that when the API is executed from pyDataverse (via api.upload_datafile()), the POST request is also formatted (using Python requests library) in a way that makes our code skip the type detection.

More info/potential explanation:

OK, looking at the POST requests formatted by curl (via curl ... --trace-ascii /dev/stdout), it looks like the difference is straightforward enough:

case 1.:

0000: --------------------------fe5bea7b618b9c79
002c: Content-Disposition: form-data; name="file"; filename="test.xxx"
006e: Content-Type: application/octet-stream
0096: 
0098: ...

vs. case 2.:

0000: --------------------------c56c65bb0215ed20
002c: Content-Disposition: form-data; name="file"; filename="-"
0067: 
0069: ...

i.e., when the standard input is used, curl encodes the multipart-form without any Content-Type: header; which somehow causes the mime type to default to text/plain, which we accept as a good enough type (?) and either skip the type check, or disregard its result. With the filename supplied, the Content-Type: is set, at least to application/octet-stream - which we recognize on the application side as a nice way to say "type unknown", so we replace it if the file can be typed as something more specific.

The same thing must be happening in pyDataverse - no Content-Type: in the multiform file entry. While it's not possible to explicitly specify the mime type in pyDataverse/upload_datafile(), it appears to be possible to do so w/ the standard requests library used by pyDataverse. So it should be possible to make a PR into https://github.com/gdcc/pyDataverse that would fix this on their end (?).

We may still want to change something in our (Dataverse) code and see if we can easily prevent it from defaulting to text/plain when the type is not supplied in the multiform POST explicitly. (the defaulting may be happening outside of our code; but we can still make our code smarter, about picking the best/most specific type possible).

@landreev
Copy link
Contributor Author

I made a PR into pyDataverse - gdcc/pyDataverse#142
I don't necessarily expect them to accept it (they may have a different solution in mind, etc.).
The defaulting to "text/plain" behavior should probably be addressed on the Dataverse side as well.

landreev added a commit that referenced this issue Feb 2, 2022
…mime type, when no Content-Type: header is found in the incoming data part. (#8344)
@isdchris
Copy link

Thank you. I can report that I did ran into kind of a similar issue that is affected by this bug.

We have to deal with multiple data files up to 11GiB (high resolution images from tissue) that get marked as "image/tiff" no matter what we try with the web-upload or through the API with pydataverse, curl. The documentations listed WA: (https://guides.dataverse.org/en/latest/api/native-api.html?highlight=mimetype#add-a-file-to-a-dataset) with ";type=application/octet-stream" is ignored. Though it works with a invalid MIME like ";type=application/data".

The Problem with the "image/tiff" is that thumbnail generator wants to make a thumbnail of the 11GiB file and it fails time and time again so the dataverse/dataset gets incredibly sluggish and doesn't recover. Our administrators restored to editing the database and set the mime by hand to "application/octet-stream" in order to prevent the thumbnail generator from going nuts.
This help... A lot!

@landreev
Copy link
Contributor Author

@isdchris
From your description, I don't think this is the exact same bug/problem. This issue is for a very specific, narrow condition, when the API does not receive ANY type definition from the client, and ends up assigning "text/plain" to the file, instead of trying to identify the type.

The fact that your files are identified as "image/tiff" means that the type detection library that Dataverse uses (jhove) recognizes these files as such. There is a very simple workaround for the problem with thumbnails for large tiff files though: You should set the size limit for thumbnail generation to something sensible (via the JVM option dataverse.dataAccess.thumbnail.image.limit; see https://guides.dataverse.org/en/latest/installation/config.html).
Also, starting Dataverse version 5.9 (the current released version), there is a pre-set default value for this limit of 3000000 bytes. Meaning if the JVM option is not explicitly set by the admin, thumbnail generation will be automatically skipped for files larger than 3000000.

@landreev
Copy link
Contributor Author

no matter what we try with the web-upload or through the API with pydataverse, curl...

This is interesting, that you mention pyDataverse, among the other ways of uploading a file. Based on what I was seeing, with pyDataverse an upload always resulted in the file tagged "text/plain" regardless of its actual type. In other words, pyDataverse appeared to always trigger the specific bug described in this issue. So I'm surprised that you appear to suggest that you saw files recognized as tiff w/ pyDataverse... It is possible that I missed something there.

Regardless, thumbnail size limit should definitely solve your immediate problem. The new default limits are mentioned in the 5.9 release note (https://github.com/IQSS/dataverse/releases/tag/v5.9). We should have had some implied limits there all along of course - but better late than never.

@isdchris
Copy link

@landreev: Thank you for the hint about v5.9's new dataverse.dataAccess.thumbnail.image.limit. I think this will work. Though, I need to add some clarifications if I may: I'm a "user" of my universities own provided dataverse. For me this means that I just need to wait until the administrator finish the update to v5.9 😌 .

As for pydataverse uploads. In my case all upload methods behaved in the same way. I wasn't seeing any text/plain with pydataverse (I've tested with the "other" dataverse-uploader. This is technically meant as a github action, but the dataverse.py uses pydataverse and can be easily modified to run as a standalone script)

@landreev
Copy link
Contributor Author

To clarify, this doesn't really need to wait until v5.9 is deployed. The administrator of your Dataverse can set this limit using the JVM option above in any earlier version.
Starting v5.9, there is an implicit, default limit for thumbnail generation. Meaning, even if the limit is not set by an admin explicitly, Dataverse will behave as if it is set to 3MB; and will skip trying to generate thumbnails for files larger than that.

Sorry if my explanation was confusing. Feel free to encourage your administrator to contact us if they have any questions about this.

@isdchris
Copy link

Oh, the administrator wanted to do the update for other reasons as well. And he just finished...

Sorry for all the noise. The last thing I wanted was to derail your issue. But I'll stick around.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants