-
Notifications
You must be signed in to change notification settings - Fork 498
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
File Upload API: problem with mime type detection #8344
Comments
I made a PR into pyDataverse - gdcc/pyDataverse#142 |
…mime type, when no Content-Type: header is found in the incoming data part. (#8344)
Thank you. I can report that I did ran into kind of a similar issue that is affected by this bug. We have to deal with multiple data files up to 11GiB (high resolution images from tissue) that get marked as "image/tiff" no matter what we try with the web-upload or through the API with pydataverse, curl. The documentations listed WA: (https://guides.dataverse.org/en/latest/api/native-api.html?highlight=mimetype#add-a-file-to-a-dataset) with ";type=application/octet-stream" is ignored. Though it works with a invalid MIME like ";type=application/data". The Problem with the "image/tiff" is that thumbnail generator wants to make a thumbnail of the 11GiB file and it fails time and time again so the dataverse/dataset gets incredibly sluggish and doesn't recover. Our administrators restored to editing the database and set the mime by hand to "application/octet-stream" in order to prevent the thumbnail generator from going nuts. |
@isdchris The fact that your files are identified as "image/tiff" means that the type detection library that Dataverse uses (jhove) recognizes these files as such. There is a very simple workaround for the problem with thumbnails for large tiff files though: You should set the size limit for thumbnail generation to something sensible (via the JVM option |
This is interesting, that you mention pyDataverse, among the other ways of uploading a file. Based on what I was seeing, with pyDataverse an upload always resulted in the file tagged "text/plain" regardless of its actual type. In other words, pyDataverse appeared to always trigger the specific bug described in this issue. So I'm surprised that you appear to suggest that you saw files recognized as tiff w/ pyDataverse... It is possible that I missed something there. Regardless, thumbnail size limit should definitely solve your immediate problem. The new default limits are mentioned in the 5.9 release note (https://github.com/IQSS/dataverse/releases/tag/v5.9). We should have had some implied limits there all along of course - but better late than never. |
@landreev: Thank you for the hint about v5.9's new As for pydataverse uploads. In my case all upload methods behaved in the same way. I wasn't seeing any |
To clarify, this doesn't really need to wait until v5.9 is deployed. The administrator of your Dataverse can set this limit using the JVM option above in any earlier version. Sorry if my explanation was confusing. Feel free to encourage your administrator to contact us if they have any questions about this. |
Oh, the administrator wanted to do the update for other reasons as well. And he just finished... Sorry for all the noise. The last thing I wanted was to derail your issue. But I'll stick around. |
[this issue is still work in progress; I may need to investigate some more/add more info; but going to create an issue so that I don't forget, again]
Short version:
When files are uploaded via
/api/datasets/{id}/add
it appears that the mime type identification step is skipped if the file stream is passed to the API in a certain way; with the file always ending up classified astext/plain
.This is not a fatal problem when using the API on the command line via curl (it works properly when used exactly as specified in our guide). But it becomes a problem when trying to use it from some software clients. Specifically, it appears to be impossible to upload a file via pyDataverse as anything but
text/plain
.Excruciating details:
1. Uploading an image file following the example in the API guide:
curl -H X-Dataverse-key:XXX -X POST -F "[email protected]" "http://localhost:8080/api/datasets/NNN/add"
this works, the file is uploaded and identified as
image/jpeg
.2. But try to pipe the same input to the API instead:
cat test.jpg | curl -H X-Dataverse-key:XXX -X POST -F "file=@-" -F 'jsonData={"label":"test_stream.jpg"}' http://localhost:8080/api/datasets/NNN/add
the file still uploads, saved as "test_stream.jpg", but identified as "text/plain".
Note that in the first example the mime type is not necessarily derived from the filename extension. You can rename a jpeg as test.xxx, and it will still be typed properly. Meaning, our detection code reads the file and identifies it as a jpeg; but for whatever reason this isn't done when the same file is piped in. I couldn't immediately tell why from looking at the API code.
It appears that when the API is executed from pyDataverse (via
api.upload_datafile()
), the POST request is also formatted (using Python requests library) in a way that makes our code skip the type detection.More info/potential explanation:
OK, looking at the POST requests formatted by curl (via
curl ... --trace-ascii /dev/stdout
), it looks like the difference is straightforward enough:case 1.:
vs. case 2.:
i.e., when the standard input is used, curl encodes the multipart-form without any
Content-Type:
header; which somehow causes the mime type to default totext/plain
, which we accept as a good enough type (?) and either skip the type check, or disregard its result. With the filename supplied, theContent-Type:
is set, at least toapplication/octet-stream
- which we recognize on the application side as a nice way to say "type unknown", so we replace it if the file can be typed as something more specific.The same thing must be happening in pyDataverse - no
Content-Type:
in the multiform file entry. While it's not possible to explicitly specify the mime type inpyDataverse/upload_datafile()
, it appears to be possible to do so w/ the standardrequests
library used bypyDataverse
. So it should be possible to make a PR into https://github.com/gdcc/pyDataverse that would fix this on their end (?).We may still want to change something in our (Dataverse) code and see if we can easily prevent it from defaulting to
text/plain
when the type is not supplied in the multiform POST explicitly. (the defaulting may be happening outside of our code; but we can still make our code smarter, about picking the best/most specific type possible).The text was updated successfully, but these errors were encountered: