Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Canonical types and aliases #101

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Canonical types and aliases #101

wants to merge 1 commit into from

Conversation

jeremy
Copy link
Member

@jeremy jeremy commented Mar 6, 2024

  • MIME type aliases are now supported.
  • Aliases are resolved to their canonical type in all APIs.
  • Introduce MimeType.canonicalize type, instead_of: old to override a Tika canonical type with our own, essentially renaming the type and making the old type an alias of the new one. Common scenario with types like WAV with multiple competing historical types, RFCs that aren't actually followed, and browser support trumping them all. This allows us to give preference to browsers' actual MIME type support while keeping Tika's file extensions and magic byte matchers.
  • Warns when extending a type with preexisting extensions, parents, etc. and when extending an aliased type.

@jeremy jeremy force-pushed the aliases branch 3 times, most recently from 216b5b3 to 075b285 Compare March 6, 2024 23:39

Marcel::MimeType.extend "audio/aac", extensions: %w( aac ), parents: "audio/x-aac"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This style of "overriding" a MIME type results in .aac files resolving to audio/aac but leaves existing magic bytes matches resolving to audio/x-aac.

Replaced with canonicalize "audio/aac", instead_of: "audio/x-aac".

Marcel::MimeType.extend "application/vnd.apple.keynote", extensions: %w( key ), parents: "application/zip"
Marcel::MimeType.extend "application/vnd.apple.pages", parents: "application/zip"
Marcel::MimeType.extend "application/vnd.apple.numbers", parents: "application/zip"
Marcel::MimeType.extend "application/vnd.apple.keynote", parents: "application/zip"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Including the file extensions suggests they're being extended, but they aren't. Omit them to clarify that we're changing the parent.

Marcel::MimeType.extend "audio/ogg", extensions: %w( ogg oga ), magic: [[0, 'OggS', [[29, 'vorbis']]]]
Marcel::MimeType.canonicalize "audio/aac", instead_of: "audio/x-aac"
Marcel::MimeType.canonicalize "audio/flac", instead_of: "audio/x-flac"
Marcel::MimeType.canonicalize "audio/x-wav", instead_of: "audio/vnd.wave"
Copy link
Member Author

@jeremy jeremy Mar 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, change the canonical type rather than introducing misleading MIME subtypes.


Marcel::MimeType.extend "image/vnd.dwg", magic: [[0, "AC10"]]
Marcel::MimeType.extend "audio/mpc", magic: [[0, "MPCKSH"]], extensions: %w( mpc )
Marcel::MimeType.extend "audio/ogg", extensions: %w( ogg oga ), magic: [[0, 'OggS', [[29, 'vorbis']]]]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really odd one, breaking the MIME hierarchy entirely and using the same magic matcher as audio/vorbis. Leaving this for later.

Marcel::MimeType.extend "image/heic", magic: [[4, "ftypheic"]], extensions: %w( heic )
Marcel::MimeType.extend "image/avif", magic: [[4, "ftypavif"]]
Marcel::MimeType.extend "image/heif", magic: [[4, "ftypmif1"]]
Marcel::MimeType.extend "image/heic", magic: [[4, "ftypheic"]]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These matchers are already in the Tika data. Unclear whether these are simply defunct and can be removed.


Marcel::MimeType.extend "image/x-raw-sony", extensions: %w( arw ), parents: "image/tiff"
Marcel::MimeType.extend "image/x-raw-canon", extensions: %w( cr2 crw ), parents: "image/tiff"
Marcel::MimeType.extend "image/x-raw-canon", parents: "image/tiff"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop duplicate extensions.


Marcel::MimeType.extend "video/mp4", magic: [[4, "ftypisom"], [4, "ftypM4V "]], extensions: %w( mp4 m4v )

Marcel::MimeType.extend "audio/flac", magic: [[0, 'fLaC']], extensions: %w( flac ), parents: "audio/x-flac"
Marcel::MimeType.extend "audio/x-wav", magic: [[0, 'RIFF', [[8, 'WAVE']]]], extensions: %w( wav ), parents: "audio/vnd.wav"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to canonicalize.

test "detects #{content_type} given magic bytes from #{name} and aliased type #{aliased}" do
assert_equal content_type, Marcel::MimeType.for(file, declared_type: aliased)
end
end
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test that declared_type: alias will resolve to the canonical type. This what'll allow browser uploads using alias types to be correctly identified.

Marcel::Magic.add('canonical/child', aliases: 'alias/child', parents: 'canonical/parent')

assert Marcel::Magic.child?('alias/child', 'alias/parent')
end
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arguments may be alias types, so check that they're resolved to canonicals before checking parentage.

existing_magic = Marcel::MAGIC.select { |type, _| type == type }.map(&:last)
if magic.any? && magic == existing_magic
warn "#{type} already has magic matchers #{magic.inspect}"
end
Copy link
Member Author

@jeremy jeremy Mar 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is all pretty grotesque. Trying to limit the API footprint and not mess with the carefully tuned data tables.

@jeremy jeremy changed the title Respect MIME type aliases Canonical types and aliases Mar 7, 2024
* MIME type aliases are now supported.
* Aliases are resolved to their canonical type in all APIs.
* Introduce `MimeType.canonicalize type, instead_of: old` to override
  a Tika canonical type with our own, essentially renaming the type
  and making the old type an alias of the new one. Common scenario
  with types like WAV with multiple competing historical types, RFCs
  that aren't actually followed, and browser support trumping them all.
  This allows us to give preference to browsers' actual MIME type
  support while keeping Tika's file extensions and magic byte matchers.
* Warns when extending a type with preexisting extensions, parents, etc.
  and when extending an aliased type.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant