Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize cross reference object offset validation by avoiding nested loop #935

Conversation

madelson
Copy link
Contributor

I've been using PdfPig for text extraction on various files.

Recently, I found 1 file which was taking ~23s to open. Profiling showed that the hotspot was:

Dictionary<IndirectReference, long>.TryInsert(...)
at CrossReferenceObjectOffsetValidator.ValidateCrossReferenceOffsets(...)
at CrossReferenceParser.Parse(...)

After the changes here, the time to open the document drops to just 3s!

@@ -6,7 +6,7 @@
/// <summary>
/// Used to uniquely identify and refer to objects in the PDF file.
/// </summary>
public readonly struct IndirectReference
public readonly struct IndirectReference : IEquatable<IndirectReference>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing that dictionary insertion was the bottleneck initially made me think that Equals boxing might be the issue so I added IEquatable. However, this didn't make a noticeable difference in my particular case. Given how frequently this type is used as a dictionary key I think it is worth keeping (I notice that other PdfPig types do have this).

@@ -49,7 +49,7 @@ internal CrossReferenceTable(CrossReferenceType type, IReadOnlyDictionary<Indire
Trailer = trailer ?? throw new ArgumentNullException(nameof(trailer));
CrossReferenceOffsets = crossReferenceOffsets ?? throw new ArgumentNullException(nameof(crossReferenceOffsets));

var result = new Dictionary<IndirectReference, long>();
var result = new Dictionary<IndirectReference, long>(capacity: objectOffsets.Count);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fly-by optimization since we're just copying the dictionary.

var bruteForceOffsets = BruteForceSearcher.GetObjectLocations(bytes);
if (bruteForceOffsets.Count > 0)
{
var builderOffsets = new Dictionary<IndirectReference, long>();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved this inside the if since it wasn't referenced outside

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about doing var builderOffsets = new Dictionary<IndirectReference, long>(bruteForceOffsets.Count); here too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed & updated!


foreach (var item in bruteForceOffsets)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving this loo outside the foreach loop that starts on 33 was the real win. As far as I can tell, the two loops are independent, and we're still maintaining the order such that bruteForceOffsets are added after object offsets and thus could override them in builderOffsets.

I don't see a reason why this needs to be an inner loop.

@BobLd BobLd self-requested a review November 12, 2024 06:51
@mikethea1
Copy link

Thanks for taking a look @BobLd !

@BobLd
Copy link
Collaborator

BobLd commented Nov 13, 2024

@madelson thanks a lot for the PR! I've added a comment, let me know what you think.

By any chance, is there a way for you to share the problematic pdf (not required, would just be nice for benchmarking)?

@madelson
Copy link
Contributor Author

Thanks @BobLd !

@madelson thanks a lot for the PR! I've added a comment, let me know what you think.

Agreed & addressed!

By any chance, is there a way for you to share the problematic pdf (not required, would just be nice for benchmarking)?

I looked into it and unfortunately it's not one I can share. However, any large PDF that hits this codepath probably reproduces the issue. Not sure if any of those are in the benchmark set already. This was over 300 pages of slides.

@BobLd
Copy link
Collaborator

BobLd commented Nov 13, 2024

looks great! thanks a lot for the contribution @madelson

@BobLd BobLd merged commit 8ca5399 into UglyToad:master Nov 13, 2024
@madelson
Copy link
Contributor Author

Appreciate the quick engagement @BobLd ! Is there an ETA for this appearing in a published NuGet version? Not sure what the release schedule is like.

@BobLd
Copy link
Collaborator

BobLd commented Nov 13, 2024

@madelson your changes will be made available as a nuget package overnight (pre released version), so you'll have that by tomorrow morning.

The official release is not planned yet, but should be made available before end of year

@BobLd
Copy link
Collaborator

BobLd commented Nov 14, 2024

@madelson see https://www.nuget.org/packages/PdfPig/0.1.10-alpha-20241114-8ca53

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants