Optimize cross reference object offset validation by avoiding nested loop #935

madelson · 2024-11-12T03:25:16Z

I've been using PdfPig for text extraction on various files.

Recently, I found 1 file which was taking ~23s to open. Profiling showed that the hotspot was:

Dictionary<IndirectReference, long>.TryInsert(...)
at CrossReferenceObjectOffsetValidator.ValidateCrossReferenceOffsets(...)
at CrossReferenceParser.Parse(...)

After the changes here, the time to open the document drops to just 3s!

…loops

madelson · 2024-11-12T03:26:52Z

src/UglyToad.PdfPig.Core/IndirectReference.cs

@@ -6,7 +6,7 @@
    /// <summary>
    /// Used to uniquely identify and refer to objects in the PDF file.
    /// </summary>
-    public readonly struct IndirectReference
+    public readonly struct IndirectReference : IEquatable<IndirectReference>


Seeing that dictionary insertion was the bottleneck initially made me think that Equals boxing might be the issue so I added IEquatable. However, this didn't make a noticeable difference in my particular case. Given how frequently this type is used as a dictionary key I think it is worth keeping (I notice that other PdfPig types do have this).

madelson · 2024-11-12T03:27:54Z

src/UglyToad.PdfPig/CrossReference/CrossReferenceTable.cs

@@ -49,7 +49,7 @@ internal CrossReferenceTable(CrossReferenceType type, IReadOnlyDictionary<Indire
            Trailer = trailer ?? throw new ArgumentNullException(nameof(trailer));
            CrossReferenceOffsets = crossReferenceOffsets ?? throw new ArgumentNullException(nameof(crossReferenceOffsets));

-            var result = new Dictionary<IndirectReference, long>();
+            var result = new Dictionary<IndirectReference, long>(capacity: objectOffsets.Count);


Fly-by optimization since we're just copying the dictionary.

madelson · 2024-11-12T03:28:11Z

src/UglyToad.PdfPig/Parser/FileStructure/CrossReferenceObjectOffsetValidator.cs

            var bruteForceOffsets = BruteForceSearcher.GetObjectLocations(bytes);
            if (bruteForceOffsets.Count > 0)
            {
+                var builderOffsets = new Dictionary<IndirectReference, long>();


Moved this inside the if since it wasn't referenced outside

What about doing var builderOffsets = new Dictionary<IndirectReference, long>(bruteForceOffsets.Count); here too?

Agreed & updated!

madelson · 2024-11-12T03:30:11Z

src/UglyToad.PdfPig/Parser/FileStructure/CrossReferenceObjectOffsetValidator.cs


-                    foreach (var item in bruteForceOffsets)


Moving this loo outside the foreach loop that starts on 33 was the real win. As far as I can tell, the two loops are independent, and we're still maintaining the order such that bruteForceOffsets are added after object offsets and thus could override them in builderOffsets.

I don't see a reason why this needs to be an inner loop.

mikethea1 · 2024-11-12T21:11:24Z

Thanks for taking a look @BobLd !

BobLd · 2024-11-13T06:57:03Z

@madelson thanks a lot for the PR! I've added a comment, let me know what you think.

By any chance, is there a way for you to share the problematic pdf (not required, would just be nice for benchmarking)?

madelson · 2024-11-13T09:43:06Z

Thanks @BobLd !

@madelson thanks a lot for the PR! I've added a comment, let me know what you think.

Agreed & addressed!

By any chance, is there a way for you to share the problematic pdf (not required, would just be nice for benchmarking)?

I looked into it and unfortunately it's not one I can share. However, any large PDF that hits this codepath probably reproduces the issue. Not sure if any of those are in the benchmark set already. This was over 300 pages of slides.

BobLd · 2024-11-13T19:47:14Z

looks great! thanks a lot for the contribution @madelson

madelson · 2024-11-13T20:34:09Z

Appreciate the quick engagement @BobLd ! Is there an ETA for this appearing in a published NuGet version? Not sure what the release schedule is like.

BobLd · 2024-11-13T20:38:11Z

@madelson your changes will be made available as a nuget package overnight (pre released version), so you'll have that by tomorrow morning.

The official release is not planned yet, but should be made available before end of year

BobLd · 2024-11-14T06:11:41Z

@madelson see https://www.nuget.org/packages/PdfPig/0.1.10-alpha-20241114-8ca53

Optimize cross reference object offset validation by avoiding nested …

67a6401

…loops

madelson commented Nov 12, 2024

View reviewed changes

BobLd self-requested a review November 12, 2024 06:51

Address UglyToad#935 (comment)

067d5d5

BobLd approved these changes Nov 13, 2024

View reviewed changes

BobLd merged commit 8ca5399 into UglyToad:master Nov 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize cross reference object offset validation by avoiding nested loop #935

Optimize cross reference object offset validation by avoiding nested loop #935

madelson commented Nov 12, 2024

madelson Nov 12, 2024

madelson Nov 12, 2024

madelson Nov 12, 2024

BobLd Nov 13, 2024

madelson Nov 13, 2024

madelson Nov 12, 2024

mikethea1 commented Nov 12, 2024

BobLd commented Nov 13, 2024

madelson commented Nov 13, 2024

BobLd commented Nov 13, 2024

madelson commented Nov 13, 2024

BobLd commented Nov 13, 2024

BobLd commented Nov 14, 2024

Optimize cross reference object offset validation by avoiding nested loop #935

Optimize cross reference object offset validation by avoiding nested loop #935

Conversation

madelson commented Nov 12, 2024

madelson Nov 12, 2024

Choose a reason for hiding this comment

madelson Nov 12, 2024

Choose a reason for hiding this comment

madelson Nov 12, 2024

Choose a reason for hiding this comment

BobLd Nov 13, 2024

Choose a reason for hiding this comment

madelson Nov 13, 2024

Choose a reason for hiding this comment

madelson Nov 12, 2024

Choose a reason for hiding this comment

mikethea1 commented Nov 12, 2024

BobLd commented Nov 13, 2024

madelson commented Nov 13, 2024

BobLd commented Nov 13, 2024

madelson commented Nov 13, 2024

BobLd commented Nov 13, 2024

BobLd commented Nov 14, 2024