Incorrectly identified split pages #349

johnlabonte · 2023-08-01T05:33:20Z

When navigating to https://ecf.ca5.uscourts.gov/docs1/00506701831?caseId=213057 and deselecting the multiple documents to only the petition, I still receive the error that there are multiple pages and it cannot be split so therefor cannot be uploaded. This is incorrect as I was only accessing one attachment from the group.

"This document will not be uploaded to the RECAP Archive because the extension has detected that this page may return a combined PDF and consistently splitting these files in a proper manner is not possible for now."

ERosendo · 2023-08-01T11:42:46Z

Thanks for creating this issue.

The extension adds the warning because it appears that clicking the view selected button is taking you to a page to download multiple PDF documents(There is an indication of this behavior in the title of the page, which includes the phrase "Multiple PDF Documents"), even when only one file is selected.

Here are screenshots of the download page for a single document and the download page for multiple documents:

Single PDF:

Note: This single-pdf download page can be accessed by clicking the document icon next to the number 1.

Multiple PDF

mlissner · 2023-08-01T12:33:47Z

This bug feels valid to me. If somebody clicks the "View Documents" button with only one item selected, we should let them upload, right? I think we just need to tweak our detector to make this work a bit better?

ERosendo · 2023-08-01T12:40:07Z

@mlissner You're right, we should check the number of documents on the page and let users upload files when they're trying to retrieve a single item.

gcklema · 2023-08-24T18:04:01Z

It's unclear to me why this error started showing up. Previously, multiple documents selected and at least downloaded were separate .PDF files within the one .ZIP file. So there shouldn't be an error dividing them under those circumstances. Viewing multiple documents, on the other hand, I don't know how PACER and/or the browser would handle that request because that's not how I use PACER. It might well be that selecting "view" multiple results in a singular concatenated .PDF file. Nevertheless, each docket entry should still have its respective docket number on it (e.g., 30-0 plus attachments to it, like 30-1 and 30-2) together perhaps also with page number for each. Parsing a single document (after scraping and obtaining the corresponding docket numbers from the web page) should be possible so that RECAP can separate one monolithically viewed document into each separate docket filing.

mlissner · 2023-08-24T18:49:50Z

It's unclear to me why this error started showing up. Previously, multiple documents selected and at least downloaded were separate .PDF files within the one .ZIP file.

That still works, in fact. Zips work as they always have. Combined docs never have, so the only change is that now we're warning (too much) about it.

Nevertheless, each docket entry should still have its respective docket number on it (e.g., 30-0 plus attachments to it, like 30-1 and 30-2) together perhaps also with page number for each. Parsing a single document (after scraping and obtaining the corresponding docket numbers from the web page) should be possible so that RECAP can separate one monolithically viewed document into each separate docket filing.

Thanks for the comment. Alas, this isn't as easy as it may seem:

Sometimes docs are filed in multiple cases and accrue multiple headings from the various cases.
Sometimes people have this feature turned off, so we don't get a heading.
These headings vary across courts (b/c of course they do).

It's one of those things we decided was too hard, but we do have a breakthrough over in #347 that should make it possible!

gbronner · 2023-12-06T03:44:27Z

From what I've seen of the actual merged pages, if you have the index page (or even if you don't), and you have the pdf, the watermark at the top of the page will tell you which subdocument you have and which page of it you are on. So this error message not only shows up on single file downloads, but seems like it could be worked around.

mlissner · 2023-12-06T18:47:57Z

Unfortunately, the watermarks on the PDFs aren't reliable. Some users disable them, and others upload documents that they re-purposed from other cases without removing the watermark. The result is that a watermark is usually fine, but can be missing, wrong, or duplicated.

The solution is over here though: #347

gbronner · 2023-12-06T19:10:47Z

Here's an example of a watermark I downloaded a couple of days ago.

Is there some reason not to try to read the merged files and look for the watermarks and split them? Seems like the benefit to getting it right exceeds the cost of trying and throwing it out

Do we have an example of a mis-identified watermark?

gcklema · 2023-12-06T19:22:12Z

I've seen filings without a clerk/docket stamp. I have also seen filings with two such stamps--but it seems to me that the later-in-time one should govern as a rule since it's seemingly impossible to have a future stamp on a current filing, but not unheard of to re-file old, previously-stamped documents.

…

On Wed, Dec 6, 2023 at 2:10 PM Gregory Bronner ***@***.***> wrote: image.png (view on web) <https://github.com/freelawproject/recap/assets/1834828/1ba531ff-a28a-4725-b713-2fed0609310a> Here's an example of a watermark I downloaded a couple of days ago. Is there some reason not to try to read the merged files and look for the watermarks and split them? Seems like the benefit to getting it right exceeds the cost of trying and throwing it out Do we have an example of a mis-identified watermark? — Reply to this email directly, view it on GitHub <#349 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ANLSNFKRWSJQMM2J5FGIRLDYIC7MFAVCNFSM6AAAAAA27GZAKWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNBTGUZTGOBSGE> . You are receiving this because you commented.Message ID: ***@***.***>

mlissner · 2023-12-06T20:13:54Z

I don't have an example, but I still think the better and easier solution is #347. Good point about selecting the latter date when encountering duplicates.

mlissner · 2024-11-13T06:44:42Z

@ERosendo, can you please give me a size estimate for analyzing the remaining task here (I haven't read through it in a year or so)?

ERosendo · 2024-11-13T11:51:06Z

can you please give me a size estimate for analyzing the remaining task here (I haven't read through it in a year or so)?

We can likely close this issue once pull request #402 is merged. The original problem was that the extension incorrectly labeled single-document download pages as "Multiple PDF" and displayed an unnecessary warning. This happened because we were relying on page elements rather than counting the documents. PR #402 addresses this by refining the logic we're using to identify these pages and avoid the warning.

@mlissner There's a separate issue about uploading combined PDFs that was discussed earlier. Are you referring to that issue when you mention the 'remaining task here'?

mlissner · 2024-11-14T00:53:24Z

Sounds great. I don't know what I was referring to, so I think we're OK here. I've put this on the current sprint so it can get wrapped up as part of this one and so it's not on your old board.

mlissner added this to @erosendo's backlog Aug 1, 2023

mlissner moved this to RECAP Backlog in @erosendo's backlog Sep 12, 2023

mlissner moved this from RECAP Backlog to Main Backlog in @erosendo's backlog Mar 4, 2024

ERosendo linked a pull request Oct 1, 2024 that will close this issue

feat(pacer): Refine multi-document page handling logic freelawproject/recap-chrome#402

Open

mlissner removed this from @erosendo's backlog Nov 14, 2024

mlissner added this to Sprint Nov 14, 2024

mlissner moved this to In review in Sprint Nov 14, 2024

mlissner assigned ERosendo Nov 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrectly identified split pages #349

Incorrectly identified split pages #349

johnlabonte commented Aug 1, 2023 •

edited

Loading

ERosendo commented Aug 1, 2023

mlissner commented Aug 1, 2023

ERosendo commented Aug 1, 2023 •

edited

Loading

gcklema commented Aug 24, 2023

mlissner commented Aug 24, 2023

gbronner commented Dec 6, 2023

mlissner commented Dec 6, 2023

gbronner commented Dec 6, 2023

gcklema commented Dec 6, 2023 via email

mlissner commented Dec 6, 2023

mlissner commented Nov 13, 2024

ERosendo commented Nov 13, 2024 •

edited

Loading

mlissner commented Nov 14, 2024

Incorrectly identified split pages #349

Incorrectly identified split pages #349

Comments

johnlabonte commented Aug 1, 2023 • edited Loading

ERosendo commented Aug 1, 2023

mlissner commented Aug 1, 2023

ERosendo commented Aug 1, 2023 • edited Loading

gcklema commented Aug 24, 2023

mlissner commented Aug 24, 2023

gbronner commented Dec 6, 2023

mlissner commented Dec 6, 2023

gbronner commented Dec 6, 2023

gcklema commented Dec 6, 2023 via email

mlissner commented Dec 6, 2023

mlissner commented Nov 13, 2024

ERosendo commented Nov 13, 2024 • edited Loading

mlissner commented Nov 14, 2024

johnlabonte commented Aug 1, 2023 •

edited

Loading

ERosendo commented Aug 1, 2023 •

edited

Loading

ERosendo commented Nov 13, 2024 •

edited

Loading