Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrectly identified split pages #349

Open
johnlabonte opened this issue Aug 1, 2023 · 13 comments · May be fixed by freelawproject/recap-chrome#402
Open

Incorrectly identified split pages #349

johnlabonte opened this issue Aug 1, 2023 · 13 comments · May be fixed by freelawproject/recap-chrome#402
Assignees

Comments

@johnlabonte
Copy link

johnlabonte commented Aug 1, 2023

When navigating to https://ecf.ca5.uscourts.gov/docs1/00506701831?caseId=213057 and deselecting the multiple documents to only the petition, I still receive the error that there are multiple pages and it cannot be split so therefor cannot be uploaded. This is incorrect as I was only accessing one attachment from the group.

2023-07-31 22_33_58-Document

2023-07-31 22_31_19-Download Confirmation

"This document will not be uploaded to the RECAP Archive because the extension has detected that this page may return a combined PDF and consistently splitting these files in a proper manner is not possible for now."

@ERosendo
Copy link

ERosendo commented Aug 1, 2023

Thanks for creating this issue.

The extension adds the warning because it appears that clicking the view selected button is taking you to a page to download multiple PDF documents(There is an indication of this behavior in the title of the page, which includes the phrase "Multiple PDF Documents"), even when only one file is selected.

Here are screenshots of the download page for a single document and the download page for multiple documents:

  • Single PDF:

image

Note: This single-pdf download page can be accessed by clicking the document icon next to the number 1.

  • Multiple PDF

image

@mlissner
Copy link
Member

mlissner commented Aug 1, 2023

This bug feels valid to me. If somebody clicks the "View Documents" button with only one item selected, we should let them upload, right? I think we just need to tweak our detector to make this work a bit better?

@ERosendo
Copy link

ERosendo commented Aug 1, 2023

@mlissner You're right, we should check the number of documents on the page and let users upload files when they're trying to retrieve a single item.

@gcklema
Copy link

gcklema commented Aug 24, 2023

It's unclear to me why this error started showing up. Previously, multiple documents selected and at least downloaded were separate .PDF files within the one .ZIP file. So there shouldn't be an error dividing them under those circumstances. Viewing multiple documents, on the other hand, I don't know how PACER and/or the browser would handle that request because that's not how I use PACER. It might well be that selecting "view" multiple results in a singular concatenated .PDF file. Nevertheless, each docket entry should still have its respective docket number on it (e.g., 30-0 plus attachments to it, like 30-1 and 30-2) together perhaps also with page number for each. Parsing a single document (after scraping and obtaining the corresponding docket numbers from the web page) should be possible so that RECAP can separate one monolithically viewed document into each separate docket filing.

@mlissner
Copy link
Member

It's unclear to me why this error started showing up. Previously, multiple documents selected and at least downloaded were separate .PDF files within the one .ZIP file.

That still works, in fact. Zips work as they always have. Combined docs never have, so the only change is that now we're warning (too much) about it.

Nevertheless, each docket entry should still have its respective docket number on it (e.g., 30-0 plus attachments to it, like 30-1 and 30-2) together perhaps also with page number for each. Parsing a single document (after scraping and obtaining the corresponding docket numbers from the web page) should be possible so that RECAP can separate one monolithically viewed document into each separate docket filing.

Thanks for the comment. Alas, this isn't as easy as it may seem:

  • Sometimes docs are filed in multiple cases and accrue multiple headings from the various cases.
  • Sometimes people have this feature turned off, so we don't get a heading.
  • These headings vary across courts (b/c of course they do).

It's one of those things we decided was too hard, but we do have a breakthrough over in #347 that should make it possible!

@mlissner mlissner moved this to RECAP Backlog in @erosendo's backlog Sep 12, 2023
@gbronner
Copy link

gbronner commented Dec 6, 2023

From what I've seen of the actual merged pages, if you have the index page (or even if you don't), and you have the pdf, the watermark at the top of the page will tell you which subdocument you have and which page of it you are on. So this error message not only shows up on single file downloads, but seems like it could be worked around.

@mlissner
Copy link
Member

mlissner commented Dec 6, 2023

Unfortunately, the watermarks on the PDFs aren't reliable. Some users disable them, and others upload documents that they re-purposed from other cases without removing the watermark. The result is that a watermark is usually fine, but can be missing, wrong, or duplicated.

The solution is over here though: #347

@gbronner
Copy link

gbronner commented Dec 6, 2023

image
Here's an example of a watermark I downloaded a couple of days ago.

Is there some reason not to try to read the merged files and look for the watermarks and split them? Seems like the benefit to getting it right exceeds the cost of trying and throwing it out

Do we have an example of a mis-identified watermark?

@gcklema
Copy link

gcklema commented Dec 6, 2023 via email

@mlissner
Copy link
Member

mlissner commented Dec 6, 2023

I don't have an example, but I still think the better and easier solution is #347. Good point about selecting the latter date when encountering duplicates.

@mlissner mlissner moved this from RECAP Backlog to Main Backlog in @erosendo's backlog Mar 4, 2024
@mlissner
Copy link
Member

@ERosendo, can you please give me a size estimate for analyzing the remaining task here (I haven't read through it in a year or so)?

@ERosendo
Copy link

ERosendo commented Nov 13, 2024

can you please give me a size estimate for analyzing the remaining task here (I haven't read through it in a year or so)?

We can likely close this issue once pull request #402 is merged. The original problem was that the extension incorrectly labeled single-document download pages as "Multiple PDF" and displayed an unnecessary warning. This happened because we were relying on page elements rather than counting the documents. PR #402 addresses this by refining the logic we're using to identify these pages and avoid the warning.

@mlissner There's a separate issue about uploading combined PDFs that was discussed earlier. Are you referring to that issue when you mention the 'remaining task here'?

@mlissner mlissner added this to Sprint Nov 14, 2024
@mlissner mlissner moved this to In review in Sprint Nov 14, 2024
@mlissner
Copy link
Member

Sounds great. I don't know what I was referring to, so I think we're OK here. I've put this on the current sprint so it can get wrapped up as part of this one and so it's not on your old board.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In review
Development

Successfully merging a pull request may close this issue.

5 participants