Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

addition to support multi page invoices #372

Closed
noxqs opened this issue Mar 21, 2022 · 5 comments
Closed

addition to support multi page invoices #372

noxqs opened this issue Mar 21, 2022 · 5 comments

Comments

@noxqs
Copy link

noxqs commented Mar 21, 2022

Thanks for your work on this project, you made my Monday enjoyable.
I would like to return the favour and contribute my code changes over the day based on the missing feature to support multi page invoices. The start / end for lines is only calculated once which makes it impossible to find all line items.

Hence in parser/lines.py I suggest to make a small modification:

  # start = re.search(settings["start"], content)
  # end = re.search(settings["end"], content)

  starts = list(re.finditer(settings["start"], content))
  ends = list(re.finditer(settings["end"], content))

  assert len(starts) == len(ends), "start end miss match"
  if not starts or not ends:
      logger.warning("no lines found - start %s, end %s", start, end)
      return

  new_content=""
  for start,end in zip(starts,ends):
      new_content += content[start.end(): end.start()]

  content = new_content
  # content = content[start.end() : end.start()]

This worked for me.

There is some cleaning up to do when you get the lines back, not sure if this belongs in your library but I used:

        for line in res['lines']:   # res is the output of extract_data
            add = {lk: lv for lk, lv in line.items() if lv not in ['', 0.0, None]}

            if line['index'] != '':    # first item on each line to be parsed
                if item != {}:
                    print(",".join(["%s:%s" % (str(k).strip(), str(v).strip()) for k, v in item.items()]))
                item=add
            else:
                for kk, vv in add.items():
                    if kk in item:
                        item[kk] += " " + vv
                    else:
                        item[kk] = vv

        if item != {}:
            print(",".join(["%s:%s" % (str(k).strip(), str(v).strip()) for k, v in item.items()]))

There are so many combinations possible with regEx (for ex. start can be cond1 | cond2).
Anyway this is quite a useful little gem of a library,

@m3nu
Copy link
Collaborator

m3nu commented Mar 21, 2022

Thanks for the improvement. Wanna make a pull request for it to make it official and make sure all tests are still working?

@noxqs
Copy link
Author

noxqs commented Mar 25, 2022

Sure ! Thanks

@rmilecki
Copy link
Collaborator

starts = list(re.finditer(settings["start"], content))
ends = list(re.finditer(settings["end"], content))

I think we need some better logic.

Consider invoice with something like:

Lines begin
Lines begin
LINE 1
LINE 2
LINE 3
Line end
Line end

I think that your change would result in pasing all three lines twice.

We should probably make looking for starts and ends iterative. You should probably look for a next start after the last found end.

@bosd
Copy link
Collaborator

bosd commented Sep 24, 2022

Would be interesting to see if PR #378 (comment) would accomplish this.

Technically one could call the lines parser multiple times.
It will be slow, as it go trough every line in the pdf multple times.
But it could get the job done.

Sorry, bit in a hurry so for now I can't to provide a more detailed answer and tests.

@rmilecki
Copy link
Collaborator

Implemented

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants