addition to support multi page invoices #372

noxqs · 2022-03-21T10:14:41Z

Thanks for your work on this project, you made my Monday enjoyable.
I would like to return the favour and contribute my code changes over the day based on the missing feature to support multi page invoices. The start / end for lines is only calculated once which makes it impossible to find all line items.

Hence in parser/lines.py I suggest to make a small modification:

  # start = re.search(settings["start"], content)
  # end = re.search(settings["end"], content)

  starts = list(re.finditer(settings["start"], content))
  ends = list(re.finditer(settings["end"], content))

  assert len(starts) == len(ends), "start end miss match"
  if not starts or not ends:
      logger.warning("no lines found - start %s, end %s", start, end)
      return

  new_content=""
  for start,end in zip(starts,ends):
      new_content += content[start.end(): end.start()]

  content = new_content
  # content = content[start.end() : end.start()]

This worked for me.

There is some cleaning up to do when you get the lines back, not sure if this belongs in your library but I used:

        for line in res['lines']:   # res is the output of extract_data
            add = {lk: lv for lk, lv in line.items() if lv not in ['', 0.0, None]}

            if line['index'] != '':    # first item on each line to be parsed
                if item != {}:
                    print(",".join(["%s:%s" % (str(k).strip(), str(v).strip()) for k, v in item.items()]))
                item=add
            else:
                for kk, vv in add.items():
                    if kk in item:
                        item[kk] += " " + vv
                    else:
                        item[kk] = vv

        if item != {}:
            print(",".join(["%s:%s" % (str(k).strip(), str(v).strip()) for k, v in item.items()]))

There are so many combinations possible with regEx (for ex. start can be cond1 | cond2).
Anyway this is quite a useful little gem of a library,

The text was updated successfully, but these errors were encountered:

m3nu · 2022-03-21T11:00:08Z

Thanks for the improvement. Wanna make a pull request for it to make it official and make sure all tests are still working?

noxqs · 2022-03-25T06:32:55Z

Sure ! Thanks

rmilecki · 2022-09-24T17:07:03Z

starts = list(re.finditer(settings["start"], content))
ends = list(re.finditer(settings["end"], content))

I think we need some better logic.

Consider invoice with something like:

Lines begin
Lines begin
LINE 1
LINE 2
LINE 3
Line end
Line end

I think that your change would result in pasing all three lines twice.

We should probably make looking for starts and ends iterative. You should probably look for a next start after the last found end.

bosd · 2022-09-24T17:59:51Z

Would be interesting to see if PR #378 (comment) would accomplish this.

Technically one could call the lines parser multiple times.
It will be slow, as it go trough every line in the pdf multple times.
But it could get the job done.

Sorry, bit in a hurry so for now I can't to provide a more detailed answer and tests.

rmilecki · 2022-10-23T21:08:44Z

Implemented

rmilecki mentioned this issue Oct 16, 2022

parsers: lines: support multiple occurrence of blocks to parse #423

Merged

rmilecki closed this as completed Oct 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

addition to support multi page invoices #372

addition to support multi page invoices #372

noxqs commented Mar 21, 2022 •

edited

Loading

m3nu commented Mar 21, 2022

noxqs commented Mar 25, 2022

rmilecki commented Sep 24, 2022

bosd commented Sep 24, 2022

rmilecki commented Oct 23, 2022

addition to support multi page invoices #372

addition to support multi page invoices #372

Comments

noxqs commented Mar 21, 2022 • edited Loading

m3nu commented Mar 21, 2022

noxqs commented Mar 25, 2022

rmilecki commented Sep 24, 2022

bosd commented Sep 24, 2022

rmilecki commented Oct 23, 2022

noxqs commented Mar 21, 2022 •

edited

Loading