Skip to content

Commit

Permalink
Clarify readme
Browse files Browse the repository at this point in the history
  • Loading branch information
Christophe-Regouby committed Dec 1, 2023
1 parent 135af18 commit 13dd927
Show file tree
Hide file tree
Showing 2 changed files with 113 additions and 28 deletions.
50 changes: 38 additions & 12 deletions README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,8 @@ knitr::opts_chunk$set(

<!-- badges: end -->

A R implementation of LMDX ([Perot et al. 2023](https://arxiv.org/pdf/2309.10952.pdf)). You provides a pdf page or file in, and a decoding json schema, and get all entities extracted from the pdf.
A R implementation of LMDX ([Perot et al. 2023](https://arxiv.org/pdf/2309.10952.pdf)).\
You provides a pdf page (or pdf file) in, and a decoding schema (json), and you get all entities extracted from the pdf.

## Installation

Expand All @@ -33,19 +34,22 @@ pak::pak("cregouby/LMDX")

## Example

We want here to extract the R short reference card pdf file content, and turn it into a data.frame:
We want here to extract the [**R short reference card pdf**](https://cran.r-project.org/doc/contrib/Short-refcard.pdf) file content, and turn it into a data.frame:

[![R reference card page 1 screenshot](inst/extdata/Short-refcard_1.jpg)](https://cran.r-project.org/doc/contrib/Short-refcard.pdf)
![R reference card page 1 screenshot](inst/extdata/Short-refcard_1.jpg)

It is a challenge as it is composed of 3 tight columns and packed between code and highly summarized sentences.

## Step 1 : Form the LLM prompt
## Step 1 : Design your taxonomy

prompt is made with the assembly of the document text with layout information, and the taxonomy, a json representation of the entities to extract. Taxonomy can be hierarchical like in the following example:
The taxonomy here is a json representation of the entities to extract from the document. Depending on the LLM model capacity, taxonomy can be hierarchical like in the following example:

```{r example}
library(LMDX)
document <- system.file("extdata", "Short-refcard_1.pdf", package = "LMDX")
Here we can see that the document is structured in paragraphs like *Getting Help*, then *Input and output*, and so on. This is the first layer of the hierarchy, and each paragraph has a title and a description.\
Then for each paragraph, there is multiple blocks that are made of an R command, description and maybe an example.

So this is what the taxonomy looks like according to this.

```{r}
taxonomy <- jsonlite::minify('{
"title" : "",
"paragraph_item": [
Expand All @@ -62,10 +66,31 @@ taxonomy <- jsonlite::minify('{
}
]
}')
```

## Step 2 : Forge the LLM prompt

**prompt** is made with the assembly of the document text with layout information and the taxonomy.

```{r example}
library(LMDX)
document <- system.file("extdata", "Short-refcard_1.pdf", package = "LMDX")
prompt <- lmdx_prompt(document, taxonomy, segment = "line")
```

## Step 2 : Query the model
Let's have a look at the prompt result :

```{r}
prompt[[1]] |> stringr::str_trunc(500)
prompt[[1]] |> stringr::str_trunc(500, side = "left")
```

`prompt` is a list textual prompts conform to the original paper taht what we want the LLM model to process.

## Step 3 : Query the model

The usual way for this is to call an LLM model served online. We use {chattr} package for that, as it also includes a local model usage capability.

Expand All @@ -79,12 +104,13 @@ response <- ch_submit_job(
)
```

## Step 3 : Decode the output
This is not run here, paper report good result with the PaLMv2 model but choose your own model and report the result !

This consists in decoding the output and parsing it to a majority-vote engine :
## Step 4 : Decode the output

This consists in decoding the output and parsing it to a majority-vote engine :

```{r eval=FALSE}
# response
r_reference_card_df <- majority_vote(decode_json_result(response))
```

91 changes: 75 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,9 @@
<!-- badges: end -->

A R implementation of LMDX ([Perot et
al. 2023](https://arxiv.org/pdf/2309.10952.pdf)). You provides a pdf
page or file in, and a decoding json schema, and get all entities
extracted from the pdf.
al. 2023](https://arxiv.org/pdf/2309.10952.pdf)).
You provides a pdf page (or pdf file) in, and a decoding schema (json),
and you get all entities extracted from the pdf.

## Installation

Expand All @@ -24,24 +24,36 @@ pak::pak("cregouby/LMDX")

## Example

We want here to extract the R short reference card pdf file content, and
turn it into a data.frame:
We want here to extract the [**R short reference card
pdf**](https://cran.r-project.org/doc/contrib/Short-refcard.pdf) file
content, and turn it into a data.frame:

[![R reference card page 1
screenshot](inst/extdata/Short-refcard_1.jpg)](https://cran.r-project.org/doc/contrib/Short-refcard.pdf)
<figure>
<img src="inst/extdata/Short-refcard_1.jpg"
alt="R reference card page 1 screenshot" />
<figcaption aria-hidden="true">R reference card page 1
screenshot</figcaption>
</figure>

It is a challenge as it is composed of 3 tight columns and packed
between code and highly summarized sentences.

## Step 1 : Form the LLM prompt
## Step 1 : Design your taxonomy

prompt is made with the assembly of the document text with layout
information, and the taxonomy, a json representation of the entities to
extract. Taxonomy can be hierarchical like in the following example:
The taxonomy here is a json representation of the entities to extract
from the document. Depending on the LLM model capacity, taxonomy can be
hierarchical like in the following example:

Here we can see that the document is structured in paragraphs like
*Getting Help*, then *Input and output*, and so on. This is the first
layer of the hierarchy, and each paragraph has a title and a
description.
Then for each paragraph, there is multiple blocks that are made of an R
command, description and maybe an example.

So this is what the taxonomy looks like according to this.

``` r
library(LMDX)
document <- system.file("extdata", "Short-refcard_1.pdf", package = "LMDX")
taxonomy <- jsonlite::minify('{
"title" : "",
"paragraph_item": [
Expand All @@ -58,16 +70,59 @@ taxonomy <- jsonlite::minify('{
}
]
}')
```

## Step 2 : Forge the LLM prompt

**prompt** is made with the assembly of the document text with layout
information and the taxonomy.

``` r
library(LMDX)
document <- system.file("extdata", "Short-refcard_1.pdf", package = "LMDX")
prompt <- lmdx_prompt(document, taxonomy, segment = "line")
```

## Step 2 : Query the model
Let’s have a look at the prompt result :

``` r
prompt[[1]] |> stringr::str_trunc(500)
#> <Document>
#> R Reference Card 132|63
#> by Tom Short, EPRI PEAC, [email protected] 2004-11-07 88|87
#> Granted to the public domain. See www.Rpad.org for the source and latest 141|97
#> version. Includes material from R for Beginners by Emmanuel Paradis (with 141|106
#> permission). 37|116
#> Getting help 53|153
#> Most R functions have online documentation. 73|165
#> help(topic) documentation on topic 101|174
#> ?topic id. 42|184
#> help.search("topic") search the help system 132|193
#> apropos("topic") the names of all...

prompt[[1]] |> stringr::str_trunc(500, side = "left")
#> ...s 661|540
#> = n!/[(n − k)!k!] 576|549
#> na.omit(x) suppresses the observations with missing data (NA) (sup- 672|559
#> presses the corresponding line if x is a matrix or a data frame) 663|569
#> na.fail(x) returns an error message if x contains at least one NA 658|578
#> </Document><Task>
#> From the document, extract the text values and tags of the following entities:
#> {"title":"","paragraph_item":[{"title":"","description":[],"line_item":[{"command":"","description":"","example":[]}]}]}
#> </Task>
#> <Extraction>
```

`prompt` is a list textual prompts conform to the original paper taht
what we want the LLM model to process.

## Step 3 : Query the model

The usual way for this is to call an LLM model served online. We use
{chattr} package for that, as it also includes a local model usage
capability.

We query 16 generation of the model with a temperature of 0.5.
We query 16 generation of the model with a **temperature of 0.5**.

``` r
library(chattr)
Expand All @@ -77,11 +132,15 @@ response <- ch_submit_job(
)
```

## Step 3 : Decode the output
This is not run here, paper report good result with the PaLMv2 model but
choose your own model and report the result !

## Step 4 : Decode the output

This consists in decoding the output and parsing it to a majority-vote
engine :

``` r
# response
r_reference_card_df <- majority_vote(decode_json_result(response))
```

0 comments on commit 13dd927

Please sign in to comment.