Clarify readme

cregouby · Dec 1, 2023 · 13dd927 · 13dd927
1 parent 135af18
commit 13dd927
Show file tree

Hide file tree

Showing 2 changed files with 113 additions and 28 deletions.
diff --git a/README.Rmd b/README.Rmd
@@ -19,7 +19,8 @@ knitr::opts_chunk$set(
 
 <!-- badges: end -->
 
-A R implementation of LMDX ([Perot et al. 2023](https://arxiv.org/pdf/2309.10952.pdf)). You provides a pdf page or file in, and a decoding json schema, and get all entities extracted from the pdf.
+A R implementation of LMDX ([Perot et al. 2023](https://arxiv.org/pdf/2309.10952.pdf)).\
+You provides a pdf page (or pdf file) in, and a decoding schema (json), and you get all entities extracted from the pdf.
 
 ## Installation
 
@@ -33,19 +34,22 @@ pak::pak("cregouby/LMDX")
 
 ## Example
 
-We want here to extract the R short reference card pdf file content, and turn it into a data.frame:
+We want here to extract the [**R short reference card pdf**](https://cran.r-project.org/doc/contrib/Short-refcard.pdf) file content, and turn it into a data.frame:
 
-[![R reference card page 1 screenshot](inst/extdata/Short-refcard_1.jpg)](https://cran.r-project.org/doc/contrib/Short-refcard.pdf)
+![R reference card page 1 screenshot](inst/extdata/Short-refcard_1.jpg)
 
 It is a challenge as it is composed of 3 tight columns and packed between code and highly summarized sentences.
 
-## Step 1 : Form the LLM prompt
+## Step 1 : Design your taxonomy
 
-prompt is made with the assembly of the document text with layout information, and the taxonomy, a json representation of the entities to extract. Taxonomy can be hierarchical like in the following example: 
+The taxonomy here is a json representation of the entities to extract from the document. Depending on the LLM model capacity, taxonomy can be hierarchical like in the following example:
 
-```{r example}
-library(LMDX)
-document <- system.file("extdata", "Short-refcard_1.pdf", package = "LMDX")
+Here we can see that the document is structured in paragraphs like *Getting Help*, then *Input and output*, and so on. This is the first layer of the hierarchy, and each paragraph has a title and a description.\
+Then for each paragraph, there is multiple blocks that are made of an R command, description and maybe an example.
+
+So this is what the taxonomy looks like according to this.
+
+```{r}
 taxonomy <- jsonlite::minify('{
   "title" : "",
   "paragraph_item": [
@@ -62,10 +66,31 @@ taxonomy <- jsonlite::minify('{
     }
   ]
 }')
+
+```
+
+## Step 2 : Forge the LLM prompt
+
+**prompt** is made with the assembly of the document text with layout information and the taxonomy.
+
+```{r example}
+library(LMDX)
+document <- system.file("extdata", "Short-refcard_1.pdf", package = "LMDX")
 prompt <- lmdx_prompt(document, taxonomy, segment = "line")
 ```
 
-## Step 2 : Query the model
+Let's have a look at the prompt result :
+
+```{r}
+prompt[[1]] |> stringr::str_trunc(500)
+
+prompt[[1]] |> stringr::str_trunc(500, side = "left")
+
+```
+
+`prompt` is a list textual prompts conform to the original paper taht what we want the LLM model to process.
+
+## Step 3 : Query the model
 
 The usual way for this is to call an LLM model served online. We use {chattr} package for that, as it also includes a local model usage capability.
 
@@ -79,12 +104,13 @@ response <- ch_submit_job(
   )
 ```
 
-## Step 3 : Decode the output
+This is not run here, paper report good result with the PaLMv2 model but choose your own model and report the result !
 
-This consists in decoding the output and parsing it to a majority-vote engine : 
+## Step 4 : Decode the output
+
+This consists in decoding the output and parsing it to a majority-vote engine :
 
 ```{r eval=FALSE}
 # response
 r_reference_card_df <- majority_vote(decode_json_result(response))
 ```
-
diff --git a/README.md b/README.md
@@ -7,9 +7,9 @@
 <!-- badges: end -->
 
 A R implementation of LMDX ([Perot et
-al. 2023](https://arxiv.org/pdf/2309.10952.pdf)). You provides a pdf
-page or file in, and a decoding json schema, and get all entities
-extracted from the pdf.
+al. 2023](https://arxiv.org/pdf/2309.10952.pdf)).  
+You provides a pdf page (or pdf file) in, and a decoding schema (json),
+and you get all entities extracted from the pdf.
 
 ## Installation
 
@@ -24,24 +24,36 @@ pak::pak("cregouby/LMDX")
 
 ## Example
 
-We want here to extract the R short reference card pdf file content, and
-turn it into a data.frame:
+We want here to extract the [**R short reference card
+pdf**](https://cran.r-project.org/doc/contrib/Short-refcard.pdf) file
+content, and turn it into a data.frame:
 
-[![R reference card page 1
-screenshot](inst/extdata/Short-refcard_1.jpg)](https://cran.r-project.org/doc/contrib/Short-refcard.pdf)
+<figure>
+<img src="inst/extdata/Short-refcard_1.jpg"
+alt="R reference card page 1 screenshot" />
+<figcaption aria-hidden="true">R reference card page 1
+screenshot</figcaption>
+</figure>
 
 It is a challenge as it is composed of 3 tight columns and packed
 between code and highly summarized sentences.
 
-## Step 1 : Form the LLM prompt
+## Step 1 : Design your taxonomy
 
-prompt is made with the assembly of the document text with layout
-information, and the taxonomy, a json representation of the entities to
-extract. Taxonomy can be hierarchical like in the following example:
+The taxonomy here is a json representation of the entities to extract
+from the document. Depending on the LLM model capacity, taxonomy can be
+hierarchical like in the following example:
+
+Here we can see that the document is structured in paragraphs like
+*Getting Help*, then *Input and output*, and so on. This is the first
+layer of the hierarchy, and each paragraph has a title and a
+description.  
+Then for each paragraph, there is multiple blocks that are made of an R
+command, description and maybe an example.
+
+So this is what the taxonomy looks like according to this.
 
 ``` r
-library(LMDX)
-document <- system.file("extdata", "Short-refcard_1.pdf", package = "LMDX")
 taxonomy <- jsonlite::minify('{
   "title" : "",
   "paragraph_item": [
@@ -58,16 +70,59 @@ taxonomy <- jsonlite::minify('{
     }
   ]
 }')
+```
+
+## Step 2 : Forge the LLM prompt
+
+**prompt** is made with the assembly of the document text with layout
+information and the taxonomy.
+
+``` r
+library(LMDX)
+document <- system.file("extdata", "Short-refcard_1.pdf", package = "LMDX")
 prompt <- lmdx_prompt(document, taxonomy, segment = "line")
 ```
 
-## Step 2 : Query the model
+Let’s have a look at the prompt result :
+
+``` r
+prompt[[1]] |> stringr::str_trunc(500)
+#> <Document>
+#> R Reference Card 132|63
+#> by Tom Short, EPRI PEAC, [email protected] 2004-11-07 88|87
+#> Granted to the public domain. See www.Rpad.org for the source and latest 141|97
+#> version. Includes material from R for Beginners by Emmanuel Paradis (with 141|106
+#> permission). 37|116
+#> Getting help 53|153
+#> Most R functions have online documentation. 73|165
+#> help(topic) documentation on topic 101|174
+#> ?topic id. 42|184
+#> help.search("topic") search the help system 132|193
+#> apropos("topic") the names of all...
+
+prompt[[1]] |> stringr::str_trunc(500, side = "left")
+#> ...s 661|540
+#> = n!/[(n − k)!k!] 576|549
+#> na.omit(x) suppresses the observations with missing data (NA) (sup- 672|559
+#> presses the corresponding line if x is a matrix or a data frame) 663|569
+#> na.fail(x) returns an error message if x contains at least one NA 658|578
+#> </Document><Task>
+#> From the document, extract the text values and tags of the following entities:
+#> {"title":"","paragraph_item":[{"title":"","description":[],"line_item":[{"command":"","description":"","example":[]}]}]}
+#> </Task>
+#> <Extraction>
+```
+
+`prompt` is a list textual prompts conform to the original paper taht
+what we want the LLM model to process.
+
+## Step 3 : Query the model
 
 The usual way for this is to call an LLM model served online. We use
 {chattr} package for that, as it also includes a local model usage
 capability.
 
-We query 16 generation of the model with a temperature of 0.5.
+We query 16 generation of the model with a **temperature of 0.5**.
 
 ``` r
 library(chattr)
@@ -77,11 +132,15 @@ response <- ch_submit_job(
   )
 ```
 
-## Step 3 : Decode the output
+This is not run here, paper report good result with the PaLMv2 model but
+choose your own model and report the result !
+
+## Step 4 : Decode the output
 
 This consists in decoding the output and parsing it to a majority-vote
 engine :
 
 ``` r
+# response
 r_reference_card_df <- majority_vote(decode_json_result(response))
 ```