This parser works for all the variations that I have in my My Clippings.txt
file as of now.
This project started as a replacement for https://github.com/icyflame/excerpts_bot in Golang. I
rewrote the logic to parse the My Clippings.txt
file inside a Kindle, rather than using the CSV
file containing highlights which can be emailed to onself. I recently started using Calibre and
keeping the My Clippings.txt
file in sync is far easier.
Some additional advantages include the fact that I can now export clippings from books that I have not completed reading yet (such as collection of multiple novels from a single author, out of which I have completed reading only one novel), from books that are in the open domain and were not bought from the Amazon store, and from articles which were converted to Epub and imported into my Kindle using Calibre.
After converting the clippings text file to YAML, I found many more opportunities for using it such as generating summaries based on my notes, seeing the best quotes in a book, generating a list of references for future reading, and so on.
Verification:
# Find the number of clippings inside a My Clippings.txt file
# Exclude bookmarks
$ rg -c -- 'ブックマーク' ~/notes/kindle-clippings/My\ Clippings.txt
1
$ rg -c -- '- Your Bookmark' ~/notes/kindle-clippings/My\ Clippings.txt
59
$ rg -c -- '={10}' ~/notes/kindle-clippings/My\ Clippings.txt
8233
$ echo $((8233-59-1))
8173
# Use the parser to extract the clippings to a YAML file
$ go run kindle-my-clippings-parser.go -input-file-path ~/notes/kindle-clippings/My\ Clippings.txt
Read 8172 clippings from file%
$ rg -c -- '- source' parsed-clippings.yaml
8172
# Off by 1, good enough!
This project is a set of useful libraries for parsing which are present inside the internal/
folder, and a set of commands which use these libraries inside the cmd/
folder. It follows the
usual structure of Golang projects and works well with gopls
.
All the commands in this project can be built using the following oneliner:
$ ls -1 cmd | while read p; do rm -f $p; go build -v -o $p ./cmd/$p/; done
$ ./parse -help
Usage of ./parse:
-input-file-path string
Input file. Supports the My Clippings.txt file from any Kindle
-output-file-path string
Output file. Output will be written in the YAML format.
-remove-clipping-limit
Remove clippings which indicate that the clipping text was not saved to the text file
-remove-duplicates
Remove duplicate clippings of type Highlight from the generated YAML file
-verbose
Enable verbose logging
This command is the primary command that I use to convert a text file containing clippings into a YAML file containing all types of clippings. Two flags are worth mentioning.
Kindle’s software does not track existing highlight entries, when an existing note is updated. The
text file seems to be append-only. So, if you write a note, and later, go back to the note and edit
it, there will be 2 entries in the Clippings text file. The -remove-duplicates
flag will remove
any highlights which are from the same source (book and author) and begin at the same position,
retaining only the most recently created highlight.
When you have highlighted more than 10% of a book which you bought on the Amazon ebook store, the Kindle will stop writing the content of clippings into the clippings text file. Instead, it will be replaced by the following message:
<You have reached the clipping limit for this item>
The -remove-clipping-limit
flag will remove such highlights from the parsed YAML file.
Note that although clippings will still be shown on the Kindle device itself, they will not be
exportable through the clippings text file beyond the 10% limit. See the
supplement-with-bookcision
command below for one option to export highlights which the Kindle
software refuses to export.
$ ./supplement-with-bookcision -help
Usage of ./supplement-with-bookcision:
-input-file-path string
Input file. Input file should be the YAML file that is output by the cmd/parse command in this project.
-output-file-path string
Output file. Output will be written in the YAML format.
-source-filter string
Regular expression for filtering the source of clippings
-supplement-file-path string
JSON file with all the clippings, exported using Bookcision
-verbose
Enable verbose logging
If you highlight more than 10% of a book’s content, Kindle’s software stops writing the content of
highlights to its text file. This text is still available to Kindle and is shown in the “notebooks”
view, however it can not be easily exported natively. To get around this limitation, I use the tool
Bookcision. Bookcision is an excellent script which runs on the online eReader provided by Amazon at
read.amazon.com
: Open your book on read.amazon.com
, open the highlights page overlay, and run
this JavaScript, and download a JSON file which has the content of all the highlights from that
book. Once this is done, there remains the task of merging the downloaded JSON with the existing
YAML file which we have parsed from the clippings text file on the Kindle. This is the task of the
./supplement-with-bookcision
command.
This command works with only one source at a time, so the appropriate -source-filter
flag is a
necessity. After merging highlights from the Bookcision file into the YAML input file, the output
YAML file will be in the same structure as before but will have all your highlights from a book.
$ ./deduper -help
Usage of ./deduper:
-input-file-path string
Input file. Input file should be the YAML file that is output by the cmd/parse command in this project.
-output-file-path string
Output file. Output will be written in the YAML format.
-verbose
Enable verbose logging
This command isolates the “deduplication” function that is implemented by the -remove-duplicates
flag of the parse
command. You can use this command, along with the excellent YAML syntactic diff
program dyff to see what highlights will be removed, and whether they are truly duplicates.
$ ./identify-duplicate-pairs -help
Usage of ./identify-duplicate-pairs:
-input-file-path string
Input file. Input file should be the YAML file that is output by the cmd/parse command in this project.
-source-filter string
Regular expression for filtering the source of clippings
-verbose
Enable verbose logging
This command generates a side-by-side view of the duplicates which were identified in a parsed clippings file. It takes a YAML file and shows any clippings which are from the same source and start at the same position. It identifies only pairs, and outputs a readable HTML file which can be viewed in any web browser. I wrote this command mainly to confirm that the logic I was using to identify duplicates was identifying true duplicates.
The output HTML file from this command looks like this:
This HTML file was generated using the following command:
$ ./identify-duplicate-pairs -input-file-path ./parsed-clippings-with-clipping-limits.yml -source-filter 'Anna' > output.html
It shows the duplicates from some of my notes on a book. In most of the quotes, I have added something to the quote after a few minutes or seconds.
This HTML files uses Bootstrap’s table related classes.
When taking notes on the Kindle, I wanted to be able to auto-generate summaries of books and a collection of quotes from the books which I want to view inside my editor and use when I am writing notes or a blog post about the book. In order to do this, I have come up with some rudimentary specifications:
#quote
: Quote from the book which I want to highlight in my review#cn [1-9]+?
: Name of a chapter with the level at which the chapter is nested#cs
: Summary of a chapter#read
: References in the book that I want to add to my reading list
The following commands help me to do this.
$ ./quote-extractor -help
Usage of ./quote-extractor:
-input-file-path string
Input file. Input file should be the YAML file that is output by the cmd/parse command in this project.
-source-filter string
Regular expression for filtering the source of clippings
-verbose
Enable verbose logging
This command simply extracts any quote from the book which is marked with the highlight #quote
. I
use this in order to find the quotes I liked the most in a book. The source filter can be used if
you want to get the quotes from only a single source at a time. Note that the output of this
command is in the Org mode format. Org mode is a commonly used plaintext file format inside
Emacs. If you are used to Markdown, then you may use Pandoc to convert Org mode into Markdown (or
any other format of your choice.)
$ ./summary-builder -help
Usage of ./summary-builder:
-input-file-path string
Input file. YAML file output from the parse command
-source-filter string
Regular expression for filtering the source of clippings
-verbose
Enable verbose logging
This command extracts a summary of the book using the highlights that I added while I was reading the book. The output is in the Org mode format, with chapter names as headings and the chapter summaries appended to each heading appropriately:
* On 42
In this chapter, the author delves into the reason that 42 is considered the answer to all the
questions in the world.
* Knee Socks
The author has great insights on why Knee Socks is the best Arctic Monkeys song of all time.
$ ./email-random -help
Usage of ./email-random:
-input-file-path string
Input file. Input file should be the YAML file that is output by the cmd/parse command in this project.
-verbose
Enable verbose logging
-version
Print the build version
This is a rewrite from scratch of the excerpts_bot project; an excellent idea originally though up by Nishant. While the original bot was written in Python and posted to Twitter, this version sends an e-mail everyday and is running on a Raspberry Pi that is connected to my router at home.
This project has been tested with Golang 1.20 on Linux running on AMD64 architecture.
$ go version
go version go1.20.2 linux/amd64
With the appropriate Golang version, this project will probably work on any operating system and architecture. In case it doesn’t work on some setup, pull requests improving support are welcome!
I use Emacs and Org mode as my primary editor and text file format for notes. So, the output of some commands is in this repository is in the Org mode format. Org mode is a readable text file format. You may use Pandoc to convert Org mode into Markdown or any other format of your choice.
There is a GitHub actions workflow set up in this repository which builds the ./cmd/email-random
command, puts the output in an archive, and uploads it as a release artifact to the appropriate Git
tag. The builds are performed for 3 architectures right now: amd64, arm (32 bit), and arm64. My
motivation for this is to improve my personal setup to avoid having to download and build code on a
Raspberry Pi which is annoyingly slow, compared to my other machines.
This is the sample output of a binary built for AMD64 running Linux:
$ file ./email-random-linux-amd64
./email-random-linux-amd64: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, Go BuildID=2xAAgiEbz0YaeTwVaLvY/TPl7Yke5m3o19Q8eJw4G/CoolvryF_ih8mxQTF0-9/heFRCu0IGe9Ljjo-wXRM, with debug_info, not stripped
$ ./email-random-linux-amd64 --version
refs/tags/v0.0.3-alpha 11a11b367ac315be12403463dea06f01ea234d3c