Parse My Clippings.txt from Kindle to YAML

This parser works for all the variations that I have in my My Clippings.txt file as of now.

This project started as a replacement for https://github.com/icyflame/excerpts_bot in Golang. I rewrote the logic to parse the My Clippings.txt file inside a Kindle, rather than using the CSV file containing highlights which can be emailed to onself. I recently started using Calibre and keeping the My Clippings.txt file in sync is far easier.

Some additional advantages include the fact that I can now export clippings from books that I have not completed reading yet (such as collection of multiple novels from a single author, out of which I have completed reading only one novel), from books that are in the open domain and were not bought from the Amazon store, and from articles which were converted to Epub and imported into my Kindle using Calibre.

After converting the clippings text file to YAML, I found many more opportunities for using it such as generating summaries based on my notes, seeing the best quotes in a book, generating a list of references for future reading, and so on.

Verification:

# Find the number of clippings inside a My Clippings.txt file
# Exclude bookmarks
$ rg -c -- 'ブックマーク' ~/notes/kindle-clippings/My\ Clippings.txt
1

$ rg -c -- '- Your Bookmark' ~/notes/kindle-clippings/My\ Clippings.txt
59

$ rg -c -- '={10}' ~/notes/kindle-clippings/My\ Clippings.txt
8233

$ echo $((8233-59-1))
8173

# Use the parser to extract the clippings to a YAML file
$ go run kindle-my-clippings-parser.go -input-file-path ~/notes/kindle-clippings/My\ Clippings.txt
Read 8172 clippings from file%

$ rg -c -- '- source' parsed-clippings.yaml
8172

# Off by 1, good enough!

Project Structure

This project is a set of useful libraries for parsing which are present inside the internal/ folder, and a set of commands which use these libraries inside the cmd/ folder. It follows the usual structure of Golang projects and works well with gopls.

Commands

All the commands in this project can be built using the following oneliner:

$ ls -1 cmd | while read p; do rm -f $p; go build -v -o $p ./cmd/$p/; done

Commands related to parsing

`parse`

 $ ./parse -help
 Usage of ./parse:
	-input-file-path string
		  Input file. Supports the My Clippings.txt file from any Kindle
	-output-file-path string
		  Output file. Output will be written in the YAML format.
	-remove-clipping-limit
		  Remove clippings which indicate that the clipping text was not saved to the text file
	-remove-duplicates
		  Remove duplicate clippings of type Highlight from the generated YAML file
	-verbose
		  Enable verbose logging

This command is the primary command that I use to convert a text file containing clippings into a YAML file containing all types of clippings. Two flags are worth mentioning.

Kindle’s software does not track existing highlight entries, when an existing note is updated. The text file seems to be append-only. So, if you write a note, and later, go back to the note and edit it, there will be 2 entries in the Clippings text file. The -remove-duplicates flag will remove any highlights which are from the same source (book and author) and begin at the same position, retaining only the most recently created highlight.

When you have highlighted more than 10% of a book which you bought on the Amazon ebook store, the Kindle will stop writing the content of clippings into the clippings text file. Instead, it will be replaced by the following message:

<You have reached the clipping limit for this item>

The -remove-clipping-limit flag will remove such highlights from the parsed YAML file.

Note that although clippings will still be shown on the Kindle device itself, they will not be exportable through the clippings text file beyond the 10% limit. See the supplement-with-bookcision command below for one option to export highlights which the Kindle software refuses to export.

`supplement-with-bookcision`

 $ ./supplement-with-bookcision -help
 Usage of ./supplement-with-bookcision:
	-input-file-path string
		  Input file. Input file should be the YAML file that is output by the cmd/parse command in this project.
	-output-file-path string
		  Output file. Output will be written in the YAML format.
	-source-filter string
		  Regular expression for filtering the source of clippings
	-supplement-file-path string
		  JSON file with all the clippings, exported using Bookcision
	-verbose
		  Enable verbose logging

If you highlight more than 10% of a book’s content, Kindle’s software stops writing the content of highlights to its text file. This text is still available to Kindle and is shown in the “notebooks” view, however it can not be easily exported natively. To get around this limitation, I use the tool Bookcision. Bookcision is an excellent script which runs on the online eReader provided by Amazon at read.amazon.com: Open your book on read.amazon.com, open the highlights page overlay, and run this JavaScript, and download a JSON file which has the content of all the highlights from that book. Once this is done, there remains the task of merging the downloaded JSON with the existing YAML file which we have parsed from the clippings text file on the Kindle. This is the task of the ./supplement-with-bookcision command.

This command works with only one source at a time, so the appropriate -source-filter flag is a necessity. After merging highlights from the Bookcision file into the YAML input file, the output YAML file will be in the same structure as before but will have all your highlights from a book.

Command related to deduplication

`deduper`

 $ ./deduper -help
 Usage of ./deduper:
	-input-file-path string
		  Input file. Input file should be the YAML file that is output by the cmd/parse command in this project.
	-output-file-path string
		  Output file. Output will be written in the YAML format.
	-verbose
		  Enable verbose logging

This command isolates the “deduplication” function that is implemented by the -remove-duplicates flag of the parse command. You can use this command, along with the excellent YAML syntactic diff program dyff to see what highlights will be removed, and whether they are truly duplicates.

`identify-duplicate-pairs`

 $ ./identify-duplicate-pairs -help
 Usage of ./identify-duplicate-pairs:
	-input-file-path string
		  Input file. Input file should be the YAML file that is output by the cmd/parse command in this project.
	-source-filter string
		  Regular expression for filtering the source of clippings
	-verbose
		  Enable verbose logging

This command generates a side-by-side view of the duplicates which were identified in a parsed clippings file. It takes a YAML file and shows any clippings which are from the same source and start at the same position. It identifies only pairs, and outputs a readable HTML file which can be viewed in any web browser. I wrote this command mainly to confirm that the logic I was using to identify duplicates was identifying true duplicates.

The output HTML file from this command looks like this:

This HTML file was generated using the following command:

$ ./identify-duplicate-pairs -input-file-path ./parsed-clippings-with-clipping-limits.yml -source-filter 'Anna' > output.html

It shows the duplicates from some of my notes on a book. In most of the quotes, I have added something to the quote after a few minutes or seconds.

This HTML files uses Bootstrap’s table related classes.

Commands related to auto-generated summaries

When taking notes on the Kindle, I wanted to be able to auto-generate summaries of books and a collection of quotes from the books which I want to view inside my editor and use when I am writing notes or a blog post about the book. In order to do this, I have come up with some rudimentary specifications:

#quote: Quote from the book which I want to highlight in my review
#cn [1-9]+?: Name of a chapter with the level at which the chapter is nested
#cs: Summary of a chapter
#read: References in the book that I want to add to my reading list

The following commands help me to do this.

`quote-extractor`

 $ ./quote-extractor -help
 Usage of ./quote-extractor:
	-input-file-path string
		  Input file. Input file should be the YAML file that is output by the cmd/parse command in this project.
	-source-filter string
		  Regular expression for filtering the source of clippings
	-verbose
		  Enable verbose logging

This command simply extracts any quote from the book which is marked with the highlight #quote. I use this in order to find the quotes I liked the most in a book. The source filter can be used if you want to get the quotes from only a single source at a time. Note that the output of this command is in the Org mode format. Org mode is a commonly used plaintext file format inside Emacs. If you are used to Markdown, then you may use Pandoc to convert Org mode into Markdown (or any other format of your choice.)

`summary-builder`

 $ ./summary-builder -help
 Usage of ./summary-builder:
	-input-file-path string
		  Input file. YAML file output from the parse command
	-source-filter string
		  Regular expression for filtering the source of clippings
	-verbose
		  Enable verbose logging

This command extracts a summary of the book using the highlights that I added while I was reading the book. The output is in the Org mode format, with chapter names as headings and the chapter summaries appended to each heading appropriately:

* On 42

In this chapter, the author delves into the reason that 42 is considered the answer to all the
questions in the world.

* Knee Socks

The author has great insights on why Knee Socks is the best Arctic Monkeys song of all time.

Utilities

`email-random`

 $ ./email-random -help
 Usage of ./email-random:
	-input-file-path string
		  Input file. Input file should be the YAML file that is output by the cmd/parse command in this project.
	-verbose
		  Enable verbose logging
	-version
		  Print the build version

This is a rewrite from scratch of the excerpts_bot project; an excellent idea originally though up by Nishant. While the original bot was written in Python and posted to Twitter, this version sends an e-mail everyday and is running on a Raspberry Pi that is connected to my router at home.

Environment

This project has been tested with Golang 1.20 on Linux running on AMD64 architecture.

$ go version
go version go1.20.2 linux/amd64

With the appropriate Golang version, this project will probably work on any operating system and architecture. In case it doesn’t work on some setup, pull requests improving support are welcome!

I use Emacs and Org mode as my primary editor and text file format for notes. So, the output of some commands is in this repository is in the Org mode format. Org mode is a readable text file format. You may use Pandoc to convert Org mode into Markdown or any other format of your choice.

Binaries

There is a GitHub actions workflow set up in this repository which builds the ./cmd/email-random command, puts the output in an archive, and uploads it as a release artifact to the appropriate Git tag. The builds are performed for 3 architectures right now: amd64, arm (32 bit), and arm64. My motivation for this is to improve my personal setup to avoid having to download and build code on a Raspberry Pi which is annoyingly slow, compared to my other machines.

This is the sample output of a binary built for AMD64 running Linux:

$ file ./email-random-linux-amd64
./email-random-linux-amd64: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, Go BuildID=2xAAgiEbz0YaeTwVaLvY/TPl7Yke5m3o19Q8eJw4G/CoolvryF_ih8mxQTF0-9/heFRCu0IGe9Ljjo-wXRM, with debug_info, not stripped

$ ./email-random-linux-amd64 --version
refs/tags/v0.0.3-alpha 11a11b367ac315be12403463dea06f01ea234d3c

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
.github/workflows		.github/workflows
cmd		cmd
img		img
internal		internal
.env.template		.env.template
.gitignore		.gitignore
LICENSE		LICENSE
README.org		README.org
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Parse My Clippings.txt from Kindle to YAML

Project Structure

Commands

Commands related to parsing

`parse`

`supplement-with-bookcision`

Command related to deduplication

`deduper`

`identify-duplicate-pairs`

Commands related to auto-generated summaries

`quote-extractor`

`summary-builder`

Utilities

`email-random`

Environment

Binaries

About

Releases 5

Packages

Languages

License

icyflame/kindle-my-clippings-parser

Folders and files

Latest commit

History

Repository files navigation

Parse My Clippings.txt from Kindle to YAML

Project Structure

Commands

Commands related to parsing

parse

supplement-with-bookcision

Command related to deduplication

deduper

identify-duplicate-pairs

Commands related to auto-generated summaries

quote-extractor

summary-builder

Utilities

email-random

Environment

Binaries

About

Resources

License

Stars

Watchers

Forks

Releases 5

Packages 0

Languages

`parse`

`supplement-with-bookcision`

`deduper`

`identify-duplicate-pairs`

`quote-extractor`

`summary-builder`

`email-random`

Packages