add maintinaers readme

namehash · Nov 18, 2024 · 769338c · 769338c
1 parent 133c5b8
commit 769338c
Show file tree

Hide file tree

Showing 2 changed files with 67 additions and 0 deletions.
diff --git a/DEV.md b/DEV.md
@@ -0,0 +1,63 @@
+# NameHash Label Inspector
+
+This is additional documentation for maintainers.
+
+## Updating
+
+> **IMPORTANT** Regenerate cache after updating!
+
+```bash
+python namehash_common/generate_cache.py
+```
+
+### Dependencies
+
+To update dependencies, modify package versions in `pyproject.toml` and run:
+
+```bash
+poetry update
+```
+
+### Unicode
+
+When a new Unicode version is released, you should update character data by running:
+
+```bash
+UNICODE_VERSION=15.1.0 ./download_latest_data.sh
+```
+
+Replace `15.1.0` with the latest **official** unicode version (not a draft).
+You can also specify an older version.
+The script will download:
+
+- list of confusable characters from <https://www.unicode.org/Public/security/latest/> into `inspector_data/inspector/confusables.json`
+- latest character data from <https://www.unicode.org/Public/UNIDATA/> into `myunicode/myunicode.json`
+- test data for numeric characters from <https://www.unicode.org/Public/UNIDATA/> into `tests/data/unicode_numerics.txt`
+
+You can then inspect and commit the changes with git.
+
+## Tests
+
+Run:
+
+```bash
+pytest
+```
+
+or without slow tests:
+
+```bash
+pytest -m "not slow"
+```
+
+## Dictionaries
+
+LabelInspector tokenizes labels using a dictionary. The dictionary is built:
+
+1. All tokens from `inspector_data/words.txt` longer than 3 characters.
+2. All tokens from `inspector_data/custom_dictionary.txt`.
+
+Calculation of probabilities is performed using ngram language model.
+
+1. `inspector_data/inspector/bigram_freq.csv` - for bigrams
+2. `inspector_data/inspector/unigram_freq.csv` - for unigrams. Counts for tokens from `inspector_data/custom_dictionary.txt`, which are not present in unigrams, are set to value defined in config by `inspector.custom_token_frequency: 500000`.
diff --git a/README.md b/README.md
@@ -51,6 +51,10 @@ The Label Inspector includes a handler for [Amazon AWS Lambda](https://aws.amazo
 
 See the included [Dockerfile](/Dockerfile) for an example of how to build a Lambda deployment package.
 
+## For maintainers
+
+See [DEV.md](DEV.md).
+
 ## License
 
 Licensed under the MIT License, Copyright © 2023-present [NameHash Labs](https://namehashlabs.org).