Skip to content

Code to the Paper "Automatic Detection of Semantic Primitives Using Optimization Based on Genetic Algorithm"

Notifications You must be signed in to change notification settings

YevhenKost/SemPrimsDetectionGA

Repository files navigation

Code to the Paper "Automatic Detection of Semantic Primitives Using Optimization Based on Genetic Algorithm"

Setup

  1. Clone the repository:
git clone https://github.com/YevhenKost/SemPrimsDetectionGA.git
  1. Install requirements
pip install -r requirements.txt
  1. Fill the configs
    1. Pagerank model fitting parameters (conf/params_pagerank.json). The parameters description can be found via the following link: PageRank
    2. Word Vectorization paths and save names (conf/vectorization_configs.py). For each vectorizer provide required model paths on your local machine.

Usage

  1. Prepare the dictionary in a following format and save to json file. Making a specific directory for the dictionary to store all the results is suggested. For example:
import json, os

# load dictionary
my_dict = {
    "cat": [
        {"definition": "a very cute animal"},
        {"definition": "makes muuuuuur"}
    ],
    "buy": [
        {"definition": "exchange something for a money"}
    ]
}

# save to the dir
SAVE_DIR = "cat_but_directory/"
os.makedirs(SAVE_DIR, exist_ok=True)
with open(os.path.join(SAVE_DIR, "dictionary.json"), "w") as f:
   json.dump(my_dict, f)
  1. Convert dictionary to directed graph. It can be achieved via the command (paths are taken from the previous example):
python dict2graph.py --word_dictionary_path cat_but_directory/dictionary.json --stanza_dir LOADED_STANZA_MODELS/en --stanza_lang en --stop_words_lang english --save_dir cat_but_directory/ --drop_self_cycles true --lemm_always false

The arguments required:

  • --word_dictionary_path: path to the dictionary saved in json format (see previous example)
  • --stanza_dir: For lemmatization the stanza package is used. Can be "", than the stanza package will download everything based on the language, given in --stanza_lang. For model details, see Pipeline.
  • --stanza_lang: Language of dictionary. List of available languages can be found here.
  • --stop_words_lang: Stop words language to use. List of available languages can be found here.
  • --save_dir: path to a directory, where the grapg dict files will be stored: word encoding dictionary and graph edges dictionary in json formats. Suggested approach is to use the same directory as for the graph.
  • --drop_self_cycles: boolean, whether to delete the definitions, which have a word they suppose to define. For example, for the word "bark" the definition "to bark" will not be used during graph building.
  • --lemm_always: boolean, whether to use lemmatization only if the word is not in dictionary vocabulary or always.
  • --vocabulary_list_path (Optional): str, path to json with vocabulary to use for graph building. If None, the keys from file from word_dictionary_path will be used.
  • --lemm_vocabulary (Optional): boolean, Ignored if vocabulary_list_path is empty. Whether to lemmatize words in vocabulary_list. The duplicates will be removed.

For more details:

python dict2graph.py -h
  1. Run Generation of Permutation-based Semantic Primitives Sets:
python sp_generation.py --load_dir cat_but_directory/ --N 1000 --n_cores 12 --seed 2

Note, that it could take a while. For example, for a wordnet dictionary the generation of 1,000 SP lists took around a week with multiprocessing. The command execution will save in the --load_dir a generated lists in the following format and filename:

sp_sets_format = [
   [1,2,3], # sp set
   [10,2,5] # sp set
]

filename = f"candidates_{str(N)}_random{str(seed)}.json" # N and seed are taken from the arguments

The arguments required:

  • --load_dir: path to directory, which contains graph.json file (generated on previous step). The generated SP lists will be saved here.
  • --N: int, number of SP lists to generate (there is no gurantee that they will be all unqiue).
  • --n_cores: int, how many cores to use during multiprocessing.
  • --seed: int, fix random seed.

For more details:
python sp_generation.py -h
  1. Fit PageRank model
python page_rank.py --load_dir cat_but_directory/ --fit_params_path conf/params_pagerank.json

The fitted model will be saved to --graph_path.

The arguments required:

  • --load_dir: path to directory, which contains graph.json file (generated on the first step). In this directory the trained pagerank model will be saved.
  • --fit_params_path: path to json file with pagerank parameters. See conf/params_pagerank.json

For more details:
python page_rank.py -h
  1. Run algorithm
python run.py --load_dir cat_but_directory/ --sp_gen_lists_path cat_but_directory/candidates_1000_random2.json --n_threads 8 --val_prank_fill -1.0 --pop_size 100 --card_diff 50 --card_upper 2800 --save_dir GA_fit_model

Algorithm results will be saved to save_dir. See https://pymoo.org/interface/result.html. The decoded results will be stored in save_dir/sp_wordlists/.


The arguments required: * --load_dir: str, path to directory, which contains graph.json, encoding_dict.json and pagerank.pickle files (generated on previous steps). * --chp_path: str (optional), path to .npy checkpoint (if you want to continue training). After the model training this checkpoint will be saved in the save_dir. * --n_threads: int, number of cores to use for multiprocessing. * --sp_gen_lists_path: str, path to json file with stored generated SP lists (see step 3). * --val_prank_fill: negative float, value to use to return for mean pagerank objective function if the cycle is still detected in the graph. * --pop_size: int, population size (see [here](https://pymoo.org/algorithms/soo/ga.html#nb-ga)) * --card_diff: int, maximum possible cardinality deviation (constraint function: f(X) = (X - card_mean) ** 2 <= card_diff ** 2). * --card_mean: int, mean cardinality for the constraint (constraint function: f(X) = (X - card_mean) ** 2 <= card_diff). * --max_mutate: int, maximum number of elements to mutate per population. Default: 60. * --min_mutate: int, minimum number of elements to mutate per population. Default: 0. * --n_max_gen: int, maximum number of iterations to fit algorithm. Default: 30. * --save_dir: path, where training args, checkpoint and results will be stored.

For more details see:

python run.py -h

Testing

  1. Prepare word lists
    Create a dir, where each word list should be in a text file with newline separated word

  2. Fill up the preprocessing configs Before that fill up the conf/vectorization_configs.py and word_preprocessing_utils.py files. word_preprocessing_utils.py supports preprocessing for English, Spanish and Ukrainian at the moment, but it is possible to add new classes for other langs. In conf/vectorization_configs.py fill up the stemming/lemmatization fields with the suitable classes.

  3. Vectorize target word lists

python vectorize_words.py --lists_dir wordlists/ --save_dir wordlists/embeddings/

The arguments required:

  • --lists_dir: path to directory, which contains word lists (see Section 1).
  • --save_dir: path, where the embeddings should be saved. Will generate a directory for each wordlist with the same name as file. In each dir the embeddings in .npy format will be saved.
  1. Vectorize obtained word lists (see Section 5 of Usage)
python vectorize_words.py --lists_dir GA_fit_model/sp_wordlists --save_dir GA_fit_model/sp_embeddings/
  1. Calculate and save metrics
python evaluate.py --pred_wordlist_embeddings_dir GA_fit_model/sp_embeddings --target_wordlist_dir wordlists/embeddings/ --save_dir GA_fit_model/ --metric cosine

The arguments required:

  • --pred_wordlist_embeddings_dir: path to directory, where the embeddings for generated populations are stored (see previous step).
  • --target_wordlist_dir: path to directory, where the embeddings for target word lists are stored (see step 3).
  • --metric: metric to use. See https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cdist.html metric argument.
  • --save_dir: path, where the metrics should be saved. The json file will be generated: metrics_metric.json, where metric is the specified one.

About

Code to the Paper "Automatic Detection of Semantic Primitives Using Optimization Based on Genetic Algorithm"

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages