add more options for train set generation

tech-srl · Jun 25, 2018 · 5232815 · 5232815
1 parent e7f83c0
commit 5232815
Show file tree

Hide file tree

Showing 2 changed files with 16 additions and 9 deletions.
diff --git a/Training_Functions.py b/Training_Functions.py
@@ -1,15 +1,18 @@
 from Helper_Functions import n_words_of_length
 
-def make_train_set_for_target(target,alphabet,lengths=None):
+def make_train_set_for_target(target,alphabet,lengths=None,max_train_samples_per_length=300,search_size_per_length=1000,provided_examples=None):
     train_set = {}
+    if None == provided_examples:
+        provided_examples = []
     if None == lengths:
         lengths = list(range(15))+[15,20,25,30] 
     for l in lengths:
-        more = n_words_of_length(1000,l,alphabet)
-        pos = [w for w in more if target(w)]
-        neg = [w for w in more if not target(w)]
-        pos = pos[:150]
-        neg = neg[:150]
+        samples = [w for w in provided_examples if len(w)==l]
+        samples += n_words_of_length(search_size_per_length,l,alphabet)
+        pos = [w for w in samples if target(w)]
+        neg = [w for w in samples if not target(w)]
+        pos = pos[:int(max_train_samples_per_length/2)]
+        neg = neg[:int(max_train_samples_per_length/2)]
         minority = min(len(pos),len(neg))
         pos = pos[:minority+20]
         neg = neg[:minority+20]

diff --git a/dfa_from_rnn.ipynb b/dfa_from_rnn.ipynb
@@ -139,13 +139,17 @@
    "metadata": {},
    "source": [
     "#### 2.2. Create a Train Set\n",
-    "`make_train_set_for_target` returns for the target function a dictionary of words of different lengths, each mapped to its classification by the target. It uses words of various lengths over the given alphabet, trying to get up to 300 words from each length, but also trying to get an even divide in the classes in those words for each length (e.g., 150 positive and 150 negative examples for each length). The optional parameter `lengths`, a list of the integers, defines which lengths will appear in the train set. If not set, the function will work with the lengths $0-15,20,25$, and $30$.\n",
+    "`make_train_set_for_target` returns for the target function a dictionary of words of different lengths, each mapped to its classification by the target. It tries to return a train set with an even split between positive and negative samples for each sample length. Its optional parameters are:\n",
+    ">1. `max_train_samples_per_length` (default 300): the maximum number of words of each length in the train set\n",
+    ">2. `search_size_per_length` (default 1000): the maximum number of words to be sampled from each length while generating the train set\n",
+    ">3. `provided_examples` (default `None`): hand-crafted samples to add to the train set (helpful if random sampling is unlikely to find one of the classes)\n",
+    ">4. `lengths` (a list of integers, default $0-15,20,25,30$): the lengths that will appear in the train set\n",
     "\n",
-    "This function works by randomly sampling the words of various lengths. If the target is such that the positive or negative class is relatively rare, it is unlikely to create an evenly split test set. In this case it is best to add some examples to the train set manually. For instance, after getting the initial train set for the language of all words containing the sequence `0123` over the alphabet $\\{0,1,2,3\\}$, you may also run the following:\n",
+    "If the target is such that the positive or negative class is relatively rare, `make_train_set_for_target` is unlikely to create an evenly split test set without some help. In this case it is best to help it with some provided examples, e.g.: for the language of all words containing the sequence `0123` over the alphabet $\\{0,1,2,3\\}$, you may want to run:\n",
     "```\n",
     "short_strings = [\"\",\"0\",\"1\",\"2\",\"3\"]\n",
     "positive_examples = [a+\"0123\"+b for a,b in itertools.product(short_strings,short_strings)]\n",
-    "train_set.update({w:True for w in positive_examples})\n",
+    "make_train_set_for_target(target,alphabet,provided_examples=positive_examples)\n",
     "```\n"
    ]
   },