Random-walk procedure for sample generation

For each problem, disjoint training and test sets have been generated by randomly walking the target automaton. The procedure can be summarized as follows:

  1. Using the random walk algorithm described below, a first sample is generated from the target automaton. This sample contains exactly 20,000 strings but may contain duplicates. The random walk algorithm is tuned to get samples containing roughly equal number of positive and negative strings.
  2. The distinct strings of the initial sample are then equally partitioned into two sets, taking care of respecting the positive and negative balance in each one. The first set is used to generate the test sample, the second one to generate the learning sample.
  3. The test sample contains 1,500 strings randomly drawn without replacement from the first set. The test set neither contains duplicates (to avoid favoring repeated strings in the scoring metric), nor intersects with the training set.
  4. The official training sets are sampled from the second set with different levels of sparsity (100%, 50%, 25%, 12.5%) and usually contain duplicates, as a consequence of the random walk generation from the target machine.

Remark: the different parameters above (20,000 strings initially generated, 1,500 strings maximum in the test set and the learning sample size) have all been empirically adjusted to ensure good induction results using Blue-Fringe on the simplest problems (alphabet of size 2 with a learning sample sparsity of 100%) without breaking the cell.

Positive and negative strings

A dedicated random walk algorithm has been implemented to generate positive and negative strings of the initial sample described above. It generates positive strings by walking through the automaton from the initial state, randomly selecting outgoing transitions with a uniform distribution. When an accepting state v is reached, the generation ends with a probability of 1.0/(1 + 2*outdegree(v)). This procedure simulates an “end of string” transition from state v with half the probability of an existing transition. The length distribution of the strings generated is approximately centered on 5 + depth(automaton), a feature shared with the Abbadingo competition.

Negative strings are generated by randomly perturbing strings generated from the target machine. Three kinds of edit operation are considered: substituting, inserting, or deleting a symbol. The editing position is randomly chosen according to a uniform distribution over the string length. Substitution and insertion also use a uniform distribution over the alphabet. The number of editing operations is chosen according to a Poisson distribution (with a mean of 3) and the three editing operations have an equal change of being selected. The randomly edited string is included in the negative sample provided it is indeed rejected by the target machine. Otherwise, it is simply discarded.