This page describes the competition protocol in details. It also provides information about the target machines, the training and test samples and the evaluation of submitted test results.

Competition set up

The competition server has a total of 100 problems – each problem is a sample of positive and negative strings generated by randomly walking a target machine. These problems vary in terms of their difficulty, related to

  • the size of the alphabet in the target machine
  • the sparsity of the sample (the extent to which it covers the behaviour of the target machine).

The 100 problems are divided into cells of 5 problems each, where each cell corresponds to a particular combination of sparsity and alphabet size. The table below shows how the cells are organised. Easier problems (with a smaller alphabet and a larger sample) are towards the upper-left of the table, and the harder problems (larger alphabet and smaller sample) are towards the bottom-right.

5 problems to solve in each cell
Sparsity of the training sample
Alphabet size 21 – 56 – 1011 – 1516 – 20
521 – 2526 – 3031 – 3536 – 40
1041 – 4546 – 5051 – 5556 – 60
2061 – 6566 – 7071 – 7576 – 80
5081 – 8586 – 9091 – 9596 – 100
Cell points
Sparsity of the training sample
Alphabet size 21133

Participation in the competition consists in attempting to solve cells. A cell is considered to be solved if all of the 5 problems it contains are solved with a BCR score of at least 0.99. Each cell is given a number of points (from 1 to 4) according to its difficulty, based on the average performance of the Blue-Fringe algorithm, computed with our scoring procedure. The first technique to solve a hardest cell, among those solved in the competition, will be the winner. The organizers will add additional problems if all cells appear to be solved more than one month before the final deadline.


The target machines are generated using a variant of the Forest-Fire algorithm and present characteristics that are similar to state machines that model the behaviour of software systems:

Number of states
State machines in the competition have approximately 50 states. Although somewhat larger than the conventional state machines identified in the Software Engineering literature, this is to ensure that any techniques submitted to this competition could scale to infer models for reasonably complex software systems. A roughly equal proportion of accepting and rejecting states has been chosen, a feature shared with Abbadingo.
Larger alphabets
State machine transitions are labelled by some alphabet symbol. Such a symbol may typically correspond to the input that triggers a transition or to a system function that governs the transition. For simplicity in this competition, alphabet symbols are represented by integer literals. Alphabet sizes in the competition range from 2 to 50 symbols (while previous competitions such as Abbadingo focussed on 2 letter alphabets)
Degree distribution
State machines of software systems tend to contain a small proportion of states with a high out-degree while most states have in- and out-degrees of one or two transitions. Some states are sink accepting states, that is, they have an out-degree of 0.
Deterministic and minimal
The machines considered in this competition are both deterministic and minimal. A machine is deterministic if, for all states, there is at most one outgoing transition on any alphabet symbol. A machine is minimal if no smaller deterministic machine defines the same regular language.

Learn more about target machine generation…

Training and test samples

For each problem, disjoint training and test samples have been generated using a dedicated generation procedure whose aim is to simulate the way examples of system behaviour are usually obtained in the software engineering community (a collection of program traces at an implementation level or the generation of scenarios at a design level, for instance). Samples present the following characteristics:

Generated by the target
Positive strings have been generated by randomly walking the target machines. Negative strings are generated by randomly perturbing positive strings. Three kinds of edit operation are considered: substituting, inserting, or deleting a symbol.
Training samples
Training sets are sampled with different levels of sparsity (100%, 50%, 25%, 12.5%) and usually contain duplicates, as a consequence of the random walk generation from the target machine.
Test samples
Each test sample contains 1,500 strings and contains no duplicates (to avoid favoring repeated strings in the scoring metric). Training and test sets do not intersect.
Empirically adjusted
The different parameters of the generation procedure (the sample sizes, in particular) have all been empirically adjusted to ensure good induction results using Blue-Fringe on the simplest problems (alphabet of size 2 with a learning sample sparsity of 100%) without breaking the cell.

Learn more about sample generation…


Once the test set for each problem has been downloaded, it is up to the competitor to establish whether each test string is accepted or rejected by their candidate machine for that problem. The competitor should produce a binary sequence of labels where, for each test string, a 1 is added to the sequence if the string is considered to be accepted, and a 0 otherwise:

0100100101101010001011010011100101... (up to the size of the test sample)

The score is computed according to a Balanced Classification Rate (BCR) measure. The submitted binary sequence of labels is compared to the actual labels, and four sets are produced:

TP : True positives
TN : True negatives
FP : False positives
FN : False negatives

These figures are used to compute two numbers, Cplus and Cminus

Cplus  = TP / (TP + FN)
Cminus = TN / (TN + FP)

The BCR value is the harmonic mean between Cplus and Cminus and is defined as follows:

BCR = (2 * Cplus * Cminus) / (Cplus + Cminus)

A problem is deemed to be solved if the BCR value is greater than or equal to 0.99. A cell is deemed to be solved if all of its 5 problems are solved. Cells of the competition and participant result grids turn to green when solved.