Skip to content

Tool Overview

Werner Dietl edited this page Dec 17, 2016 · 6 revisions

The system uses a set of static and dynamic program analysis tools to identify frequently used concepts in a corpus of Java projects and assign ‘ontic’ labels to these concepts that capture their intent. A concept can be any program artifact which can easily be identified by its name, such as textbook algorithms, a class representing known concepts from the real world (e.g., boxes) or from science (e.g., vectors), or entire packages like an image processing library.

The ontic labels and their link to concrete program constructs such as classes or methods are stored in a triple store to make it accessible to the programmer’s apprentice which we discuss later. Further, ontic labels associated with classes, methods, and fields are added to the source code of the corpus programs and can be used for further analysis when associated with a semantic meaning. For example, ontic annotations referring to types of measurement can be used to identify bugs in algorithms.


Employed tools:

partitions.jar

Input: set of corpus projects.

Output: clustering of input projects into clusters that share distinct set of typical words.

Motivation: Many of the employed analysis tools that search for similarities are computationally expensive. Hence, it is important to pre-filter to corpus to avoid searching for similarities in unrelated projects. Further, identifying typical words in the corpus generates a taxonomy that can be used for search and other purposes.

clusterer.jar

Input: set of compiled corpus projects.

Output:

  • class_field_map.json file containing a map from classes to all fields in the corpus that are instances of this class.
  • word_based_field_clusters.json file clustering fields of the same type based on typical names. This clustering serves as initial assignment for type propagation.
  • clusters.json file clustering classes with similar names across projects. Several strategies for name-based clustering are available.

Motivation:

For clusters.json output

Our program similarity checking is based on flow-graphs whose nodes are labeled with types. To facilitate cross-project similarity checking, we need to re-label these graphs s.t. nodes labeled with class names that we consider similar are re-labeled to a shared label. For example, consider two methods in different projects that multiply vectors in 3D space. The methods are likely to be similar but the names and implementation details of the vector class may significantly differ from one project to the other. Hence, we use the clustering to identify that both classes are named ‘vector’, and we use the information that they occur in the same cluster to re-label our flow-graphs.

####For word_based_field_clusters.json

Java types often do not capture the intent of what a variable is used for. However, Java programmers tend to use human-readable variable names to provide hints of this intent. For example, variables that describe the details of a person could be named height, weight, and ssn. While the purpose of these variables is obvious for a human, a static analysis tool would only see three integers. To retain some of this ontic information provided by the programmer, we word_based_field_clusters.json groups fields of the same type based on similarity in the naming. This does not tell the static analysis what height actually means, but it preserves the information that height is different from ssn which may in turn allow an analysis to infer better invariants (e.g., height is between 30 and 250, and ssn is between 100000000 and 999999999).

run_simprog.py

The tool ""simprog" identifies similar programs across a set of projects. It takes as input graph representations for each program in the corpus and uses a graph kernel method to compute a similarity score between programs across projects.

How to run it: We provide a script called run_simprog.py in the integration-test2 master branch:

Usage:

run_simprog.py [-h] [-c CLUSTER] -d DIR [-p PLIST] [-k] optional arguments: -h, --help show this help message and exit -c CLUSTER, --cluster CLUSTER path to the input json file that contains the clustering information. -d DIR, --dir DIR output folder. -p PLIST, --plist PLIST a comma separated list of projects to work with. -k, --kernel recompute kernel vectors.

** Example usage: ** python run_simprog.py -d output -p "Sort07,Sort09,Sort10" -k This will use the projects "Sort07", "Sort09" and "Sort10" as the corpus, computes the kernel vectors and stores them as "Sort0X_kernel.txt" in the "output" folder, and produces as output "Sort0X_result.txt" also in the "output" folder. The json version ""Sort0X_result.json" of "Sort0X_result.txt" is also produced.

In "Sort07_result.txt", it contains for each program in "Sort07", the top few most similar programs from "Sort09" and "Sort10" together with the similarity score.


Ontology Type Inference (OTI)

General description

OTI supports other code analysis tools by propagating the ground truths about ontic types in the corpus. For example, ontic types help to distinguish two functions with the same Java types and control flow graphs. OTI takes a minimal set of ground truths from other tools and then propagates them properly in the corpus by type inference based on ontology type rules.

Input

Ground truths about ontic types in the corpus. A dictionary that contains:

  1. mappings from Java types to an ontology concept, e.g.
{
    "Sequence": [
        'TypeKind.ARRAY',
        'java.util.List'
    ]
}

This input tells OTI that Java array types and java.util.List are related to the ontology concept Sequence.

  1. mappings from fields to an ontology concept, e.g.
{
     "mappings": [
        {
         "fields":[
            "demo.package.Demo.externalVelocity",
         ],
         "label":[
            "velocity"
         ]
        },
        {
         "fields":[
            "demo.package.Demo.externalForce",
         ],
         "label":[
            "force"
         ]
        },
        ...
    ]
}

This input tells OTI that in package demo.package, there is a class Demo, whose field externalVelocity is related to ontology concept velocity, and field externalForce is related to ontology concept force.

Output

The corpus annotated with @Ontology type annotations propagated from the ground truths. For example, given the above mappings, the corpus would be annotated as below:

Original file:

public class Demo {
    Vector externalVelocity;
    Vector externalForce;

    public void applyVelocity(Vector velocity) {
        externalVelocity.add(velocity);
    }

    public void applyForce(Vector force) {
        externalForce.add(force);
    }
}

Annotated file:

import ontology.qual.Ontology;
import ontology.qual.OntologyValue;

public class Demo {
    @Ontology(OntologyValue.VELOCITY_3D) Vector externalVelocity;
    @Ontology(OntologyValue.FORCE_3D) Vector externalForce;

    public void applyVelocity(@Ontology(OntologyValue.VELOCITY_3D) Vector velocity) {
        ((@Ontology(OntologyValue.VELOCITY_3D) Vector) (externalVelocity.add(velocity)));
    }

    public void applyForce(@Ontology(OntologyValue.FORCE_3D) Vector force) {
        ((@Ontology(OntologyValue.FORCE_3D) Vector) (externalForce.add(force)));
    }
}

Note how the input contained annotations for the two fields. These annotations have been propagated to the method parameters and the polymorphic method invocations.

Original file:

public static int [] sort(int [] unsorted) {
    int [] sorted = new int[unsorted.length];
    for (int i = 0; i < unsorted.length; i++) {
        sorted[i] = unsorted[i];
    }
    Arrays.sort(sorted);
    return sorted;
}

Annotated file:

public static int @Ontology(OntologyValue.SEQUENCE) [] sort(int @Ontology(OntologyValue.SEQUENCE) [] unsorted) {
    int @Ontology(OntologyValue.SEQUENCE) [] sorted = new int[unsorted.length];
    for (int i = 0; i < unsorted.length; i++) {
        sorted[i] = unsorted[i];
    }
    Arrays.sort(sorted);
    return sorted;
}

Note how the general rule for arrays applies in this example, resulting in the SEQUENCE annotation on all array types.

Processing steps

  1. take a json file describing the mappings from Java types to ontology concepts and mappings from class fields to ontology concepts

  2. update the OTI system

  3. create new OntologyValue enum values in the OntologyValue class

  4. insert mapping rules from Java types to ontolgoy values in the OntologyUtils#determineOntologyValue() method

  5. re-compile OTI

  6. create a .jaif file describing the annotation information on the class fields

  7. insert ground truth type annotations into the corpus (Instead of performing steps 0-3, this can also be done manually by annotating fields with ground truths.)

  8. run OTI on inserted corpus to further propagate type annotations

Data

The corpus source code, before and after OTI.

How to quantify the output

The number of ``@Ontologyannotations that have been inserted into the corpus. Before running OIT, the source code contains no@Ontology` annotations. After running OTI, the source code contains `@Ontology` annotations marking the ontic types that occur in the application.

Running the scripts for OTI

mapping_2_annotation module in integration-test2 provides a command-line interface for running OTI by giving mappings.

Usage:

  1. insert rules mapping Java types to ontology concepts, then annotate the indicated projects in the corpus based on these rules:
python map2annotation --type-mapping <type_mappings>.json --project-list projectA,projectB,...
  1. propagate annotations in the indicated projects in the corpus by using mappings from fields to an ontology concept as ground truth:
python map2annotation --field-mapping <field_mappings>.json --project-list projectA,projectB,...
  1. do both:
python map2annotation --type-mapping <type_mappings>.json --field-mapping <field_mappings>.json --project-list projectA,projectB,...

Note: --project-list is optional. If this argument isn't provided, the map2annotation will run the propagation process on the whole corpus.

Note: when called from the command line, map2annotation will clean the source code of OTI before updating the type rules and Ontology values.

Note: for the json format of the mapping files, please refer to the examples in mapping_2_annotation/json_file_examples/