Version 67 (modified by sunshine, 13 years ago) (diff)


This page talks about language used to design experiments. It should both be used to design metadescriptions and perhaps to make them more specific to a particular experiment the user wants to run.


I'll start with a few examples of experiments first, that we should be able to design in this language.

  1. [BotnetExample]
  2. [CachePoisonExample]
  3. [MitmExample]

Expressiveness of the Meta-description

We may end up with a single language or a set of related languages. Here is what we need to express:

  1. Logical topology - both at the level of individual nodes or groups of nodes. We are expressing a logical topology of the experiment where there are objects that do something in the experiment - generate traffic, change state, hold data, whatever. Whether these objects are individually generated or generated as a group of entities, whether they are physical nodes or virtual, etc. does not matter. The expressiveness should be such that the actual implementation of objects and the cardinality of each object is orthogonal to the topology description. We should however be able to give hints such as "these objects are in the same network or on same physical node or object A resides on object B". Here is a rough list of hints we'd like to be able to give:
    • What type of object this is - we need to enumerate all possible types such as Node, Info, DNSRecord, ... As new meta-descriptions and new generators are added type list will grow
    • What is the cardinality of this object - this is just a hint about cardinality. I need to be able to say "this is one of many" vs "one and only one", like in [BotnetExample] I should have many vulnerable hosts, but in [MitmExample] I have exactly two nodes and one attacker in the middle of them. I could have multiple MITM triplets but the minimum size is 1 triplet.
    • Can objects overlap or are they distinct
    • An object is (not) located on another object (e.g., cache on a node)
    • An object is (not) contained in another object (e.g. cache record in cache)
  1. Timeline of events - we need to express the ordering of actions that some objects will take in the experiment, their duration, repetition and concurrency. We also need to express state transitions in objects. In some domains this is called a workflow. It could be pre-created in the experiment design stage or it could be generated manually during the experiment (mined from events that happen as user takes manual actions) or a mix of those. Each experiment class must have some default workflow that user can manipulate during experiment design. Here is a rough list of things to express here:
    • Parameters of an action
    • An action must happen vs may happen
    • Additional actions are allowed vs not allowed
    • An action can(not) be split into multiple smaller pieces that have the same effect
    • State changes in objects due to some action, at random, due to a timeout ... State changes that generate an action
    • Conditions that lead to state change or to an action
    • One (or none or N) of a number of actions must happen
    • Loops with and without conditions
  1. Invariants - we need to express what MUST happen in the experiment for it to be valid. This is not a complete set, just the necessary one. If any of the invariants were violated the experiment would become invalid. Valid here means "for it to belong to a class of experiments whose metadescription we used" plus any other conditions that user wants to impose. There are two types of invariants:
    1. those that deal with objects and their states ("cache must be poisoned")
    2. those that deal with events and their features ("traffic must flow from A to B for 5 minutes at 100Mbps")

In general case invariants are defined in the logical topology and timeline of events. Additional invariants may be defined in this section but so far I had hard time coming up with those for a meta-description. Naturally when the user designs her experiment starting from the metadescription this will lead to more invariants being defined automatically and to some that a user can choose to define.

Note that intentionally this is all pretty high-level and is orthogonal to any generator used to generate topologies, traffic, etc. There must be a mapping process that selects eligible generators for each dimension and takes their output and maps objects and events to it. More about this mapping process later.

Domain Knowledge

The entire system has a domain knowledge database that contains domain-specific information, such as:

  • Context:
    • what is an IP address
    • what is an IP flow
    • what is an IP packet pair
    • what is a TCP connection (consists of three packets with following features ..)
    • what is a DNS request/reply, what makes a reply match the request
    • what is an HTTP request/reply, what makes a reply match the request
    • what is the syntax of a DNS record
  • Parameterization:
    • for each type of object and each type of event in any metadescription what kind of parameters can be defined. E.g., for HTTP request this would be timing of requests and content of requests (type=GET/PUT/POST, filename). Note that this is divorced from the actual generator of such event, parameters are strictly related to events, regardless of how they are generated.
  • Relationships:
    • One type of traffic induces another: HTTP traffic leads to DNS traffic leads to ARP traffic
    • Nodes have IP and ARP addresses per each interface
    • Web files may contain forms that require user input
    • HTTP replies depend on Web files stored at a server-specific location
    • DNS records reside in caches and have info about name=address mappings, and info about authorities for domains

The context part would be mostly populated by us. The parameterization and relationship parts would be seeded by us but then extended by other experts that create meta-descriptions. There is an automated way of identifying unknowns from a meta-description that must be defined in the knowledge DB.

Generator Descriptions

Our system keeps the following info about each generator:

  • Name
  • Contact of the author or N/A for non-supported generators
  • URL of the manual for the generator
  • What is the type of its output (e.g. Topology) - the types mentioned here are from the same enum list as object types in the logical topology
  • What parameters users can customize - these come from the same list as parameters in the parameterization part of the knowledge DB. A generator may have some parameters specific to itself, we don't care about those in this item. We care about parameters that describe a certain event/object and which of them we can manipulate if we choose this generator.
  • How the parameters can be customized, e.g., randomized, input by user, selected from a list ...
  • Other inputs of the generator, their range of values and their relationship to the parameters from the above two items

Mapping elements of metadescription to generators during experiment design


  • How is ordering of events defined?
  • What do we denote "all", "each", "none", "some"
  • How do we denote state transitions because of an event, vs. self-initiated, vs. those that emit an event