WikiPrint - from Polar Technologies

This page talks about language used to design experiments. It should both be used to design metadescriptions and perhaps to make them more specific to a particular experiment the user wants to run.

Examples

I'll start with a few examples of experiments first, that we should be able to design in this language.

A botnet experiment where a worm infects some vulnerable hosts, they organize into a P2P botnet with some botmaster and start exchanging C&C traffic. Experimenter wants to observe the evolution of the botnet and the amount of traffic that master receives. There are two classes of experiments here that need to be combined together:
1. an experiment where worm spreads and infects vulnerable hosts
2. an experiment where some hosts organize into P2P network and somehow elect a leader who then sends commands to them and they may send reports back

A cache poisoning experiment where the attacker poisons a DNS cache to take over authority for a given domain. The attacker then creates a phishing page and tries to steal user's usernames/passwords. There are two classes of experiments that need to be combined:
1. an experiment where a DNS cache is poisoned, subclass of cache poisoning experiments
2. an experiment where a phishing attack is conducted via a Web page to steal usernames/passwords

An ARP spoofing experiment where the attacker puts himself in between two nodes and then modifies their traffic. There are two classes of experiments that need to be combined:
1. an experiment where ARP poisoning happens between two nodes by the attacker
2. an experiment where an attacker changes traffic passing through it

Requirements

We may end up with a single language or a set of related languages. Here is what we need to express:

Logical topologies - both at the level of individual nodes or groups of nodes. We are expressing a logical topology of the experiment where there are objects that do something in the experiment - generate traffic, change state, hold data, whatever. Whether these objects are individually generated or generated as a group of entities, whether they are physical nodes or virtual, etc. does not matter. The expressiveness should be such that the actual implementation of objects and the cardinality of each object is orthogonal to the topology description. We should however be able to give hints such as "these objects are in the same network or on same physical node or object A resides on object B". These hints are called constraints.
Timeline of events - we need to express the ordering of actions that some objects will take in the experiment, their duration, repetition and concurrency. We also need to express state transitions in objects. In some domains this is called a workflow. It could be pre-created in the experiment design stage or it could be generated manually during the experiment (mined from events that happen as user takes manual actions) or a mix of those. Each experiment class must have some default workflow that user can manipulate during experiment design.
Invariants - we need to express what MUST happen in the experiment for it to be valid. Valid here means "for it to belong to a class of experiments whose metadescription we used" plus any other conditions that user wants to impose. There are two types of invariants:
1. those that deal with objects and their states ("cache must be poisoned")
2. those that deal with events and their features ("traffic must flow from A to B for 5 minutes at 100Mbps")

Note that intentionally this is all pretty high-level and is orthogonal to any generator used to generate topologies, traffic, etc. There must be a mapping process that selects eligible generators for each dimension and takes their output and maps objects and events to it. More about this mapping process later.

Diving in

I'll now ignore the question which language to use to design experiments because I think that pretty much any language can be used once we know what we want to say. To figure this out I'll try to use some variation of UML that can express both protocol diagrams and state transitions. If the level of detail is right we can decide on appropriate language in the next step.

Example 1: Botnet

This example used two metadescriptions. Let's go through each of them:

Worm spread metadescription

Dimensions:

Logical topology:

(in English: There must be two sets of hosts, at least one infected host in infected set and at least one vulnerable host in vulnerable set. These sets are disjoint. All objects here are of type Nodes.)

Timeline of events:

(in English: Each infected host generates scan events that target a vulnerable host, scanning for vulnerability x. Once scan hits a vulnerable host with vulnerability x, an infection event occurs and vulnerable host becomes infected.

Note that I haven't yet defined very well what scan event means. I have to do this somewhere but I think the right place for this would be a common repository of domain knowledge.

Invariants: There are some in definition of topology and timeline above. No additional ones are needed here.

P2P w leader and C&C traffic metadescription

Logical topology:

(in English: There must be two sets of hosts, at least two peers and at least one leader. Nothing is said about relationship between sets so it's possible that there's an intersection between those that is non-empty. All objects here are of type Nodes.)

Timeline of events:

(in English: Each peer contacts some other peer asking them to peer with it - the contacted peer may reply with a "yes". In parallel with this a peer somehow learns about a leader. If a leader object is known to a given peer, the peer will send it a "hello" message. The leader will then send commands to the peers it knows and may get reports from them back.).

Note that I haven't defined what wannapeer, yespeer, leader, hello, cmd and report events are and I should define it in the common domain knowledge base.

Invariants: There are some in definition of topology and timeline above. No additional ones are needed here.

Experiment design

Now I'm a user who wants to design my experiment. I need to combine two metadescriptions and somehow tie them down to generator choices. To combine I need to specify how outputs of worm metadescription match inputs of P2P metadescription. I'll do something like this:

i.e. each infected host becomes a peer.

The system now needs to offer me several generators:

It should offer a topology generator and map the initial infected set, vulnerable set and leader set to the topology that gets generated.
It should offer event generator for each of the events: scan, yespeer, wannapeer, leaderis, hello, cmd and report. Specifically for scan, yespeer, wannapeer, hello, cmd and report it should offer traffic generators. For leaderis it could either offer a traffic generator or an option to hardcode the leader ID into the peer software.
It should offer a malware generator for vulnerability x

User either chooses each generator or agrees to use a default one for each choice. User can then manipulate the generators (their parameters) and the workflow. For example the user may add "patched" state after the "infected" one with the "patch" event to make the transition.

Example 2: DNS cache poisoning for phishing

This example used two metadescriptions. Let's go through each of them:

Cache poisoning metadescription

Dimensions:

Logical topology:

(in English: There is one attacker node. There is one fake resource, of type Info which means that it is a piece of text that is given to someone upon request. A cache is simply a collection of Info items, one or more. Cache does not reside at the attacker.)

Timeline of events:

(in English: An attacker sends a reply to the cache (may be solicited or not) with information about resource=fakeresource. This leads to cache accepting the fake information. Notice that nothing is said about WHY cache chooses to accept this - this is specific to the poisoning flavor.)

Invariants:

Nothing in addition to the topology and timeline above.

DNS Cache poisoning metadescription

This is a special case of cache poisoning where the target is DNS cache. I've highlighted customizations from the general cache poisoning metadescriptions to arrive at this one.

(in English: There is one attacker node. There is either a fakeIP or fake authority that the attacker wants to inject into the cache. The first is of type IPaddress, the second of type DNSname. Both of these types are subtypes of Info, and this is recorded somewhere in the domain knowledge DB. A cache is simply a collection of DNSRecord items, one or more. These are also subtypes of Info and in the domain knowledge DB there's syntax defined for a DNSRecord. Cache does not reside at the attacker.)

Timeline of events:

(in English: Attacker asks for name.domain - nothing is said if the name is a made up one or real one. Cache then finds the authority for that domain - again nothing is said how. Once found, cache asks the authority for the name.domain and gets a reply. At the same time one of two scenarios can happen. If the name is selected randomly (made-up) then the attacker tries to replace the authority for the domain with some fake authority. Otherwise, he tries to replace the name.domain IP with a fakeIP. In both cases the attacker's reply should arrive before the auth's reply. Note that nothing is said about what a reply has to contain to be accepted as the right reply for the query - which is having a specific queryID. This is protocol dependent and at this model level what's important is that "reply should fit the query". How is the question for lower levels. Also, oftentimes at the lower level guesswork will be needed for the reply to fit the query so there may be looping in the actual experiment - that is all left for the lower level.)

Invariants:

Nothing in addition to the topology and timeline above.

Confidential input metadescription

The phishing attempt is essentially same as presenting a valid page to the user that asks for confidential info - it's just that the location of that page is not as user expected.

(in English: There is a user and a server node. There are two pieces of information - some confidential information about the user and optionally some public information. The user is a human, which means that his actions should be generated by some process that mimics human behavior.)

Timeline of events:

(in English: User accesses the server and the server asks the user for some confidential info and possibly some public info. The user then sends those over.)

Invariants:

Nothing in addition to the topology and timeline above.

Experiment design

Now I'm a user who wants to design an experiment. I need to combine two metadescriptions (DNS cache poisoning and phishing) and somehow tie them down to generator choices. To combine I'll do something like this:

i.e. the fakeIP from DNS cache poisoning metadescription belongs to the server from confidential input metadescription. Notice that I did nothing to say that fakeIP should match the IP address of the server but that's obvious from the context. Since fakeIP is an IP address it must match an IP address that somehow has to be related to the server.

The system now needs to offer me several generators:

It should offer a topology generator and map the nodes (Auth, Attacker, Server) to the topology that gets generated. Cache has to reside somewhere and it can't be at the attacker or Auth, so it will need an extra node. Note there's a little vagueness here - I said nothing about the server so theoretically cache could go there but it wouldn't make sense since the DNS at the server would know what is this server's IP. So ultimately this would violate some invariant during setup when it would become obvious that the findauth step will never point to Auth since DNS info hard-coded at the Server node has all the right information.
It should offer event generator for each of the events: query, reply, access, askconfidential. Specifically for query, reply it should offer DNS traffic generators. For access, askconfidential it should offer HTTP traffic generators.

Example 3: ARP poisoning with MITM attack

This example used two metadescriptions. The first was ARP poisoning which is a flavor of cache poisoning, and the other is MITM attack.

ARP poisoning metadescription

This is a special case of cache poisoning where the target is ARP cache. I've highlighted customizations from the general cache poisoning metadescriptions to arrive at this one.

(in English: There is one attacker node. There is a fakeIP of type IPaddress. A cache is simply a collection of ARPRecord items, one or more. These are subtypes of Info and in the domain knowledge DB there's syntax defined for an ARPRecord. Cache does not reside at the attacker.)

Timeline of events:

(in English: Attacker sends the ARP reply with mapping of an ARP address to somebody's IP. This really could be anybody's ARP address but in most cases it is the attacker's.)

Invariants:

Nothing in addition to the topology and timeline above.

TODO

How is ordering of events defined?
What do we denote "all", "each", "none", "some"
How do we denote state transitions because of an event, vs. self-initiated, vs. those that emit an event