The catnet package

Overview

The catnet package provides a set of inference routines for categorical Bayesian networks based on the Maximum Likelihood estimation. Exhaustive search for a given node order and stochastic search in the space of orders are implemented. The package aims at diversifying the existing tools for network learning by focusing and facilitating the model selection problem.

Generating Networks

We start by describing several ways to create a catNetwork object. In most cases, networks are inferred from data and are created implicitly. There are occasions, however, when the user may want to create a network manually and the package provides such means.

A categorical network can be created explicitly by calling the cnNew function. The function takes following arguments: a vector of node names (nodes), a list of node categories (cats), a list of parents (parents) and an optional list of conditional probabilities (probs). The probability structure is defined by a nested hierarchical list and therefore specifying the probability argument directly can be very elaborated task, especially for large networks.

In the following example we create a small network with only three nodes. Note that all inner most vectors in the probs argument, such as (0.4,0.6), represent conditional distributions and thus must sum to 1. If probs parameter is omitted, a random probability structure will be assigned - the probability slots in the conditional distributions will be sampled uniformly in [0.01,0.99] and then normalized to probability mass functions.

 > library(catnet)
 > cnet <- cnNew(nodes = c("a", "b", "c"), cats = list(c("1", "2"), 
 +     c("1", "2"), c("1", "2")), parents = list(NULL, c(1), c(1, 
 +     2)), probs = list(c(0.2, 0.8), list(c(0.6, 0.4), c(0.4, 0.6)), 
 +     list(list(c(0.3, 0.7), c(0.7, 0.3)), list(c(0.9, 0.1), c(0.1, 
 +         0.9)))))

Randomly generated networks can be useful for simulation and evaluation purposes. By calling the cnRandomCatnet function, the user may generate a catNetwork with random DAG and probability model. The number of nodes, maximum parent size and the number of categories have to be given. All nodes are assigned equal number of categories.

 > set.seed(123)
 > cnet1 <- cnRandomCatnet(numnodes = 4, maxParents = 2, numCategories = 2)
 > cnet1
A catNetwork object with  4  nodes,  2  parents,  2  categories,
 Likelihood =  0 , Complexity =  8 .

A catNetwork object can be also created by inheriting an existing graph object as defined in the graph package. The graph package provides greater number of function for creating and manipulating graphs but does not deal with probability structures. A graph object can be created directly by specifying its nodes (myNodes) and edges (myEdges). It contains only a graphical structure description, not a probability one. Then, a catNetwork is created by calling the cnCatnetFromGraph function.

 > bgraph <- FALSE
 > try(bgraph <- require(graph), TRUE)
 > if (bgraph) {
 +     myNodes <- c("a", "s", "p", "q", "r", "t", "u")
 +     myEdges <- list(a = list(edges = NULL), s = list(edges = c("p", 
 +         "q")), p = list(edges = c("q")), q = list(edges = c("r")), 
 +         r = list(edges = c("u")), t = list(edges = c("q")), u = list(edges = NULL))
 +     g <- new("graphNEL", nodes = myNodes, edgeL = myEdges, edgemode = "directed")
 +     cnet2 <- cnCatnetFromGraph(g)
 + }

It is also possible to import graphs from Simple Interaction Format (SIF) files by calling the cnCatnetFromSif function. Since SIF files can describe any graph structure one has to be sure that the imported graph is DAG.

Accessing Network Attributes and Characteristics

There are several functions that give access to the main components of a catNetwork object, or more technically, its slots. Such are the functions cnNumNodes, cnNodes, cnEdges, cnMatEdges, cnParents, cnMatParents and cnProb, which are briefly described below.

Note that all attributes can be accessed directly using the @attribute mechanism, but that opens the possibility for accidental object changes. Direct manipulation with object components is discouraged for it may destroy the object integrity.

Functions cnNumNodes and cnNodes return the number and the list of names, respectively, of network nodes. For each node of a network, for example cnet1, one can obtain its parents by calling the cnParents function or find its children with the cnEdges function.

 > cnNumNodes(cnet1)
[1] 4
 > cnNodes(cnet1)
[1] "N1" "N2" "N3" "N4"
 > cnEdges(cnet1)
$N2
[1] "N3" "N4"
$N3
[1] "N4"
 > cnParents(cnet1)
$N3
[1] "N2"
$N4
[1] "N2" "N3"

In addition, the corresponding to cnParents and cnEdges functions, cnMatParents and cnMatEdges, return matrices instead of lists. cnMatParents returns the adjacency binary matrix representing the pairwise node connections, and it is especially useful for comparing networks with the same number of nodes. The function cnMatEdges returns a two-column matrix of ordered pairs encoding the edges with direction from the first to the second.

 > cnMatParents(cnet1)
   N1 N2 N3 N4
N1  0  0  0  0
N2  0  0  0  0
N3  0  1  0  0
N4  0  1  1  0
 > cnMatEdges(cnet1)
     [,1] [,2]
[1,] "N2" "N3"
[2,] "N2" "N4"
[3,] "N3" "N4"

The cnProb function provides an access to the conditional probability table of a network. Conditional probabilities are reported in the following format. First, a node name and its parents are given, then a list of probability values corresponding to all combinations of parent categories (put in brackets) and node categories. In the following example the first node (N1) has two categories (C1 and C2) and no parents, thus two numbers are given, probability 0.68 for the first category and 0.32 for the second. The third node (N3) has two categories and one parent (N2) and consequently two pairs of probabilities are reported, one for N2=C1 and another for N2=C2.

 > cnProb(cnet1)
Node[N1], Parents: 
[]C1  0.68
[]C2  0.32
 Node[N2], Parents: 
[]C1  0.55
[]C2  0.45
 Node[N3], Parents: N2
[ C1]C1  0.12
[ C1]C2  0.88
[ C2]C1  0.84
[ C2]C2  0.16
 Node[N4], Parents: N2 N3
[ C1 C1]C1  0.26
[ C1 C1]C2  0.74
[ C1 C2]C1  0.57
[ C1 C2]C2  0.43
[ C2 C1]C1  0.4
[ C2 C1]C2  0.6
[ C2 C2]C1  0.49
[ C2 C2]C2  0.51

An important characteristic of any categorical Bayesian network is its complexity. The complexity is an integer number specifying the number of independent parameters needed to define the probability structure of the network. For example, the complexity of a network with nodes without parents and two categories per node equals the number of nodes.

The complexity of a catNetwork object can be obtained by calling the cnComplexity function.

 > cnComplexity(cnet1)
[1] 8

The network complexity plays a key role in network estimation and model selection problems.

Drawing Networks

The catnet package provides two alternatives for visualizing a network. They are implemented in the function cnPlot. If the igraph package is installed, a catNetwork object will be coerced to a igraph object and plotted. Alternatively, cnPlot may generate a dot-file, compatible with the external \pkg{Graphviz} software package. Dot-files can be rendered to a POSTSCRIPT or a PDF files using the dot executable from Graphviz or directly visualized by dotty or tcldot. There is an auxiliary to cnPlot function called cnDot that generates and saves dot-files specifically.

cnPlot(cnet1)
cnDot(cnet1, "cnet1.dot")
cnPlot(cnet2)

:packages:cran:catnet:plot_cnet1.jpg :packages:cran:catnet:plot_cnet2.jpg

Network Estimation for Given Node Order

The function \code{cnSearchOrder} is the main computational tool provided by \pkg{catnet}. The function implements a Dynamic Programming (DP) algorithm for searching in the space of networks having a node order specified by the user. The result is a set of networks with increasing complexity up to a given maximum value. Each resulting network is the exact MLE for the corresponding complexity.

 > set.seed(789)
 > ## create a random network
 > cnet2 <- cnRandomCatnet(numnodes = 10, maxParents = 2, numCategories = 2)
 > ## select a random order  for the nodes of cnet2
 > nodeOrder <- order(runif(cnNumNodes(cnet2)))
 > ## display network summary
 > cnet2
A catNetwork object with  10  nodes,  2  parents,  2  categories,
 Likelihood =  0 , Complexity =  11 .
 > ## generate samples from 'cnet2'
 > samples <- cnSamples(object = cnet2, numsamples = 100, output = "frame")
 > ## search for networks fitting the samples and compatible with 'nodeOrder'
 > netlist <- cnSearchOrder(data = samples, perturbations = NULL, 
 +     maxParentSet = 2, maxComplexity = 20, nodeOrder, parentsPool = NULL, 
 +     fixedParentsPool = NULL)
 > ## select a network with target complexity 20
 > bnet <- cnFind(netlist, 20)
 > ## display bnet summary
 > bnet
A catNetwork object with  10  nodes,  2  parents,  2  categories,
 Likelihood =  -623.9366 , Complexity =  20 .

cnSearchOrder has two mandatory parameters: data and maxParentSet. The data is given as a matrix or data frame of character strings and the function extracts the categorical set for each node from the data. If maxComplexity is not specified, the network search is carried on up the maximum possible complexity. In the latter, more extreme case, the computational complexity of cnSearchOrder function is of order O((nc)^{maxParentSet + 1}), where $n$ is the number of nodes and $c$ is the maximum number of categories per node.

There are situations where some prior information about the network structure is available and the user may want to utilize. In its search functions, catnet includes the parameters parentsPool and fixedParentsPool, that can be used for defining edge constraints. The first parameter, parentsPool, specifies a set of possible parents for each node and in this way, some nodes can be excluded as potential parents. Additionally, the fixedParentsPool parameter specifies edge inclusion rules - the user may assign mandatory parents to some nodes. These two parameters allow a variety of edge constraints to be implemented.

In the next example, we generate a random network with 12 nodes and then search for the best fitting networks that comply with the following requirements: first, the last node, node 12, is not a parent of anyone else, and second, the first two nodes, 1 and 2, are mandatory parents to all other nodes. The search is restricted to the ‘true’ node order, the one of the network from which the data is generated, as obtained by cnOrder(cnet) function. Finally, we draw the log-likelihood vs. complexity curve for the resultant list of networks.

 > set.seed(123)
 > nnodes <- 12
 > cnet <- cnRandomCatnet(numnodes = nnodes, maxParents = 5, numCategories = 2)
 > norder <- cnOrder(cnet)
 > parPool <- vector("list", nnodes)
 > for (i in 1:nnodes) parPool[[i]] <- 1:(nnodes - 1)
 > fixparPool <- vector("list", nnodes)
 > for (i in 3:nnodes) fixparPool[[i]] <- c(1, 2)
 > samples <- cnSamples(cnet, numsamples=200)
 > eval <- cnSearchOrder(data = samples, perturbations = NULL, maxParentSet = 2, 
 +     maxComplexity = 200, nodeOrder = norder, parentsPool = parPool, 
 +     fixedParentsPool = fixparPool)
 > eval
 Number of nodes    = 12, 
 Sample size        = 200,
 Number of networks = 31
 Processing time    = 0.042

:packages:cran:catnet:catnet-ex14.jpg

Released package info

  • Version: 1.02.0
  • Date: 2010-04-13
  • Authors: Nikolay Balov, Peter Salzman * Maintainer: Nikolay Balov * License: GPL version 2 or newer. * Correspondence Author:**
Nikolay Balov
University of Rochester Medical Center,
Department of Biostatistics and Computational Biology
e-mails: nikolay_balov at urmc.rochester.edu
Peter Salzman
University of Rochester Medical Center,
Department of Biostatistics and Computational Biology
e-mails: psalzman at bst.rochester.edu
 
packages/cran/catnet.txt · Last modified: 2010/08/03 by nbalov
 
Recent changes RSS feed R Wiki powered by Driven by DokuWiki and optimized for Firefox Creative Commons License