Table of Contents

The *catnet* package provides a set of inference routines for categorical Bayesian networks based on the Maximum Likelihood estimation. Exhaustive search for a given node order and stochastic search in the space of orders are implemented. The package aims at diversifying the existing tools for network learning by focusing and facilitating the model selection problem.

We start by describing several ways to create a *catNetwork* object. In most cases, networks are inferred from data and are created implicitly. There are occasions, however, when the user may want to create a network manually and the package provides such means.

A categorical network can be created explicitly by calling the *cnNew* function. The function takes following arguments: a vector of node names (*nodes*), a list of node categories (*cats*), a list of parents (*parents*) and an optional list of conditional probabilities (*probs*). The probability structure is defined by a nested hierarchical list and therefore specifying the probability argument directly can be very elaborated task, especially for large networks.

In the following example we create a small network with only three nodes. Note that all inner most vectors in the *probs* argument, such as (0.4,0.6), represent conditional distributions and thus must sum to 1. If *probs* parameter is omitted, a random probability structure will be assigned - the probability slots in the conditional distributions will be sampled uniformly in [0.01,0.99] and then normalized to probability mass functions.

> library(catnet) > cnet <- cnNew(nodes = c("a", "b", "c"), cats = list(c("1", "2"), + c("1", "2"), c("1", "2")), parents = list(NULL, c(1), c(1, + 2)), probs = list(c(0.2, 0.8), list(c(0.6, 0.4), c(0.4, 0.6)), + list(list(c(0.3, 0.7), c(0.7, 0.3)), list(c(0.9, 0.1), c(0.1, + 0.9)))))

Randomly generated networks can be useful for simulation and evaluation purposes. By calling the *cnRandomCatnet* function, the user may generate a *catNetwork* with random DAG and probability model. The number of nodes, maximum parent size and the number of categories have to be given. All nodes are assigned equal number of categories.

> set.seed(123) > cnet1 <- cnRandomCatnet(numnodes = 4, maxParents = 2, numCategories = 2) > cnet1 A catNetwork object with 4 nodes, 2 parents, 2 categories, Likelihood = 0 , Complexity = 8 .

A *catNetwork* object can be also created by inheriting an existing *graph* object as defined in the *graph* package. The *graph* package provides greater number of function for creating and manipulating graphs but does not deal with probability structures. A *graph* object can be created directly by specifying its nodes (*myNodes*) and edges (*myEdges*). It contains only a graphical structure description, not a probability one. Then, a *catNetwork* is created by calling the *cnCatnetFromGraph* function.

> bgraph <- FALSE > try(bgraph <- require(graph), TRUE) > if (bgraph) { + myNodes <- c("a", "s", "p", "q", "r", "t", "u") + myEdges <- list(a = list(edges = NULL), s = list(edges = c("p", + "q")), p = list(edges = c("q")), q = list(edges = c("r")), + r = list(edges = c("u")), t = list(edges = c("q")), u = list(edges = NULL)) + g <- new("graphNEL", nodes = myNodes, edgeL = myEdges, edgemode = "directed") + cnet2 <- cnCatnetFromGraph(g) + }

It is also possible to import graphs from Simple Interaction Format (SIF) files by calling the *cnCatnetFromSif* function. Since SIF files can describe any graph structure one has to be sure that the imported graph is DAG.

There are several functions that give access to the main components of a *catNetwork* object, or more technically, its slots. Such are the functions *cnNumNodes*, *cnNodes*, *cnEdges*, *cnMatEdges*, *cnParents*, *cnMatParents* and *cnProb*, which are briefly described below.

Note that all attributes can be accessed directly using the *@attribute* mechanism, but that opens the possibility for accidental object changes. Direct manipulation with object components is discouraged for it may destroy the object integrity.

Functions *cnNumNodes* and *cnNodes* return the number and the list of names, respectively, of network nodes. For each node of a network, for example *cnet1*, one can obtain its parents by calling the *cnParents* function or find its children with the *cnEdges* function.

> cnNumNodes(cnet1) [1] 4 > cnNodes(cnet1) [1] "N1" "N2" "N3" "N4" > cnEdges(cnet1) $N2 [1] "N3" "N4" $N3 [1] "N4" > cnParents(cnet1) $N3 [1] "N2" $N4 [1] "N2" "N3"

In addition, the corresponding to *cnParents* and *cnEdges* functions, *cnMatParents* and *cnMatEdges*, return matrices instead of lists. *cnMatParents* returns the adjacency binary matrix representing the pairwise node connections, and it is especially useful for comparing networks with the same number of nodes. The function *cnMatEdges* returns a two-column matrix of ordered pairs encoding the edges with direction from the first to the second.

> cnMatParents(cnet1) N1 N2 N3 N4 N1 0 0 0 0 N2 0 0 0 0 N3 0 1 0 0 N4 0 1 1 0 > cnMatEdges(cnet1) [,1] [,2] [1,] "N2" "N3" [2,] "N2" "N4" [3,] "N3" "N4"

The *cnProb* function provides an access to the conditional probability table of a network. Conditional probabilities are reported in the following format. First, a node name and its parents are given, then a list of probability values corresponding to all combinations of parent categories (put in brackets) and node categories. In the following example the first node (*N1*) has two categories (*C1* and *C2*) and no parents, thus two numbers are given, probability 0.68 for the first category and 0.32 for the second. The third node (*N3*) has two categories and one parent (*N2*) and consequently two pairs of probabilities are reported, one for *N2=C1* and another for *N2=C2*.

> cnProb(cnet1) Node[N1], Parents: []C1 0.68 []C2 0.32 Node[N2], Parents: []C1 0.55 []C2 0.45 Node[N3], Parents: N2 [ C1]C1 0.12 [ C1]C2 0.88 [ C2]C1 0.84 [ C2]C2 0.16 Node[N4], Parents: N2 N3 [ C1 C1]C1 0.26 [ C1 C1]C2 0.74 [ C1 C2]C1 0.57 [ C1 C2]C2 0.43 [ C2 C1]C1 0.4 [ C2 C1]C2 0.6 [ C2 C2]C1 0.49 [ C2 C2]C2 0.51

An important characteristic of any categorical Bayesian network is its complexity. The complexity is an integer number specifying the number of independent parameters needed to define the probability structure of the network. For example, the complexity of a network with nodes without parents and two categories per node equals the number of nodes.

The complexity of a *catNetwork* object can be obtained by calling the *cnComplexity* function.

> cnComplexity(cnet1) [1] 8

The network complexity plays a key role in network estimation and model selection problems.

The *catnet* package provides two alternatives for visualizing a network. They are implemented in the function *cnPlot*. If the *igraph* package is installed, a *catNetwork* object will be coerced to a *igraph* object and plotted. Alternatively, *cnPlot* may generate a dot-file, compatible with the external \pkg{Graphviz} software package. Dot-files can be rendered to a POSTSCRIPT or a PDF files using the *dot* executable from *Graphviz* or directly visualized by *dotty* or *tcldot*. There is an auxiliary to *cnPlot* function called *cnDot* that generates and saves dot-files specifically.

cnPlot(cnet1) cnDot(cnet1, "cnet1.dot") cnPlot(cnet2)

The function \code{cnSearchOrder} is the main computational tool provided by \pkg{catnet}. The function implements a Dynamic Programming (DP) algorithm for searching in the space of networks having a node order specified by the user. The result is a set of networks with increasing complexity up to a given maximum value. Each resulting network is the exact MLE for the corresponding complexity.

> set.seed(789) > ## create a random network > cnet2 <- cnRandomCatnet(numnodes = 10, maxParents = 2, numCategories = 2) > ## select a random order for the nodes of cnet2 > nodeOrder <- order(runif(cnNumNodes(cnet2))) > ## display network summary > cnet2 A catNetwork object with 10 nodes, 2 parents, 2 categories, Likelihood = 0 , Complexity = 11 . > ## generate samples from 'cnet2' > samples <- cnSamples(object = cnet2, numsamples = 100, output = "frame") > ## search for networks fitting the samples and compatible with 'nodeOrder' > netlist <- cnSearchOrder(data = samples, perturbations = NULL, + maxParentSet = 2, maxComplexity = 20, nodeOrder, parentsPool = NULL, + fixedParentsPool = NULL) > ## select a network with target complexity 20 > bnet <- cnFind(netlist, 20) > ## display bnet summary > bnet A catNetwork object with 10 nodes, 2 parents, 2 categories, Likelihood = -623.9366 , Complexity = 20 .

*cnSearchOrder* has two mandatory parameters: *data* and *maxParentSet*. The data is given as a matrix or data frame of character strings and the function extracts the categorical set for each node from the data. If *maxComplexity* is not specified, the network search is carried on up the maximum possible complexity. In the latter, more extreme case, the computational complexity of *cnSearchOrder* function is of order *O((nc)^{maxParentSet + 1})*, where $n$ is the number of nodes and $c$ is the maximum number of categories per node.

There are situations where some prior information about the network structure is available and the user may want to utilize. In its search functions, *catnet* includes the parameters *parentsPool* and *fixedParentsPool*, that can be used for defining edge constraints. The first parameter, *parentsPool*, specifies a set of possible parents for each node and in this way, some nodes can be excluded as potential parents. Additionally, the *fixedParentsPool* parameter specifies edge inclusion rules - the user may assign mandatory parents to some nodes. These two parameters allow a variety of edge constraints to be implemented.

In the next example, we generate a random network with 12 nodes and then search for the best fitting networks that comply with the following requirements: first, the last node, node 12, is not a parent of anyone else, and second, the first two nodes, 1 and 2, are mandatory parents to all other nodes. The search is restricted to the ‘true’ node order, the one of the network from which the data is generated, as obtained by *cnOrder(cnet)* function. Finally, we draw the log-likelihood vs. complexity curve for the resultant list of networks.

> set.seed(123) > nnodes <- 12 > cnet <- cnRandomCatnet(numnodes = nnodes, maxParents = 5, numCategories = 2) > norder <- cnOrder(cnet) > parPool <- vector("list", nnodes) > for (i in 1:nnodes) parPool[[i]] <- 1:(nnodes - 1) > fixparPool <- vector("list", nnodes) > for (i in 3:nnodes) fixparPool[[i]] <- c(1, 2) > samples <- cnSamples(cnet, numsamples=200) > eval <- cnSearchOrder(data = samples, perturbations = NULL, maxParentSet = 2, + maxComplexity = 200, nodeOrder = norder, parentsPool = parPool, + fixedParentsPool = fixparPool) > eval Number of nodes = 12, Sample size = 200, Number of networks = 31 Processing time = 0.042

**Version:**1.02.0**Date:**2010-04-13**Authors:**Nikolay Balov,**Peter Salzman ***Maintainer:**Nikolay Balov ***License:**GPL version 2 or newer. ***Correspondence Author:**

Nikolay Balov University of Rochester Medical Center, Department of Biostatistics and Computational Biology e-mails: nikolay_balov at urmc.rochester.edu

Peter Salzman University of Rochester Medical Center, Department of Biostatistics and Computational Biology e-mails: psalzman at bst.rochester.edu