# graph4lg - Tutorial

## Introduction

The creation of the graph4lg package on R should make easier the construction and analysis of genetic and landscape graphs in landscape genetics studies (hence the name graph4lg). The package allows to weight the links and to prune the graphs by several ways. To our knowledge, it is the first software which enables to create genetic graphs with such a large variety of parameters. Besides, it allows to carry out preliminary analyses of the spatial pattern of genetic differentiation in order to choose the best genetic distance and pruning method in every context. Lastly, it makes possible the comparison of two spatial graphs sharing the same nodes.

In the following sections, we describe the functions of the package following their separation into three parts. We illustrate them with results’ examples and report the command lines to copy in order to reproduce them.

The package includes genetic and spatial data allowing users to discover its different functionalities. We included two datasets. The first is from French Guadeloupe and consists of 432 Plumbeous Warblers (Setophaga plumbea) genotyped at 12 microsatellite loci (169 different alleles in total) from 20 populations whose spatial coordinates are available. These data were collected and analysed by Khimoun et al. (2017). They are also freely available on the website datadryad.org. They were included in the package in different formats: data.frame format with columns of locus class from the package gstud, genind class from adegenet, loci class for pegas, text files (.txt) for GENEPOP or STRUCTURE.

The second data set was created during simulations done with CDPOP (Landguth and Cushman 2010) on a simulated landscape. It consists of 1500 individuals from 50 populations genotyped at 20 microsatellite loci. Individuals dispersed less when the cost-distance between populations was large. These data were included in format genind and as a text file compatible with GENEPOP software. A landscape graph was created with GRAPHAB (Foltête, Clauzel, and Vuidel 2012) whose nodes were the 50 simulated populations and the links were weighted by cost-distance values between populations. The project created with GRAPHAB was included into the package such that the landscape graphs and the cost-distance matrix can be easily imported into the R environment.

## Input data processing

The first type of functions from this package allows to process spatial and genetic data used as input data in subsequent analyses performed with other functions. These functions convert genetic data, compute genetic or geographical distances, import landscape graphs and convert cost-distance values into Euclidean geographical distances.

### Genetic data conversion

In order to make the package user-friendly and compatible with genetic data commonly used in landscape genetics, the functions gstud_to_genind, loci_to_genind, structure_to_genind and genepop_to_genind allow to convert genetic data from formats used respectively in the GENETIC STUDIO (Dyer 2009) and pegas (Paradis 2010) packages on R and in STRUCTURE (Pritchard, Stephens, and Donnelly 2000) and GENEPOP (Raymond 1995) software into R objects with the class attribute genind from ADEGENET package (Jombart 2008). The format genind allows to use time-efficient functions from ADEGENET (coded in C). This package was developed and is regularly maintained by Thibaut Jombart (his website). The function genind_to_genepop enables to convert genind object into text files in format genepop in order to perform easily analyses with this commonly used R package and executable software.

Genetic data in formats used by software commonly used in population genetics can be converted into genind objects in R. This class of object was created for the package adegenet (Jombart 2008).

#### GENEPOP to genind

The GENEPOP software (Raymond 1995) developed by M. Raymond and F. Rousset (Montpellier) can be used as an executable file, with or without graphical user interface, or as a R package. It is frequently used to compute FST values and to test for Hardy-Weinberg equilibrium, linkage disequilibrium or genetic differentiation.

When performing simulations with CDPOP, individuals’ genotypes can be saved as GENEPOP files at the end of the simulation.

The function genepop_to_genind loads a GENEPOP file (.txt extension) and converts it into a genind object. To use it, the path to the file, the total number and names of the loci and the populations’ names must be indicated.

data_genind <- genepop_to_genind(path = paste0(system.file('extdata',
package = 'graph4lg'), "/gpop_51_sim22_01_25.txt"),
n.loci = 20, pop_names = as.character(1:50))
#> Registered S3 method overwritten by 'spdep':
#>   method   from
#>   plot.mst ape
#> Registered S3 method overwritten by 'pegas':
#>   method      from
data_genind
#> /// GENIND OBJECT /////////
#>
#>  // 1,500 individuals; 20 loci; 568 alleles; size: 3.4 Mb
#>
#>  // Basic content
#>    @tab:  1500 x 568 matrix of allele counts
#>    @loc.n.all: number of alleles per locus (range: 26-30)
#>    @loc.fac: locus factor for the 568 columns of @tab
#>    @all.names: list of allele names for each locus
#>    @ploidy: ploidy of each individual  (range: 2-2)
#>    @type:  codom
#>    @call: adegenet::df2genind(X = as.matrix(x[, attr(x, "locicol"), drop = FALSE]),
#>     sep = "/", pop = pop, ploidy = ploidy)
#>
#>  // Optional content
#>    @pop: population of each individual (group size range: 30-30)

We get a genind object. It contains the genotypes of the 1500 individuals from the 50 populations created during the simulation.

#### genind to GENEPOP

Similarly, the function genind_to_genepop allows to perform the reverse conversion, i.e. to convert a genind object into a GENEPOP file. This file is created and saved in the working directory defined earlier.

genind_to_genepop(x = data_genind, output = "data_gpop_test.txt")

This function allows for example to create a GENEPOP file to test for between populations genetic differentiation or to compute fixation indices with GENEPOP software.

#### STRUCTURE to genind

STRUCTURE software (Pritchard, Stephens, and Donnelly 2000) is frequently used in population genetics and landscape genetics. It enables to create populations’ clusters via a Bayesian approach aiming at minimizing the deviation from Hardy-Weinberg equilibrium when gathering populations with one another. The input files have a particular structure. The function structure_to_genind allows to convert this type of file into a genind object.

To use the function, we need to indicate the path to the file, the names of the loci, the individuals’ ID and the populations’ names in the same order as in the original file.

loci_names <- c("DkiD104", "DkiD124", "DkiD102", "CAM19",
"DkiC118", "DkiD128", "DkiB12",  "Lswmu7",
"DkiD109", "Lswmu5",  "TG12_15", "DkiD12" )
data(data_pc_genind)
ind_names <- row.names(data_pc_genind@tab)
pop_names <- c("BT-1", "BT-10", "BT-11", "BT-12", "BT-13", "BT-2",
"BT-3",  "BT-4",  "BT-5",  "BT-6",  "BT-7",  "BT-8",
"BT-9",  "GT-1",  "GT-2", "GT-3",  "GT-4",  "GT-5",
"GT-6", "GT-7")
data_paru <- structure_to_genind(path = paste0(system.file('extdata',
package = 'graph4lg'),
"/data_PC_str.txt"),
loci_names = loci_names,
pop_names = pop_names,
ind_names = ind_names)
data_paru
#> /// GENIND OBJECT /////////
#>
#>  // 432 individuals; 12 loci; 169 alleles; size: 352.3 Kb
#>
#>  // Basic content
#>    @tab:  432 x 169 matrix of allele counts
#>    @loc.n.all: number of alleles per locus (range: 5-23)
#>    @loc.fac: locus factor for the 169 columns of @tab
#>    @all.names: list of allele names for each locus
#>    @ploidy: ploidy of each individual  (range: 2-2)
#>    @type:  codom
#>    @call: adegenet::df2genind(X = as.matrix(x[, attr(x, "locicol"), drop = FALSE]),
#>     sep = "/", pop = pop, ploidy = ploidy)
#>
#>  // Optional content
#>    @pop: population of each individual (group size range: 6-33)

#### gstud to genind

Packages gstudio and popgraph developed by R. Dyer (Dyer 2009) use as input data R data.frames with columns of class locus. These data.frame objects constitute gstud objects. Given these packages are often used to create genetic graphs, we created a function to convert them into the genind format.

A gstud object generally has the following structure (Plumbeous Warbler data from Guadeloupe as an example):

head(data_pc_gstud)
#>   Species Cluster        ID   id Latitude Longitude DkiD104 DkiD124 DkiD102
#> 1      PC    BT-1 03PCGD495 F011 16.32928 -61.77361 177:177 255:263 302:314
#> 2      PC    BT-1 03PCGD499 F011 16.32928 -61.77361 177:177 259:267 286:314
#> 3      PC    BT-1 03PCGD500 F011 16.32928 -61.77361 177:177 263:263 306:322
#> 4      PC    BT-1 03PCGD501 F011 16.32928 -61.77361 177:189 263:267 314:334
#> 5      PC    BT-1 03PCGD502 F011 16.32928 -61.77361 177:177 263:267 306:318
#> 6      PC    BT-1 03PCGD503 F011 16.32928 -61.77361 177:177 263:267 306:314
#>     CAM19 DkiC118 DkiD128  DkiB12  Lswmu7 DkiD109  Lswmu5 TG12_15  DkiD12
#> 1 238:240 313:313 257:280 182:185 181:181 203:203 254:254 298:300 205:205
#> 2 240:240 318:318 261:292 182:182 171:179 167:170 254:254 302:304 197:205
#> 3 234:238 317:318 261:268 182:182 171:179 170:170 254:256 302:302 197:209
#> 4 236:240 313:313 265:288 182:182 171:179 167:167 254:254 304:306 197:209
#> 5 234:240 317:317 261:264 182:182 171:179 163:207 254:262 300:304 200:209
#> 6 238:238 313:313 260:272 182:182 179:179 159:167 254:256 302:304 200:209

To convert it with the function gstud_to_genind, we indicate the name of the data.frame and of the columns with populations’ names and individuals’ names:

gstud_to_genind(x = data_pc_gstud, pop_col = "Cluster",
ind_col = "ID")
#> /// GENIND OBJECT /////////
#>
#>  // 432 individuals; 12 loci; 169 alleles; size: 352.3 Kb
#>
#>  // Basic content
#>    @tab:  432 x 169 matrix of allele counts
#>    @loc.n.all: number of alleles per locus (range: 5-23)
#>    @loc.fac: locus factor for the 169 columns of @tab
#>    @all.names: list of allele names for each locus
#>    @ploidy: ploidy of each individual  (range: 2-2)
#>    @type:  codom
#>    @call: adegenet::df2genind(X = as.matrix(x[, attr(x, "locicol"), drop = FALSE]),
#>     sep = "/", pop = pop, ploidy = ploidy)
#>
#>  // Optional content
#>    @pop: population of each individual (group size range: 6-33)

### Genetic and geographical distances calculation

#### Genetic distances

From a genind object, the function mat_gen_dist calculates several types of between populations genetic distances:

• FST (Weir and Cockerham 1984) or linearised FST (Rousset 1997) (options ‘dist=FST’ and ‘dist=FST_lin’).

• G’ST (Hedrick 2005) (option ‘dist=GST’).

• DJost (Jost 2008) (option ‘dist=D’).

• DPS (1 - proportion of shared alleles) (Bowcock et al. 1994, @murphy_graph_2015) (option ‘dist=DPS’).

• Euclidean genetic distance (Excoffier, Smouse, and Quattro 1992) (option ‘dist=basic’).

• Euclidean genetic distance with a weighting depending on allelic frequencies giving more weight to rare alleles (Fortuna et al. 2009) (option ‘dist=weight’).

• Euclidean genetic distance computed after a PCA of the matrix of allelic frequencies by population. The axes considered to compute the Euclidean distance are the non-collinear principal components (total number of alleles - number of loci) (Paschou et al. 2014, @shirk2017comparison) (option ‘dist=PCA’).

• Euclidean genetic distance computed in the same way as with the function popgraph from popgraph package, i.e. after a PCA and two SVD, among other computation steps (option ‘dist=PG’). This distance is different from the conditional genetic distance (cGD) computed from a population graph by summing genetic distances along shortest paths.

To do these calculations with the function mat_gen_dist, you just have to indicate the name of the genind object which includes the genetic data of the individuals as well as the populations to which each of them belongs. The other argument of the function is the type of genetic distance to compute. Here are two examples:

mat_dps <- mat_gen_dist(x = data_genind, dist = "DPS")
mat_dps[1:5, 1:5]
#>            1        10        11        12        13
#> 1  0.0000000 0.7883333 0.5900000 0.6816667 0.8733333
#> 10 0.7883333 0.0000000 0.5741667 0.5450000 0.8816667
#> 11 0.5900000 0.5741667 0.0000000 0.3750000 0.8658333
#> 12 0.6816667 0.5450000 0.3750000 0.0000000 0.8633333
#> 13 0.8733333 0.8816667 0.8658333 0.8633333 0.0000000
mat_pg <- mat_gen_dist(x = data_genind, dist = "PG")
mat_pg[1:5, 1:5]
#>           1       10       11       12       13
#> 1   0.00000 29.93508 21.67789 25.40865 29.17504
#> 10 29.93508  0.00000 19.15163 20.32432 29.99936
#> 11 21.67789 19.15163  0.00000 13.50181 24.69579
#> 12 25.40865 20.32432 13.50181  0.00000 25.18634
#> 13 29.17504 29.99936 24.69579 25.18634  0.00000

#### Import of landscape graphs created with GRAPHAB

The function graphab_to_igraph allows to import landscape graphs created with software into the R environment. It takes the path to the directory of a project created with as a first argument. Then, the names of the shapefile layers corresponding to the nodes and links (the least-cost paths or a set of graph links) of the graphs must be indicated (for the nodes, by default, the shapefile layers “patches.shp” is imported). The links of the imported graph are weighted by cost-distance values or by geographical distance values depending on the weight option. The graph can be plotted on a geographical map. Besides, an object of type data.frame with the spatial coordinates of the nodes’ centroids can be created.

As an example, the following command allows to import the project created to parameterize the gene-flow simulation on performed to create data_simul_genind:

land_graph <- graphab_to_igraph(dir_path = system.file('extdata',
package = 'graph4lg'),
nodes = "patches",
weight = "cost", fig = FALSE, crds = TRUE)
#> OGR data source with driver: ESRI Shapefile
#> Source: "C:\Users\psavary\AppData\Local\Temp\RtmpEbE3Db\Rinst78c6f4fa71\graph4lg\extdata", layer: "patches"
#> with 50 features
#> It has 4 fields
#> OGR data source with driver: ESRI Shapefile
#> with 1225 features
#> It has 5 fields

The land_graph object is a list composed of an object of class igraph corresponding to the landscape graph, and of a table of class data.frame with the coordinates of the patches’ centroids.

crds_patches <- land_graph[[2]]
land_graph <- land_graph[[1]]

We create a matrix with the cost-distances between populations from the links of the landscape graph imported, which is a complete graph. You have to pay attention to the order of the populations’ names in the distance matrices, even if error messages will appear when using the functions of this package if you use two matrices in which rows’ and columns’ names do not match. The function reorder_mat reorders the rows and columns of a symmetric matrix according to a specified order. To use it, you just need to create a character vector with the rows and columns names from the symmetric matrix in the order you want them to be ordered in the new matrix.

mat_ld <- as_adjacency_matrix(land_graph, attr = "weight",
type = "both", sparse = FALSE)
order <- row.names(mat_ld)[order(c(as.character(row.names(mat_ld))))]
mat_ld <- reorder_mat(mat = mat_ld, order = order)

#### Geographical distances

You can also calculate Euclidean geographical distances between populations with the function mat_geo_dist. It takes as an argument a data.frame with 3 columns:

• ID : populations’ ID (or simple points’ ID when using this function in another context)

• x : populations’ or points’ longitude

• y : populations’ or points’ latitude

Geographical coordinates must be expressed in a projected coordinates system (metric) because Pythagoras’s theorem is used to compute the distances (a warning message is displayed in every case).

Example :

head(crds_patches)
#>   ID     x     y
#> 0  1 25650 54150
#> 1  2 22550 53650
#> 2  3  7250 51750
#> 3  4 32350 50550
#> 4  5 27350 49650
#> 5  6 15950 49550
mat_geo <- mat_geo_dist(data = crds_patches, ID = "ID", x = "x", y = "y")
#> Coordinates were treated as projected coordinates. Check whether
#>               it is the case.
mat_geo <- reorder_mat(mat = mat_geo, order = order) 
mat_geo[1:5, 1:5]
#>           1        10        11        12       13
#> 1      0.00 25292.687 15033.296 20465.825 13354.40
#> 10 25292.69     0.000 12001.667  5622.277 31907.68
#> 11 15033.30 12001.667     0.000  6407.027 19906.28
#> 12 20465.83  5622.277  6407.027     0.000 26300.76
#> 13 13354.40 31907.679 19906.280 26300.760     0.00

Cost-distances are expressed in cost units arbitrarily defined based on the cost values assigned to every land cover type when creating resistance surfaces. However, species dispersal distances are usually known as distances expressed in metric units. When we know that the probability that a species covers 10 km is 5 %, we can ask what is the equivalent of this distance in cost distance units according to the scenario of cost values assumed. It allows to prune a landscape graph with a given distance threshold e.g. or provides an order of magnitude of the cost-distance values.

To that purpose, you can perform a regression of the between populations cost-distance values against the corresponding geographical distances. It estimates the relationship between both types of distance. Then, the resulting parameters estimates enable to convert a geographical distance into its cost-distance equivalent according to the cost scenario.

The function convert_cd performs the linear regression or log-log linear regression between the geographical distance matrix and the cost-distance matrix, in the same way as Tournant et al. (2013) and as performed by GRAPHAB software.

Below, we estimate the relationship between geographical distance and cost-distance between the populations used to perform the gene-flow simulation. We convert 10 km into cost-distance units. The function also plots the relationship between these distances.

convert_res <- convert_cd(mat_euc = mat_geo, mat_ld = mat_ld,
to_convert = 10000, fig = TRUE,
method = "log-log", pts_col = "grey")
convert_res
#> $Converted values #> 10000 #> 1605.543 #> #>$Model parameters
#> Intercept     Slope
#> -2.251216  1.045828
#>
#> $Model multiple R-squared #> Multiple R-squared #> 0.6870757 #> #>$Plot

In this case, we can see that 10 km are equivalent to 1606 cost-distance units. The log-log linear model estimates the relationship between geographical distance (GD) and cost-distance (CD) such that: $$log(CD)=-2.2512+1.0458 \times log(GD)$$. The determination coefficient $$R^2$$ associated to this linear model is 0.69. The black dot represented on the figure refers to the 10 km value on the regression line characterizing the relationship between cost-distance and geographical distance.

## Graphs’ construction

Some functions from graph4lg package construct graphs from either genind objects or genetic or landscape distance matrices. To choose a genetic distance and a pruning method for the genetic graphs’ construction, we developed functions to perform preliminary analyses of the spatial pattern of genetic differentiation. Indeed, a genetic graph can be created in order to i) identify the direct dispersal paths between populations or to ii) select the set of population pairs to consider to infer landscape effects on dispersal. According to the use of a genetic graph and to the spatial pattern of genetic differentiation (type-I or type-IV pattern of IBD (Hutchison and Templeton 1999, @van2015isolation)), the choice of a genetic distance and of a pruning method will not be the same.

Van Strien, Holderegger, and Van Heck (2015) computed the so-called distance of maximum correlation (DMC) as the distance between populations below which population pairs should be considered in order to maximize the correlation between landscape distance (geographical distance in their case, but applies similarly to cost-distance) and genetic distance. This distance threshold is computed by increasing iteratively the maximum distance between populations above which population pairs are not taken into account to compute the correlation. Thus, an increasing number of population pairs is considered in the inference. When the correlation coefficient between landscape distance and genetic distance reaches a maximum, the distance threshold considered is the DMC. When the DMC is equal to the maximum distance between populations, it means that an equilibrium established between gene flow and genetic drift at the scale of the study area. Conversely, when the DMC is lower than this maximum distance, it means that there is a “plateau” in the relationship between landscape distance and genetic distance because migration-drift equilibrium has not been reached yet at the scale considered. It can be due to recent modifications of the landscape which consistently reduced the connectivity in a previously connected context. In this case, graph pruning is needed to well infer landscape effect on dispersal. Similarly, genetic distances that do not assume this equilibrium should be used.

The function dist_max_corr calculates the DMC from two distance matrices. You need to specify the interval between two distance thresholds iteratively considered to select population pairs and compute the correlation coefficient. The function scatter_dist, on the other hand, allows to visualize the relationship between two distance matrices by making a scatter plot. The shape of this relationship can be compared to the four different types of IBD patterns described by Hutchison and Templeton (1999) in order to characterize the spatial pattern of genetic differentiation.

Here are the command lines to execute to use these two functions:

dmc <- dist_max_corr(mat_gd = mat_dps, mat_ld = mat_ld,
interv = 500, pts_col = "black")

The dmc object is a list with 1) the DMC value, 2) a vector containing all the computed correlation coefficients, 3) a vector with all the distance thresholds tested and 4) a graphic object created with the ggplot2 package.

# DMC value
dmc[[1]]
#> [1] 4500
# Correlation coefficients
dmc[[2]]
#>  [1]        NA 0.2986565 0.3154498 0.5188747 0.7059633 0.7559539 0.7850267
#>  [8] 0.7947691 0.8038470 0.7853646 0.7760106 0.7641339 0.7530264 0.7462445
#> [15] 0.7386713 0.7333936 0.7305631 0.7226695 0.7137972 0.7110962 0.7041702
# Threshold distances tested
dmc[[3]]
#>  [1]   500.00  1000.00  1500.00  2000.00  2500.00  3000.00  3500.00  4000.00
#>  [9]  4500.00  5000.00  5500.00  6000.00  6500.00  7000.00  7500.00  8000.00
#> [17]  8500.00  9000.00  9500.00 10000.00 10230.05

The figure below represents the evolution of the correlation coefficient values when distance thresholds increase.

dmc[[4]]

The scatter plot is displayed with the function scatter_dist.

scatter_dist(mat_gd = mat_dps, mat_ld = mat_ld,
pts_col = "black")
#> 1225 out of 1225 values were used.

In this particular case, we notice a type-IV pattern of isolation by distance with a “plateau” in the relationship between cost-distance and genetic-distance (DPS). Graph pruning will be needed to select the population pairs to include in the inference of landscape effects on dispersal.

In the following section, we present the different pruning methods available.

### Pruning based on distance thresholds

To prune a graph whose links are weighted by distances, you can remove all the links associated to geographical or genetic distances larger (or lower) than a specific threshold distance. This distance can for example be equal to the maximum dispersal distance of an individual of the study species at the scale of its lifespan so that the resulting graph represents the direct dispersal paths of the species. It can also be equal to the DMC if the objective is to infer landscape effects on dispersal.

The function gen_graph_thr takes as arguments a distance matrix used to weight the links of the resulting graph (mat_w) and a distance matrix on which the “thresholding” is based (mat_thr). The selected links are selected according to the values of this latter matrix. The argument thr is the numerical value of the threshold distance. If mat_thr is not specified, mat_w is used by default for the thresholding. Lastly, you have to specify if the links to remove take larger or lower values than the threshold value.

graph_thr <- gen_graph_thr(mat_w = mat_dps, mat_thr = mat_geo,
thr = 12000, mode = "larger")
graph_thr
#> IGRAPH f478cbc UNW- 50 162 --
#> + attr: name (v/c), weight (e/n)
#> + edges from f478cbc (vertex names):
#>  [1] 1 --2  1 --4  1 --5  1 --6  1 --9  10--12 10--20 10--21 10--8  11--12
#> [11] 11--16 11--18 11--20 11--4  11--5  11--8  11--9  12--18 12--20 12--21
#> [21] 12--8  13--14 13--15 13--16 13--17 13--2  13--22 13--5  13--6  13--7
#> [31] 14--15 14--16 14--17 14--22 14--5  14--6  15--16 15--17 15--22 15--5
#> [41] 15--9  16--17 16--18 16--22 16--26 16--4  16--5  16--9  17--19 17--22
#> [51] 17--23 17--27 17--28 17--6  18--20 18--21 18--24 18--25 18--26 18--8
#> [61] 18--9  19--23 19--27 19--7  2 --4  2 --5  2 --6  2 --9  20--21 20--24
#> [71] 20--25 20--26 20--8  21--24 21--29 22--26 22--28 22--30 23--27 23--28
#> + ... omitted several edges

The function returns a graph in the form of an igraph object, which is consequently compatible with all functions from igraph package (Csardi and Nepusz 2006), one of the most used R package to create and analyse graphs (together with sna and networks). In the latter example, the graph has 50 nodes and 162 links when we prune it using a 12-km distance threshold. Its links are weighted with the values of the mat_dps matrix.

### Pruning based on topological constraints

A graph can be pruned according to a topological criterion. The function gen_graph_topo can use 4 different criteria. As with the previous function, topological criteria are applied by considering the distance values of the mat_topo matrix, but the links are weighted with the values of the mat_w matrix (except when mat_topo is not specified, cf. previous section).

Gabriel graph: in the created graph, two nodes are connected by a link if, when we draw a circle whose center is set at the middle of the segment linking them and whose radius is equal to half the length of this segment, there is no other node inside the circle. In mathematical terms, it means that there is a segment between $$x$$ and $$y$$ if and only if for every other point $$z$$, we have: $$d_{xy}\leq \sqrt{d_{xz}^{2}+d_{yz}^{2}}$$. We can compute such a graph from geographical distances (Gabriel and Sokal 1969) (graph_gab_geo below) but also, less commonly, from genetic distances (Naujokaitis-Lewis et al. 2013) (graph_gab_gen below). In the latter case, it is to some extent as if Pythagoras’s theorem was applied to genetic distances, which can seem a bit strange even if this method has already been used by Naujokaitis-Lewis et al. (2013).

graph_gab_geo <- gen_graph_topo(mat_w = mat_dps, mat_topo = mat_geo,
topo = "gabriel")
graph_gab_geo
#> IGRAPH f4d349f UNW- 50 98 --
#> + attr: name (v/c), weight (e/n)
#> + edges from f4d349f (vertex names):
#>  [1] 1 --2  1 --5  10--12 10--21 11--12 11--16 11--18 11--8  11--9  12--20
#> [11] 12--8  13--14 13--17 13--5  13--6  14--15 14--17 14--5  15--16 15--22
#> [21] 15--5  16--18 16--22 16--9  17--19 17--23 17--27 17--28 18--20 18--25
#> [31] 19--23 19--7  2 --6  20--21 20--24 20--25 21--24 21--29 22--26 22--28
#> [41] 23--27 24--25 24--29 24--31 25--26 25--31 25--32 26--30 27--28 27--33
#> [51] 28--30 28--34 28--36 29--31 29--37 3 --7  30--32 30--34 31--32 31--35
#> [61] 31--37 32--34 32--35 33--36 33--42 34--36 34--44 35--37 35--39 35--45
#> [71] 36--40 37--38 37--39 37--41 38--41 38--43 39--41 39--49 4 --5  4 --9
#> + ... omitted several edges
graph_gab_gen <- gen_graph_topo(mat_w = mat_dps, mat_topo = mat_dps,
topo = "gabriel")

Minimum Spanning Tree (MST): it creates a minimum spanning tree, i.e a graph in which every node is connected by a link to at least another node and whose total links’ weight is minimum. By definition, its number of links is equal to the number of nodes - 1.

graph_mst <- gen_graph_topo(mat_w = mat_dps, mat_topo = mat_dps,
topo = "mst")
graph_mst
#> IGRAPH f52b476 UNW- 50 49 --
#> + attr: name (v/c), weight (e/n)
#> + edges from f52b476 (vertex names):
#>  [1] 1 --2  1 --4  10--8  11--12 11--18 11--20 11--4  12--8  13--14 13--6
#> [11] 14--15 15--17 15--22 15--28 16--23 17--23 19--23 2 --7  20--25 21--24
#> [21] 23--27 24--29 25--29 25--30 26--30 27--33 3 --6  3 --7  30--34 31--32
#> [31] 32--34 32--35 32--37 36--40 37--38 38--43 39--43 4 --9  40--42 41--43
#> [41] 41--49 42--48 43--45 43--47 44--45 44--48 46--47 48--50 5 --9

“Percolation” graph: the graph is created by removing iteratively some links, beginning with those with the highest weights until the graph breaks into more than one component. We conserve the link whose removal entails the creation of another component to obtain a connected graph. This method is also called the edge-thinning method (Urban et al. 2009). Such a method is linked to percolation theory (Rozenfeld et al. 2008). The function gen_graph_topo indicates the number of conserved links and the weight of the link whose removal disconnects the graph (maximum link weight of the created graph).

graph_percol <- gen_graph_topo(mat_w = mat_dps, mat_topo = mat_dps,
topo = "percol")
#> Number of conserved links : 325
#> Maximum weight of the conserved links : 0.7525

Complete graph: the function allows to create a complete graph from a distance matrix. In that case, there is no pruning and, by definition, all population pairs are connected.

graph_comp <- gen_graph_topo(mat_w = mat_dps, mat_topo = mat_dps,
topo = "comp")

### Pruning based on the conditional independence principle

The last pruning method implemented by the graph4lg package is based upon the conditional independence principle. The function gen_graph_indep is largely inspired by the function popgraph created by R. Dyer (Dyer and Nason 2004), but does not need the package popgraph to function. Besides, as some calculations are performed with functions from the adegenet package (coded in C), it is faster than the original popgraph function. It is also more flexible than popgraph function given you can vary i) the way we compute genetic distances used to weight the links and to compute the covariance between populations, ii) the formula used to compute the covariance from squared distances or alternatively simple distances, iii) the statistical tolerance threshold, iv) the p-values adjustment and v) the returned objects created by the function. Without entering further into the details, here is an implementation example.

graph_ci <- gen_graph_indep(x = data_genind,
dist = "PCA",
cov = "sq",
adj = "holm")
graph_ci
#> IGRAPH 326a1c8 UNW- 50 105 --
#> + attr: name (v/c), weight (e/n)
#> + edges from 326a1c8 (vertex names):
#>  [1] 1 --2  1 --3  1 --4  1 --9  10--11 10--8  11--12 11--18 11--20 12--20
#> [11] 12--25 12--8  13--14 13--15 13--22 13--6  14--15 14--17 14--19 14--27
#> [21] 14--32 14--6  15--17 15--23 15--28 16--22 16--23 16--33 16--41 17--23
#> [31] 17--33 18--19 18--20 18--9  19--23 19--27 2 --5  2 --7  20--4  20--5
#> [41] 21--24 21--29 21--45 22--23 22--8  23--27 23--49 24--29 25--26 25--3
#> [51] 25--30 25--32 25--37 26--3  26--30 26--34 26--39 26--44 26--8  27--33
#> [61] 3 --31 3 --6  3 --7  30--34 30--7  31--32 31--35 31--36 32--34 32--35
#> [71] 32--8  33--40 34--6  35--37 36--40 36--48 37--38 37--39 37--46 38--43
#> + ... omitted several edges

## Graphs’ processing and analysis

Once the graphs have been created, you can perform calculation from them, visualize and export them.

### Graphical application

#### Representation of the graph on a map

The function plot_graph_lg integrates functions from igraph and ggplot2 to represent graphs on a map. Most frequently, graphs are spatial and a table with populations’ coordinates must be given as an argument. It must have exactly the same structure as the table given as an argument to mat_geo_dist (3 columns : ID, x, y). The visual representation can make visible the links’ weights by plotting the links with a width proportional to the weight (width = "w") or the inverse weight (width = "inv") of the links.

For example, with the graph graph_mst :

p <- plot_graph_lg(graph = graph_mst, crds = crds_patches,
mode = "spatial", weight = TRUE, width = "inv")
p

If the populations’ spatial coordinates are not available, you can still display the graph on a two-dimensional plane. In that case, the nodes’ positions are computed with Fruchterman and Reingold (1991) algorithm to optimize the representation. With the graph graph_mst, we obtain:

p <- plot_graph_lg(graph = graph_mst, crds = crds_patches,
mode = "aspatial", weight = TRUE, width = "inv")
p

#### Representation of the graph’s modules on a map

You can also plot the results of a graph’s partition carried out by computing a modularity index with the function plot_graph_modul. This function is similar to plot_graph_lg. The main difference is that the graph’s nodes from the same modules are displayed in the same color.

Several algorithms can be used: fast greedy (Clauset, Newman, and Moore 2004), louvain (Blondel et al. 2008), optimal (Brandes et al. 2008) and walktrap (Pons and Latapy 2006). The number of created modules in each graph is adjustable but can be fixed depending on the optimal values of the modularity indices. Besides, the modularity calculation can take into account the links’ weights or not. When taken into account, the weight given to a link in the calculation will be i) proportional to or ii) inversely proportional to the genetic or landscape distance to which it is associated.

In the following example, the graph is partitioned into 6 modules with the fast_greedy algorithm (default option).

plot_graph_modul(graph = graph_gab_geo, crds = crds_patches)

### Spatial layers export

Even if the function plot_graph_lg enables to visualize a spatial graph on a geographical plane, it is often useful to confront the populations’ and links’ locations to other types of spatial data. To that purpose, you can export the graph into shapefile layers in order to open them in a GIS. The graphs’ nodes must have spatial coordinates. When exporting, you can choose to export only the nodes’ shapefile layer, the links’ shapefile layer or both. You can also compute metrics associated to the nodes (degree, betweenness, sum of inverse links weights). These metrics will be included in the attribute table of the exported nodes’ shapefile layer. For the links, the attribute table contains the weights associated to every link. The function graph_to_shp takes also as an argument the coordinates reference system (CRS) in which the points’ coordinates from the table are expressed. It will be the CRS of the created shapefile layers. The last argument is the suffix given to the shapefile layers’ names beginning with “node” or “link”.

graph_to_shp(graph = graph_mst, crds = crds_patches, mode = "both",
layer_name = "test_shp_mst",
dir_path = "wd",
metrics = TRUE,
crds_crs = "+proj=lcc +lat_1=49 +lat_2=44 +lat_0=46.5 +lon_0=3
+x_0=700000 +y_0=6600000 +ellps=GRS80 +units=m +no_defs")

Shapefile layers are created in the working directory and can be imported into a GIS.

### Metrics calculation

You can also compute metrics from graphs with functions from igraph package on R. For example, to compute the nodes’ degree:

igraph::degree(graph_percol)
#>  1 10 11 12 13 14 15 16 17 18 19  2 20 21 22 23 24 25 26 27 28 29  3 30 31 32
#> 11  6 14 14 12 18 14 10 17 14 11 12 18  2 19 14  2 28 17 13 11  3 11 19 13 14
#> 33 34 35 36 37 38 39  4 40 41 42 43 44 45 46 47 48 49  5 50  6  7  8  9
#> 13 17 12  5 17 13 16 12 19 15 12 20 12 12 16 20 13 11  8 13 13  9  7  8

You can also perform modularity analyses:

igraph::cluster_fast_greedy(graph_thr,
weights = 1/igraph::E(graph_thr)$weight) #> IGRAPH clustering fast greedy, groups: 6, mod: 0.62 #> + groups: #>$1
#>    [1] "13" "14" "15" "16" "17" "19" "22" "23" "27" "28" "3"  "33" "36" "6"
#>   [15] "7"
#>
#>   $2 #> [1] "25" "26" "30" "31" "32" "34" "35" #> #>$3
#>    [1] "1"  "10" "11" "12" "18" "2"  "20" "4"  "5"  "8"  "9"
#>
#>   + ... omitted several groups/vertices

Metrics’ values can be associated to the nodes to which they correspond. To that purpose, 1) you can create a data.frame to store the values. It must have a column with the nodes’ names in the same form as the nodes’ names of the igraph object from which you compute the metric.

df_met <- data.frame(ID = V(graph_percol)$name) df_met$deg <- igraph::degree(graph_percol)
df_met$modul <- igraph::cluster_fast_greedy(graph_thr, weights = 1/igraph::E(graph_thr)$weight)\$membership

Then, you add the metrics’ values to the igraph object as nodes’ attributes.

graph_percol <- add_nodes_attr(graph_percol,
input = "df",
data = df_met,
index = "ID")

It is another and more flexible way to export metrics’ values in the attribute table of the shapefile layers created with the function graph_to_shp.

You can also 2) associate metrics’ values computed with GRAPHAB software to the nodes of the igraph object. In this case, you have to specify the path to a shapefile layer created with GRAPHAB whose attribute table contains a field with the graph’s nodes names.

land_graph <- add_nodes_attr(land_graph,
input = "shp",
dir_path = system.file('extdata', package = 'graph4lg'),
layer = "patches",
index = "Id",
include = "Area")

## Graphs’ comparison

In landscape genetics, the analysis of the link between landscape and genetic data can be performed at several scales. Indeed, node-, link-, neighbourhood- and boundary-based analyses are distinguished (Wagner and Fortin 2013). Similarly, you can compare landscape and genetic graphs at several scales, in particular when they share the same nodes. Two graphs are similar if the correlation coefficient (non parametric rank correlation ) between the metric values calculated for the same nodes or between the distance values associated to the same links is high. Similarly, they are similar if the conserved links of one graph are the same ones as those conserved in the other. Finally, two graphs have similar topological and connectivity properties if their modules match, i.e. if two nodes classified together in the same module when the partition in modules is computed from one graph are also classified together in the same module when the partition is computed from the other graph.

We included two functions allowing such comparisons. The function graph_topo_compar compares the topologies of two graphs sharing the same nodes. To do that, it creates a contingency table whose modalities are “Presence of a link” and “Absence of a link”. We consider the topology of one graph to represent the “reality” of the dispersal flows that the other one is supposed to reproduce. In the (double entry) contingency table, when there is a 10 in the cell “Presence of a link” $$\times$$ “Presence of a link”, it means that 10 links between the same population pairs are present in the two graphs compared. The 10 value corresponds to the number of true positives (TP). The three other values of the table are the number of false positives (FP), true negatives (TN) and false negatives (FN). From this table, you can compute several metrics often used to evaluate the performance of classification methods: the Matthews’ correlation coefficient (Matthews 1975), the Kappa index, the False Discovery Rate, the Accuracy, the Sensitivity, the Specificity and the Precision.

For example, you can create a landscape graph from a cost-distance matrix using a threshold of 2000 cost-distance units and then compare the topology of this graph to that of the Gabriel graph created from the same nodes by computing the Matthews’ correlation coefficient.


land_graph2 <- gen_graph_thr(mat_w = mat_ld, mat_thr = mat_ld,
thr = 2000, mode = "larger")

graph_topo_compar(obs_graph = land_graph2,
pred_graph = graph_gab_geo,
mode = "mcc",
directed = FALSE)
#> Matthews Correlation Coefficient : 0.541875568864072
#> [1] 0.5418756

We get a Matthews’ correlation coefficient of 0.54. This coefficient takes a value of 0 when the matches between topologies are no more frequent than by simple chance. It reaches 1 when the topologies are identical.

Besides, you can compare the topologies of two graphs sharing the same nodes visually. To that purpose, their links are displayed on a map with a color depending on their presence in both graphs or in only one of them. The function graph_plot_compar can be used:

graph_plot_compar(x = land_graph2, y = graph_gab_geo,
crds = crds_patches)

Studies on graph theory developed a range of metrics. Among them, some are computed at the node level. Hence, two graphs sharing the same nodes can be compared by correlating the metrics computed at the node level in both graphs. These metrics can also be correlated to other variables characterizing the nodes, such as the allelic richness. The function graph_node_compar allows to compute such correlations. The metrics which can be computed at the node level are: the degree, the closeness centrality, the betweenness centrality, the strength (sum of the weights of the links connected to a node), the sum or the mean of the weights of links connected to a node (Koen, Bowman, and Wilson 2016). Besides, other variables can be studied by associating them to the graph’s nodes as attributes with the function add_nodes_attr. Pearson’s, Kendall’s and Spearman’s correlation coefficients between the metrics or variables can be computed. A significance test can then be performed.

In the following example, the betweenness centrality index is computed in both graphs. Then, Spearman’s correlation coefficient is computed.

graph_node_compar(x = graph_gab_geo, y = land_graph2,
metrics = c("btw", "btw"), method = "spearman",
weight = TRUE, test = TRUE)
#> [[1]]
#> [1] "Metric from graph x: btw"
#>
#> [[2]]
#> [1] "Metric from graph y: btw"
#>
#> [[3]]
#> [1] "Method used: spearman's correlation coefficient"
#>
#> [[4]]
#> [1] "Sample size: 50"
#>
#> [[5]]
#> [1] "Correlation coefficient: 0.435435371318489"
#>
#> [[6]]
#> [1] "p-value of the significance test: 0.00157527417929113"

The function graph_modul_compar compares the nodes partitions into modules. To do that, it computes the Adjusted Rand Index (ARI) (Hubert and Arabie 1985), a standardised index which counts the number of nodes pairs classified in the same module in both graphs. The function also performs the nodes partition into modules by using the modularity calculations available from igraph. You can specify the algorithm used to compute the modularity, the link weighting, the number of modules, among others (cf. function plot_graph_modul).

In the following example, we compare the same graphs as previously with the default parameters (‘fast_greedy’ algorithm, optimal number of modules, links weighted by inverse distances).

graph_modul_compar(x = land_graph2, y = graph_gab_geo)
#> [1] "6 modules in graph 1 and 6 modules in graph 2"
#> [1] "Adjusted Rand Index: 0.667565013147675"
#> [1] 0.667565

The ARI value is relatively high. There are 6 modules in both graph partitions.

## References

Blondel, Vincent D, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. 2008. “Fast Unfolding of Communities in Large Networks.” Journal of Statistical Mechanics - Theory and Experiment 10.

Bowcock, Anne M, A Ruiz-Linares, James Tomfohrde, Eric Minch, Judith R Kidd, and L Luca Cavalli-Sforza. 1994. “High Resolution of Human Evolutionary Trees with Polymorphic Microsatellites.” Nature 368 (6470): 455–57.

Brandes, Ulrik, Daniel Delling, Marco Gaertler, Robert Gorke, Martin Hoefer, Zoran Nikoloski, and Dorothea Wagner. 2008. “On Modularity Clustering.” IEEE Transactions on Knowledge and Data Engineering 20 (2): 172–88.

Clauset, Aaron, Mark EJ Newman, and Cristopher Moore. 2004. “Finding Community Structure in Very Large Networks.” Physical Review E 70 (6).

Csardi, Gabor, and Tamas Nepusz. 2006. “The Igraph Software Package for Complex Network Research.” International Journal of Complex Systems 1695 (5): 1–9.

Dyer, Rodney J. 2009. “GeneticStudio: A Suite of Programs for Spatial Analysis of Genetic-Marker Data.” Molecular Ecology Resources 9 (1): 110–13.

Dyer, Rodney J, and John D Nason. 2004. “Population Graphs: The Graph Theoretic Shape of Genetic Structure.” Molecular Ecology 13 (7): 1713–27.

Excoffier, Laurent, Peter E Smouse, and Joseph M Quattro. 1992. “Analysis of Molecular Variance Inferred from Metric Distances Among Dna Haplotypes: Application to Human Mitochondrial Dna Restriction Data.” Genetics 131 (2): 479–91.

Foltête, Jean-Christophe, Céline Clauzel, and Gilles Vuidel. 2012. “A Software Tool Dedicated to the Modelling of Landscape Networks.” Environmental Modelling & Software 38: 316–27.

Fortuna, Miguel A, Rafael G Albaladejo, Laura Fernández, Abelardo Aparicio, and Jordi Bascompte. 2009. “Networks of Spatial Genetic Variation Across Species.” Proceedings of the National Academy of Sciences 106 (45): 19044–9.

Fruchterman, Thomas MJ, and Edward M Reingold. 1991. “Graph Drawing by Force-Directed Placement.” Software: Practice and Experience 21 (11): 1129–64.

Gabriel, K Ruben, and Robert R Sokal. 1969. “A New Statistical Approach to Geographic Variation Analysis.” Systematic Zoology 18 (3): 259–78.

Hedrick, Philip W. 2005. “A Standardized Genetic Differentiation Measure.” Evolution 59 (8): 1633–8.

Hubert, Lawrence, and Phipps Arabie. 1985. “Comparing Partitions.” Journal of Classification 2 (1): 193–218.

Hutchison, Delbert W, and Alan R Templeton. 1999. “Correlation of Pairwise Genetic and Geographic Distance Measures: Inferring the Relative Influences of Gene Flow and Drift on the Distribution of Genetic Variability.” Evolution 53 (6): 1898–1914.

Jombart, Thibaut. 2008. “Adegenet: A R Package for the Multivariate Analysis of Genetic Markers.” Bioinformatics 24 (11): 1403–5.

Jost, Lou. 2008. “GST and Its Relatives Do Not Measure Differentiation.” Molecular Ecology 17 (18): 4015–26.

Khimoun, Aurélie, William Peterman, Cyril Eraud, Bruno Faivre, Nicolas Navarro, and Stéphane Garnier. 2017. “Landscape Genetic Analyses Reveal Fine-Scale Effects of Forest Fragmentation in an Insular Tropical Bird.” Molecular Ecology.

Koen, Erin L, Jeff Bowman, and Paul J Wilson. 2016. “Node-Based Measures of Connectivity in Genetic Networks.” Molecular Ecology Resources 16 (1): 69–79.

Landguth, Erin L, and SA Cushman. 2010. “CDPOP: A Spatially Explicit Cost Distance Population Genetics Program.” Molecular Ecology Resources 10 (1): 156–61.

Matthews, Brian W. 1975. “Comparison of the Predicted and Observed Secondary Structure of T4 Phage Lysozyme.” Biochimica et Biophysica Acta (BBA)-Protein Structure 405 (2): 442–51.

Murphy, Melanie, Rodney Dyer, and Samuel A. Cushman. 2015. “Graph Theory and Network Models in Landscape Genetics.” In Landscape Genetics: Concepts, Methods, Applications, edited by Niko Balkenhol, Samuel Cushman, Andrew Storfer, and Lisette Waits, 1st ed., 165–80. John Wiley & Sons.

Naujokaitis-Lewis, Ilona R, Yessica Rico, John Lovell, Marie-Josée Fortin, and Melanie A Murphy. 2013. “Implications of Incomplete Networks on Estimation of Landscape Genetic Connectivity.” Conservation Genetics 14 (2): 287–98.

Paradis, Emmanuel. 2010. “Pegas: An R Package for Population Genetics with an Integrated–Modular Approach.” Bioinformatics 26 (3): 419–20.

Paschou, Peristera, Petros Drineas, Evangelia Yannaki, Anna Razou, Katerina Kanaki, Fotis Tsetsos, Shanmukha Sampath Padmanabhuni, et al. 2014. “Maritime Route of Colonization of Europe.” Proceedings of the National Academy of Sciences.

Pons, Pascal, and Matthieu Latapy. 2006. “Computing Communities in Large Networks Using Random Walks.” J. Graph Algorithms Appl. 10 (2): 191–218.

Pritchard, Jonathan K, Matthew Stephens, and Peter Donnelly. 2000. “Inference of Population Structure Using Multilocus Genotype Data.” Genetics 155 (2): 945–59.

Raymond, Michel. 1995. “GENEPOP: Population Genetics Software for Exact Tests and Ecumenism. Vers. 1.2.” Journal of Heredity 86: 248–49.

Rousset, François. 1997. “Genetic Differentiation and Estimation of Gene Flow from F-Statistics Under Isolation by Distance.” Genetics 145 (4): 1219–28.

Rozenfeld, Alejandro F, Sophie Arnaud-Haond, Emilio Hernández-Garcı́a, Vı́ctor M Eguı́luz, Ester A Serrão, and Carlos M Duarte. 2008. “Network Analysis Identifies Weak and Strong Links in a Metapopulation System.” Proceedings of the National Academy of Sciences 105 (48): 18824–9.

Shirk, AJ, EL Landguth, and SA Cushman. 2017. “A Comparison of Indiviudal-Based Genetic Distance Metrics for Landscape Genetics.” Molecular Ecology 17 (6).

Tournant, Pierline, Eve Afonso, Sébastien Roué, Patrick Giraudoux, and Jean-Christophe Foltête. 2013. “Evaluating the Effect of Habitat Connectivity on the Distribution of Lesser Horseshoe Bat Maternity Roosts Using Landscape Graphs.” Biological Conservation 164: 39–49.

Urban, Dean L, Emily S Minor, Eric A Treml, and Robert S Schick. 2009. “Graph Models of Habitat Mosaics.” Ecology Letters 12 (3): 260–73.

Van Strien, Maarten J, Rolf Holderegger, and Hein J Van Heck. 2015. “Isolation-by-Distance in Landscapes: Considerations for Landscape Genetics.” Heredity 114 (1): 27.

Wagner, Helene H, and Marie-Josée Fortin. 2013. “A Conceptual Framework for the Spatial Analysis of Landscape Genetic Data.” Conservation Genetics 14 (2): 253–61.

Weir, Bruce S, and C Clark Cockerham. 1984. “Estimating F-Statistics for the Analysis of Population Structure.” Evolution 38 (6): 1358–70.