RIdeogram: drawing SVG graphics to visualize and map genome-wide data on idiograms

Zhaodong Hao

2020-01-20

Introduction

RIdeogram is a R package to draw SVG (Scalable Vector Graphics) graphics to visualize and map genome-wide data on idiograms.

Citation

If you use this package in a published paper, please cite this paper:

Hao Z, Lv D, Ge Y, Shi J, Weijers D, Yu G, Chen J. 2020. RIdeogram: drawing SVG graphics to visualize and map genome-wide data on the idiograms. PeerJ Comput. Sci. 6:e251 http://doi.org/10.7717/peerj-cs.251

Usage and Examples

This is a simple package with only three functions ideogram, convertSVG and GFFex.

First, you need to load the package after you installed it.

require(RIdeogram)
#> Loading required package: RIdeogram

Then, you need to load the data from the RIdeogram package.

data(human_karyotype, package="RIdeogram")
data(gene_density, package="RIdeogram")
data(Random_RNAs_500, package="RIdeogram")

You can use the function “head()” to see the data format.

head(human_karyotype)
#>   Chr Start       End  CE_start    CE_end
#> 1   1     0 248956422 122026459 124932724
#> 2   2     0 242193529  92188145  94090557
#> 3   3     0 198295559  90772458  93655574
#> 4   4     0 190214555  49712061  51743951
#> 5   5     0 181538259  46485900  50059807
#> 6   6     0 170805979  58553888  59829934

Specifically, the ‘karyotype’ file contains the karyotype information and has five columns (or three, see below). The first column is Chromosome ID, the second and thrid columns are start and end positions of corresponding chromosomes and the fourth and fifth columns are start and end positions of corresponding centromeres.

head(gene_density)
#>   Chr   Start     End Value
#> 1   1       1 1000000    65
#> 2   1 1000001 2000000    76
#> 3   1 2000001 3000000    35
#> 4   1 3000001 4000000    30
#> 5   1 4000001 5000000    10
#> 6   1 5000001 6000000    10

The ‘mydata’ file contains the heatmap information and has four columns. The first column is Chromosome ID, the second and thrid columns are start and end positions of windows in corresponding chromosomes and the fourth column is a characteristic value in corresponding windows, such as gene number.

head(Random_RNAs_500)
#>    Type    Shape Chr    Start      End  color
#> 1  tRNA   circle   6 69204486 69204568 6a3d9a
#> 2  rRNA      box   3 68882967 68883091 33a02c
#> 3  rRNA      box   5 55777469 55777587 33a02c
#> 4  rRNA      box  21 25202207 25202315 33a02c
#> 5 miRNA triangle   1 86357632 86357687 ff7f00
#> 6 miRNA triangle  11 74399237 74399333 ff7f00

The ‘mydata_interval’ file contains the label information and has six columns. The first column is the label type, the second column is the shape of label with three available options of box, triangle and circle, the third column is Chromosome ID, the fourth and fifth columns are the start and end positions of corresponding labels in the chromosomes and the sixth column is the color of the label.

Or, you can also load your own data by using the function read.table, such as

human_karyotype <- read.table("karyotype.txt", sep = "\t", header = T, stringsAsFactors = F)
gene_density <- read.table("data_1.txt", sep = "\t", header = T, stringsAsFactors = F)
Random_RNAs_500 <- read.table("data_2.txt", sep = "\t", header = T, stringsAsFactors = F)

The “karyotype.txt” file contains karyotype information; the “data_1.txt” file contains heatmap data; the “data_2.txt” contains track label data.

In addition, we also provide a simple function GFFex for the heatmap information (like gene density) extraction from a GFF file. First, you need to download the GFF file of one species genome, for example, human genome annotation file from GENCODE (ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/gencode.v32.annotation.gff3.gz). Then, you need to prepare the karyotype file with the format same as the one mentioned above. The only thing you need to notice is that the chromosome ID at the first column in the karyotype file must be the same as that in the gff file (in this case, like chr1, chr2,…). Next, you can run the following code:

gene_density <- GFFex(input = "gencode.v32.annotation.gff3.gz", karyotype = "human_karyotype.txt", feature = "gene", window = 1000000)

You can use the argument “feature” (default value is “gene”) to select the feature you want to extract from the GFF file and the argument “window” (default value is “1000000”) to set the window size.

Now, you can visualize these information using the ideogram function.

Basic usage

ideogram(karyotype, overlaid = NULL, label = NULL, label_type = NULL, synteny = NULL, colorset1, colorset2, width, Lx, Ly, output = "chromosome.svg")
convertSVG(svg, device, width, height, dpi)

Now, let’s begin.

First, we draw a idiogram with no mapping data.

ideogram(karyotype = human_karyotype)
convertSVG("chromosome.svg", device = "png")

Then, you will find a SVG file and a PNG file in your Working Directory.

Next, we can map genome-wide data on the chromosome idiogram. In this case, we visulize the gene density across the human genome.

ideogram(karyotype = human_karyotype, overlaid = gene_density)
convertSVG("chromosome.svg", device = "png")

Alternatively, we can map some genome-wide data with track labels next to the chromosome idiograms.

ideogram(karyotype = human_karyotype, label = Random_RNAs_500, label_type = "marker")
convertSVG("chromosome.svg", device = "png")

We can also map the overlaid heatmap and track labels on the chromosome idiograms at the same time.

ideogram(karyotype = human_karyotype, overlaid = gene_density, label = Random_RNAs_500, label_type = "marker")
convertSVG("chromosome.svg", device = "png")

If you want to change the color of heatmap, you can modify the argument ‘colorset1’ (default set is colorset1 = c(“#4575b4”, “#ffffbf”, “#d73027”)). You can use either color names as listed by colors() or hexadecimal strings of the form “#rrggbb” or “#rrggbbaa”.

ideogram(karyotype = human_karyotype, overlaid = gene_density, label = Random_RNAs_500, label_type = "marker", colorset1 = c("#fc8d59", "#ffffbf", "#91bfdb"))
convertSVG("chromosome.svg", device = "png")

If you don not know the centromere information in your species, you don not need to modify the script. In this case, the ‘karyotype’ file has only three columns.

To simulate this case, we deleted the last two columns of the ‘human_karyotype’ file.

human_karyotype <- human_karyotype[,1:3]
ideogram(karyotype = human_karyotype, overlaid = gene_density, label = Random_RNAs_500, label_type = "marker")
convertSVG("chromosome.svg", device = "png")

If there are only ten chromosomes in your species, maybe you need to motify the argument ‘width’ (default value is “170”).

To simulate this case, we only keep the first ten columns of the ‘human_karyotype’ file.

Before

human_karyotype <- human_karyotype[1:10,]
ideogram(karyotype = human_karyotype, overlaid = gene_density, label = Random_RNAs_500, label_type = "marker")
convertSVG("chromosome.svg", device = "png")

After

human_karyotype <- human_karyotype[1:10,]
ideogram(karyotype = human_karyotype, overlaid = gene_density, label = Random_RNAs_500, label_type = "marker", width = 100)
convertSVG("chromosome.svg", device = "png")

If you want to move the Legend, then you need to modify the arguments ‘Lx’ and ‘Ly’(default values are “160” and “35”, separately).

‘Lx’ means the distance between upper-left point of the Legend and the left margin; ‘Ly’ means the distance between upper-left point of the Legend and the upper margin.

ideogram(karyotype = human_karyotype, overlaid = gene_density, label = Random_RNAs_500, label_type = "marker", width = 100, Lx = 80, Ly = 25)
convertSVG("chromosome.svg", device = "png")

We also provide other types of label, like “heatmap”, “line” and “polygon”. For heatmap label, you can use the following scripts to map and visualize these data on idiograms.

data(human_karyotype, package="RIdeogram") #reload the karyotype data
ideogram(karyotype = human_karyotype, overlaid = gene_density, label = LTR_density, label_type = "heatmap", colorset1 = c("#f7f7f7", "#e34a33"), colorset2 = c("#f7f7f7", "#2c7fb8")) #use the arguments 'colorset1' and 'colorset2' to set the colors for gene and LTR heatmaps, separately.
convertSVG("chromosome.svg", device = "png")

For one-line label,

data(liriodendron_karyotype, package="RIdeogram") #load the karyotype data
data(Fst_between_CE_and_CW, package="RIdeogram") #load the Fst data for overlaid heatmap
data(Pi_for_CE, package="RIdeogram") #load the Pi data for one-line label
head(Pi_for_CE) #this data has a similar format with the heatmap data with additional column of "Color" which indicate the color for the line.
#>   Chr   Start     End      Value  Color
#> 1   1       1 2000000 0.00273566 fc8d62
#> 2   1 1000001 3000000 0.00239580 fc8d62
#> 3   1 2000001 4000000 0.00319407 fc8d62
#> 4   1 3000001 5000000 0.00286900 fc8d62
#> 5   1 4000001 6000000 0.00186596 fc8d62
#> 6   1 5000001 7000000 0.00186182 fc8d62
ideogram(karyotype = liriodendron_karyotype, overlaid = Fst_between_CE_and_CW, label = Pi_for_CE, label_type = "line", colorset1 = c("#e5f5f9", "#99d8c9", "#2ca25f"))
convertSVG("chromosome.svg", device = "png")

For two-line label,

data(liriodendron_karyotype, package="RIdeogram") #load the karyotype data
data(Fst_between_CE_and_CW, package="RIdeogram") #load the Fst data for overlaid heatmap
data(Pi_for_CE_and_CW, package="RIdeogram") #load the Pi data for two-line label
head(Pi_for_CE_and_CW) #this data has a similar format with the one for one-line label with additional two columns indicating the second feature you want to show. When you prepare your own data, please keep the exact same column names.
#>   Chr   Start     End    Value_1 Color_1    Value_2 Color_2
#> 1   1       1 2000000 0.00273566  fc8d62 0.00385702  8da0cb
#> 2   1 1000001 3000000 0.00239580  fc8d62 0.00331109  8da0cb
#> 3   1 2000001 4000000 0.00319407  fc8d62 0.00374530  8da0cb
#> 4   1 3000001 5000000 0.00286900  fc8d62 0.00339141  8da0cb
#> 5   1 4000001 6000000 0.00186596  fc8d62 0.00305246  8da0cb
#> 6   1 5000001 7000000 0.00186182  fc8d62 0.00323655  8da0cb
ideogram(karyotype = liriodendron_karyotype, overlaid = Fst_between_CE_and_CW, label = Pi_for_CE_and_CW, label_type = "line", colorset1 = c("#e5f5f9", "#99d8c9", "#2ca25f"))
convertSVG("chromosome.svg", device = "png")

For one-polygon label,

data(liriodendron_karyotype, package="RIdeogram") #load the karyotype data
data(Fst_between_CE_and_CW, package="RIdeogram") #load the Fst data for overlaid heatmap
data(Pi_for_CE, package="RIdeogram") #load the Pi data for one-polygon label
ideogram(karyotype = liriodendron_karyotype, overlaid = Fst_between_CE_and_CW, label = Pi_for_CE, label_type = "polygon", colorset1 = c("#e5f5f9", "#99d8c9", "#2ca25f"))
convertSVG("chromosome.svg", device = "png")

For two-polygon label,

data(liriodendron_karyotype, package="RIdeogram") #load the karyotype data
data(Fst_between_CE_and_CW, package="RIdeogram") #load the Fst data for overlaid heatmap
data(Pi_for_CE_and_CW, package="RIdeogram") #load the Pi data for two-polygon label
ideogram(karyotype = liriodendron_karyotype, overlaid = Fst_between_CE_and_CW, label = Pi_for_CE_and_CW, label_type = "polygon", colorset1 = c("#e5f5f9", "#99d8c9", "#2ca25f"))
convertSVG("chromosome.svg", device = "png")

Comparing with the two-line label plot, we shift all x coordinates of the second polygon labels to right with a 0.2X chromosome width for better visualization.

In addition, you can use the argument “device” (default value is “png”)to set the format of output file, such as, “tiff”, “pdf”, “jpg”, etc. And, you can use the argument “dpi” (default value is “300”) to set the resolution of the output image file.

convertSVG("chromosome.svg", device = "tiff", dpi = 600)

Also, there are four shortcuts to convert the SVG images to these optional image formats with no necessary to set the argument “device”, such as

svg2tiff("chromosome.svg")
svg2pdf("chromosome.svg")
svg2jpg("chromosome.svg")
svg2png("chromosome.svg")

For genome synteny analysis, we can use the ideogram function to visualize the genome synteny results between two or three genomes.

For dual genome comparison, load the example data first,

data(karyotype_dual_comparison, package="RIdeogram")
head(karyotype_dual_comparison)
#>   Chr Start      End   fill species size  color
#> 1  I      1 23037639 969696   Grape   12 252525
#> 2  II     1 18779884 969696   Grape   12 252525
#> 3 III     1 19341862 969696   Grape   12 252525
#> 4  IV     1 23867706 969696   Grape   12 252525
#> 5   V     1 25021643 969696   Grape   12 252525
#> 6  VI     1 21508407 0ab276   Grape   12 252525
table(karyotype_dual_comparison$species)
#> 
#>   Grape Populus 
#>      19      19

data(synteny_dual_comparison, package="RIdeogram")
head(synteny_dual_comparison)
#>   Species_1  Start_1    End_1 Species_2 Start_2   End_2   fill
#> 1         1 12226377 12267836         2 5900307 5827251 cccccc
#> 2        15  5635667  5667377        17 4459512 4393226 cccccc
#> 3         9  7916366  7945659         3 8618518 8486865 cccccc
#> 4         2  8214553  8242202        18 5964233 6027199 cccccc
#> 5        13  2330522  2356593        14 6224069 6138821 cccccc
#> 6        11 10861038 10886821        10 8099058 8011502 cccccc

If you want to import your own data, using read.table function as mentioned above. One thing you need to notice is that the format of karyotype for genome synteny visualization is a little bit different: First three columns are the same, the fourth is the color you want to fill the idiograms, the fifth is the name of species, the rest two columns are the size and color of species name. This karyotype file contains information of two genomes (species A: Grape and species B: Populus) with species A being sorted to the front. And, for dual genome synteny file: the first three columns are position information in species A (Grape) and the next three columns are position information in species B (Populus) of corresponding synteny blocks, the last column is the color of the bezier curves which link corresponding synteny blocks. Please sort the colourful lines to the last as possiable as you can.

Then, run the code as folloing

ideogram(karyotype = karyotype_dual_comparison, synteny = synteny_dual_comparison)
convertSVG("chromosome.svg", device = "png")

For ternary genome comparison, load the example data first,

data(karyotype_ternary_comparison, package="RIdeogram")
head(karyotype_ternary_comparison)
#>   Chr Start      End   fill   species size  color
#> 1  NA     1 15980527 fcb06b Amborella   10 fcb06b
#> 2  NA     1 11522362 fcb06b Amborella   10 fcb06b
#> 3  NA     1 11085951 fcb06b Amborella   10 fcb06b
#> 4  NA     1 10537363 fcb06b Amborella   10 fcb06b
#> 5  NA     1  9585472 fcb06b Amborella   10 fcb06b
#> 6  NA     1  9414115 fcb06b Amborella   10 fcb06b
table(karyotype_ternary_comparison$species)
#> 
#>    Amborella        Grape Liriodendron 
#>          100           19           19

data(synteny_ternary_comparison, package="RIdeogram")
head(synteny_ternary_comparison)
#>   Species_1 Start_2   End_2 Species_2  Start_1    End_1   fill type
#> 1         1 4761181 2609697         1   342802   981451 cccccc    1
#> 2         6 6344197 8074393         1 15387184 16716190 cccccc    1
#> 3        10 6457890 9052487         1 11224953 14959548 cccccc    1
#> 4        13 6318795 1295413         1 20564870 21386271 cccccc    1
#> 5        16 1398101 2884119         1 21108654 22221088 cccccc    1
#> 6        16 1482529 2093625         1 21864494 22364888 cccccc    1
tail(synteny_ternary_comparison, n = 20)
#>     Species_1  Start_2    End_2 Species_2  Start_1    End_1   fill type
#> 571        16 19278042 20828694         2 95267449 93334736 cccccc    3
#> 572        12 20546006 22461088         2 22647943 18365764 cccccc    3
#> 573         4 22259262 23453956         2 15068249 17839485 cccccc    3
#> 574        14 22377895 23821929         2 97299880 96033346 cccccc    3
#> 575         6  1538773  2808373         1 91285578 95681546 cccccc    3
#> 576        11  3381792  4954528         1 67689752 75286468 cccccc    3
#> 577         9  4814481  6975840         1 69506847 76015710 cccccc    3
#> 578        10  7091825  9742616         1 19333526 24516133 cccccc    3
#> 579        13 22063957 23402389         1 95843870 92195256 cccccc    3
#> 580         7   679765  1881756         6  7365421  7531534 e41a1c    1
#> 581         7   679765  2752867        13   501561   766473 e41a1c    1
#> 582         7   679765  3012501         8  7406703  8222490 e41a1c    1
#> 583         7  2049369  2942034        14 29350547 34369929 e41a1c    2
#> 584         7  2075095  1538540        10 28985737 30815217 e41a1c    2
#> 585        13   531939   834472        14 28866243 35278211 e41a1c    3
#> 586         8  7427221  8894821        14 28632063 34805893 e41a1c    3
#> 587         6  7567597  7690342        14 32050301 34913801 e41a1c    3
#> 588        13   501561   876423        10 30496700 27874100 e41a1c    3
#> 589         6  7171014  7815454        10 31408837 27660041 e41a1c    3
#> 590         8  5773528  9346871        10 31408837 26585934 e41a1c    3

The format of karyotype file for ternary genome synteny visualization is similar to that of dual genome syteny visualization, containing one more species karyotype information and being sorted in the order of species A (Amborella), B (Grape) and C (Liriodendron). However, the synteny file is different from that of dual genome syteny visualization. Because this synteny file contains three comparisons, i.e., species A_vs_B, species A_vs_C and species B_vs_C, we add one additional column with the number “1” being representative of the species A_vs_B, “2” being representative of the species A_vs_C and “3” being representative of the species B_vs_C. Also, please sort the colourful lines to the last as possiable as you can.

Then, run the code as following

ideogram(karyotype = karyotype_ternary_comparison, synteny = synteny_ternary_comparison)
convertSVG("chromosome.svg", device = "png")

In addition, if you want use gradient color for the bezier curves which you want to highlighted (red lines in the above picture), just replace the red color “e41a1c” with “gradient” in the seventh column (as like in the example data of “synteny_ternary_comparison_graident”). Here, we first load the example data and visualize the ternary genome syteny using ideogram function. And since R graphics does not support the SVG element of gradient fill, we use the rsvg_pdf function from rsvg package to convert this svg file into a pdf file directly. So, maybe you need to install the rsvg package if you want to show the gradient fill or you can also open the svg file with Inkscape and then save as a pdf file.

data(synteny_ternary_comparison_graident, package="RIdeogram")
ideogram(karyotype = karyotype_ternary_comparison, synteny = synteny_ternary_comparison_graident)
library("rsvg")
rsvg_pdf("chromosome.svg", "chromosome.pdf")