Viola’s Data Structure – Why tidy data?
Viola breaks down the SV information into multiple tidy tables to enable flexible data proccessing. These tables follow the principles of tidy data, i.e., each SV record is a row, each variable is a column, and each type of observational unit is a table (See Also: Wickham, 2014). Consequently, storage of multiple values in one element is avoided, in contrast to the INFO and FORMAT columns of a VCF file.
The most important benefit of this data structure is the extendibility for future functions. Software always requires updates to keep up with trends and developments. By breaking down the SV information into multiple tidy tables, even complex requirements can be implemented with little effort.
These points are rather the developer side benefits than the user side ones. However, maintaining data structure that is easy to develop is an important part of keeping software healthy, and is ultimately passed on to the user.
Here, we explain the specific data structure of Bedpe and Vcf class.
Tables composing Vcf object
positions table
Name of the table: “positions”
id |
Identifiers of SV records |
|---|---|
chrom1(FK) |
Chromosome of the first breakend |
pos1 |
Position of the first breakend |
chrom2(FK) |
Chromosome of the second breakend |
pos2 |
Position of the second breakend |
strand1 |
Strand of the first breakend |
strand2 |
Strand of the second breakend |
qual |
Quality of the SV call |
ref |
Reference nucleotides in the chrom1:pos1 |
alt(FK) |
Details about the alteration |
svtype |
SV type of the event |
filters table
Name of the table: “filters”
id(FK) |
Identifiers of SV records |
|---|---|
filter |
Filters assigned to the id |
info tables
Vcf object contains the same number of info tables as the number of INFO (e.g. SVLEN, CIPOS, IMPRECISE, …) in the original vcf file. Here, we show the ‘svlen’ table as an example of info tables.
Name of the table: “svlen”
id(FK) |
Identifiers of SV records |
|---|---|
value_idx |
0 origin index (this column is necessary since some INFOs have multiple values such as CIPOS) |
svlen |
Length of SV |
formats table
Name of the table: “formats”
id(FK) |
Identifiers of SV records |
|---|---|
sample(FK) |
Sample name |
format(FK) |
Name of the FORMAT |
value_idx |
0 origin index |
value |
Value of the FORMAT |
contigs_meta table
This table corresponds to the CONTIG section in header of vcf files.
Name of the table: “contigs_meta”
id |
Name of a contig (in most case, chromosome) |
|---|---|
length |
Length of the contig |
alts_meta table
Name of the table: “alts_meta”
id |
Name of an ALT |
|---|---|
description |
Description of the ALT |
filters_meta table
Name of the table: “filters_meta”
id |
Name of a FILTER |
|---|---|
description |
Description of the FILTER |
infos_meta table
Name of the table: “infos_meta”
id |
Name of a INFO |
|---|---|
number |
Number of values included with the INFO |
type |
Type of the value of the INFO |
description |
Description of the INFO |
source |
Source of the annotation (e.g. “dbsnp”) |
version |
Version of the source (e.g. “138”) |
formats_meta table
Name of the table: “formats_meta”
id |
Name of a FORMAT |
|---|---|
number |
Number of values included with the FORMAT |
type |
Type of the value of the FORMTA |
description |
Description of the FORMAT |
source |
Source of the annotation (e.g. “dbsnp”) |
version |
Version of the source (e.g. “138”) |
samples_meta table
This table has only a single column.
Name of the table: “samples_meta”
id |
Name of a sample |
|---|---|