Quick Start

This is a short example how to manage VCF files.

After you have read this page, you can

Import Viola package to your script.
Read VCF files to create viola.Vcf object.
Convert VCF files into BEDPE files, with any features you wish, such as INFO fields and FORMAT fields.
Filter SV records by any features, including genomic positions, INFO fields, FILTER fields, and FORMAT fields.

Import Viola to your script

In [1]: import viola

Create Vcf object from VCF file.

In order to use Viola’s features, you must first run read_vcf to create an viola.Vcf object.

In [2]: url = 'https://raw.githubusercontent.com/dermasugita/Viola-SV/master/docs/_static/tutorial.manta.vcf'

In [3]: vcf = viola.read_vcf(url, variant_caller='manta', patient_name="manta1") # filepath, url, and file-like object are acceptable.

Now you’re ready to perform a number of functions that the Viola package has.

To learn detail data structure of viola.Vcf object, see Viola’s Data Structure.

Viewing the contents of viola.Vcf object

You may want to view the content of Vcf objects after creating them. To do so, you have three options.

1) Simply print the object.

# The column name 'be' stands for 'breakend'.
In [4]: print(vcf)
INFO=imprecise,svtype,svlen,end,cipos,ciend,cigar,mateid,event,homlen,homseq,svinslen,svinsseq,left_svinsseq,right_svinsseq,contig,bnd_depth,mate_bnd_depth,somatic,somaticscore,junction_somaticscore,inv3,inv5
Documentation of Vcf object ==> https://dermasugita.github.io/ViolaDocs/docs/html/reference/vcf.html
        id              be1              be2 strand  qual svtype
0    test1    chr1:82550461    chr1:82554226     +-  None    DEL
1    test2    chr1:22814217    chr1:92581132     --  None    INV
2    test3    chr1:60567906    chr1:60675941     +-  None    DEL
3    test4    chr1:69583190    chr1:69590948     +-  None    DEL
4    test5  chr11:104534877  chr11:104536574     +-  None    DEL
5  test6_1  chr11:111134697   chr17:26470495     +-  None    BND
6  test6_2   chr17:26470495  chr11:111134697     -+  None    BND

2) Generate BEDPE like pandas.DataFrame.

In [5]: vcf_bedpe_like = vcf.to_bedpe_like()

In [6]: print(vcf_bedpe_like)
  chrom1     start1       end1 chrom2     start2       end2     name score strand1 strand2
 chr1   82550460   82550461   chr1   82554225   82554226    test1  None       +       -
 chr1   22814216   22814217   chr1   92581131   92581132    test2  None       -       -
 chr1   60567905   60567906   chr1   60675940   60675941    test3  None       +       -
 chr1   69583189   69583190   chr1   69590947   69590948    test4  None       +       -
chr11  104534876  104534877  chr11  104536573  104536574    test5  None       +       -
chr11  111134696  111134697  chr17   26470494   26470495  test6_1  None       +       -
chr17   26470494   26470495  chr11  111134696  111134697  test6_2  None       -       +

The way to add INFO, FILTER, and FORMAT to this bedpe-like DataFrame is explained here.

3) Use the get_table() method to get individual tables composing the Vcf object.

# the returned value is pd.DataFrame
In [7]: vcf.get_table('positions')
Out[7]: 
        id chrom1       pos1 chrom2       pos2  ... strand2 ref                 alt  qual svtype
0    test1   chr1   82550461   chr1   82554226  ...       -   G               <DEL>  None    DEL
1    test2   chr1   22814217   chr1   92581132  ...       -   T               <INV>  None    INV
2    test3   chr1   60567906   chr1   60675941  ...       -   T               <DEL>  None    DEL
3    test4   chr1   69583190   chr1   69590948  ...       -   T               <DEL>  None    DEL
4    test5  chr11  104534877  chr11  104536574  ...       -   C               <DEL>  None    DEL
5  test6_1  chr11  111134697  chr17   26470495  ...       -   T   T[chr17:26470495[  None    BND
6  test6_2  chr17   26470495  chr11  111134697  ...       +   T  ]chr11:111134697]T  None    BND

[7 rows x 11 columns]

The names of all the tables in vcf are stored in the table_list attribute:

In [8]: vcf.table_list
Out[8]: 
['positions',
 'filters',
 'imprecise',
 'svtype',
 'svlen',
 'end',
 'cipos',
 'ciend',
 'cigar',
 'mateid',
 'event',
 'homlen',
 'homseq',
 'svinslen',
 'svinsseq',
 'left_svinsseq',
 'right_svinsseq',
 'contig',
 'bnd_depth',
 'mate_bnd_depth',
 'somatic',
 'somaticscore',
 'junction_somaticscore',
 'inv3',
 'inv5',
 'formats',
 'contigs_meta',
 'alts_meta',
 'infos_meta',
 'formats_meta',
 'filters_meta',
 'samples_meta']

You can get any table you want.

In [9]: vcf.get_table('formats_meta') # get header information of FORMAT field
Out[9]: 
   id number     type                    description
0  PR   None  Integer  Spanning paired-read suppo...
1  SR   None  Integer  Split reads for the ref an...

Export as VCF/BEDPE file

You can export VCF/BEDPE files by to_vcf/to_bedpe method.

vcf.to_vcf('/path/to/the/output.vcf')
vcf.to_bedpe('/path/to/the/output.bedpe')

Filter Vcf object

Filtering VCF file is an essential step of bioinformatics study. viola.Vcf object provides an intuitive way to filter SV in almost any item.

You have two options for filtering.

1) Filter with queries using filter method

Viola has a query system that is easy to understand.

First, let’s look at a couple of examples.

a. Filter with SVTYPE of the INFO field. syntax: "<INFO name> [<value index>] <operator> <value>"

Please do not omit spaces.
<value indexer> is optional.
The <value indexer> is a 0-origin indexer which allows you to specify which of the comma-separated INFOs, such as CIPOS, should be filtered.
The following syntax can also be used for other INFO.

# filter with svtype.
In [10]: vcf.filter('svtype == DEL')
Out[10]: 
INFO=imprecise,svtype,svlen,end,cipos,ciend,cigar,mateid,event,homlen,homseq,svinslen,svinsseq,left_svinsseq,right_svinsseq,contig,bnd_depth,mate_bnd_depth,somatic,somaticscore,junction_somaticscore,inv3,inv5
Documentation of Vcf object ==> https://dermasugita.github.io/ViolaDocs/docs/html/reference/vcf.html
      id              be1              be2 strand  qual svtype
0  test1    chr1:82550461    chr1:82554226     +-  None    DEL
1  test3    chr1:60567906    chr1:60675941     +-  None    DEL
2  test4    chr1:69583190    chr1:69590948     +-  None    DEL
3  test5  chr11:104534877  chr11:104536574     +-  None    DEL

# example of filtering using <value index>
# The code below means that if the right value of CIPOS (e.g. the value "20" of CIPOS=-10,20) is
# lower than 60, the SV record will be output.
In [11]: using_value_idx = vcf.filter('cipos 1 < 60').to_vcf_like()

In [12]: print(using_value_idx)
  chrom1      pos1     id ref    alt  ...           filter                           info format  sample1_N  sample1_T
0   chr1  82550461  test1   G  <DEL>  ...  MinSomaticScore  IMPRECISE;SVTYPE=DEL;SVLEN...  PR:SR  21,0:10,0  43,4:15,3
1   chr1  22814216  test2   T  <INV>  ...  MinSomaticScore  IMPRECISE;SVTYPE=INV;SVLEN...     PR       24,0       35,5
2   chr1  60567906  test3   T  <DEL>  ...  MinSomaticScore  SVTYPE=DEL;SVLEN=-108034;E...     PR       23,0       44,6

[3 rows x 11 columns]

In [13]: print(using_value_idx['info'].values)
['IMPRECISE;SVTYPE=DEL;SVLEN=-3764;END=82554225;CIPOS=-51,52;CIEND=-51,52;SOMATIC;SOMATICSCORE=10'
 'IMPRECISE;SVTYPE=INV;SVLEN=69766915;END=92581131;CIPOS=-51,51;CIEND=-89,90;SOMATIC;SOMATICSCORE=11;INV5'
 'SVTYPE=DEL;SVLEN=-108034;END=60675940;CIPOS=-44,44;CIEND=-38,39;SOMATIC;SOMATICSCORE=18']

Note

What is <value index>?

Some INFO and FORMAT entries are separated by commas to store more than one value. For example, the CIPOS and CIEND entries of INFO field always store two values:

CIPOS=-1,1;CIEND=0,2

<value index> is assigned to such data with 0-origin manner as shown below:

As another example, the values in the FORMAT field of Manta VCF are separated by commas.

b. Filter with genomic coordinates. syntax: "<'be1'|'be2'> <chromosome>[:[<start position>]-[<end position>]]"

‘be’ stands for ‘breakend’.
If you skip <start position> with the minus sign kept, you can get all SV record younger than <end position>, and vice versa if you skip <end position>.
Note that <start position>-<end position> specifies genomic coordinates with left-closed, right-open interval, that is, [<start position>, <end position>).

# filter with genomic coordinates.
In [14]: vcf.filter('be1 chr11')
Out[14]: 
INFO=imprecise,svtype,svlen,end,cipos,ciend,cigar,mateid,event,homlen,homseq,svinslen,svinsseq,left_svinsseq,right_svinsseq,contig,bnd_depth,mate_bnd_depth,somatic,somaticscore,junction_somaticscore,inv3,inv5
Documentation of Vcf object ==> https://dermasugita.github.io/ViolaDocs/docs/html/reference/vcf.html
        id              be1              be2 strand  qual svtype
0    test5  chr11:104534877  chr11:104536574     +-  None    DEL
1  test6_1  chr11:111134697   chr17:26470495     +-  None    BND

In [15]: vcf.filter('be1 !chr1')
Out[15]: 
INFO=imprecise,svtype,svlen,end,cipos,ciend,cigar,mateid,event,homlen,homseq,svinslen,svinsseq,left_svinsseq,right_svinsseq,contig,bnd_depth,mate_bnd_depth,somatic,somaticscore,junction_somaticscore,inv3,inv5
Documentation of Vcf object ==> https://dermasugita.github.io/ViolaDocs/docs/html/reference/vcf.html
        id              be1              be2 strand  qual svtype
0    test5  chr11:104534877  chr11:104536574     +-  None    DEL
1  test6_1  chr11:111134697   chr17:26470495     +-  None    BND
2  test6_2   chr17:26470495  chr11:111134697     -+  None    BND

In [16]: vcf.filter('be2 chr1:69583189-')
Out[16]: 
INFO=imprecise,svtype,svlen,end,cipos,ciend,cigar,mateid,event,homlen,homseq,svinslen,svinsseq,left_svinsseq,right_svinsseq,contig,bnd_depth,mate_bnd_depth,somatic,somaticscore,junction_somaticscore,inv3,inv5
Documentation of Vcf object ==> https://dermasugita.github.io/ViolaDocs/docs/html/reference/vcf.html
      id            be1            be2 strand  qual svtype
0  test1  chr1:82550461  chr1:82554226     +-  None    DEL
1  test2  chr1:22814217  chr1:92581132     --  None    INV
2  test4  chr1:69583190  chr1:69590948     +-  None    DEL

c. Filter with FORMAT table syntax: "<sample name> <FORMAT name> [<FORMAT index>] <operator> <value>

FORMAT indexer is optional. It is not required when the FORMAT isn’t separated by commas.
FORMAT indexer is 0-origin. Default value is 0.

# The meaning of this code is that if the right value of the PR of sample1_T
# in the FORMAT field (e.g. the value "3" in PR:SR 6,7:8,9  1,3:5,2) is greater than 5,
# the SV records will be returned.
In [17]: vcf.filter('sample1_T PR 1 > 5').to_bedpe_like(add_formats=True)
Out[17]: 
  chrom1     start1       end1 chrom2     start2  ...  strand2 sample1_N_PR_0 sample1_N_PR_1 sample1_T_PR_0 sample1_T_PR_1
0   chr1   60567905   60567906   chr1   60675940  ...        -             23              0             44              6
1   chr1   69583189   69583190   chr1   69590947  ...        -             21              0             20             12
2  chr11  104534876  104534877  chr11  104536573  ...        -             22              0             57             14

[3 rows x 14 columns]

c. Filter with FILTER table syntax: "[!]<FILTER name>"

When “!” mark is prepended, the SV records excluding <FILTER name> are returned.

In [18]: vcf.filter('PASS').to_bedpe_like(add_filters=True)
Out[18]: 
  chrom1     start1       end1 chrom2     start2  ...   name score strand1 strand2  PASS
0   chr1   69583189   69583190   chr1   69590947  ...  test4  None       +       -  True
1  chr11  104534876  104534877  chr11  104536573  ...  test5  None       +       -  True

[2 rows x 11 columns]

In [19]: vcf.filter('!PASS').to_bedpe_like(add_filters=True)
Out[19]: 
  chrom1     start1       end1 chrom2     start2  ...     name score strand1 strand2 MinSomaticScore
0   chr1   82550460   82550461   chr1   82554225  ...    test1  None       +       -            True
1   chr1   22814216   22814217   chr1   92581131  ...    test2  None       -       -            True
2   chr1   60567905   60567906   chr1   60675940  ...    test3  None       +       -            True
3  chr11  111134696  111134697  chr17   26470494  ...  test6_1  None       +       -            True
4  chr17   26470494   26470495  chr11  111134696  ...  test6_2  None       -       +            True

[5 rows x 11 columns]

d. Query can be a list

In [20]: query2_1 = 'svlen < -4000'

In [21]: query2_2 = 'svlen > -10000'

In [22]: vcf2 = vcf.filter([query2_1, query2_2], query_logic='and')

In [23]: result2 = vcf2.to_bedpe_like(custom_infonames=['svtype', 'svlen'])

In [24]: print(result2)
  chrom1    start1      end1 chrom2    start2  ...  score strand1 strand2 svtype_0 svlen_0
0   chr1  69583189  69583190   chr1  69590947  ...   None       +       -      DEL   -7757

[1 rows x 12 columns]

You can perform set operations by passing expressions to query_logic.

In [25]: query3_1 = 'svtype == DEL'

In [26]: query3_2 = 'svtype == BND'

In [27]: query3_3 = 'somaticscore > 20'

# (query3_1 or query3_2) and (query3_3)
In [28]: vcf3 = vcf.filter([query3_1, query3_2, query3_3], query_logic='(0 | 1) & 2')

In [29]: print(vcf3.to_bedpe_like(custom_infonames=['svtype', 'somaticscore']))
  chrom1     start1       end1 chrom2     start2  ...  score strand1 strand2 svtype_0 somaticscore_0
0   chr1   69583189   69583190   chr1   69590947  ...   None       +       -      DEL             47
1  chr11  104534876  104534877  chr11  104536573  ...   None       +       -      DEL             38
2  chr11  111134696  111134697  chr17   26470494  ...   None       +       -      BND             24
3  chr17   26470494   26470495  chr11  111134696  ...   None       -       +      BND             24

[4 rows x 12 columns]

2) Filter with SV ID using filter_by_id method

Vcf.filter is very useful, but in some situation, you may have to filter with much more complex criteria. In such cases we recommend to use Vcf.filter_by_id method.

Suppose you obtained the list of SV id, such as ['test2', 'test4'], as a result of quite complex criterion. In this case, your code should be:

In [30]: vcf.filter_by_id(['test2', 'test4'])
Out[30]: 
INFO=imprecise,svtype,svlen,end,cipos,ciend,cigar,mateid,event,homlen,homseq,svinslen,svinsseq,left_svinsseq,right_svinsseq,contig,bnd_depth,mate_bnd_depth,somatic,somaticscore,junction_somaticscore,inv3,inv5
Documentation of Vcf object ==> https://dermasugita.github.io/ViolaDocs/docs/html/reference/vcf.html
      id            be1            be2 strand  qual svtype
0  test2  chr1:22814217  chr1:92581132     --  None    INV
1  test4  chr1:69583190  chr1:69590948     +-  None    DEL