VCF Merging

Here, we describe how to merge VCF files with Viola.

First of all, let’s load the four VCF files from the different SV callers.

In [1]: manta = viola.read_vcf('https://raw.githubusercontent.com/dermasugita/Viola-SV/master/examples/demo_merge/test.merge.manta.vcf', variant_caller='manta', patient_name='manta1')

In [2]: delly = viola.read_vcf('https://raw.githubusercontent.com/dermasugita/Viola-SV/master/examples/demo_merge/test.merge.delly.vcf', variant_caller='delly', patient_name='delly1')

In [3]: lumpy = viola.read_vcf('https://raw.githubusercontent.com/dermasugita/Viola-SV/master/examples/demo_merge/test.merge.lumpy.vcf', variant_caller='lumpy', patient_name='lumpy1')

In [4]: gridss = viola.read_vcf('https://raw.githubusercontent.com/dermasugita/Viola-SV/master/examples/demo_merge/test.merge.gridss.vcf', variant_caller='gridss', patient_name='gridss')

These are synthetic files that we have created to briefly test the functionality of viola.merge, and they share some SVs.

You can see which SV records in a file are sharing SVs with other files by looking at their SV ID.

In [5]: print(manta)
INFO=imprecise,svtype,svlen,end,cipos,ciend,cigar,mateid,event,homlen,homseq,svinslen,svinsseq,left_svinsseq,right_svinsseq,contig,bnd_depth,mate_bnd_depth,somatic,somaticscore,junction_somaticscore,inv3,inv5
Documentation of Vcf object ==> https://dermasugita.github.io/ViolaDocs/docs/html/reference/vcf.html
       id          be1          be2 strand  qual svtype
0      M1  chr1:100000  chr1:200001     +-  None    DEL
1     MD1  chr1:200000  chr1:300001     +-  None    DEL
2     ML1  chr1:300000  chr1:400001     +-  None    DEL
3     MG1  chr1:400000  chr1:500001     +-  None    DEL
4    MDL1  chr2:100001  chr2:200000     -+  None    DUP
5    MDG1  chr2:100501  chr2:200500     -+  None    DUP
6    MLG1  chr2:200001  chr2:300000     -+  None    DUP
7  MDLG1o  chr3:100000  chr4:200000     ++  None    BND
8  MDLG1h  chr4:200000  chr3:100000     ++  None    BND

In the above case, for example, SV ID “ML1” means “the SV record is intended to be shared by Manta and Lumpy”.

In [6]: print(delly)
INFO=precise,imprecise,svtype,svmethod,end,chr2,pos2,pe,mapq,ct,cipos,ciend,srmapq,inslen,homlen,sr,srq,consensus,ce,rdratio,svlenorg,somatic,svlen
Documentation of Vcf object ==> https://dermasugita.github.io/ViolaDocs/docs/html/reference/vcf.html
      id           be1           be2 strand  qual svtype
0     D1   chr1:100030   chr1:600001     +-   180    DEL
1    MD1   chr1:200030   chr1:299921     +-   180    DEL
2    DL1  chr1:1000000  chr1:2000001     +-   180    DEL
3    DG1  chr1:2000000  chr1:3000001     +-   180    DEL
4   MDL1   chr2:100021   chr2:200080     -+    29    DUP
5   MDG1   chr2:100451   chr2:200480     -+    29    DUP
6   DLG1  chr2:1000001  chr2:2000000     -+    29    DUP
7  MDLG1   chr3:100030   chr4:200050     ++   193    BND

In [7]: print(lumpy)
INFO=svtype,strands,secondary,event,mateid,svlen,end,cipos,ciend,cipos95,ciend95,imprecise,su,pe,sr,bd,ev,prpos,prend,suorg
Documentation of Vcf object ==> https://dermasugita.github.io/ViolaDocs/docs/html/reference/vcf.html
       id           be1           be2 strand  qual svtype
0      L1    chr1:10000    chr1:10101     +-  None    DEL
1     ML1   chr1:300020   chr1:400005     +-  None    DEL
2     DL1  chr1:1000020  chr1:2000031     +-  None    DEL
3     LG1  chr1:3000000  chr1:4000031     +-  None    DEL
4    MDL1   chr2:100061   chr2:200085     -+  None    DUP
5    MLG1   chr2:200061   chr2:300065     -+  None    DUP
6    DLG1  chr2:1000041  chr2:2000065     -+  None    DUP
7  MDLG1o   chr3:100020   chr4:200010     ++  None    BND
8  MDLG1h   chr4:200010   chr3:100020     ++  None    BND

In [8]: print(gridss)
INFO=as,asc,asq,asrp,assr,ba,banrp,banrpq,bansr,bansrq,baq,basrp,bassr,bealn,beid,beidh,beidl,benames,bmq,bmqn,bmqx,bpnames,bq,bsc,bscq,bum,bumq,bvf,cas,casq,ciend,cipos,cirpos,cq,end,event,homlen,homseq,ic,ihompos,imprecise,insrm,insrmp,insrmrc,insrmro,insrmrt,instaxid,iq,mateid,mq,mqn,mqx,ras,rasq,ref,refpair,rf,rp,rpq,rsi,sb,sc,self,si,sr,srq,svlen,svtype,vf
Documentation of Vcf object ==> https://dermasugita.github.io/ViolaDocs/docs/html/reference/vcf.html
        id           be1           be2 strand    qual svtype
0      G1o   chr1:100050   chr1:110000     +-   13.03    BND
1      G1h   chr1:110000   chr1:100050     -+   13.03    BND
2     MG1o   chr1:400050   chr1:500003     +-  618.98    BND
3     MG1h   chr1:500003   chr1:400050     -+  618.98    BND
4     DG1o  chr1:2000050  chr1:3000090     +-  618.98    BND
5     DG1h  chr1:3000090  chr1:2000050     -+  618.98    BND
6     LG1o  chr1:3000050  chr1:4000090     +-  618.98    BND
7     LG1h  chr1:4000090  chr1:3000050     -+  618.98    BND
8    MDG1o   chr2:100550   chr2:200505     -+   13.03    BND
9    MDG1h   chr2:200505   chr2:100550     +-   13.03    BND
10   MLG1o   chr2:200020   chr2:300015     -+   13.03    BND
11   MLG1h   chr2:300015   chr2:200020     +-   13.03    BND
12   DLG1o  chr2:1000020  chr2:2000015     -+   13.03    BND
13   DLG1h  chr2:2000015  chr2:1000020     +-   13.03    BND
14  MDLG1o   chr3:100020   chr4:200015     ++   13.03    BND
15  MDLG1h   chr4:200015   chr3:100020     ++   13.03    BND

Default Behaviour

Now, let’s use viola.merge function.

In [9]: merged_vcf = viola.merge([manta, delly, lumpy, gridss], threshold=100)

In [10]: print(merged_vcf)
INFO=imprecise,svtype,svlen,end,cipos,ciend,cigar,mateid,event,homlen,homseq,svinslen,svinsseq,left_svinsseq,right_svinsseq,contig,bnd_depth,mate_bnd_depth,somatic,somaticscore,junction_somaticscore,inv3,inv5,precise,svmethod,chr2,pos2,pe,mapq,ct,srmapq,inslen,sr,srq,consensus,ce,rdratio,svlenorg,strands,secondary,cipos95,ciend95,su,bd,ev,prpos,prend,suorg,as,asc,asq,asrp,assr,ba,banrp,banrpq,bansr,bansrq,baq,basrp,bassr,bealn,beid,beidh,beidl,benames,bmq,bmqn,bmqx,bpnames,bq,bsc,bscq,bum,bumq,bvf,cas,casq,cirpos,cq,ic,ihompos,insrm,insrmp,insrmrc,insrmro,insrmrt,instaxid,iq,mq,mqn,mqx,ras,rasq,ref,refpair,rf,rp,rpq,rsi,sb,sc,self,si,vf,mergedid,originalid,caller,supportingid,supportingcaller,supportingidcount,supportingcallercount
Documentation of Vcf object ==> https://dermasugita.github.io/ViolaDocs/docs/html/reference/vcf.html
              id           be1           be2 strand   qual svtype
0       manta_M1   chr1:100000   chr1:200001     +-   None    DEL
1      manta_MD1   chr1:200000   chr1:300001     +-   None    DEL
2      manta_ML1   chr1:300000   chr1:400001     +-   None    DEL
3      manta_MG1   chr1:400000   chr1:500001     +-   None    DEL
4     manta_MDL1   chr2:100001   chr2:200000     -+   None    DUP
5     manta_MDG1   chr2:100501   chr2:200500     -+   None    DUP
6     manta_MLG1   chr2:200001   chr2:300000     -+   None    DUP
7   manta_MDLG1o   chr3:100000   chr4:200000     ++   None    BND
8       delly_D1   chr1:100030   chr1:600001     +-    180    DEL
9      delly_DL1  chr1:1000000  chr1:2000001     +-    180    DEL
10     delly_DG1  chr1:2000000  chr1:3000001     +-    180    DEL
11    delly_DLG1  chr2:1000001  chr2:2000000     -+     29    DUP
12      lumpy_L1    chr1:10000    chr1:10101     +-   None    DEL
13     lumpy_LG1  chr1:3000000  chr1:4000031     +-   None    DEL
14    gridss_G1o   chr1:100050   chr1:110000     +-  13.03    BND

You can see the SV records that are within 100bp of each other have been successfully merged.

By default, for each merged event, only one SV record is kept; regarding the “ML1”, the SV record from Manta is selected and that from Lumpy is discarded.

You can control which SV callers are given priority for selection by changing the order of the list of input Vcf objects.

Here is the example.

In [11]: merged_vcf = viola.merge([delly, lumpy, manta, gridss], threshold=100)

In [12]: print(merged_vcf)
INFO=precise,imprecise,svtype,svmethod,end,chr2,pos2,pe,mapq,ct,cipos,ciend,srmapq,inslen,homlen,sr,srq,consensus,ce,rdratio,svlenorg,somatic,svlen,strands,secondary,event,mateid,cipos95,ciend95,su,bd,ev,prpos,prend,suorg,cigar,homseq,svinslen,svinsseq,left_svinsseq,right_svinsseq,contig,bnd_depth,mate_bnd_depth,somaticscore,junction_somaticscore,inv3,inv5,as,asc,asq,asrp,assr,ba,banrp,banrpq,bansr,bansrq,baq,basrp,bassr,bealn,beid,beidh,beidl,benames,bmq,bmqn,bmqx,bpnames,bq,bsc,bscq,bum,bumq,bvf,cas,casq,cirpos,cq,ic,ihompos,insrm,insrmp,insrmrc,insrmro,insrmrt,instaxid,iq,mq,mqn,mqx,ras,rasq,ref,refpair,rf,rp,rpq,rsi,sb,sc,self,si,vf,mergedid,originalid,caller,supportingid,supportingcaller,supportingidcount,supportingcallercount
Documentation of Vcf object ==> https://dermasugita.github.io/ViolaDocs/docs/html/reference/vcf.html
             id           be1           be2 strand   qual svtype
0      delly_D1   chr1:100030   chr1:600001     +-    180    DEL
1     delly_MD1   chr1:200030   chr1:299921     +-    180    DEL
2     delly_DL1  chr1:1000000  chr1:2000001     +-    180    DEL
3     delly_DG1  chr1:2000000  chr1:3000001     +-    180    DEL
4    delly_MDL1   chr2:100021   chr2:200080     -+     29    DUP
5    delly_MDG1   chr2:100451   chr2:200480     -+     29    DUP
6    delly_DLG1  chr2:1000001  chr2:2000000     -+     29    DUP
7   delly_MDLG1   chr3:100030   chr4:200050     ++    193    BND
8      lumpy_L1    chr1:10000    chr1:10101     +-   None    DEL
9     lumpy_ML1   chr1:300020   chr1:400005     +-   None    DEL
10    lumpy_LG1  chr1:3000000  chr1:4000031     +-   None    DEL
11   lumpy_MLG1   chr2:200061   chr2:300065     -+   None    DUP
12     manta_M1   chr1:100000   chr1:200001     +-   None    DEL
13    manta_MG1   chr1:400000   chr1:500001     +-   None    DEL
14   gridss_G1o   chr1:100050   chr1:110000     +-  13.03    BND

Look again at “ML1” and you will see that the Lumpy-derived SV record has been chosen. In this case, Lumpy is the 2nd priority and Manta is the 3rd.

INFO relating to VCF Merging

viola.merge adds three INFOs:

  • mergedid: ID given to merged SV records.

  • originalid: Original ID of each SV record; the name of the caller is prepended after the merge to avoid conflict between the names of the SV IDs of different callers.

  • caller: The name of the caller which identified the SV record.

  • supportingid: IDs of SV records supporting the merged SV record.

  • supportingcaller: SV callers supporting the variant.

  • supportingidcount: Number of SV records supporting the merged SV record.

  • supportingcallercount: Count of SV callers supporting the variant.

As examples, ‘mergedid’ and ‘supportingcallercount’ are shown below.

In [13]: print(merged_vcf.view(custom_infonames=['mergedid', 'supportingcallercount']))
INFO=precise,imprecise,svtype,svmethod,end,chr2,pos2,pe,mapq,ct,cipos,ciend,srmapq,inslen,homlen,sr,srq,consensus,ce,rdratio,svlenorg,somatic,svlen,strands,secondary,event,mateid,cipos95,ciend95,su,bd,ev,prpos,prend,suorg,cigar,homseq,svinslen,svinsseq,left_svinsseq,right_svinsseq,contig,bnd_depth,mate_bnd_depth,somaticscore,junction_somaticscore,inv3,inv5,as,asc,asq,asrp,assr,ba,banrp,banrpq,bansr,bansrq,baq,basrp,bassr,bealn,beid,beidh,beidl,benames,bmq,bmqn,bmqx,bpnames,bq,bsc,bscq,bum,bumq,bvf,cas,casq,cirpos,cq,ic,ihompos,insrm,insrmp,insrmrc,insrmro,insrmrt,instaxid,iq,mq,mqn,mqx,ras,rasq,ref,refpair,rf,rp,rpq,rsi,sb,sc,self,si,vf,mergedid,originalid,caller,supportingid,supportingcaller,supportingidcount,supportingcallercount
Documentation of Vcf object ==> https://dermasugita.github.io/ViolaDocs/docs/html/reference/vcf.html
             id           be1           be2 strand   qual svtype  mergedid_0  supportingcallercount_0
0      delly_D1   chr1:100030   chr1:600001     +-    180    DEL           0                        1
1     delly_MD1   chr1:200030   chr1:299921     +-    180    DEL           1                        2
2     delly_DL1  chr1:1000000  chr1:2000001     +-    180    DEL           2                        2
3     delly_DG1  chr1:2000000  chr1:3000001     +-    180    DEL           3                        2
4    delly_MDL1   chr2:100021   chr2:200080     -+     29    DUP           4                        3
5    delly_MDG1   chr2:100451   chr2:200480     -+     29    DUP           5                        3
6    delly_DLG1  chr2:1000001  chr2:2000000     -+     29    DUP           6                        3
7   delly_MDLG1   chr3:100030   chr4:200050     ++    193    BND           7                        4
8      lumpy_L1    chr1:10000    chr1:10101     +-   None    DEL           8                        1
9     lumpy_ML1   chr1:300020   chr1:400005     +-   None    DEL           9                        2
10    lumpy_LG1  chr1:3000000  chr1:4000031     +-   None    DEL          10                        2
11   lumpy_MLG1   chr2:200061   chr2:300065     -+   None    DUP          11                        3
12     manta_M1   chr1:100000   chr1:200001     +-   None    DEL          12                        1
13    manta_MG1   chr1:400000   chr1:500001     +-   None    DEL          13                        2
14   gridss_G1o   chr1:100050   chr1:110000     +-  13.03    BND          14                        1

Now, in order to increase the accuracy of SV detection, let’s get SVs detected by more than one SV caller. To do this, just select the ones with ‘supportingcallercount’ greater than or equal to 2.

In [14]: filtered_vcf = merged_vcf.filter('supportingcallercount >= 2')

In [15]: print(filtered_vcf.view(custom_infonames=['mergedid', 'supportingcallercount']))
INFO=precise,imprecise,svtype,svmethod,end,chr2,pos2,pe,mapq,ct,cipos,ciend,srmapq,inslen,homlen,sr,srq,consensus,ce,rdratio,svlenorg,somatic,svlen,strands,secondary,event,mateid,cipos95,ciend95,su,bd,ev,prpos,prend,suorg,cigar,homseq,svinslen,svinsseq,left_svinsseq,right_svinsseq,contig,bnd_depth,mate_bnd_depth,somaticscore,junction_somaticscore,inv3,inv5,as,asc,asq,asrp,assr,ba,banrp,banrpq,bansr,bansrq,baq,basrp,bassr,bealn,beid,beidh,beidl,benames,bmq,bmqn,bmqx,bpnames,bq,bsc,bscq,bum,bumq,bvf,cas,casq,cirpos,cq,ic,ihompos,insrm,insrmp,insrmrc,insrmro,insrmrt,instaxid,iq,mq,mqn,mqx,ras,rasq,ref,refpair,rf,rp,rpq,rsi,sb,sc,self,si,vf,mergedid,originalid,caller,supportingid,supportingcaller,supportingidcount,supportingcallercount
Documentation of Vcf object ==> https://dermasugita.github.io/ViolaDocs/docs/html/reference/vcf.html
             id           be1           be2 strand  qual svtype  mergedid_0  supportingcallercount_0
0     delly_MD1   chr1:200030   chr1:299921     +-   180    DEL           1                        2
1     delly_DL1  chr1:1000000  chr1:2000001     +-   180    DEL           2                        2
2     delly_DG1  chr1:2000000  chr1:3000001     +-   180    DEL           3                        2
3    delly_MDL1   chr2:100021   chr2:200080     -+    29    DUP           4                        3
4    delly_MDG1   chr2:100451   chr2:200480     -+    29    DUP           5                        3
5    delly_DLG1  chr2:1000001  chr2:2000000     -+    29    DUP           6                        3
6   delly_MDLG1   chr3:100030   chr4:200050     ++   193    BND           7                        4
7     lumpy_ML1   chr1:300020   chr1:400005     +-  None    DEL           9                        2
8     lumpy_LG1  chr1:3000000  chr1:4000031     +-  None    DEL          10                        2
9    lumpy_MLG1   chr2:200061   chr2:300065     -+  None    DUP          11                        3
10    manta_MG1   chr1:400000   chr1:500001     +-  None    DEL          13                        2

Export VCF file

Finally, let’s export the Vcf object filtered by ‘supportingcallercount’ as a VCF file.

filtered_vcf.to_vcf('/path/to/the/output.vcf')