VCF Merging
Here, we describe how to merge VCF files with Viola.
First of all, let’s load the four VCF files from the different SV callers.
In [1]: manta = viola.read_vcf('https://raw.githubusercontent.com/dermasugita/Viola-SV/master/examples/demo_merge/test.merge.manta.vcf', variant_caller='manta', patient_name='manta1')
In [2]: delly = viola.read_vcf('https://raw.githubusercontent.com/dermasugita/Viola-SV/master/examples/demo_merge/test.merge.delly.vcf', variant_caller='delly', patient_name='delly1')
In [3]: lumpy = viola.read_vcf('https://raw.githubusercontent.com/dermasugita/Viola-SV/master/examples/demo_merge/test.merge.lumpy.vcf', variant_caller='lumpy', patient_name='lumpy1')
In [4]: gridss = viola.read_vcf('https://raw.githubusercontent.com/dermasugita/Viola-SV/master/examples/demo_merge/test.merge.gridss.vcf', variant_caller='gridss', patient_name='gridss')
These are synthetic files that we have created to briefly test the functionality of viola.merge, and they share some SVs.
You can see which SV records in a file are sharing SVs with other files by looking at their SV ID.
In [5]: print(manta)
INFO=imprecise,svtype,svlen,end,cipos,ciend,cigar,mateid,event,homlen,homseq,svinslen,svinsseq,left_svinsseq,right_svinsseq,contig,bnd_depth,mate_bnd_depth,somatic,somaticscore,junction_somaticscore,inv3,inv5
Documentation of Vcf object ==> https://dermasugita.github.io/ViolaDocs/docs/html/reference/vcf.html
id be1 be2 strand qual svtype
0 M1 chr1:100000 chr1:200001 +- None DEL
1 MD1 chr1:200000 chr1:300001 +- None DEL
2 ML1 chr1:300000 chr1:400001 +- None DEL
3 MG1 chr1:400000 chr1:500001 +- None DEL
4 MDL1 chr2:100001 chr2:200000 -+ None DUP
5 MDG1 chr2:100501 chr2:200500 -+ None DUP
6 MLG1 chr2:200001 chr2:300000 -+ None DUP
7 MDLG1o chr3:100000 chr4:200000 ++ None BND
8 MDLG1h chr4:200000 chr3:100000 ++ None BND
In the above case, for example, SV ID “ML1” means “the SV record is intended to be shared by Manta and Lumpy”.
In [6]: print(delly)
INFO=precise,imprecise,svtype,svmethod,end,chr2,pos2,pe,mapq,ct,cipos,ciend,srmapq,inslen,homlen,sr,srq,consensus,ce,rdratio,svlenorg,somatic,svlen
Documentation of Vcf object ==> https://dermasugita.github.io/ViolaDocs/docs/html/reference/vcf.html
id be1 be2 strand qual svtype
0 D1 chr1:100030 chr1:600001 +- 180 DEL
1 MD1 chr1:200030 chr1:299921 +- 180 DEL
2 DL1 chr1:1000000 chr1:2000001 +- 180 DEL
3 DG1 chr1:2000000 chr1:3000001 +- 180 DEL
4 MDL1 chr2:100021 chr2:200080 -+ 29 DUP
5 MDG1 chr2:100451 chr2:200480 -+ 29 DUP
6 DLG1 chr2:1000001 chr2:2000000 -+ 29 DUP
7 MDLG1 chr3:100030 chr4:200050 ++ 193 BND
In [7]: print(lumpy)
INFO=svtype,strands,secondary,event,mateid,svlen,end,cipos,ciend,cipos95,ciend95,imprecise,su,pe,sr,bd,ev,prpos,prend,suorg
Documentation of Vcf object ==> https://dermasugita.github.io/ViolaDocs/docs/html/reference/vcf.html
id be1 be2 strand qual svtype
0 L1 chr1:10000 chr1:10101 +- None DEL
1 ML1 chr1:300020 chr1:400005 +- None DEL
2 DL1 chr1:1000020 chr1:2000031 +- None DEL
3 LG1 chr1:3000000 chr1:4000031 +- None DEL
4 MDL1 chr2:100061 chr2:200085 -+ None DUP
5 MLG1 chr2:200061 chr2:300065 -+ None DUP
6 DLG1 chr2:1000041 chr2:2000065 -+ None DUP
7 MDLG1o chr3:100020 chr4:200010 ++ None BND
8 MDLG1h chr4:200010 chr3:100020 ++ None BND
In [8]: print(gridss)
INFO=as,asc,asq,asrp,assr,ba,banrp,banrpq,bansr,bansrq,baq,basrp,bassr,bealn,beid,beidh,beidl,benames,bmq,bmqn,bmqx,bpnames,bq,bsc,bscq,bum,bumq,bvf,cas,casq,ciend,cipos,cirpos,cq,end,event,homlen,homseq,ic,ihompos,imprecise,insrm,insrmp,insrmrc,insrmro,insrmrt,instaxid,iq,mateid,mq,mqn,mqx,ras,rasq,ref,refpair,rf,rp,rpq,rsi,sb,sc,self,si,sr,srq,svlen,svtype,vf
Documentation of Vcf object ==> https://dermasugita.github.io/ViolaDocs/docs/html/reference/vcf.html
id be1 be2 strand qual svtype
0 G1o chr1:100050 chr1:110000 +- 13.03 BND
1 G1h chr1:110000 chr1:100050 -+ 13.03 BND
2 MG1o chr1:400050 chr1:500003 +- 618.98 BND
3 MG1h chr1:500003 chr1:400050 -+ 618.98 BND
4 DG1o chr1:2000050 chr1:3000090 +- 618.98 BND
5 DG1h chr1:3000090 chr1:2000050 -+ 618.98 BND
6 LG1o chr1:3000050 chr1:4000090 +- 618.98 BND
7 LG1h chr1:4000090 chr1:3000050 -+ 618.98 BND
8 MDG1o chr2:100550 chr2:200505 -+ 13.03 BND
9 MDG1h chr2:200505 chr2:100550 +- 13.03 BND
10 MLG1o chr2:200020 chr2:300015 -+ 13.03 BND
11 MLG1h chr2:300015 chr2:200020 +- 13.03 BND
12 DLG1o chr2:1000020 chr2:2000015 -+ 13.03 BND
13 DLG1h chr2:2000015 chr2:1000020 +- 13.03 BND
14 MDLG1o chr3:100020 chr4:200015 ++ 13.03 BND
15 MDLG1h chr4:200015 chr3:100020 ++ 13.03 BND
Default Behaviour
Now, let’s use viola.merge function.
In [9]: merged_vcf = viola.merge([manta, delly, lumpy, gridss], threshold=100)
In [10]: print(merged_vcf)
INFO=imprecise,svtype,svlen,end,cipos,ciend,cigar,mateid,event,homlen,homseq,svinslen,svinsseq,left_svinsseq,right_svinsseq,contig,bnd_depth,mate_bnd_depth,somatic,somaticscore,junction_somaticscore,inv3,inv5,precise,svmethod,chr2,pos2,pe,mapq,ct,srmapq,inslen,sr,srq,consensus,ce,rdratio,svlenorg,strands,secondary,cipos95,ciend95,su,bd,ev,prpos,prend,suorg,as,asc,asq,asrp,assr,ba,banrp,banrpq,bansr,bansrq,baq,basrp,bassr,bealn,beid,beidh,beidl,benames,bmq,bmqn,bmqx,bpnames,bq,bsc,bscq,bum,bumq,bvf,cas,casq,cirpos,cq,ic,ihompos,insrm,insrmp,insrmrc,insrmro,insrmrt,instaxid,iq,mq,mqn,mqx,ras,rasq,ref,refpair,rf,rp,rpq,rsi,sb,sc,self,si,vf,mergedid,originalid,caller,supportingid,supportingcaller,supportingidcount,supportingcallercount
Documentation of Vcf object ==> https://dermasugita.github.io/ViolaDocs/docs/html/reference/vcf.html
id be1 be2 strand qual svtype
0 manta_M1 chr1:100000 chr1:200001 +- None DEL
1 manta_MD1 chr1:200000 chr1:300001 +- None DEL
2 manta_ML1 chr1:300000 chr1:400001 +- None DEL
3 manta_MG1 chr1:400000 chr1:500001 +- None DEL
4 manta_MDL1 chr2:100001 chr2:200000 -+ None DUP
5 manta_MDG1 chr2:100501 chr2:200500 -+ None DUP
6 manta_MLG1 chr2:200001 chr2:300000 -+ None DUP
7 manta_MDLG1o chr3:100000 chr4:200000 ++ None BND
8 delly_D1 chr1:100030 chr1:600001 +- 180 DEL
9 delly_DL1 chr1:1000000 chr1:2000001 +- 180 DEL
10 delly_DG1 chr1:2000000 chr1:3000001 +- 180 DEL
11 delly_DLG1 chr2:1000001 chr2:2000000 -+ 29 DUP
12 lumpy_L1 chr1:10000 chr1:10101 +- None DEL
13 lumpy_LG1 chr1:3000000 chr1:4000031 +- None DEL
14 gridss_G1o chr1:100050 chr1:110000 +- 13.03 BND
You can see the SV records that are within 100bp of each other have been successfully merged.
By default, for each merged event, only one SV record is kept; regarding the “ML1”, the SV record from Manta is selected and that from Lumpy is discarded.
You can control which SV callers are given priority for selection by changing the order of the list of input Vcf objects.
Here is the example.
In [11]: merged_vcf = viola.merge([delly, lumpy, manta, gridss], threshold=100)
In [12]: print(merged_vcf)
INFO=precise,imprecise,svtype,svmethod,end,chr2,pos2,pe,mapq,ct,cipos,ciend,srmapq,inslen,homlen,sr,srq,consensus,ce,rdratio,svlenorg,somatic,svlen,strands,secondary,event,mateid,cipos95,ciend95,su,bd,ev,prpos,prend,suorg,cigar,homseq,svinslen,svinsseq,left_svinsseq,right_svinsseq,contig,bnd_depth,mate_bnd_depth,somaticscore,junction_somaticscore,inv3,inv5,as,asc,asq,asrp,assr,ba,banrp,banrpq,bansr,bansrq,baq,basrp,bassr,bealn,beid,beidh,beidl,benames,bmq,bmqn,bmqx,bpnames,bq,bsc,bscq,bum,bumq,bvf,cas,casq,cirpos,cq,ic,ihompos,insrm,insrmp,insrmrc,insrmro,insrmrt,instaxid,iq,mq,mqn,mqx,ras,rasq,ref,refpair,rf,rp,rpq,rsi,sb,sc,self,si,vf,mergedid,originalid,caller,supportingid,supportingcaller,supportingidcount,supportingcallercount
Documentation of Vcf object ==> https://dermasugita.github.io/ViolaDocs/docs/html/reference/vcf.html
id be1 be2 strand qual svtype
0 delly_D1 chr1:100030 chr1:600001 +- 180 DEL
1 delly_MD1 chr1:200030 chr1:299921 +- 180 DEL
2 delly_DL1 chr1:1000000 chr1:2000001 +- 180 DEL
3 delly_DG1 chr1:2000000 chr1:3000001 +- 180 DEL
4 delly_MDL1 chr2:100021 chr2:200080 -+ 29 DUP
5 delly_MDG1 chr2:100451 chr2:200480 -+ 29 DUP
6 delly_DLG1 chr2:1000001 chr2:2000000 -+ 29 DUP
7 delly_MDLG1 chr3:100030 chr4:200050 ++ 193 BND
8 lumpy_L1 chr1:10000 chr1:10101 +- None DEL
9 lumpy_ML1 chr1:300020 chr1:400005 +- None DEL
10 lumpy_LG1 chr1:3000000 chr1:4000031 +- None DEL
11 lumpy_MLG1 chr2:200061 chr2:300065 -+ None DUP
12 manta_M1 chr1:100000 chr1:200001 +- None DEL
13 manta_MG1 chr1:400000 chr1:500001 +- None DEL
14 gridss_G1o chr1:100050 chr1:110000 +- 13.03 BND
Look again at “ML1” and you will see that the Lumpy-derived SV record has been chosen. In this case, Lumpy is the 2nd priority and Manta is the 3rd.
INFO relating to VCF Merging
viola.merge adds three INFOs:
mergedid: ID given to merged SV records.
originalid: Original ID of each SV record; the name of the caller is prepended after the merge to avoid conflict between the names of the SV IDs of different callers.
caller: The name of the caller which identified the SV record.
supportingid: IDs of SV records supporting the merged SV record.
supportingcaller: SV callers supporting the variant.
supportingidcount: Number of SV records supporting the merged SV record.
supportingcallercount: Count of SV callers supporting the variant.
As examples, ‘mergedid’ and ‘supportingcallercount’ are shown below.
In [13]: print(merged_vcf.view(custom_infonames=['mergedid', 'supportingcallercount']))
INFO=precise,imprecise,svtype,svmethod,end,chr2,pos2,pe,mapq,ct,cipos,ciend,srmapq,inslen,homlen,sr,srq,consensus,ce,rdratio,svlenorg,somatic,svlen,strands,secondary,event,mateid,cipos95,ciend95,su,bd,ev,prpos,prend,suorg,cigar,homseq,svinslen,svinsseq,left_svinsseq,right_svinsseq,contig,bnd_depth,mate_bnd_depth,somaticscore,junction_somaticscore,inv3,inv5,as,asc,asq,asrp,assr,ba,banrp,banrpq,bansr,bansrq,baq,basrp,bassr,bealn,beid,beidh,beidl,benames,bmq,bmqn,bmqx,bpnames,bq,bsc,bscq,bum,bumq,bvf,cas,casq,cirpos,cq,ic,ihompos,insrm,insrmp,insrmrc,insrmro,insrmrt,instaxid,iq,mq,mqn,mqx,ras,rasq,ref,refpair,rf,rp,rpq,rsi,sb,sc,self,si,vf,mergedid,originalid,caller,supportingid,supportingcaller,supportingidcount,supportingcallercount
Documentation of Vcf object ==> https://dermasugita.github.io/ViolaDocs/docs/html/reference/vcf.html
id be1 be2 strand qual svtype mergedid_0 supportingcallercount_0
0 delly_D1 chr1:100030 chr1:600001 +- 180 DEL 0 1
1 delly_MD1 chr1:200030 chr1:299921 +- 180 DEL 1 2
2 delly_DL1 chr1:1000000 chr1:2000001 +- 180 DEL 2 2
3 delly_DG1 chr1:2000000 chr1:3000001 +- 180 DEL 3 2
4 delly_MDL1 chr2:100021 chr2:200080 -+ 29 DUP 4 3
5 delly_MDG1 chr2:100451 chr2:200480 -+ 29 DUP 5 3
6 delly_DLG1 chr2:1000001 chr2:2000000 -+ 29 DUP 6 3
7 delly_MDLG1 chr3:100030 chr4:200050 ++ 193 BND 7 4
8 lumpy_L1 chr1:10000 chr1:10101 +- None DEL 8 1
9 lumpy_ML1 chr1:300020 chr1:400005 +- None DEL 9 2
10 lumpy_LG1 chr1:3000000 chr1:4000031 +- None DEL 10 2
11 lumpy_MLG1 chr2:200061 chr2:300065 -+ None DUP 11 3
12 manta_M1 chr1:100000 chr1:200001 +- None DEL 12 1
13 manta_MG1 chr1:400000 chr1:500001 +- None DEL 13 2
14 gridss_G1o chr1:100050 chr1:110000 +- 13.03 BND 14 1
Now, in order to increase the accuracy of SV detection, let’s get SVs detected by more than one SV caller. To do this, just select the ones with ‘supportingcallercount’ greater than or equal to 2.
In [14]: filtered_vcf = merged_vcf.filter('supportingcallercount >= 2')
In [15]: print(filtered_vcf.view(custom_infonames=['mergedid', 'supportingcallercount']))
INFO=precise,imprecise,svtype,svmethod,end,chr2,pos2,pe,mapq,ct,cipos,ciend,srmapq,inslen,homlen,sr,srq,consensus,ce,rdratio,svlenorg,somatic,svlen,strands,secondary,event,mateid,cipos95,ciend95,su,bd,ev,prpos,prend,suorg,cigar,homseq,svinslen,svinsseq,left_svinsseq,right_svinsseq,contig,bnd_depth,mate_bnd_depth,somaticscore,junction_somaticscore,inv3,inv5,as,asc,asq,asrp,assr,ba,banrp,banrpq,bansr,bansrq,baq,basrp,bassr,bealn,beid,beidh,beidl,benames,bmq,bmqn,bmqx,bpnames,bq,bsc,bscq,bum,bumq,bvf,cas,casq,cirpos,cq,ic,ihompos,insrm,insrmp,insrmrc,insrmro,insrmrt,instaxid,iq,mq,mqn,mqx,ras,rasq,ref,refpair,rf,rp,rpq,rsi,sb,sc,self,si,vf,mergedid,originalid,caller,supportingid,supportingcaller,supportingidcount,supportingcallercount
Documentation of Vcf object ==> https://dermasugita.github.io/ViolaDocs/docs/html/reference/vcf.html
id be1 be2 strand qual svtype mergedid_0 supportingcallercount_0
0 delly_MD1 chr1:200030 chr1:299921 +- 180 DEL 1 2
1 delly_DL1 chr1:1000000 chr1:2000001 +- 180 DEL 2 2
2 delly_DG1 chr1:2000000 chr1:3000001 +- 180 DEL 3 2
3 delly_MDL1 chr2:100021 chr2:200080 -+ 29 DUP 4 3
4 delly_MDG1 chr2:100451 chr2:200480 -+ 29 DUP 5 3
5 delly_DLG1 chr2:1000001 chr2:2000000 -+ 29 DUP 6 3
6 delly_MDLG1 chr3:100030 chr4:200050 ++ 193 BND 7 4
7 lumpy_ML1 chr1:300020 chr1:400005 +- None DEL 9 2
8 lumpy_LG1 chr1:3000000 chr1:4000031 +- None DEL 10 2
9 lumpy_MLG1 chr2:200061 chr2:300065 -+ None DUP 11 3
10 manta_MG1 chr1:400000 chr1:500001 +- None DEL 13 2
Export VCF file
Finally, let’s export the Vcf object filtered by ‘supportingcallercount’ as a VCF file.
filtered_vcf.to_vcf('/path/to/the/output.vcf')