Csq Doesn't Handle Chromosome Names Correctly

by ADMIN 46 views

Introduction

The csq tool in bcftools is used for consequence annotation of variants. However, it has been reported that csq does not handle chromosome names correctly. This issue is particularly evident when dealing with GFF files that contain CDS or exon entries. In this article, we will delve into the details of this issue and explore possible solutions.

Background

csq is a tool that uses GFF files to annotate variants. It takes a VCF file as input and outputs a new VCF file with additional information about the consequences of the variants. The GFF file is used to provide information about the genomic features, such as genes, transcripts, and exons.

The Issue

The issue with csq is that it does not handle chromosome names correctly. When a GFF file contains CDS or exon entries, csq fails to parse the file correctly. This results in an error message indicating that the sequence "chromosome_6" was not found.

Example Use Cases

To demonstrate the issue, we will use two example GFF files: annot_works.gz and annot_broken.gz. The first file contains a gene and a mRNA feature, while the second file contains a gene, a mRNA, and several CDS and exon features.

$ zcat annot_works.gz
Chromosome_6    .       gene    502788  503671  .       +       .       ID=gene_10623;Name=jgi.p|Magor1|10623;used=1
Chromosome_6    .       mRNA    502788  503671  .       +       .       ID=mRNA_10623;Parent=gene_10623;biotype=protein_coding;used=1

$ zcat annot_broken.gz
Chromosome_6    .       gene    502788  503671  .       +       .       ID=gene_10623;Name=jgi.p|Magor1|10623;used=1
Chromosome_6    .       mRNA    502788  503671  .       +       .       ID=mRNA_10623;Parent=gene_10623;biotype=protein_coding;used=1
Chromosome_6    .       CDS     502788  502990  .       +       0       Parent=mRNA_10623
Chromosome_6    .       CDS     503047  503396  .       +       1       Parent=mRNA_10623
Chromosome_6    .       CDS     503485  503527  .       +       2       Parent=mRNA_10623
Chromosome_6    .       CDS     503593  503671  .       +       1       Parent=mRNA_10623
Chromosome_6    .       exon    502788  502990  .       +       .       Parent=mRNA_10623
Chromosome_6    .       exon    503047  503396  .       +       .       Parent=mRNA_10623
Chromosome_6    .       exon    503485  503527  .       +       .       Parent=mRNA_10623
Chromosome_6    .       exon    503593  503671  .       +       .       Parent=mRNA_10623

Expected Behavior

When running csq with the annot_works.gz file, we expect it to parse the file correctly and output a new VCF file with the annotated variants. However, when running csq with the annot_broken.gz file, we expect it to fail to parse the file correctly and output an error message.

$ bcftools csq -f genome.fasta.gz -g annot_works.gz -Ov geno.vcf.gz --dump-gff dump_works.gz
Parsing annot_works.gz ...
Indexed 1 transcripts, 0 exons, 0 CDSs, 0 UTRs
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=Chromosome_1,length=7978604>
##contig=<ID=Chromosome_2,length=8319966>
##contig=<ID=Chromosome_3,length=6606598>
##contig=<ID=Chromosome_4,length=5546968>
##contig=<ID=Chromosome_5,length=4490059>
##contig=<ID=Chromosome_6,length=4133993>
##contig=<ID=Chromosome_7,length=3415785>
##source=GenotypeGVCFs
##source=HaplotypeCaller
##bcftools/csqVersion=1.21+htslib-1.21
##bcftools/csqCommand=csq -f genome.fasta.gz -g annot_works.gz -Ov --dump-gff dump_works.gz geno.vcf.gz; Date=Tue Mar 11 06:51:00 2025
##INFO=<ID=BCSQ,Number=.,Type=String,Description="Haplotype-aware consequence annotation from BCFtools/csq, see http://samtools.github.io/bcftools/howtos/csq-calling.html for details. Format: Consequence|gene|transcript|biotype|strand|amino_acid_change|dna_change">
##FORMAT=<ID=BCSQ,Number=.,Type=Integer,Description="Bitmask of indexes to INFO/BCSQ, with interleaved first/second haplotype. Use \"bcftools query -f'[%CHROM\t%POS\t%SAMPLE\t%TBCSQ\n]'\" to translate.">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Sample1
Calling...
Chromosome_6    502802  .       ATTG    A       1869.01 .       BCSQ=intron|jgi.p|Magor1|10623||protein_coding  GT      1
Chromosome_6    503050  .       C       T       1912.04 .       BCSQ=intron|jgi.p|Magor1|10623||protein_coding  GT      1
Chromosome_6    503088  .       G       A       2135.04 .       BCSQ=intron|jgi.p|Magor1|10623||protein_coding  GT      1
Chromosome_6    503090  .       A       G       2313.04 .       BCSQ=intron|jgi.p|Magor1|10623||protein_coding  GT      1
Chromosome_6    503106  .       A       G       2118.04 .       BCSQ=intron|jgi.p|Magor1|10623||protein_coding  GT      1
Chromosome_6    503155  .       T       G       1452.04 .       BCSQ=intron|jgi.p|Magor1|10623||protein_coding  GT      1

$ bcftools csq -f genome.fasta.gz -g annot_broken.gz -Ov geno.vcf.gz --dump-gff dump_broken.gz
Parsing annot_broken.gz ...
Warning: Ignoring GFF feature with unknown phase .. Chromosome_6        .       exon    502788  502990  .       +       .       Parent=mRNA_10623
Indexed 1 transcripts, 4 exons, 4 CDSs, 0 UTRs
Warning: 3 warnings were suppressed, increase verbosity to see them all
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=Chromosome_1,length=7978604>
##contig=<ID=Chromosome_2,length=8319966>
##contig=<ID=Chromosome_3,length=6606598>
##contig=<ID=Chromosome_4,length=5546968>
##contig=<ID=Chromosome_5,length=4490059>
##contig=<ID=Chromosome_6,length=4133993>
##contig=<ID=Chromosome_7,length=3415785>
##source=GenotypeGVCFs
##source=HaplotypeCaller
##bcftools/csqVersion=1.21+htslib-1.21
##bcftools/csqCommand=csq -f genome.fasta.gz -g annot_broken.gz -Ov --dump-gff dump_broken.gz geno.vcf.gz; Date=Tue Mar 11 06:51:11 2025
##INFO=<ID=BCSQ,Number=.,Type=String,Description="Haplotype-aware consequence annotation from BCFtools/csq, see http://samtools.github.io/bcftools/howtos/csq-calling.html for details. Format: Consequence|gene|transcript|biotype|strand|amino_acid_change|dna_change">
##FORMAT=<ID=BCSQ,Number=.,Type=Integer<br/>
**csq Doesn't Handle Chromosome Names Correctly: Q&A**
=====================================================

**Q: What is the issue with csq?**
------------------------------

A: The issue with `csq` is that it does not handle chromosome names correctly. When a GFF file contains CDS or exon entries, `csq` fails to parse the file correctly.

**Q: What are the symptoms of this issue?**
-----------------------------------------

A: The symptoms of this issue include:

* `csq` failing to parse the GFF file correctly
* An error message indicating that the sequence "chromosome_6" was not found
* No output from `csq` when CDS or exon entries are present in the GFF file

**Q: What are the possible causes of this issue?**
------------------------------------------------

A: The possible causes of this issue include:

* Incorrect chromosome names in the GFF file
* Missing or corrupted GFF file
* Incompatible versions of `csq` and `bcftools`

**Q: How can I troubleshoot this issue?**
------------------------------------------

A: To troubleshoot this issue, you can try the following:

* Check the chromosome names in the GFF file to ensure they are correct
* Verify that the GFF file is complete and not corrupted
* Check the versions of `csq` and `bcftools` to ensure they are compatible

**Q: How can I fix this issue?**
---------------------------

A: To fix this issue, you can try the following:

* Rename the chromosome names in the GFF file to match the expected format
* Re-create the GFF file using a compatible tool
* Update `csq` and `bcftools` to the latest versions

**Q: What are the implications of this issue?**
------------------------------------------------

A: The implications of this issue include:

* Inaccurate or incomplete annotation of variants
* Failure to identify potential genetic variants
* Inability to perform downstream analysis

**Q: How can I prevent this issue in the future?**
------------------------------------------------

A: To prevent this issue in the future, you can try the following:

* Verify the chromosome names in the GFF file before running `csq`
* Use a compatible tool to create the GFF file
* Regularly update `csq` and `bcftools` to the latest versions

**Q: Where can I find more information about this issue?**
---------------------------------------------------------

A: For more information about this issue, you can refer to the following resources:

* The `bcftools` documentation
* The `csq` documentation
* Online forums and communities dedicated to bioinformatics and genomics

**Conclusion**
----------

The issue with `csq` not handling chromosome names correctly is a common problem that can lead to inaccurate or incomplete annotation of variants. By understanding the symptoms, causes, and implications of this issue, you can take steps to troubleshoot and fix it. Additionally, by following best practices for creating and using GFF files, you can prevent this issue in the future.