-
Notifications
You must be signed in to change notification settings - Fork 13
Add support for generating CSI file from bgzip compressed VCF. #190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
athos
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on another feature! I just added some comments on relatively superficial matters.
Codecov Report
@@ Coverage Diff @@
## master #190 +/- ##
==========================================
+ Coverage 86.67% 86.81% +0.14%
==========================================
Files 76 77 +1
Lines 6057 6205 +148
Branches 502 516 +14
==========================================
+ Hits 5250 5387 +137
+ Misses 305 302 -3
- Partials 502 516 +14
Continue to review full report at Codecov.
|
9ae9ce7 to
803b2d5
Compare
|
Thank you for reviewing. |
alumi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates! I added a few more comments.
|
Sorry for late, I fixed. |
cf1ce51 to
cddf8c2
Compare
alumi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like lein test :all is failing 🔥
|
I'm so sorry,I will fix test cases same as others. |
athos
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some more trivial comments.
src/cljam/io/csi.clj
Outdated
| [bin (->> offsets | ||
| (map #(chunk/->Chunk (:file-beg %) (:file-end %))) | ||
| reverse)])) | ||
| sort |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| sort |
src/cljam/io/csi.clj
Outdated
| [offsets shift depth] | ||
| (let [chr-offsets (->> (partition-by :chr offsets) | ||
| (map #(vector (:chr (first %)) %)) | ||
| sort |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| sort |
src/cljam/io/csi.clj
Outdated
| from offsets {:file-beg :file-end :beg :end :chr }" | ||
| [offsets shift depth] | ||
| (let [chr-offsets (->> (partition-by :chr offsets) | ||
| (map #(vector (:chr (first %)) %)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two lines seem to be equivalent to (group-by :chr offsets) in this case.
src/cljam/algo/vcf_indexer.clj
Outdated
|
|
||
| (defn create-index | ||
| "Creates a CSI index file from the VCF file." | ||
| [in-bgziped-vcf out-csi {:keys [shift depth] :or {shift 14 depth 5}}] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/bgziped/bgzipped/
d55f8f2 to
79451d1
Compare
|
Thank you for pointing. I made the following changes.
|
alumi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Run formatter and linters every time before you push your commits
- It seems like generated CSI files are not compatible with bcftools
- Please make sure that both
cljam.io.vcf.read-variants-randomlyandbcftools viewworks as well
- Please make sure that both
- Add a test case for VCF files with skipped contigs as in #190 (comment)
79451d1 to
3bf6899
Compare
|
Sorry, I applied linter. |
e655e42 to
4fd65bd
Compare
|
Bcftools (hts_lib) will generate an error if some tabix data (such as format,col_seq , names and so on) is not inserted into CSI aux. |
alumi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! The indexing looks good. Added some more trivial comments.
5d3a3aa to
985fed3
Compare
|
Thank you for pointing.I fixed them. |
alumi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the changes. I've added some suggestions around reg->bin. Would you check them, please?
src/cljam/io/sam/util.clj
Outdated
| (defn compute-bin | ||
| "Returns indexing bin based on alignment start and end." | ||
| [aln] | ||
| (let [beg (dec (:pos aln)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| (let [beg (dec (:pos aln)) | |
| (let [beg (:pos aln) |
src/cljam/io/bam_index/writer.clj
Outdated
| (let [bin (util-bin/reg->bin (.pos aln) (inc (.end aln)) | ||
| linear-index-shift linear-index-depth) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| (let [bin (util-bin/reg->bin (.pos aln) (inc (.end aln)) | |
| linear-index-shift linear-index-depth) | |
| (let [bin (util-bin/reg->bin | |
| (.pos aln) (.end aln) linear-index-shift linear-index-depth) |
src/cljam/io/util/bin.clj
Outdated
| beg (Math/min max-pos (Math/max 0 (dec beg))) | ||
| end (Math/min max-pos (Math/max 0 (dec end)))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| beg (Math/min max-pos (Math/max 0 (dec beg))) | |
| end (Math/min max-pos (Math/max 0 (dec end)))] | |
| beg (dec (Math/max 1 (Math/min max-pos beg))) | |
| end (dec (Math/max 1 (Math/min max-pos end)))] |
test/cljam/io/util/bin_test.clj
Outdated
| (deftest reg->bin-test | ||
| (is (= (util-bin/reg->bin 1 1 14 6) 37449)) | ||
| (is (= (util-bin/reg->bin 1 1 14 7) 299593)) | ||
| (is (= (util-bin/reg->bin 1 32769 15 6) 4681)) | ||
| (is (= (util-bin/reg->bin 1 2 0 2) 1)) | ||
| (is (= (util-bin/reg->bin 240877561 240877568 14 6) 52150))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| (deftest reg->bin-test | |
| (is (= (util-bin/reg->bin 1 1 14 6) 37449)) | |
| (is (= (util-bin/reg->bin 1 1 14 7) 299593)) | |
| (is (= (util-bin/reg->bin 1 32769 15 6) 4681)) | |
| (is (= (util-bin/reg->bin 1 2 0 2) 1)) | |
| (is (= (util-bin/reg->bin 240877561 240877568 14 6) 52150))) | |
| (deftest reg->bin-test | |
| (testing "BAI-compatible" | |
| (are [?start ?end ?bin] | |
| (= ?bin (util-bin/reg->bin ?start ?end 14 5)) | |
| 0 0 4681 | |
| 0 1 4681 | |
| 1 1 4681 | |
| 1 16384 4681 | |
| 16384 16384 4681 | |
| 16384 16385 585 | |
| 16385 16385 4682 | |
| 131072 131072 4688 | |
| 131072 131073 73 | |
| 8388608 8388609 1 | |
| 67108864 67108865 0 | |
| 536870912 536870912 37448 | |
| 536870912 536870913 37448)) | |
| (testing "1-by-1" | |
| (are [?start ?end ?bin] | |
| (= ?bin (util-bin/reg->bin ?start ?end 0 2)) | |
| 0 1 9 | |
| 1 1 9 | |
| 2 2 10 | |
| 1 2 1 | |
| 1 8 1 | |
| 1 9 0 | |
| 8 9 0 | |
| 9 9 17 | |
| 63 64 8 | |
| 64 64 72 | |
| 64 64 72)) | |
| (testing "various min-shfits and depths" | |
| (is (= (util-bin/reg->bin 1 1 14 6) 37449)) | |
| (is (= (util-bin/reg->bin 1 1 14 7) 299593)) | |
| (is (= (util-bin/reg->bin 1 32769 15 6) 4681)) | |
| (is (= (util-bin/reg->bin 240877561 240877568 14 6) 52150)))) |
|
Thank you,I fixed. |
alumi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
I found a bug in the random reading of VCF files. Could you fix it in another PR?
;; with test-resources/vcf/test-chr-skipped.vcf.gz.csi
(with-open [r (vcf/reader "test-resources/vcf/test-chr-skipped.vcf.gz")]
(doall (vcf/read-variants-randomly r {:chr "chr3"} {})))
Execution error (NullPointerException) at cljam.io.csi.CSI/get_chunks (csi.clj:19).|
@niyarin Please don't forget to squash the commits before merging this branch 😀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@niyarin Thanks for your persistent effort to improve the code for this feature! 🙏Now the implementation works fine 👍
Just one more thing, can you add a mention to CSI, as well as Tabix, in the docstring of cljam.io.vcf/read-variants-randomly as another supported index format? The current description looks to me like it only supports the Tabix format. Also, you may want to avoid using the name tabix-data in the body of that function.
Once you make a commit for the above point and squash all the commits into one (or a few), I will merge the patch soon.
2c71be5 to
2444372
Compare
|
Thank you,I added a comment that VCF can also use CSI and renamed 'tabix-data' to 'index-data'. |
|
🎉 🎉 🎉 |
Add support for generating CSI file from VCF.