R语言中的动态压缩：vroom在基因组数据处理中的应用-CSDN博客

本文链接：https://blog.csdn.net/qq_33408922/article/details/133715937

动态创建压缩文件，可以简单地理解为将结果写出到压缩文件，而不是先写出到文件然后压缩。

R语言中，R包vroom就可以实现这一过程。

逐行写出函数vroom_write_lines与数据表写出函数vroom_write，可以通过识别文件后缀名的方式实现动态创建压缩文件。

生物信息中常见的压缩格式为.gz 压缩，这里以拟南芥的基因组序列进行测试。

下载

我的windows上也安装了axel，因此这里直接使用axel下载基因组。下载后的压缩文件为35M左右

library(vroom)

command <- "axel -n 20 https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-57/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz"
system(command = command)

ath_file <- "Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz"

file.size(ath_file)
# [1] 36462703 # 35M左右

按行读取

vroom_lines 按行读入数据，读入结果为字符串向量。

vroom 读入数据表数据，读入结果为数据框（“spec_tbl_df”, “tbl_df” , “tbl” , “data.frame”）。

这里按行读入fasta数据。

ath <- vroom_lines(file = ath_file)

class(ath)
# [1] "character"

ath[1:2]
# [1] ">1 dna:chromosome chromosome:TAIR10:1:1:30427671:1 REF"      
# [2] "CCCTAAACCCTAAACCCTAAACCCTAAACCTCTGAATCCTTAATCCCTAAATCCCTAAAT"

写出未压缩文件

结果文件在116M左右。

vroom_write_lines(ath, file = "ath.fa", eol = "\n")
file.size("ath.fa")
# [1] 121662600  # 116M左右

写出压缩文件

通过.gz后缀名识别并动态创建压缩文件。结果文件在35M左右

vroom_write_lines(ath, file = "ath.fa.gz", eol = "\n")
file.size("ath.fa.gz")

# [1] 36461631 # 35M左右

可以看到，加了.gz的结果文件"ath.fa.gz"与原压缩文件大小相近，且远小于未压缩的"ath.fa" 文件。说明动态常见压缩文件成功了。

全部代码

library(vroom)

command <- "axel -n 20 https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-57/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz"
system(command = command)

ath_file <- "Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz"

file.size(ath_file)
# [1] 36462703 # 35M左右

ath <- vroom_lines(file = ath_file)
class(ath)
# 未压缩
vroom_write_lines(ath, file = "ath.fa", eol = "\n")
file.size("ath.fa")
# [1] 121662600  # 116M左右

# 压缩
vroom_write_lines(ath, file = "ath.fa.gz", eol = "\n")
file.size("ath.fa.gz")

# [1] 36461631 # 35M左右