BioRecordsProcessing

In BioRecordsProcessing records are processed using a Pipeline that is constructed by taking a source (producing records), a user-defined function to process the records and a sink that will store the output of the processing function. The pipeline can then be run.

In this example a FASTA file is read from the disk, the sequence is extracted from the records and collected in an array :

using BioRecordsProcessing, FASTX, BioSequences

p = Pipeline(
    Reader(FASTX.FASTA, File(filepath)),
    record -> begin
        sequence(LongDNA{4}, record)
    end,
    Collect(LongDNA{4}),
)
run(p)

# output
2-element Vector{LongSequence{DNAAlphabet{4}}}:
 CTTGGCATACTCAAACTCTT
 CTTGGCATACTCAAACTCTT

By using different combinations of source and sink, and with user defined processing function, this allows to handle many common cases of biological records processing.

Conventions

  • If the processing function returns nothing the record will not be written to the sink, allowing to filter out records.
  • When writing a file to the disk the sink will get the filename from the source, so a source need to have a filename provided in this case.
  • Paired records are passed as a tuple to the processing function, and this function should generally returns a tuple of records.

Sources

BioRecordsProcessing.ReaderType
Reader(record_module::Module, file_provider::F) where {F <: AbstractFileProvider}

Read a file or a directory on the disk and produce records of type record_module.Record. The second argument can be a File or a Directory.

If a string is passed the second argment will default to File.

Reader(FASTX.FASTA, "test.fa")
Reader(FASTX.FASTA, File("test.fa"))
Reader(FASTX.FASTQ, Directory("data/", "*.fastq"))
source
BioRecordsProcessing.BufferType
Buffer(data::Vector{T}; filename = "")

Use the array data as a source of records. An optional filename can be provided when a Writer is used as a sink.

source

File Providers

Reader can take one of these files provider as agument :

BioRecordsProcessing.FileType
File(filename; second_in_pair = nothing)

For paired files a function taking as argument the filename of the first file in pair and returning the filename of the second file can be provided. For example one can use replace or a dictionnary, e.g. second_in_pair = f1 -> replace(f1, "_1" => "_2").

source
BioRecordsProcessing.DirectoryType
Directory(directory::String, glob_pattern::String; second_in_pair = nothing)

List all files matching the glob_pattern (See Glob.jl) in directory. For paired files a function taking as argument the filename of the first file in pair and returning the filename of the second file can be provided.

Directory(input_directory, "*.fastq")
source

Sinks

BioRecordsProcessing.WriterType
Writer(record_module::Module, output_directory::String; 
    suffix = "", 
    paired = false, 
    second_in_pair = nothing, 
    extension = nothing, 
    header = nothing
)

Write the output of the processing function into a file, the first argument is the module that owns the Record type (e.g FASTX.FASTA, VCF, ...), and the second the ouput directory. The filename is determined by the source, to which an optional suffix can be added. If the type ouput is different from the type of the output (e.g. SAM to BAM), the extension (".bam") should be specified. For SAM & BAM a SAM.Header should be provided.

To avoid overwriting existing files, the pipeline will check that the output file is different from the input file.

source
BioRecordsProcessing.CollectType
Collect(T::DataType; paired=false)

Write the output of the processing function into an vector in memory. The type of output has to be provided. For paired files the option paired need to be set to true, the output will then consists of a vector of tuples.

source

Pipeline

Base.runFunction
run(p::Pipeline; max_records = Inf, verbose = true)

Run the pipeline, the processing will stop after max_records have been read. Depending on the sink it will return a path to the output file or an array.

source