BioRecordsProcessing
In BioRecordsProcessing records are processed using a Pipeline
that is constructed by taking a source (producing records), a user-defined function to process the records and a sink that will store the output of the processing function. The pipeline can then be run.
In this example a FASTA file is read from the disk, the sequence is extracted from the records and collected in an array :
using BioRecordsProcessing, FASTX, BioSequences
p = Pipeline(
Reader(FASTX.FASTA, File(filepath)),
record -> begin
sequence(LongDNA{4}, record)
end,
Collect(LongDNA{4}),
)
run(p)
# output
2-element Vector{LongSequence{DNAAlphabet{4}}}:
CTTGGCATACTCAAACTCTT
CTTGGCATACTCAAACTCTT
By using different combinations of source and sink, and with user defined processing function, this allows to handle many common cases of biological records processing.
Conventions
- If the processing function returns
nothing
the record will not be written to the sink, allowing to filter out records. - When writing a file to the disk the sink will get the filename from the source, so a source need to have a filename provided in this case.
- Paired records are passed as a tuple to the processing function, and this function should generally returns a tuple of records.
Sources
BioRecordsProcessing.Reader
— TypeReader(record_module::Module, file_provider::F) where {F <: AbstractFileProvider}
Read a file or a directory on the disk and produce records of type record_module.Record
. The second argument can be a File
or a Directory
.
If a string is passed the second argment will default to File
.
Reader(FASTX.FASTA, "test.fa")
Reader(FASTX.FASTA, File("test.fa"))
Reader(FASTX.FASTQ, Directory("data/", "*.fastq"))
BioRecordsProcessing.Buffer
— TypeBuffer(data::Vector{T}; filename = "")
Use the array data
as a source of records. An optional filename can be provided when a Writer
is used as a sink.
File Providers
Reader can take one of these files provider as agument :
BioRecordsProcessing.File
— TypeFile(filename; second_in_pair = nothing)
For paired files a function taking as argument the filename of the first file in pair and returning the filename of the second file can be provided. For example one can use replace
or a dictionnary, e.g. second_in_pair = f1 -> replace(f1, "_1" => "_2")
.
BioRecordsProcessing.Directory
— TypeDirectory(directory::String, glob_pattern::String; second_in_pair = nothing)
List all files matching the glob_pattern
(See Glob.jl) in directory
. For paired files a function taking as argument the filename of the first file in pair and returning the filename of the second file can be provided.
Directory(input_directory, "*.fastq")
Sinks
BioRecordsProcessing.Writer
— TypeWriter(record_module::Module, output_directory::String;
suffix = "",
paired = false,
second_in_pair = nothing,
extension = nothing,
header = nothing
)
Write the output of the processing function into a file, the first argument is the module that owns the Record
type (e.g FASTX.FASTA
, VCF
, ...), and the second the ouput directory. The filename is determined by the source, to which an optional suffix can be added. If the type ouput is different from the type of the output (e.g. SAM to BAM), the extension (".bam") should be specified. For SAM & BAM a SAM.Header should be provided.
To avoid overwriting existing files, the pipeline will check that the output file is different from the input file.
BioRecordsProcessing.Collect
— TypeCollect(T::DataType; paired=false)
Write the output of the processing function into an vector in memory. The type of output has to be provided. For paired files the option paired need to be set to true
, the output will then consists of a vector of tuples.
Pipeline
BioRecordsProcessing.Pipeline
— TypePipeline(source, processor, sink)
Pipeline(source, sink)
Build a Pipeline, if processor
is omitted it will default to identity
.
Base.run
— Functionrun(p::Pipeline; max_records = Inf, verbose = true)
Run the pipeline, the processing will stop after max_records
have been read. Depending on the sink it will return a path to the output file or an array.