Field Reference¶
SINA generates a number of named meta-data values for each processed
sequence. By default, all values will be written to ARB output files,
while CSV output requires that each meta-data field is specified using
-f/--fields
. The fields generated by SINA and the
typical fields present in ARB databases are described below.
Basic Fields¶
The below fields are standard ARB meta data fields describing each sequence.
-
name
¶
The
Id
of the sequence. In FASTA input and output, this value is mapped to the first word of the header line.
-
full_name
¶
The textual description of the sequence. In FASTA input and output, this value is mapped to all but the first word of the header line.
-
acc
¶
The sequence accession number. This field is relevant for ARB input/output as together with
start
it defines the unique identity of the sequence when regenerating sequence names. For sequences read from FASTA and written to ARB, SINA will generate a pseud-accession asARB_
followed by a 8 character hexadecimal CRC32 checksum. This matches the behavior of ARB during FASTA import.
-
version
¶
The version part of the accession number reference.
-
start
¶
The start position of the gene sequence with the sequence referenced by the accession number.
-
stop
¶
The stop position of the gene sequence with the sequence referenced by the accession number.
SINA specific fields¶
-
align_quality_slv
¶
The alignment “quality”. The alignment score, normalized to remove weighting effects and scaled as integer between 0 and 100. If the alignment for the sequence was copied from an identical match to a reference sequence, the value is set to 100.
-
align_cutoff_head_slv
¶
The number of unaligned basepairs at the beginning of the sequence.
-
align_cutoff_tail_slv
¶
The number of unaligned basepairs at the end of the sequence.
-
aligned_slv
¶
The time and date at which the sequence was aligned.
-
align_startpos_slv
¶
The position of the first base of the sequence within the reference alignment.
-
align_stoppos_slv
¶
The position of the last base of the sequence within the reference alignment.
-
align_ident_slv
¶
The highest fractional identity of the aligned sequence with any of the used reference sequences. The value is computed using optimistic IUPAC comparison (N matches anything) over the overlapping region of each pair of sequences.
-
nuc_gene_slv
¶
The number of basepairs aligned within the gene. (Currently not computed).
-
align_bp_score_slv
¶
A score indicating the average binding strength of basepairs aligned into helix regions. Each pair of bases aligned to opposing sides of a helix specified in the reference database is assigned a score (
AG
= 0.5,AU
= 1.1,CG
= 1.5,GG
= 0.4,GU
= 0.9), the sum of scores divided by the number of helix positions with bases on either side and multiplied by 100.
-
align_family_slv
¶
The reference sequences used to align the query sequence. Each reference is listed as
ACC.START:SCORE
whereACC
andSTART
are the contents of the reference sequence’s respectiveacc
andstart
fields andSCORE
is the score assigned by the sequence search engine (ARB PT server or internal kmer search).
-
align_log_slv
¶
A log of events that occurred during the alignment of a query sequence.
-
align_filter_slv
¶
The weighting filter selected for the query sequence, if any.
-
nearest_slv
¶
The results from the sequence search. Available only when the search stage is enabled (
-S/--search
).Each matched sequence is given as
ACC.VERSION.START.STOP~SCORE
whereACC
,VERSION
,START
, andSTOP
are the contents of the matched sequence’s respectiveacc
,version
,start
andstop
fields andSCORE
is the score calculated according to the search settings.
SILVA taxonomy fields:¶
The SILVA SSU and LSU databases in ARB format contain taxonomic meta
data suitable for generating taxonomic assingments using the
:option:--lca-fields
option. Each of the following fields contains
the taxonomic assignment as a “materialized path” (Domain; Phylum;
...
). The _name
field contains the sequence name assigned by the
respective taxonomy.
-
tax_embl
¶
The EMBL-EBI/ENA taxonomy. Note that the name was changed to tax_embl_ebi_ena in newer releases of SILVA.
-
tax_embl_ebi_ena
¶
The EMBL-EBI/ENA taxonomy.
-
tax_ltp
¶
The Living Tree Project (LTP) taxonomy.
-
tax_gg
¶
The Greengenes taxonomy. (Discontinued)
-
tax_gtdb
¶
The Genome Taxonomy Database Taxonomy.
Additional standard ARB fields:¶
-
ali_16s/data
¶
The actual sequence alignment. This field type always has the form
ali_<name>/data
, with<name>
indicating the alignment (ARB databases may contain multiple alignments).
-
ARB_color
¶
A number indicating in which color the sequence should be highlighted inside of ARB.
-
used_rels
¶
The
names
of the reference sequences used during alignment separated by spaces. This field is generated only if--write-used-rels
is given. It allows selecting the reference sequences via a special menu item from within ARB.
-
nuc
¶
The length of the sequence.