# Filtering

General filter configuration
Available filters and examples

ELLA uses filter chains when loading an analysis to reduce the number of variants for interpretation. These filter chains are modular, and can consist of multiple filters, each with different configurations.

# General filter configuration

All filter configurations follow this pattern:

{
  "name": "FilterA",
  "config": {
    "<FilterA specific config key>": "<FilterA specific config value>"
  }
},

The configuration is specific to the filter with the given name. The available filters can be found in this documentation, and in ella/datalayer/allelefilter/allelefilter.py.

# Exceptions

For each filter, we can specify a list of exceptions, i.e. rules for when the filter should not apply. Many filter rules can be used both as filters or as exception to other filters.

To specify exceptions to the config above, we add the exceptions key to the config:

{
  "name": "FilterA",
  "config": {
    "<FilterA specific config key>": "<FilterA specific config value>"
  },
  "exceptions": [
    "name": "FilterB",
    "config": {
      "<FilterB specific config key>": "<FilterB specific config value>"
    }
  ]
},

This will first run FilterA with the given config, and, before filtering out variants, it will exclude from filtering the variants caught by FilterB with the given config.

# Filter chains

Filters can be chained together to create a filter chain. This will run filters in the order specified.

"filterconfig": {
    "filters": [
        {
          "name": "FilterA",
          "config": {
            "<FilterA specific config key>": "<FilterA specific config value>"
          },
          "exceptions": [
            "name": "FilterB",
            "config": {
              "<FilterB specific config key>": "<Value suitable for exception>"
            }
          ]
        },
        {
          "name": "FilterB",
          "config": {
            "<FilterB specific config key>": "<FilterB specific config value>"
          },
          "exceptions": [
            "name": "FilterB",
            "config": {
              "<FilterB specific config key>": "<Value suitable for exception>"
            }
          ]
        },
    ]
}

This filter config specifies to first run FilterA on all passed variants/analysis variants, then run FilterB on the variants not filtered by FilterA. Note that both filters have FilterB as an exception.

# Update filter configuration

To update the filter configurations, run the following command:

ella-cli filterconfigs update <path to filterconfigs.json>

# Available filters and examples

A set of filters are implemented in ELLA, and are described below. The modularity of these filters makes is easy to construct complex filter chains, and reasonably easy to implement new filters. Most filters can be used either as a normal filter (filtering out variants) or as an exception to another filter (rescuing variants that would otherwise be filtered).

Generic filters

Generic annotation filter
Pre-filter (before import)

Specific filters

Classification filter
Consequence filter
External filter
Frequency filter
Gene filter
Inheritance model filter
Polypyrimidine filter
Quality filter
Region filter
Segregation filter

See also filterconfigs.json (opens new window) for examples of a complete setup. The schema is located in ella/vardb/datamodel/pydantic/filterconfig.py.

# Generic annotation filter

The generic_annotation filter filters out or rescues alleles based on arbitrary parts of the annotation JSONB object (annotations column of the Annotation table). Similar to the annotations column, this filter is highly configurable, allowing for complex filtering on any type and combination of allele annotations.

# Configuration

Option	Key	Description
Annotation target	`target`	Target path of the JSONB object to filter on. The path is specified as a `.`-separated string (e.g. `variantDBs.varde`).
Array	`is_array`	Whether the JSONB annotation target is an array or not. If true, alleles will be filtered out (or rescued) if any element in the array matches the filter criteria.
Rules	`rules`	List of filter rules to apply. Each rule contains the following options:
	`key`	The JSONB key of the annotation target to filter on.
	`operator`	The column operator to use for comparison. See the list of available operators for alternatives and supported types.
	`value`	The value to run the operator against.
	`type_str`	Primitive type of the target value. This is used internally when constructing the SQL subquery to cast each element to the correct type. If the value of the target key is an array, this type refers to the type of the array elements. Can be either of the SQL types `string`, `number`, or `boolean`.
Mode	`mode`	Either `all` or `any`. If `all` (default), all rules must be fulfilled for the filter to apply. If `any`, at least one rule must be fulfilled for the filter to apply.

# Example

This configuration will filter out alleles where both SpliceAI DS_AG (delta score, acceptor gain) and DS_AL (delta score, acceptor loss) are smaller than 0.5:

{
    "name": "generic_annotation",
    "config": {
        "target": "prediction.spliceai",
        "is_array": true,
        "rules": [
            {
                "key": "DS_AG",
                "operator": "<",
                "value": 0.5,
                "type_str": "number"
            },
            {
                "key": "DS_AL",
                "operator": "<",
                "value": 0.5,
                "type_str": "number"
            }
        ],
        "mode": "all"
    }
}

# Operators

The following operators are available, along with their supported types:

Operator	Target type	Value type	Description
`is`/`is_not`	`None`, `bool`	`None`, `bool`	Identity comparison for `None` and `bool` values.
`==`/`!=`	`str`, `int`, `float`	`str`, `int`, `float`	Evalutes numerical or string equality.
`<`/`<=`/`>`/`>=`	`number`	`number`	Implements the `<`, `<=`, `>`, and `>=` inequality operators.
`in`/`not_in`	`str`, `int`, `float`	`list[str] or list[int] or list[float]`	Whether or not the target value is in the list of values.
`contains`/`not_contains`	`list[str] or list[int] or list[float]`	`list[str] or list[int] or list[float]`	Whether or not the target array contains the value array.
`str_contains`/`str_not_contains`	`str`	`str`	Whether the target string contains the value string.
`overlap`/`not_overlap`	`list[str] or list[int] or list[float]`	`list[str] or list[int] or list[float]`	Whether or not the target array and the value array share any elements.

Note that while the generic annotation filter fully supports nullable target values, it reflects the behavior of PostgreSQL in terms of how None (i.e. NULL) are handled. This means that None values in the target values will not match negated comparison operators (e.g. !=). To filter on None values, always use the is or is_not operators.

# Pre-filter (before import)

For large data sets, it is advisable to apply a special pre-filter that removes variants that are certain to be benign (e.g. population frequency above 0.05) before importing into ELLA, to reduce the loading time when opening an analysis.

NOTE

Variants removed in this way (before import) will not be visible in FILTERED variants in ELLA. However, ELLA can be configured to include a VCF track in the VISUAL mode, where these variants would be included.

# Configuration

Unlike the regular filters above, the pre-filter is specified in the usergroups.json file, under deposit.analysis.prefilters; see also Deposit.

To combine options with boolean AND (all conditions must be met), list them like this: [["<option1>", "<option2>"]].
To instead combine options with boolean OR (any condition met), nest the options like this: [["<option1>"], ["<option2>"]].

The available options are:

Option	Key	Description
High population frequency	`hi_frequency`	Pre-filter variants with AF>0.05, AN>5000 in the total population of the gnomAD genomes dataset.
Low mapping quality	`low_mapping_quality`	Pre-filter variants with MQ<20. This can be useful for variants called with Dragen-GATK, which unlike GATK does not automatically exclude variants with a low MQ.
No classification	`no_classification`	Only pre-filter variants that do not have an existing classification in the internal database.
Position not nearby	`position_not_nearby`	Only pre-filter variants that have no other variant(s) within 3 bp. This is to prevent loosing information that is important for determining correct codon change in coding sequence.
Not multi-allelic	`non_multiallelic`	Only pre-filter variants that are not multiallelic. A common way of handling multiallelic variants (ALT1/ALT2) in variant calling pipelines is to decompose them into two separate REF/ALT1 and REF/ALT2 variants. If this is the case, this option will make sure that information about the true genotype is not lost in case only one of the ALT1/ALT2 alleles matches the pre-filtering criteria.

# Example

This configuration (in usergroups.json) will enable all of the above options for analyses matching the <regex> pattern, considering either high population frequency with qualifiers (AND no nearby variants AND no classification AND not multi-allelic), or low MQ (with no exceptions):

"deposit": {
    "analysis": [
        {
            "pattern": "[<regex>]",
            "postprocess": [],
            "prefilters": [
                [
                    "hi_frequency",
                    "no_classification",
                    "position_not_nearby",
                    "non_multiallelic"
                ],
                [
                    "low_mapping_quality"
                ]
            ]
        },
    ]
}

# Classification filter

The classification filter filters out or rescues alleles that have an existing classification in the internal database.

# Configuration

Option	Key	Description
Classes	`classes`	List classifications to consider. Must be a subset of the available classes in ELLA.
Exclude outdated	`exclude_outdated`	Optionally disregard variant classifications if they are past the corresponding validity period defined in the ELLA config (`true`/`false` (default)).

# Example

This configuration will filter out alleles previously classified as class 1 or 2, but only if they are still valid:

{
    "name": "classification",
    "config": {
        "classes": ["1", "2"],
        "exclude_outdated": true
    }
}

# Consequence filter

The consequence filter filters out or rescues alleles that are annotated with specific consequences from VEP (opens new window).

# Configuration

Option	Key	Description
Consequences	`consequences`	List consequences to use. Must be a subset of the available VEP consequences (opens new window).
Gene panel only	`genepanel_only`	Specify if only consequences in genes within the current gene panel should be included (`true`/`false`)

TIP

Since any given variant can be annotated with many different consequences, this is typically also used as a filter exception on itself. This is done to avoid filtering out variants that are e.g. synonymous_variant in one transcript, but stop_gained in another.

# Example

This configuration will filter out variants that are annotated as either synonymous_variant, stop_retained_variant or start_retained_variant.

{
  "name": "consequence",
  "config": {
      "genepanel_only": false
      "consequences": ["synonymous_variant", "stop_retained_variant", "start_retained_variant"]
  }
}

# External filter

The external filter filters out or rescues alleles based on annotation from HGMD and/or ClinVar.

# Configuration

Source	Option	Key	Description
ClinVar		`clinvar`
	Number of stars	`num_stars`	Specify a comparison operator (>/</=) and a number (1-4) corresponding to number of stars in ClinVar (opens new window).
	Combinations	`combinations`	Specify combinations of criteria to compare. Given as source, operator and target.
	Inverse	`inverse`	Apply to alleles NOT fulfilling the given criteria (`true`/`false` (default)).
HGMD		`hgmd`
	Tags	`tags`
	Inverse	`inverse`	Apply to alleles NOT fulfilling the given criteria (`true`/`false` (default)).

TIP

If you only want to use one of the databases, just omit the other key (hgmd or clinvar).

# Example

This configuration is useful for exceptions, and will rescue alleles where each of these criterions are fulfilled:

The variant is in ClinVar and the number of stars is two or more.
The number of pathogenic or likely pathogenic submissions are greater than the number of benignor likely benign submissions.
The variant is in HGMD with the tags DM or DM?.

{
    "name": "external",
    "config": {
        "clinvar": {
            "num_stars": [">=", 2],
            "combinations": [["pathogenic", ">", "benign"]],
            "inverse" false,
        },
        "hgmd": {
            "tags": ["DM", "DM?"],
            "inverse" false,
        }
    }
}

# Frequency filter

The frequency filter filters out alleles based on their population allele frequency. You may use different frequency thresholds for different data set groups (data provider/sub-population) and/or inheritance mode. Also supports setting thresholds for number of actual observations required in the data set.

NOTE

This filter cannot be used as an exception filter.

# Configuration

Option	Key	Description
Filter groups	`groups`	Categorize data set providers (e.g. gnomAD) and their sub-populations in groups, which can be referred to when specifying thresholds. A group `external` could for instance consist of gnomAD with sub-populations G, AMR and AFR. This can e.g. be used to separate internal in-house and external data sets.
Number threshold	`num_thresholds`	Set a threshold for the Allele number (number of observed chromosomes at a given locus) for each sub-population. Sub-populations with less observations than the threshold at a given locus (e.g. due to poor coverage) will not be used for filtering at that locus.
Frequency threshold	`thresholds`	Set population frequency thresholds (0-1). They can be defined for two inheritance types, `AD` (autosomal dominant) and `default`. Thresholds set for `AD` applies to genes specified with AD inheritance mode only, i.e. not combinations such as AD/AR. `default` applies to everything not in `AD`. For each inheritance mode, you can also configure the thresholds for each filter group (see above) separately.

# Example

This configuration will filter out variants that are:

GnomAD genomes or exomes, total populations (external, G)
- More than 5000 observed chromosomes
- Higher than 0.005 frequency in AD genes
- Higher than 0.01 frequency for other inheritance modes
In-house database (internal, OUSWES)
- Higher than 0.05 frequency for any inheritance mode

{
    "name": "frequency",
    "config": {
        "groups": {
            "external": {
                "GNOMAD_GENOMES": ["G"],
                "GNOMAD_EXOMES": ["G"]
            },
            "internal": {
                "inDB": ["OUSWES"]
            }
        },
        "num_thresholds": {
            "GNOMAD_GENOMES": {
                "G": 5000
            },
            "GNOMAD_EXOMES": {
                "G": 5000
            }
        },
        "thresholds": {
            "AD": {
                "external": 0.005,
                "internal": 0.05
            },
            "default": {
                "external": 0.01,
                "internal": 0.05
            }
        }
    }
}

# Gene filter

The gene filter filters out or rescues alleles that are within a given gene.

NOTE

The annotation is matched on gene panel with transcript (excluding RefSeq versioning). This means that it will only take into account variants annotated with gene panel transcript(s).

# Configuration

Option	Key/Value	Description
Genes	`genes`	List of HGNC IDs to apply filter to.
Filter mode	`mode`	Either:
	`all`	(default) Variant must be annotated with genes specified in `genes` only. This is useful for filtering out variants.
	`one`	Variant must be annotated with at least one gene from `genes` (but could be annotated with other genes). This is useful for exceptions.
Inverse	`inverse`	Apply to alleles NOT fulfilling the given criteria (default `false`)

# Examples

This configuration will filter out all variants annotated with BRCA1 (1100) and/or BRCA2 (1101), but not if they are also annotated on gene panel transcripts for any other gene:

{
    "name": "gene",
    "config": {
        "genes": [1100, 1101],
        "mode": "all"
    }
}

This configuration will filter out all variants not annotated with either BRCA1 (1100) or BRCA2 (1101):

{
    "name": "gene",
    "config": {
        "genes": [1100, 1101],
        "mode": "one",
        "inverse": true
    }
}

# Inheritance model filter

The inheritance_model filter filters out or rescues alleles that are not consistent with the inheritance model for a gene given in the gene panel.

NOTE

This filter is intended for single samples only and does not use family information.

By design, only genes specified in the gene panel are checked. This means that if variant A is located in GENE1 and GENE2, and variant B is located in GENE2 and GENE3, but only GENE1 and GENE3 are in the gene panel, both variant A and B would be filtered out, even though they are compound heterozygous candidates for GENE2.

# Configuration

Option	Key/Value	Description
Filter mode	`filter_mode`	Either:
	`recessive_non_candidates`	Applies to variants in genes with autosomal recessive (AR) inheritance, where the variant is heterozygous and the only (non-filtered) variant in that gene. Typically used for filtering out variants.
	`recessive_candidates`	Applies to variants in genes that are NOT autosomal dominant (AD), where the variant is either homozygous or there is at least one other (non-filtered) variant in the same gene. Typically used for rescuing variants from another filter.

# Example

This configuration will filter out variants meeting the criteria for Recessive non-candidates.

{
    "name": "inheritance_model",
    "config": {
        "filter_mode": "recessive_non_candidates"
    }
}

# Polypyrimidine filter

The ppy filter filters out or rescues the following allele changes in the polypyrimidine tract:

C>T, T>C, delCC, delTT, delCT and delTC

NOTE

For transcripts on the positive genomic strand, deletions will not be filtered out if they are preceded by an A, as this might introduce a new AG splice site. Similarly, on reverse strand transcripts, deletions will not be filtered out if the are preceded (in genomic coordinates) by a C.

The filter does not check for a new splice site, since ELLA only has access to the base preceding a deletion from the imported data (VCF).

# Configuration

Option	Key	Description
Polypyrimidine tract region	`ppy_tract_region`	Set interval in number of bases to treat as polypyrimidine tract region, upstream of exon start.

# Example

This configuration will filter out the specified polypyrimidine changes in the region between 3 and 20 bases upstream of the exon start.

{
    "name": "ppy",
    "config": {
        "ppy_tract_region": [-20, -3]
    }
}

# Quality filter

The quality filter filters out alleles with a low quality, using the Filter status or Quality fields from the VCF, or the allele ratio. The latter is calculated in ELLA as alternative allele reads/total reads (presented as Ratio in the Quality card).

# Configuration

Option	Key	Description
Filter status	`filter_status`	Filter based on values in the VCF `FILTER` field.
	`pattern`	Regex pattern to look for (e.g. `PASS`, `.VQSRTranche.`).
	`filter_empty`	Filter out if value is empty (`true`/`false` (default)).
	`inverse`	Apply to alleles NOT fulfilling the given criteria (`true`/`false` (default)).
Quality	`qual`	Set threshold value (integer) for the VCF `QUAL` field, below which a variant should be filtered.
Allele ratio	`allele_ratio`	Set threshold value (0-1), below which a variant should be filtered.

NOTES

Due to filtering below a certain threshold, the qual or allele_ratio parameters are not suitable for use in exceptions.
The meaning of a particular qual value is dependent on the particular variant calling pipeline, and thresholds should therefore be used with caution if importing results from multiple setups.
Although a skewed allele_ratio is most often indicative of technical artifacts, it may also indicate somatic mosaicism. This option should therefore not be used in patients where mosaicism is suspected.

# Example

This configuration will filter out any variant with QUAL <100 AND allele ratio <0.25 AND FILTER is NOT PASS:

{
    "name": "quality",
    "config": {
        "qual": 100,
        "allele_ratio": 0.25,
        "filter_status": {
            "pattern": "PASS",
            "inverse": true
        }
    }
}

This configuration will filter out any variant that has empty FILTER status or VQSRTrancheSNP with any value (e.g. VQSRTrancheSNP99.00to99.90):

{
    "name": "quality",
    "config": {
        "filter_status": {
            "pattern": ".*VQSRTranche.*",
            "filter_empty": true
        }
    }
}

# Region filter

The region filter (region) filters out alleles that fall outside a specified splice or UTR region.

NOTE

Gene panels used in ELLA may be configured with custom regions that define regions of interest outside transcribed exons, e.g. regulatory regions such as promoters. These regions are always exempted from the region filter.

NOTE

This filter is not suitable as an exception filter.

The genomic regions from the transcript database is used as basis for the filtering. However, if a variant filtered on genomic region is annotated with a cDNA position that says it's within the region, the variant is not filtered. In other words, cDNA information can save a variant from being filtered.

# Configuration

Option	Key	Description
Splice region	`splice_region`	Sets number of bases to treat as splice region, upstream from exon start and downstream from exon end. Variants in the intron outside this region will be filtered.
UTR region	`utr_region`	Sets number of bases to treat as UTR region, upstream from coding start and downstream of coding end. Variants in the UTR outside this region will be filtered.

# Example

This configuration will filter out:

Intron variants
- Upstream (5') of 12 bases from exon start AND
- Downstream (3') of 10 bases from exon end
UTR variants
- Upstream (5') of 10 bases from the coding start OR
- Downstream (3') of 5 bases from the coding end

{
    "name": "region",
    "config": {
        "splice_region": [-12, 10],
        "utr_region": [10, 5]
    }
}

# Segregation filter

The segregation filter requires family data, and presents powerful options to filter out anything that is not de novo, homo-/hemizygous or compound heterozygous.

NOTE

This filter "rescues" variants (filters any variant where conditions are not met) and is not suitable as an exception filter.
For the purposes below, variants in the pseudo-autosomal X-chromosome regions PAR1 and PAR2 (X:60001-2699520 and X:154931044-155260560 on GRCh37) are treated as autosomal, not X-linked.

# Configuration

This filter works in reverse and considers criteria in an OR fashion, i.e. variants where conditions for any of the enabled (true) options are met will not be filtered. See the links in the table for further explanation.

Option	Key	Description
De novo variant	`denovo`	Rescue de novo variants.
	`gq_threshold`	Optionally define GQ threshold for de novo variants.
Compound heterozygous candidate	`compound_heterozygous`	Rescue variants with possibility for being compound heterozygous.
Homozygous recessive variant	`recessive_homozygous`	Rescue variants that are homo-/hemizygous in affecteds.
No coverage in parents	`no_coverage_parents`	Rescue variants with no coverage in parents.
Parental mosaicism	`parental_mosaicism`	Rescue variants with possibility for parental mosaicism.

# Example

This configuration will filter out any variant that does not match any of the possible criteria given above, except "Parental mosaicism", with a threshold of GQ>20 for all family members for de novo candidates:

{
    "name": "segregation",
    "config": {
        "denovo": {
            "enable": true,
            "gq_threshold": {
                "proband": 20,
                "mother": 20,
                "father": 20
            }
        },
        "compound_heterozygous": { "enable": true },
        "recessive_homozygous": { "enable": true },
        "no_coverage_parents": { "enable": true },
        "parental_mosaicism": { "enable": false }
    }
}

# De novo variant

Designating a variant as de novo is based on rules given in Vigeland et al. (2016) (opens new window). Genotype inheritance patterns that designates a variant allele "1" (reference = "0") as de novo in the child (father + mother = child) are:

For autosomal or pseudo-autosomal regions:
- 0/0 + 0/0 = 0/1
- 0/0 + 0/0 = 1/1
- 0/0 + 0/1 = 1/1
- 0/1 + 0/0 = 1/1
For X-linked regions, child is a boy:
- 0 + 0/0 = 1
For X-linked regions, child is a girl:
- 0 + 0/0 = 0/1
- 0 + 0/0 = 1/1
- 0 + 0/1 = 1/1

NOTE

If a male trio member is reported as heterozygous for an X-linked variant, the variant will be filtered out.

# GQ threshold

If gq_threshold is defined, only de novo candidates where the genotype qualities (GQ) are above the given thresholds are considered. This can be useful for removing false de novo predictions. Note that thresholds must be defined for each of the proband, father and mother. Setting the threshold to 0 is equivalent to disabling the check.

# Compound heterozygous candidate

Variants are designated as compound heterozygous candidates based on the rule set from Kamphans et al. (2013) (opens new window):

A variant has to be in a heterozygous state in all affected individuals.
A variant must not occur in a homozygous state in any of the unaffected individuals.
A variant that is heterozygous in an affected child must be heterozygous in exactly one of the parents.
A gene must have two or more heterozygous variants in each of the affected individuals.
There must be at least one variant transmitted from the paternal side and one transmitted from the maternal side.

NOTE

For the third rule, note this excerpt from the article:

"[This rule] is applicable only if we assume that no de novo mutations occurred. The number of de novo mutations is estimated to be below five per exome per generation, thus, the likelihood that an individual is compound heterozygous and at least one of these mutations arose de novo is low. If more than one family member is affected, de novo mutations are even orders of magnitudes less likely as a recessive disease cause. On the other hand, excluding these variants from the further analysis helps to remove many sequencing artifacts."

# Homozygous recessive variant

This rule set checks for homo-/hemizygous variants. The following conditions must be met, assuming parental genotypes are available:

For autosomal or pseudo-autosomal regions:
- Homozygous in the proband and any affected siblings.
- Heterozygous in both parents.
- Not homozygous in unaffected siblings.
For X-linked regions:
- Homo-/hemizygous in the proband and any affected siblings (note: for girls this requires a de novo, but still valid case).
- Heterozygous in mother.
- Not present in father.
- Not homo-/hemizygous in unaffected siblings

When parental genotypes are not available, only the rule "Not homo-/hemizygous in unaffected siblings" is applicable.

# No coverage in parents

If enabled, variants where either parental genotype is missing (no variant call/coverage) will not be filtered out. This is useful if you want to keep variants with uncertain de novo status, for instance.

# Parental mosaicism

This rule set checks whether a variant is inherited from a parent with possible allelic mosaicism (excluding cases where the other parent has a normal genotype). The following conditions must be met:

Proband:
- Has variant.
- Genotype is heterozygous or hemizygous (for X-linked regions).
Either parent has an allele_ratio between given thresholds:
- For autosomal or pseudo-autosomal regions: [0, 0.3]
- For X-linked regions:
  - Mother: [0, 0.3]
  - Father: [0, 0.8]
Other parent does not have an allele_ratio outside given thresholds (i.e., rule fails if the other parent has a normal genotype).

← Import and deposit ACMG and classification →