The VCF files preprocessor is used for validating the datasets such that the inconsistencies can be easily identified. It can be used as a standalone validator to check the validity of the VCF files, or as a helper tool for VCF to BigQuery pipeline. The VCF to BigQuery loading process may fail for ill-defined datasets, which can be resolved by applying conflicts resolving strategies. This tool provides insights on what conflicts are there in the VCF files and the resolutions that will be applied when loading the data into BigQuery. Meanwhile, it also makes the manual corrections easier if the proposed resolutions are not satisfactory.
This tool generates a report that includes three types of inconsistencies:
- Conflicting header definitions: It is common to have different definitions for the same field in different VCF files. For this type of inconsistency, the report lists all conflicting definitions, the corresponding file paths (up to 5), and the suggested resolutions.
- Malformed Records (Optional): There can be malformed records that cannot be parsed correctly by the VCF parser. In this case, the file paths and the malformed variant records are included in the report for reference.
- Inferred headers (Optional): Occasionally, some fields may be used directly
without a definition in the VCF files, or though a definition is provided, the
field value does not match the field definition. This tool can infer the type
and num for such fields and provides them in the report. For the latter case,
two kinds of the mismatches are handled:
- The defined field type is
Integer
, but the provided value is float. Correct the type to beFloat
. - Defined num is
A
(i.e. one value for each alternate base), but the provided values do not have the same cardinality as the alternate bases. Infer the num to be unknown (i.e..
).
- The defined field type is
Similar to running the VCF to BigQuery pipeline, the preprocessor can also be run using docker or directly from the source.
Run the script below and replace the following parameters:
- Dataflow's required inputs:
GOOGLE_CLOUD_PROJECT
: This is your project ID where the job should run.GOOGLE_CLOUD_REGION
: You must choose a geographic region for Cloud Dataflow to process your data, for example:us-west1
. For more information please refer to Setting Regions.GOOGLE_CLOUD_LOCATION
: You may choose a geographic location for Cloud Life Sciences API to orchestrate job from. This is not where the data will be processed, but where some operation metadata will be stored. This can be the same or different from the region chosen for Cloud Dataflow. If this is not set, the metadata will be stored in us-central1. See the list of Currently Available Locations.TEMP_LOCATION
: This can be any folder in Google Cloud Storage that your project has write access to. It's used to store temporary files and logs from the pipeline.
INPUT_PATTERN
: A location in Google Cloud Storage where the VCF file are stored. You may specify a single file or provide a pattern to analyze multiple files at once.REPORT_PATH
: This can be any path in Google Cloud Storage that your project has write access to. It is used to store the report from the preprocessor tool.RESOLVED_HEADERS_PATH
: Optional. This can be any path in Google Cloud Storage that your project has write access to. It is used to store the resolved headers which can be further used as a representative header (via--representative_header_file
) for the VCF to BigQuery pipeline.
report_all_conflicts
is optional. By default, the preprocessor reports the
inconsistent header definitions across multiple VCF files. By setting this
parameter to true, it also checks the undefined headers and the malformed
records.
#!/bin/bash
# Parameters to replace:
GOOGLE_CLOUD_PROJECT=GOOGLE_CLOUD_PROJECT
GOOGLE_CLOUD_REGION=GOOGLE_CLOUD_REGION
GOOGLE_CLOUD_LOCATION=GOOGLE_CLOUD_LOCATION
TEMP_LOCATION=gs://BUCKET/temp
INPUT_PATTERN=gs://BUCKET/*.vcf
REPORT_PATH=gs://BUCKET/report.tsv
RESOLVED_HEADERS_PATH=gs://BUCKET/resolved_headers.vcf
COMMAND="vcf_to_bq_preprocess \
--input_pattern ${INPUT_PATTERN} \
--report_path ${REPORT_PATH} \
--job_name vcf-to-bigquery-preprocess \
--resolved_headers_path ${RESOLVED_HEADERS_PATH} \
--report_all_conflicts true \
--runner DataflowRunner"
docker run -v ~/.config:/root/.config \
gcr.io/cloud-lifesciences/gcp-variant-transforms \
--project "${GOOGLE_CLOUD_PROJECT}" \
--location "${GOOGLE_CLOUD_LOCATION}" \
--region "${GOOGLE_CLOUD_REGION}" \
--temp_location "${TEMP_LOCATION}" \
"${COMMAND}"
In addition to using the docker image, you may run the pipeline directly from source.
Example command for DirectRunner:
python3 -m gcp_variant_transforms.vcf_to_bq_preprocess \
--input_pattern gcp_variant_transforms/testing/data/vcf/valid-4.0.vcf \
--report_path gs://BUCKET/report.tsv
--job_name vcf-to-bigquery-preprocess-direct-runner \
--resolved_headers_path gs://BUCKET/resolved_headers.vcf \
--report_all_conflicts true \
--temp_location "${TEMP_LOCATION}"
Example command for DataflowRunner:
python3 -m gcp_variant_transforms.vcf_to_bq_preprocess \
--input_pattern gs://BUCKET/*.vcf \
--report_path gs://BUCKET/report.tsv \
--job_name vcf-to-bigquery-preprocess \
--resolved_headers_path gs://BUCKET/resolved_headers.vcf \
--report_all_conflicts true \
--setup_file ./setup.py \
--runner DataflowRunner \
--project "${GOOGLE_CLOUD_PROJECT}" \
--region "${GOOGLE_CLOUD_REGION}" \
--temp_location "${TEMP_LOCATION}"
The following is the conflicts report generated by running the preprocessor on the data set 1000 Genomes.
Header Conflicts
ID | Category | Conflicts | File Paths | Proposed Resolution |
---|---|---|---|---|
GL | FORMAT | num=None type=Float | gs://genomics-public-data/1000-genomes/vcf/ALL.chr13.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.vcf | num=None type=Float |
gs://genomics-public-data/1000-genomes/vcf/ALL.chr17.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.vcf | ||||
gs://genomics-public-data/1000-genomes/vcf/ALL.chr21.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.vcf | ||||
gs://genomics-public-data/1000-genomes/vcf/ALL.chr8.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.vcf | ||||
gs://genomics-public-data/1000-genomes/vcf/ALL.chrX.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.vcf | ||||
num=3 type=Float | gs://genomics-public-data/1000-genomes/vcf/ALL.wgs.integrated_phase1_v3.20101123.snps_indels_sv.sites.vcf | |||
GQ | FORMAT | num=1 type=Float | gs://genomics-public-data/1000-genomes/vcf/ALL.chrY.phase1_samtools_si.20101123.snps.low_coverage.genotypes.vcf | num=1 type=Float |
num=1 type=Integer | gs://genomics-public-data/1000-genomes/vcf/ALL.chrY.phase1_samtools_si.20101123.snps.low_coverage.genotypes.vcf |
Inferred Headers
ID | Category | Proposed Resolution |
---|---|---|
FT | FORMAT | num=1 type=String |
No Malformed Records Found.