Taiwan Biobank’s Purpose
The purpose of the Taiwan Biobank is to establish a biological database of the Taiwanese population by combining information on lifestyle habits, environmental factors, and biological markers. The aim is to collect a vast number of biological specimens and health information for biomedical research, and provide researchers with access to this information.
Since the announcement of the first complete human genome sequence in 2003, significant progress has been made in the biomedical research fields worldwide. The Taiwan Biobank hopes to contribute to this progress by establishing a database of genome-related information to help biomedical researchers identify clues to combat the onset, progression, and treatment of diseases, and promote the future health of the Taiwanese people. Researchers may apply for access to the data and conduct analyses to establish preliminary study hypotheses, which may serve as the basis for further research. Any personally identifiable information (PII) is removed to ensure the privacy of the participant.
Variants
Genomic DNA was extracted from peripheral blood using Chemagen Blood DNA isolation kit or QIAsymphony DNA Maxi Kit at the Taiwan Biobank. Sequencing and genotyping were performed at Genomics Bioscience and Technology Co., Ltd. and the National Center for Genome Medicine (NCGM) at Academia Sinica, respectively.
Whole-genome sequence (WGS)
The TruSeq DNA PCR-free Hight Throughput Library Prep Kit was used for library construction and sequenced on Illumina HiSeq 2500, 4000 or NovaSeq 6000 platforms (2 x 150 bp paired-end) with 30x sequencing depth. The workflows for data pre-processing and variant discovery adhered to GATK Best Practices. The paired-end reads were aligned to the GRCh38 human reference with BWA-MEM (v0.7.17). PCR duplicates were removed with samblaster (v0.1.24), followed by sorting with sambamba (v0.6.8) and base quality score recalibration (BQSR) with GATK (v4.1.0.0) BaseRecalibrator and ApplyBQSR. Single-sample genomic variant call format (gVCF) files were generated with GATK (v4.1.0.0) HaplotypeCaller in gVCF mode. Multi-sample joint calling was then performed using GATK (v4.1.0.0) DBImport and GenotypeGVCFs, followed by variant quality score recalibration (VQSR) filtering with GATK (v4.1.0.0) to produce the final multi-sample call set. The final variant annotation was conducted with Ensembl VEP (v108). The allele frequency of ~46.3M autosomal variants and ~2.3M variants on sex chromosomes were calculated from 1,492 unrelated samples (no 2nd degree relatives or closer).
TWB1 array
The customized TWB1 array is based on the Thermo Fisher Axiom Genome-Wide CHB array. About 650,000 variants related to cancer risk, drug response, drug metabolism, or with polymorphism in the Taiwanese population can be detected on the array.
TWB2 array
Based on the genotyping data accrued from the TWB1 array and whole-genome sequence information from ~1000 Taiwan Biobank participants, the NCGM and Thermo Fisher Scientific jointly developed the TWB2 genotyping array specifically for the Taiwanese population. This array contains approximately 750,000 variants that can be immediately applied to clinical and precision medicine research and is enriched with rare coding variants. The TWB2 array contains ~100,000 variants which are shared with the TWB1 array.
Genotype imputation
The TWB1 and TWB2 array data were separately pre-phased with SHAPEIT4 (v4.1.2) and imputed with IMPUTE2 (v2.3.2) using a combined reference panel consisted of East Asian population data from the 1000 Genomes Project phase 3 (n=504) and Taiwan Biobank WGS data (n=1,451). Post-imputation quality control was conducted to remove variants of poor imputation quality (INFO score < 0.3) or low minor allele frequency (MAF < 0.01%). After variant removal, around 16.5M and 16.2M variants remained in the imputed data of TWB1 and TWB2, respectively. A combined dataset of ~9.8M variants was generated by only including genotypes which were consistent among all three of the WGS, TWB1-imputed and TWB2-imputed datasets.
Allele frequencies were calculated for the genotyped, imputed and combined datasets after removing samples which fulfilled any one of the following criteria: call rate < 0.98, extreme heterozygosity, relatedness up to second degree, or divergent ancestry.
PheWeb
The combined dataset from imputation of TWB1 and TWB2 chips were used for the calculation of genome-wide associations. Quality control was performed to remove samples with call rate < 0.98, extreme heterozygosity, twins/duplicate, or divergent ancestry. Variants with INFO score < 0.8, call rate < 0.98, MAF < 0.5%, or HWE test p value < 10-10 were filtered out as well. Associations were then calculated using SAIGE (v0.44.6.2) with adjustment for age, sex, BMI, genotyping array, and the top 5 principal components for a set of 242 phenotypes, including 111 quantitative traits as well as 131 binary traits with at least 100 cases. Genetic correlations were estimated across phenotypes with heritability z score > 2 using LDSC (v1.0.1) with East Asian LD scores from the 1000 Genomes Project phase 3.
Methylation
DNA methylation data from blood cells was collected with the Illumina Infinium MethylationEPIC BeadChip at Health GeneTech Corp. and analyzed with GenomeStudio software (v2011.1). This chip is capable of directly detecting the methylation status of over 850,000 individual CpG sites.
HLA
The Human Leukocyte Antigen (HLA) complex, a group of genes located on the short arm of chromosome 6 (6p21.3), plays a critical role in regulating the immune system by facilitating antigen presentation and immune response. Class I (HLA-A, HLA-B, HLA-C) and class II (HLA-DPA1, HLA-DPB1, HLA-DQA1, HLA-DQB1, HLA-DRB1, HLA-DRB345) HLA loci were genotyped through next-generation sequencing with NXType NGS HLA typing Kit on Ion S5 XL sequencer at Yourgene Bioscience. Raw sequence files were analyzed by the TypeStream software and produced HLA genotype calls. Using this experimental technique, it is possible to obtain high-resolution 4-field HLA typing data, providing detailed allele information for the human leukocyte antigen complex.
Metabolites
Up to 150 metabolite concentrations and lipid-related measurements in blood plasma were acquired on the Bruker IVDr nuclear magnetic resonance (NMR) platform at the NMR Core Facility of the Biomedical Translational Research Center. Blood plasma was mixed with standard buffer prior to the NMR experiments following instructions from the vendor. 11 urine plasticizer and melamine concentrations were measured at certified laboratories of Kaohsiung Medical University and National Institute of Environment Health Sciences with liquid chromatography-tandem mass spectrometry (LC-MS/MS).