Data & Software — Sebat Lab

Data Sources

Our lab has made genomic datasets publicly available to the scientific community. These data are available from the following sources.

Whole Genome Sequencing

As part of our NIH-funded genome sequencing studies of autism (the REACH Project), we have made the variant calls and whole genome sequence data available through the National Database for Autism Research (NDAR).

View Collections

Psychiatric Genomics Consortium

The PGC CNV resource is now publicly available through the PGC CNV browser. The rare CNV call set from the PGC schizophrenia CNV study can be obtained from the European Genome-Phenome Archive (Study accession #EGAS00001001960).

Open Browser

Genome Wide Estimates of Site Specific Mutability

(Michealson et al, Cell 2012). Mutability index (MI) values for each position in the (hg18) genome are listed. They are bundled by chromosome, and each .Rdata file (an R workspace file) contains an Rle object called 's', which contains the MI values. The log10 MI has been multiplied by 100 (to facilitate being stored as integers). If you were divided by 100, a value of 0 indicates a prediction of ~ genome average mutation rate, and 1 and -1 represent, respectively, a 10x increase and 10x decrease in mutation rate relative to the genome average (since again these are on the log10 scale).

Download MI Files (1.4GB)

Software

Our lab has developed bioinformatic tools for the analysis of whole genome sequence data. These software are available from the following sources.

Sebatlab Github

Most software applications that are being developed in our laboratory are available on our GitHub page.

Open Browser

Determining Parent of Origin

Parent of origin code for de novo SNV sites.

Download (7MB)

forestSV

A statistical learning approach, based on Random Forests, that integrates prior knowledge about the characteristics of structural variants and leads to improved discovery in high-throughput sequencing data. The implementation of this technique, forestSV, offers high sensitivity and specificity coupled with the flexibility of a data-driven approach.

Download (738KB)

forestDNM

An R package built around a classifier that was trained to predict true de novo germline mutations (DNMs), using features derived from family genotype data contained in a VCF. The classifier was trained on 10 families with monozygotic twins, whose putative DNMs had undergone extensive experimental validation (the classifier was trained to predict validation status). In an independent test set of held-out data from the 10 families, sensitivity was > 95

Download (558KB)