BioXpress DESeq step
Step 3 of the BioXpress pipeline.
General Flow of Scripts
run_per_study.py -> run_per_tissue.py -> run_per_case.py
Procedure
DESeq step 1: Run the script run_per_study.sh
Summary
The python script run_per_study.py provides arguments to the R script deseq.R. The count and category files generated from the Annotation step are used to calculate differential expression and statistical significance. The result is a series of files per tissue including the normalized reads (DESeq normalization method), the DE results and significance, and QC files such as the PCA plot.
Note: this step is time consuming (~2-3 hours of run time)
Method
Edit the hard-coded paths in the script run_per_tissue.py
Specify the
in_dir
to be the folder containing the final output files of the Annotation steps for per studySpecify the
out_dir
Ensure that the file
list_files/studies.csv
contains all of the tissues you wish to process - Note: the studies can be run separately (in the event that 2-3 hours cannot be dedicated to run the all studies at once) by creating separate dat files with specific tissues to run
Run the shell script sh run_per_study.sh
Note: the R libraries specified in deseq.R will need to be installed if running on a new server or system, as these installations are not included in the scripts
Output
A set of files:
log file
deSeq_reads_normalized.csv - Normalized read counts (DESeq normalization method applied)
results_significance.csv - log2fc differential expression results and statistical significance (t-test)
dispersion.png
distance_heatmap.png
pca.png - Principal component analysis plot, important for observing how well the Primary Tumor and Solid Tissue Normal group together
DESeq Step 2 : Run the script run_per_tissue.sh
Summary
The python script run_per_tissue.py provides arguments to the R script deseq.R. The count and category files generated from the Annotation step are used to calculate differential expression and statistical significance. The result is a series of files per study including the normalized reads (DESeq normalization method), the DE results and significance, and QC files such as the PCA plot.
Note: this step is time consuming (~2-3 hours of run time)
Method
Edit the hard-coded paths in the script run_per_tissue.py
Specify the
in_dir
to be the folder containing the final output files of the Annotation steps for per tissueSpecify the
out_dir
Ensure that the file
list_files/tissue.dat
contains all of the tissues you wish to process - Note: the tissues can be run separately (in the event that 2-3 hours cannot be dedicated to run the all tissues at once) by creating separate dat files with specific tissues to run
Run the shell script sh run_per_tissue.sh
Output
A set of files:
log file
deSeq_reads_normalized.csv - Normalized read counts (DESeq normalization method applied)
results_significance.csv - log2fc differential expression results and statistical significance (t-test)
dispersion.png
distance_heatmap.png
pca.png - Principal component analysis plot, important for observing how well the Primary Tumor and Solid Tissue Normal group together
DESeq Step 3 : Run the script run_per_case.sh
Summary
The python script run_per_case.py provides arguments to the R script deseq.R. The count and category files generated from the Annotation step are used to calculate differential expression and statistical significance. The result is a series of files per case including the normalized reads (DESeq normalization method), the DE results and significance, and QC files such as the PCA plot.
Note: this step is time consuming (~2-3 hours of run time)
Method
Edit the hard-coded paths in the script run_per_case.py
Specify the
in_dir
to be the folder containing the final output files of the Annotation step for per_caseSpecify the
out_dir
Ensure that the file
list_files/cases.csv
contains all of the cases you wish to process - Note: the cases can be run separately (in the event that 2-3 hours cannot be dedicated to run the all tissues at once) by creating separate dat files with specific cases to run
Run the shell script sh run_per_tissue.sh
Output
A set of files:
log file
deSeq_reads_normalized.csv - Normalized read counts (DESeq normalization method applied)
results_significance.csv - log2fc differential expression results and statistical significance (t-test)
dispersion.png
distance_heatmap.png
pca.png - Principal component analysis plot, important for observing how well the Primary Tumor and Solid Tissue Normal group together