Description of the data frames and metadata files available in this directory: Drug_MoA_Information_Of_CTRP_GDSC_CCLE_gCSI_v2.txt: The MoA information of compounds/drugs included in the CTRP, GDSC, CCLE, and gCSI drug screening studies. The MoA information is curated from multiple sources, and is grouped into categories. Target genes are represented by both gene symbols and Entrez IDs. Drug IDs used by the Pilot 1 project are also included. Processed_Drug_MoA_Information_From_Broad_Institute.txt: The MoA information of compounds collected from the Drug Repurposing Hub of the Broad Institute. The data have been furt her processed to include compound name, PubChem ID, Broad Institute ID, SMILES, MoA description, and target gene symbols. combined_cancer_types: Cancer types for all cell lines and PDX models, generated using GDC cancer type type classifier. combined_cl_metadata: ID-Name mapping and additional metadata for the cancer cell lines, PDX models, and GDC patient samples. combined_dragon7_descriptors: The molecular descriptors for the drugs, generated using Dragon 7.0 software package, which calculates 5,270 molecular descriptors, which include the simplest atom types, functional groups and fragment counts, topological and geometrical descriptors, three-dimensional descriptors, but also several properties estimation (such as logP) and drug-like and lead-like alerts (such as the Lipinski's alert). The Dragon 7.0 software package is also used to generate path-based fingerprints (PFP) and extended connectivity fingerprints (ECFP) for drugs. combined_mordred_descriptors: The molecular descriptors for the drugs, generated using Mordred software package, which calculates 1,826 molecular descriptors. combined_rnaseq_data_combat: The gene expression datasets for cancer cell lines, generated using RNA-seq, were collected from the following sources: NCI-60, CCLE, and GDSC. The CTRP and gCSI drug response datasets were generated using the cell lines from CCLE dataset. Hence, for those, we used the gene expression data from the matching cell lines in the CCLE dataset. The gene expression values were represented as FPKM values. The genes were filtered based on the gene in The Library of Integrated Network-Based Cellular Signatures (LINCS) 1000 gene set and the FPKM values were transformed into TPM values (FPKM * 10E6 / Sum of all FPKM values), log transformed, and normalized using ComBat method for removal of batch effects. combined_rnaseq_data_lincs1000_combat: The gene expression datasets for cancer cell lines, generated using RNA-seq, were collected from the following sources: NCI-60, CCLE, and GDSC. The CTRP and gCSI drug response datasets were generated using the cell lines from CCLE dataset. Hence, for those, we used the gene expression data from the matching cell lines in the CCLE dataset. The gene expression values were represented as FPKM values. The genes were filtered to 17,743 genes common to all datasets and the FPKM values were transformed into TPM values (FPKM * 10E6 / Sum of all FPKM values), log transformed, and normalized using ComBat method for removal of batch effects. combined_single_response_agg: The drug response datasets for cancer cell lines were collected from the following sources: The NCI-60 Human Cancer Cell Line Screen, The Cancer Cell Line Encyclopedia (CCLE), The Cancer Therapeutics Response Portal (CTRP), The Genomics of Drug Sensitivity in Cancer (GDSC), and The Genentech Cell Line Screening Initiative (gCSI). The drug response was measured as percent growth inhibition measured at several concentrations. The dose response values outside of the range [-100, 300] were considered outliers and removed. The remaining dose response values were used to calculate dose-independent area-under-the-curve (AUC) using and curve fitting method over a fixed dose range, so that it can be compared across the studies. TopN (Top6/Top21) Dataframes: The TopN dataframes for Pilot 1 component combine drug response data, gene expression data, and drug molecular descriptors into a single data frame to support building binary classification or regression machine learning models to predict drug response. These dataframes include top N cancer types that have the most number of cell lines with the RNA-seq and drug response data available. top_21.res_bin.cf_rnaseq.dd_dragon7.csv.gz: Dataframe that combines drug response data (binary), rna-seq gene expression data, and Dragon7 drug descriptors for top 21 cancer types to support building binary classification models for predicting drug response. top_21.res_reg.cf_rnaseq.dd_dragon7.csv.gz: Dataframe that combines drug response data (AUC), rna-seq gene expression data, and Dragon7 drug descriptors for top 21 cancer types to support building regression models for predicting drug response. top_6.res_bin.cf_rnaseq.dd_dragon7.csv.gz: Dataframe that combines drug response data (binary), rna-seq gene expression data, and Dragon7 drug descriptors for top six cancer types to support building binary classification models for predicting drug response. top_6.res_reg.cf_rnaseq.dd_dragon7.csv.gz: Dataframe that combines drug response data (AUC), rna-seq gene expression data, and Dragon7 drug descriptors for top six cancer types to support building regression models for predicting drug response. top_6.res_reg.cf_rnaseq.dd_dragon7.labled.csv.gz: Dataframe generated by taking top_6.res_reg.cf_rnaseq.dd_dragon7.csv.gz dataframe and filtering drug descriptors with more than 10% missing (NaN) values, and drug responses with poor quality of fit as measured by R-square.