Description
Summary PhD, Major in Mathematics, Minor in Computer Science and Statistics with professional training by National Center for Biotechnology Information (NCBI), National Institutes of Health (NIH). Over 20 years experience in data science/statistician in human health related areas and higher education. Expert in data acquisition, processing, analysis, interpreting and reporting. Familiar with high throughput genomic data and clinical data such as Next Generation Sequences and Electronic Health Records and the patient matching algorithms, the three key components for the Precision Medicine Initiative (PMI). Proficiency in modern computer platforms and programing languages and software programs. Strong analytical, statistical, and programming skills. More than 20 publications in professional journals and three software programs for analyzing genetic data and the Government administrative data. Demonstrated Successful Experience Unemployed/Independent Research 11/2016 - Present Data Scientist/Statistician * Database design, implementation and data acquisition for patients who need lab test for biomarkers to determine their breast cancer risk. The database design will follow the HL7 standard and Logical Observation Identifiers Names and Codes (LOINC) database coding system for EHR interoperability, ePrescript functionality and efficient patients matching algorithms. * Predictive modeling for breast cancer risk based on the biomarkers and other health information data in the database. The modeling process will be designed so that it can be connected to the database and can be updated automatically based on the database changes. * Design and implement an automatic reporting system that will generate a breast cancer risk report using the established predictive model once a new lab order is placed through the ePrescript by a Physician and the lab biomarker signature information becomes available in the database. The report can be saved as a PDF file. National Institute of Health (NIH) 02/2015 - 10/2016 Statistician * Developed and implemented a new searching algorithm and methodology for a large Government database, the US Social Security Administration (SSA)'s Case Process and Management System (CPMS). * Applied the above developed program written in R and Python to analyzed one of the master table with 112,000,000 records for the administrative dataset in the CPMS system. * Using parallel computing technology such as multicore and cluster computing to deal with big data challenge. * Using scripting language skills like R, Python and Shell to do the data segmentation and profiling. * Using visualization tools such as R, SAS, Tableau to summarize and report the findings * Explained the findings and gave SSA recommendations based on the analyzed results. * Applied Machine Learning algorithms to analyze SSA data in supervised and unsupervised situations. Familiar with EHR data structure and database system like NIH's CRIS and BTRIS systems as well as the open source EHR system openEMR. * Collaboration research in Survey data analysis, Item Response Theory and Computerized Adaptive Testing with Application in Functional measures for disability. * Collaboration research in NLP and text mining technology with application in SSA'a disability application processing. * Participated in Deep learning training and workshops by Nvidia. * Two publications Unemployed/Independent Research 07/2014 - 01/2015 Statistician * High Dimensional Predictive modeling. Developed a variable transformation method for high dimensional predictive modeling Implemented the method into a program using R Applied the method to Tox21 data by participated in the 2014 Tox21 Challenge Competition * High Performance Computing. Set up and tested small network connected cluster using laptop, desktop in windows and Linux Tested Online Statistical and Computing Services using the network cluster National Human Genome Center at Howard University 01/2006 - 06/2014 Biostatistician/Data Scientist * Collaborative research for breast cancer, from gene selection based on microarray data, to Real Time PCR data validation and biomarker confirmation (four publications and two grants) Designed and implemented a program using R and SQLite to perform automated deep database analysis for NIH's GEO microarray database. Developed and implemented a computer program using c to do genome-wide genetic data (one publication) Collaborative research for methodology in DNA Sequence analysis, gene expression analysis (three publications) Provided statistical services for other investigators Provided basic statistical training for medical students