ImageVerifierCode 换一换
格式:PPT , 页数:17 ,大小:1.61MB ,
资源ID:378936      下载积分:2000 积分
快捷下载
登录下载
邮箱/手机:
温馨提示:
如需开发票,请勿充值!快捷下载时,用户名和密码都是您填写的邮箱或者手机号,方便查询和重复下载(系统自动生成)。
如填写123,账号就是123,密码也是123。
特别说明:
请自助下载,系统不会自动发送文件的哦; 如果您已付费,想二次下载,请登录后访问:我的下载记录
支付方式: 支付宝扫码支付 微信扫码支付   
注意:如需开发票,请勿充值!
验证码:   换一换

加入VIP,免费下载
 

温馨提示:由于个人手机设置不同,如果发现不能下载,请复制以下地址【http://www.mydoc123.com/d-378936.html】到电脑端继续下载(重复下载不扣费)。

已注册用户请登录:
账号:
密码:
验证码:   换一换
  忘记密码?
三方登录: 微信登录  

下载须知

1: 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。
2: 试题试卷类文档,如果标题没有明确说明有答案则都视为没有答案,请知晓。
3: 文件的所有权益归上传用户所有。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 本站仅提供交流平台,并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

版权提示 | 免责声明

本文(BIG Biomedicine and the Foundations of BIG Data Analysis.ppt)为本站会员(李朗)主动上传,麦多课文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知麦多课文库(发送邮件至master@mydoc123.com或直接QQ联系客服),我们立即给予删除!

BIG Biomedicine and the Foundations of BIG Data Analysis.ppt

1、BIG Biomedicine and the Foundations of BIG Data Analysis,Michael W. MahoneyICSI and Dept of Statistics, UC BerkeleyMay 2014(For more info, see: http:/www.stat.berkeley.edu/mmahoney),Insiders vs outsiders views (1 of 2),Ques: Genetics vs molecular biology vs biochemistry vs biophysics: Whats the diff

2、erence?,Insiders vs outsiders views (1 of 2),Ques: Genetics vs molecular biology vs biochemistry vs biophysics: Whats the difference?Answer: Not much, (if you are a “methods” person*)they are all biologyyou get data from any of those areas, ignoring important domain details, and evaluate your method

3、 qua methodyour reviewers evaluate the methods and dont care about the science.*E.g., one who self-identifies as doing data analysis or machine learning or statistics or theory of algorithms or artificial intelligence or .,Insiders vs outsiders views (2 of 2),Ques: Data analysis vs machine learning

4、vs statistics vs theory of algorithms vs artificial intelligence (vs scientific computing vs computational mathematics vs databases .): Whats the difference?,Insiders vs outsiders views (2 of 2),Ques: Data analysis vs machine learning vs statistics vs theory of algorithms vs artificial intelligence

5、(vs scientific computing vs computational mathematics vs databases .): Whats the difference?Answer: Not much, (if you are a “science” person*)they are all just toolsyou get a tool from any of those areas and bury details in a methods sectionyour reviewers evaluate the science and dont care about the

6、 methods.*E.g., one who self identifies as doing genetics or molecular biology or biochemistry or biophysics or .,BIG data? MASSIVE data?,NYT, Feb 11, 2012: “The Age of Big Data” “What is Big Data? A meme and a marketing term, for sure, but also shorthand for advancing trends in technology that open

7、 the door to a new approach to understanding the world and making decisions. ” Why are big data big? Generate data at different places/times and different resolutionsFactor of 10 more data is not just more data, but different data,Thinking about large-scale data,Data generation is modern version of

8、microscope/telescope: See things couldnt see before: e.g., fine-scale movement of people, fine-scale clicks and interests; fine-scale tracking of packages; fine-scale measurements of temperature, chemicals, etc.Those inventions ushered new scientific eras and new understanding of the world and new t

9、echnologies to do stuffEasy things become hard and hard things become easy: Easier to see the other side of universe than bottom of oceanMeans, sums, medians, correlations is easy with small data,Our ability to generate data far exceeds our ability to extract insight from data.,How do we view BIG da

10、ta?,Algorithmic vs. Statistical Perspectives,Computer Scientists Data: are a record of everything that happened. Goal: process the data to find interesting patterns and associations.Methodology: Develop approximation algorithms under different models of data access since the goal is typically comput

11、ationally hard.Statisticians (and Natural Scientists)Data: are a particular random instantiation of an underlying process describing unobserved patterns in the world.Goal: is to extract information about the world from noisy data.Methodology: Make inferences (perhaps about unseen events) by positing

12、 a model that describes the random variability of the data around the deterministic model.,Lambert (2000), Mahoney (2010),Single Nucleotide Polymorphisms: the most common type of genetic variation in the genome across different individuals.They are known locations at the human genome where two alter

13、nate nucleotide bases (alleles) are observed (out of A, C, G, T).,SNPs,individuals, AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG GG CC GG AA GG AA CC AA CC

14、AA GG TT AA TT GG GG GG TT TT CC GG TT GG GG TT GG AA GG TT TT GG TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT

15、CT AG CT AG GT GT GA AG GG TT TT GG TT CC CC CC CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT AG GG TT GG AA GG TT TT GG TT CC CC CG CC AG AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG TT GG AA GG TT

16、TT GG TT CC CC CC CC GG AA AG AG AG AA TT AA GG GG CC AG AG CG AA CC AA CG AA GG TT AA TT GG GG GG TT TT CC GG TT GG GT TT GG AA ,Matrices including thousands of individuals and hundreds of thousands if SNPs are available, and more/bigger/better are coming soon.This can be written as a “matrix,” ass

17、ume its been preprocessed properly, so lets call black box matrix algorithms.,Applications in: Human Genetics,Africa,Middle East,S C Asia & Gujarati,Europe,Oceania,East Asia,America,Not altogether satisfactory: the principal components are linear combinations of all SNPs, and of course can not be as

18、sayed! Can we find actual SNPs that capture the information in the singular vectors? Formally: spanning the same subspace, optimizing variance, computationally efficient.,Mexicans,Paschou, et al. (2010) J Med Genet,Apply PCA/SVD:,Issues with eigen-analysis,Computing large SVDs: computational timeIn

19、commodity hardware (e.g., a 4GB RAM, dual-core laptop), using MatLab 7.0 (R14), the computation of the SVD of the dense 2,240-by-447,143 matrix A takes about 20 minutes.Computing this SVD is not a one-liner, since we can not load the whole matrix in RAM (runs out-of-memory in MatLab). Instead, compu

20、te the SVD of AAT.In a similar” experiment,” compute 1,200 SVDs on matrices of dimensions (approx.) 1,200-by-450,000 (roughly, a full leave-one-out cross-validation experiment).Selecting actual columns that “capture the structure” of the top PCsCombinatorial optimization problem; hard even for small

21、 matrices. Often called the Column Subset Selection Problem (CSSP).Not clear that such “good” columns even exist.Avoid “reification” problem of “interpreting” singular vectors!,CUR matrix decompositions,Goal. Solve the following problem: “While very efficient basis vectors, the (singular) vectors th

22、emselves are completely artificial and do not correspond to actual (DNA expression) profiles. . . . Thus, it would be interesting to try to find basis vectors for all experiment vectors, using actual experiment vectors and not artificial bases that offer little insight.” Kuruvilla et al. (2002)Theor

23、em: Given an arbitrary matrix, call a black box that I wont describe. You get a small number of actual columns/rows that are only marginally worse than the truncated PCA/SVD. The black box runs faster than computing a truncated PCA/SVD for arbitrary input. Its very robust to heuristic modifications.

24、 Corollary: We can use the same methods to approximate the PCA/SVD.,Mahoney and Drineas “CUR Matrix Decompositions for Improved Data Analysis” (PNAS, 2009),SNPs by chromosomal order,PCA-scores,* top 30 PCA-correlated SNPs,Africa,Europe,Asia,America,Selecting PCA SNPs for individual assignment to fou

25、r continents (Africa, Europe, Asia, America),Mahoney and Drineas (2009) PNAS Paschou et al (2007; 2008) PLoS Genetics Paschou et al (2010) J Med Genet Drineas et al (2010) PLoS One Javed et al (2011) Annals Hum Genet,Data analysis and machine learning and statistics and theory of algorithms and scie

26、ntific computing . and genetics and astronomy and mass spectrometry and . likes this-but each for different reasons! Good “hydrogen atom” for methods development!,Bioinformatics: a cautionary tale?,How did/does bioinformatics relate to computer science, statistics, and applied mathematics, “technica

27、lly” and “sociologically”?How did NIH choose to fund graduate students and postdocs in the budget expansion of the 90s?What effect did this have on the number of American/foreign going into biomedical research?How will the pay structure of biomedical researchers effect which cs/stats “data scientist

28、s” engage you in your efforts?What effect does med schools deciding not to do joint faculty hires with cs departments have on bioinformatics and big biomedical data? How is this Big Biomedical Data phenomenon similar to and different than the Bioinformatics experience?,Big changes in the past . and

29、future,Consider the creation of: Modern PhysicsComputer ScienceMolecular Biology These were driven by new measurement techniques and technological advances, but they led to: big new (academic and applied) questionsnew perspectives on the worldlots of downstream applications We are in the middle of a

30、 similarly big shift!,OR and Management Science Transistors and MicroelectronicsBiotechnology,MMDS Workshop on “Algorithms for Modern Massive Data Sets” (http:/mmds-data.org),at UC Berkeley, June 17-20, 2014Objectives:Address algorithmic, statistical, and mathematical challenges in modern statistica

31、l data analysis.Explore novel techniques for modeling and analyzing massive, high-dimensional, and nonlinearly-structured data. - Bring together computer scientists, statisticians, mathematicians, and data analysis practitioners to promote cross-fertilization of ideas.Organizers: M. W. Mahoney, A. Shkolnik, P. Drineas, R. Zadeh, and F. PerezRegistration is available now!,

copyright@ 2008-2019 麦多课文库(www.mydoc123.com)网站版权所有
备案/许可证编号:苏ICP备17064731号-1