ImageVerifierCode 换一换
格式:PPT , 页数:21 ,大小:526KB ,
资源ID:372904      下载积分:2000 积分
快捷下载
登录下载
邮箱/手机:
温馨提示:
快捷下载时,用户名和密码都是您填写的邮箱或者手机号,方便查询和重复下载(系统自动生成)。 如填写123,账号就是123,密码也是123。
特别说明:
请自助下载,系统不会自动发送文件的哦; 如果您已付费,想二次下载,请登录后访问:我的下载记录
支付方式: 支付宝扫码支付 微信扫码支付   
验证码:   换一换

加入VIP,免费下载
 

温馨提示:由于个人手机设置不同,如果发现不能下载,请复制以下地址【http://www.mydoc123.com/d-372904.html】到电脑端继续下载(重复下载不扣费)。

已注册用户请登录:
账号:
密码:
验证码:   换一换
  忘记密码?
三方登录: 微信登录  

下载须知

1: 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。
2: 试题试卷类文档,如果标题没有明确说明有答案则都视为没有答案,请知晓。
3: 文件的所有权益归上传用户所有。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 本站仅提供交流平台,并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

版权提示 | 免责声明

本文(Data Science.ppt)为本站会员(孙刚)主动上传,麦多课文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知麦多课文库(发送邮件至master@mydoc123.com或直接QQ联系客服),我们立即给予删除!

Data Science.ppt

1、Data Science,Topics,databases and data architectures databases in the real world scaling, data quality, distributed machine learning/data mining/statistics information retrieval,Data Science is currently a popular interest of employers our Industrial Affiliates Partners say there is high demand for

2、students trained in Data Science databases, warehousing, data architectures data analytics statistics, machine learning Big Data gigabytes/day or more Examples: Walmart, cable companies (ads linked to content, viewer trends), airlines/Orbitz, HMOs, call centers, Twitter (500M tweets/day), traffic su

3、rveillance cameras, detecting fraud, identity theft. supports “Business Intelligence” quantitative decision-making and control finance, inventory, pricing/marketing, advertising need data for identifying risks, opportunities, conducting “what-if” analyses,Data Architectures,traditional databases (CS

4、CE 310/608) tables, fields tuples = records or rows key = field with unique values can be used as a reference from one table into another important for avoiding redundancy (normalization), which risks inconsistency join combining 2 tables using a key metadata data about the data names of the fields,

5、 types (string, int, real, mpeg.) also things like source, date, size, completeness/sampling,Instructors:,TeachingAssignments:,Courses:,SQL: Structured Query Language SELECT Name,HomeTown FROM Instructors WHERE PhDSELECT Course,Title FROM Courses ORDER BY Course; CSCE 121 Introduction to Computing i

6、n C+ CSCE 206 Programming in C CSCE 314 Programming Languages CSCE 411 Design and Analysis of Algorithmscan also compute sums, counts, means, etc.example of JOIN: find all courses taught by someone from CMU:SELECT TeachingAssignments.Course FROM Instructors JOIN TeachingAssignmentsON Instructors.Nam

7、e=TeachingAssigmnents.Name WHERE Instructor.PhD=“Carnegie Mellon” CSCE 314 CSCE 206 because they were both taught by Bill Jones,SQL servers centralized database, required for concurrent access by multiple users ODBC: Open DataBase Connectivity protocol to connect to servers and do queries, updates f

8、rom languages like Java, C, Python Oracle, IBM DB2 - industrial strength SQL databases,some efficiency issues with real databases indexing how to efficiently find all songs written by Paul Simon in a database with 10,000,000 entries? data structures for representing sorted order on fields disk manag

9、ement databases are often too big to fit in RAM, leave most of it on disk and swap in blocks of records as needed could be slow concurrency transaction semantics: either all updates happen en batch or none (commit or rollback) like delete one record and simultaneously add another but guarantee not t

10、o leave in an inconsistent state other users might be blocked till done query optimization the order in which you JOIN tables can drastically affect the size of the intermediate tables,Unstructured data raw text documents, digital libraries grep, substring indexing, regular expressions like find all

11、 instances of “aAg+ies” including “agggggies” Information Retrieval (CSCE 470) look for synonyms, similar words (like “car” and “auto”) tfIdf (term frequency/inverse doc frequency) weighting for important words LSI (latent semantic indexing) e.g. dogs is similar to canines because they are used simi

12、larly (both near bark and bite) Natural Language parsing extracting requirements from jobs postings,Unstructured data images, video (BLOBs=binary large objects) how to extract features? index them? search them? color histograms convolutions/transforms for pattern matching looking for ICBM missiles i

13、n aerial photos of Cuba streams sports ticker, radio, stock quotes. XML files with tags indicating field namesCSCE 411Design and Analysis of Algorithms,Object databases,CHEM 102 Intro to Chemistry TR, 3:00-4:00 prereq: CHEM 101,Texas A&M College Station, TX Div 1A 53,299 students,Dr. Frank Smith 302

14、 Miller St. PhD, Cornell 13 years experience,ClassOfferedAt,TaughtBy,Instructor/Employee,In a database with millions of objects, how do you efficiently do queries (i.e. follow pointers) and retrieve information?,Real-world issues with databases its all about scaling up to many records (and many user

15、s) data warehousing: full database is stored in secure, off-site location slices, snapshots, or views are put on interactive query servers for fast user access (“staging”) might be processed or summarized data databases are often distributed different parts of the data held in different sites some q

16、ueries are local, others are “corporate-wide” how to do distributed queries? how to keep the databases synchronized? CSCE 438 Distributed Object Programming,OLAP: OnLine Analytical Processing,data warehouse: every transactionever recorded,OLAP server,nightly updates and summaries,http:/ library/ms17

17、4587.aspx,multi-dimensional tables of aggregated sales in different regions in recent quarters, rather than “every transaction” users can still look at seasonal or geographic trends in different product categories project data onto 2D spreadsheets, graphs,data integrity missing values how to interpr

18、et? not available? 0? use the mean? duplicated values including partial matches (Jon Smith=John Smith?) inconsistency: multiple addresses for person out-of-date data inconsistent usage: does “destination” mean of first leg or whole flight? outliers: salaries that are negative, or in the trillions mo

19、st database allow “integrity constraints” to be defined that validate newly entered data,Interoperability how can data from one database be compared or combined with another? what if fields are not the same, or not present, or used differently? think of medical or insurance records translation/mappi

20、ng of terms standards units like ft/s, or gallons, etc. identifiers like SSN, UIN, ISBN “federated” databases queries that combine information across multiple servers,“Data cleansing” filling in missing data (imputing values) detecting and removing outliers smoothing removing noise by averaging valu

21、es together filtering, sampling keeping only selected representative values feature extraction e.g. in a photo database, which people are wearing glasses? which have more than one person? which are outdoors?,Data Mining/Data Analytics,finding patterns in the data statistics machine learning (CSCE 63

22、3),Numerical data correlations multivariate regression fitting “models” predictive equations that fit the data from a real estate database of home sales, we get housing price = 100*SqFt - 6*DistanceToSchools + 0.1*AverageOfNeighborhood ANOVA for testing differences between groups R is one of the mos

23、t commonly used software packages for doing statistical analysis can load a data table, calculate means and correlations, fit distributions, estimate parameters, test hypotheses, generate graphs and histograms,clustering similar photos, documents, cases discovery of “structure” in the data example:

24、accident database some clusters might be identified with “accidents involving a tractor trailer” or “accidents at night” top-down vs. bottom-up clustering methods granularity: how many clusters?,decision trees (classifiers) what factors, decisions, or treatments led to different outcomes? recursive

25、partitioning algorithms related methods “discriminant” analysis what factors lead to return of product? extract “association rules” boxers dogs tend to have congenital defects covers 5% of patients with 80% confidence,Veterinary database - dogs treated for disease,other types of data time series and

26、 forecasting: model the price of gas using autoregression a function of recent prices, demand, geopolitics. de-trend: factor out seasonal trends GIS (geographic information systems) longitude/latitude coordinates in the database objects: city/state boundaries, river locations, roads find regions in B/CS with an excess of coffee shops,from: Basic Statistics for Business and Economics, Lind et al (2009), Ch 16.,Toy Sales,credit: Frank Curriero,

copyright@ 2008-2019 麦多课文库(www.mydoc123.com)网站版权所有
备案/许可证编号:苏ICP备17064731号-1