The MCDC Data Archive.ppt

上传人:terrorscript155 文档编号:373081 上传时间:2018-10-04 格式:PPT 页数:59 大小:710KB
下载 相关 举报
The MCDC Data Archive.ppt_第1页
第1页 / 共59页
The MCDC Data Archive.ppt_第2页
第2页 / 共59页
The MCDC Data Archive.ppt_第3页
第3页 / 共59页
The MCDC Data Archive.ppt_第4页
第4页 / 共59页
The MCDC Data Archive.ppt_第5页
第5页 / 共59页
亲,该文档总共59页,到这儿已超出免费预览范围,如果喜欢就下载吧!
资源描述

1、The MCDC Data Archive,John Blodgett Office of Social & Economic Data Analysis University of Missouri Rev. May 2007http:/mcdc.missouri.edu/tutorials/mcdc_data_archive.ppt,A Brief History of the Archive,Started by the Urban Information Center (UIC) at UM St. Louis (UMSL), circa 1981. Accessing census

2、data files (“STF”s huge sequential summary files on tape) was very tedious and error-prone. Idea was to standardize the data and make it easier, cheaper and more reliable to access. SAS software package was becoming the tool for accessing the data.,Brief History (cont),Idea was to create an organize

3、d collection of datasets with certain standardization. E.g. A FIPS county code field would always be converted to a SAS variable named County and would be stored as a 3-character field (NOT a numeric) with leading 0s. STFs with thousands of records would be partitioned into smaller datasets based on

4、 geographic summary units (counties, tracts, places, etc.),Brief History (cont),Very informal “database” concept. Users were 3 SAS programmers at UIC using MVS (IBM mainframe). No web access and no end-user access to worry about. A database designed for easy and efficient analysis and ad-hoc queries

5、. The data was almost entirely (decennial) Census data. We developed SCADS SAS Census Access and Display System. Sold 8 copies. Only ran on IBM mainframe systems (MVS) with SAS.,Brief History: 1988,In 1988 the UIC and OSEDA (UM-Extension at Columbia) team up to become data support for the Missouri C

6、ensus Data Center. OSEDA has a wider variety of data that is to be added to the collection (archive). OSEDA has data analysts who are not SAS programmers. Lotus 1-2-3 is very big. Storing metadata (documentation) in pendaflex-based system no longer as viable as when it was just “us guys”.,Brief Hist

7、ory: 1991-1992,The 1990 Census results are flowing. The UIC is converting all the files to SAS datasets, mostly on tape. Data on disk is very expensive on the MVS system. The Census Bureau is releasing the data on CDs along with some extraction software. These are the DOS ages. To access an STF3 tab

8、le for Poplar Bluff requires mounting a tape and reading it sequentially to find the relevant data, paying for tape I/Os required to get there. Slow, expensive and hard to estimate the cost of a query.,Brief History: 1993,Breakthrough year. COIN (Columbia Online Information Network) and Gopher becom

9、e important elements of the MSCDC. The UICs standard extract reports based on STF3 are turned into very simple but very popular 1 or 2-page demographic profile reports. Delivered via the Internet using the Gopher protocol. This required copying the report files to a Unix system at OSEDA. But the dat

10、a and most processing are still on MVS mainframe.,Brief History: 1994-1996,Transition years. (Most) archive data are copied to an AIX (IBM Unix) system. This was the Great Leap Forward for the archive. The web takes off. Windows 95 appears. Suddenly it seems like everybody has MS-Office with Excel.

11、First version of Uexplore debuts in 1996 with “sub-applications” xtract, hypercon and tabrgen. It allows users to explore the data archive and do extractions. Targeted for use by the state data center core group & affiliates.,Brief History: 2001-2003,Archive moves to new hardware system with storage

12、 and processing speed to handle 2k decennial census. Dexter replaces old xtract modules. Hypercon & tabrgen are retired. Metadata system based on “datasets dataset” developed, with Datasets.html index pages. Enhancements designed to make archive more “self service” oriented.,Relevance of History to

13、DA,It was not until the mid-90s that the data archive was made end-user-accessible via the web. Even then it was for a more sophisticated user, not a casual 1-time user. The advent of the WWW resulted in much more emphasis on making datasets easier to use and on creating metadata. The widespread use

14、 of Excel led us to concentrate on creating extracts that could be easily loaded into spreadsheets. There are still “filetypes” in the archive that pre-date the web and these are generally not as accessible as those created after we started worrying about web-access issues.,What Is the Data Archive?

15、,A loosely organized collection of data files (data sets, data tables, SAS data sets - these are all terms for the same thing). Related supporting files in html, pdf, csv, xls and other standard web formats. Such files may contain metadata, extracts, raw input data, reports, etc. A reasonably rigoro

16、us set of naming and organizational conventions that make accessing the data easier. A network of MCDC people who will assist you with accessing the data.,Data Archive Directories,The archive is really just a very large Unix directory. It is named /pub/data . The 1st level subdirectories represent d

17、ata categories that we call “filetypes”. All filetypes have a subdirectory named Tools where we keep the SAS programs that created the data sets in the filetype directory. Occasionally we have subdirectories of filetype directories that contain data files. We do this to avoid having too many data se

18、ts in 1 directory.,Uexplore and Directories,The Uexplore navigation utility displays the contents of a single directory. It lists subdirectories, data files and other files. Subdirectories (identified via folder icons) are listed before most files (special files like Datasets.html & Readme.html are

19、the only ones that appear before subdirectories). Clicking on a subdirectory invokes Uexplore to display the contents of that subdirectory.,Files and Data Files,The directories are simply containers for organizing the content of the DA, which is comprised of files. “Data Files” is the term we use to

20、 reference the special files that can be accessed via the Dexter extraction utility. AKA “data sets” & “SAS data sets”. Uexplore displays a listing of all the files within a directory in alphabetical order, with the filenames serving as hyperlinks. In Unix, case matters and uppercase letters sort be

21、fore lowercase.,File Naming Conventions,File extensions determine what happens when you select (click on) a file on the uexplore-generated web page. Extensions sas7bdat and sas7bvew indicate data files. Clicking invokes Dexter to extract from that data set. Extension sas indicates a SAS code file. I

22、t will display as a text file in your browser. Most other extensions (html, pdf, csv, txt, etc) will be displayed as usual by your browser. E.g. for most users clicking on a file with a “.csv” extension will cause Excel to be invoked.,File Naming Conventions,Many data sets pertain to a specific geog

23、raphic universe. In these cases we commonly use a filename that identifies this universe such as “mo” (for Missouri) or “us” (for United States). A file name that ends with 2 digits usually indicates data pertaining to a year. So file mocom06.sas7bdat contains data for 2006.,File Naming Conventions

24、(cont),We sometimes use geographic levels as part of file names to indicate the level(s) of geography being summarized on the set. E.g. mostcnty is a file containing summaries for Missouri state and counties. uszips04 would indicate ZIP code level summaries for the entire U.S. for 2004.,Datasets.htm

25、l,This is a special file that occurs in most (but not yet all) filetype directories. Uexplore displays it at the top of the page in bold and uses the Description field to tell you to Use this custom data directory page to access the database files (only) with greatly enhanced descriptions and metada

26、ta. The MCDC goes to considerable trouble to create these files in order to make it easier to access our data. Take advantage of them.,SeeAlso.html,This filename is used in several of our filetype directories and we hope to create them for many more. They provide links to other web sites with relate

27、d data or information regarding this data directory. They are usually very short pages with no fancy formatting.,Tools and Queries,These are two specially-named subdirectories. Tools we have already discussed: its where we store the code for creating the data files, as well as (sometimes) examples o

28、f sas programs for accessing. Queries contains saved Dexter queries. We have not fully implemented these yet, but the idea is that users can select these saved queries and re-run them just by clicking on the .txt files in these special subdirectories.,Structure of Data Files,The Data Files in the ar

29、chive are stored as SAS data sets. ( If you do not know or want to know anything about SAS that is OK. Dexter lets you access these without need to know anything about SAS. ) They are rectangular data tables with rows and columns aka observations and variables. The rows represent the entities being

30、described or summarized. The columns contain the attributes or the statistics summarizing the entity.,Finding Out About Data Files,The key to using the data archive is understanding what kinds of information about what kinds of entities are stored in the data files. Within a filetype directory the b

31、est place to start trying to figure out what we have is using a Datasets.html page (if available). Each row of the table displayed on a Datasets.html page tells you about a data file. Not all about, but some basic stuff.,The Uexplore/Dexter Home Page,The Archive Directory (on the Uexplore/Dexter hom

32、e page),The teal box contains links to 9 major data categories (2000 Census thru Compendia) The rest of the page consists mostly of descriptions of, and hyperlinks to, the archives data categories (which we refer to as filetypes.) Filetypes within the major categories are in order of what we think w

33、ill be user interest. Sf32000x has been our most popular filetype. Popests and acs2005 are gaining.,Whats In the Archive?,Over 20,000 data tables (“datasets”) organized into 60+ major categories. Heavy emphasis on U.S. census data. Not all filetypes are created equal. We spend 90% of our resources o

34、n maybe 10% of our data directories. Filetypes in bold on the directory page are the MCDC “house specialties”.,Uexplore & Dexter,Uexplore is the web tool that lets you browse the archive, displaying the contents of one directory at a time. When Uexplore displays a special data table file it makes th

35、e name of the file a hyperlink to invoke Dexter for that table. Dexter (which is really 2 modules) allows the user to do custom extractions from the data table files.,Facts Worth Repeating,The data tables (the things Dexter accesses) are in the same directories with other related files (SeeAlso.html

36、s, spreadsheets, csv files, Readme files, etc.) Each filetype directory has a special Tools subdirectory where we keep program code and other tool modules related to the data. Subdirectories & files starting with capital letters are listed first and are usually worth looking at. Dexter-accessible ta

37、ble files (“SAS datasets”) have extensions of sas7bdat or sas7bvew.,Exercise,The Bureau of Economic Analysis disseminates its REIS data with key economic indictors for US geography down to the county level. On the Uexplore home page locate the filetype corresponding to this data collection (whats th

38、e major category?) and navigate to the directory page.,Uexplore Page for beareis (cropped),What you see when you click on the beareis link on the Uexplore home page. It displays a list of files within the directory. The “File” column entries are hyperlinks. With a few exceptions the files are displa

39、yed in alphabetical order. Datasets.html is a special file providing enhanced navigation of the data files in this dir. It displays just the data-table files, but in a more logical order and with additional metadata.,Datasets.html page,Datasets.html Columns,The Name column is also a link to uex2dex

40、/ dexter. Label is a short description of the dataset. #Rows (# of observations) and #Cols (# of columns/variables) are taken from the datasets metadata set. As are the Geographic Universe and Units. Details link provides access to more detailed metadata.,Universe and Units,The majority of datasets

41、in the archive contain summary data for geographic areas. For example, a dataset in the popests directory might contain the latest estimates for all counties in the state of Missouri. The geographic universe is Missouri, and the units are counties. When we have many datasets in a directory its usual

42、ly because we have many different combinations of universe and units.,Common Universes,Missouri (the state of) is by far the most common universe for the MCDC archive. United States is second we have quite a number of national datasets. Illinois and Kansas are also very common since we routinely dow

43、nload and convert census files for these key neighbor states. A common sort order for files on Datasets.html pages is Missouri files first, then US, then IL/KS and then other states.,Rows & Columns,The rows of the data tables typically represent (i.e. contain data about) geographic entities: states,

44、 counties, cities (places), etc Most of the columns in the data tables are summary stats for the entity: e.g. the 2000 pop count, the latest estimated pop, the change and percent change, etc. Other columns (“variables”) are identifiers with names such as sumlev, geocode and areaname,A Details Metada

45、ta Page,We get here by clicking on the Details link on Datasets.html page. Lots of info here but varies Key variables is often very useful when doing filters. Note the direct link to Dexter under Access the dataset near the bottom.,Increase Text Size to Read Fine Print,Exercise Navigate to Dataset,T

46、he filetype mig2000 has data regarding migration from 1995 to 2000 as captured in the 2000 census. Go to the Uexplore home page and navigate to this filetype. Use the Datasets.html page to display the datasets within the directory. Find the row for the usccflows data table and click on the Details l

47、ink for this table. From the Details page click on the keyvals link for the variable State.,Key Variables Report: State,Tells you that the variable State has a value of 01 (for “Alabama”) in 22137 rows of this dataset. This can be very helpful when doing a data filter in Dexter.,General Information

48、About Archive Data Sets and Data Set Variables (Columns),Dataset Naming Conventions,All filetype names are 8 characters or less. Dataset names were limited to 8 characters by the software until recently. The first characters of the dataset name often correspond to the universe e.g. “mo”, “il”, “us”.

49、 The geo units are often part of the ds-name e.g. “motracts”, “uszips”. For time series data the name usually ends with a time indicator e.g. “uscom05” contains data thru 2005. The names are cryptic on purpose.,Variable Naming Conventions,Not as rigorously applied as we might like, esp. for older datasets (conventions used for 1980 datasets differ a little from 2K and 1990 sets, for example) Certain names appear on many datasets and are consistent. These are mostly identifier variables, the ones used in creating filters and as keys for merging data from different files.,

展开阅读全文
相关资源
猜你喜欢
相关搜索

当前位置:首页 > 教学课件 > 大学教育

copyright@ 2008-2019 麦多课文库(www.mydoc123.com)网站版权所有
备案/许可证编号:苏ICP备17064731号-1