1、 Collection of SANS standards in electronic format (PDF) 1. Copyright This standard is available to staff members of companies that have subscribed to the complete collection of SANS standards in accordance with a formal copyright agreement. This document may reside on a CENTRAL FILE SERVER or INTRA
2、NET SYSTEM only. Unless specific permission has been granted, this document MAY NOT be sent or given to staff members from other companies or organizations. Doing so would constitute a VIOLATION of SABS copyright rules. 2. Indemnity The South African Bureau of Standards accepts no liability for any
3、damage whatsoever than may result from the use of this material or the information contain therein, irrespective of the cause and quantum thereof. ISBN 978-0-626-23546-8 SANS 28500:2009Edition 1ISO 28500:2009Edition 1SOUTH AFRICAN NATIONAL STANDARD Information and documentation WARC file format This
4、 national standard is the identical implementation of ISO 28500:2009, and is adopted with the permission of the International Organization for Standardization. Published by SABS Standards Division 1 Dr Lategan Road Groenkloof Private Bag X191 Pretoria 0001Tel: +27 12 428 7911 Fax: +27 12 344 1568 ww
5、w.sabs.co.za SABS SANS 28500:2009 Edition 1 ISO 28500:2009 Edition 1 Table of changes Change No. Date Scope National foreword This South African standard was approved by National Committee SABS SC 46C, Information and documentation Identification and description, in accordance with procedures of the
6、 SABS Standards Division, in compliance with annex 3 of the WTO/TBT agreement. This SANS document was published in December 2009. Reference numberISO 28500:2009(E)ISO 2009INTERNATIONAL STANDARD ISO28500First edition2009-05-15Information and documentation WARC file format Information et documentation
7、 Format de fichier WARC SANS 28500:2009This s tandard may only be used and printed by approved subscription and freemailing clients of the SABS .ISO 28500:2009(E) PDF disclaimer This PDF file may contain embedded typefaces. In accordance with Adobes licensing policy, this file may be printed or view
8、ed but shall not be edited unless the typefaces which are embedded are licensed to and installed on the computer performing the editing. In downloading this file, parties accept therein the responsibility of not infringing Adobes licensing policy. The ISO Central Secretariat accepts no liability in
9、this area. Adobe is a trademark of Adobe Systems Incorporated. Details of the software products used to create this PDF file can be found in the General Info relative to the file; the PDF-creation parameters were optimized for printing. Every care has been taken to ensure that the file is suitable f
10、or use by ISO member bodies. In the unlikely event that a problem relating to it is found, please inform the Central Secretariat at the address given below. COPYRIGHT PROTECTED DOCUMENT ISO 2009 All rights reserved. Unless otherwise specified, no part of this publication may be reproduced or utilize
11、d in any form or by any means, electronic or mechanical, including photocopying and microfilm, without permission in writing from either ISO at the address below or ISOs member body in the country of the requester. ISO copyright office Case postale 56 CH-1211 Geneva 20 Tel. + 41 22 749 01 11 Fax + 4
12、1 22 749 09 47 E-mail copyrightiso.org Web www.iso.org Published in Switzerland ii ISO 2009 All rights reservedSANS 28500:2009This s tandard may only be used and printed by approved subscription and freemailing clients of the SABS .ISO 28500:2009(E) ISO 2009 All rights reserved iiiContents Page Fore
13、word. v Introduction . vi 1 Scope . 1 2 Normative references . 1 3 Terms, definitions and abbreviated terms . 2 3.1 Terms and definitions. 2 3.2 Abbreviated terms 2 4 File and record model. 3 5 Named fields 5 5.1 General. 5 5.2 WARC-Record-ID (mandatory) 6 5.3 Content-Length (mandatory) . 6 5.4 WARC
14、-Date (mandatory) 6 5.5 WARC-Type (mandatory) . 6 5.6 Content-Type. 7 5.7 WARC-Concurrent-To. 7 5.8 WARC-Block-Digest 8 5.9 WARC-Payload-Digest 8 5.10 WARC-IP-Address. 8 5.11 WARC-Refers-To. 9 5.12 WARC-Target-URI . 9 5.13 WARC-Truncated 9 5.14 WARC-Warcinfo-ID . 10 5.15 WARC-Filename 10 5.16 WARC-P
15、rofile 10 5.17 WARC-Identified-Payload-Type. 10 5.18 WARC-Segment-Number 10 5.19 WARC-Segment-Origin-ID 11 5.20 WARC-Segment-Total-Length . 11 6 WARC record types 11 6.1 General. 11 6.2 warcinfo 11 6.3 response 12 6.4 resource . 13 6.5 request . 13 6.6 metadata. 14 6.7 revisit 15 6.8 conversion . 16
16、 6.9 continuation. 16 7 Record segmentation . 16 8 Registration of MIME media types application/warc and application/warc-fields . 17 8.1 General. 17 8.2 application/warc 17 8.3 application/warc-fields . 18 9 WARC file name, size and compression 18 Annex A (informative) Use cases for writing WARC re
17、cords 19 SANS 28500:2009This s tandard may only be used and printed by approved subscription and freemailing clients of the SABS .ISO 28500:2009(E) iv ISO 2009 All rights reservedAnnex B (informative) Examples of WARC records 22 Annex C (informative) WARC file size and name recommendations 26 Annex
18、D (informative) Compression recommendations 27 Bibliography . 28 SANS 28500:2009This s tandard may only be used and printed by approved subscription and freemailing clients of the SABS .ISO 28500:2009(E) ISO 2009 All rights reserved vForeword ISO (the International Organization for Standardization)
19、is a worldwide federation of national standards bodies (ISO member bodies). The work of preparing International Standards is normally carried out through ISO technical committees. Each member body interested in a subject for which a technical committee has been established has the right to be repres
20、ented on that committee. International organizations, governmental and non-governmental, in liaison with ISO, also take part in the work. ISO collaborates closely with the International Electrotechnical Commission (IEC) on all matters of electrotechnical standardization. International Standards are
21、drafted in accordance with the rules given in the ISO/IEC Directives, Part 2. The main task of technical committees is to prepare International Standards. Draft International Standards adopted by the technical committees are circulated to the member bodies for voting. Publication as an International
22、 Standard requires approval by at least 75 % of the member bodies casting a vote. Attention is drawn to the possibility that some of the elements of this document may be the subject of patent rights. ISO shall not be held responsible for identifying any or all such patent rights. ISO 28500 was prepa
23、red by Technical Committee ISO/TC 46, Information and documentation, Subcommittee SC 4, Technical interoperability. SANS 28500:2009This s tandard may only be used and printed by approved subscription and freemailing clients of the SABS .ISO 28500:2009(E) vi ISO 2009 All rights reservedIntroduction W
24、ebsites and web pages emerge and disappear from the World Wide Web every day. For the past ten years, memory storage organizations have tried to find the most appropriate ways to collect and keep track of this vast quantity of important material using web-scale tools such as web crawlers. A web craw
25、ler is a program that browses the web in an automated manner according to a set of policies; starting with a list of URLs, it saves each page identified by a URL, finds all the hyperlinks in the page (e.g. links to other pages, images, videos, scripting or style instructions, etc.), and adds them to
26、 the list of URLs to visit recursively. Storing and managing the billions of saved web page objects itself presents a challenge. At the same time, those same organizations have a rising need to archive large numbers of digital files not necessarily captured from the web (e.g. entire series of electr
27、onic journals, or data generated by environmental sensing equipment). A general requirement that appears to be emerging is for a container format that permits one file simply and safely to carry a very large number of constituent data objects for the purpose of storage, management, and exchange. Tho
28、se data objects (or resources) need to be of unrestricted type (including many binary types for audio, CAD, compressed files, etc.), but fortunately the container needs only minimal knowledge of the nature of the objects. The WARC (Web ARChive) file format offers a convention for concatenating multi
29、ple resource records (data objects), each consisting of a set of simple text headers and an arbitrary data block into one long file. The WARC format is an extension of the ARC file format (ARC) that has traditionally been used to store “web crawls“ as sequences of content blocks harvested from the W
30、orld Wide Web. Each capture in an ARC file is preceded by a one-line header that very briefly describes the harvested content and its length. This is directly followed by the retrieval protocol response messages and content. The original ARC format file has been used by the Internet Archive (IA) sin
31、ce 1996 for managing billions of objects, and by several national libraries. The motivation to extend the ARC format arose from the discussion and experiences of the International Internet Preservation Consortium (IIPC), whose members include the national libraries of Australia, Canada, Denmark, Fin
32、land, France, Iceland, Italy, Norway, Sweden, The British Library (UK), The Library of Congress (USA), and the Internet Archive (IA). The California Digital Library and the Los Alamos National Laboratory also provided input on extending and generalizing the format. The WARC format is expected to be
33、a standard way to structure, manage and store billions of resources collected from the web and elsewhere. It will be used to build applications for harvesting (such as the open source Heritrix web crawler), managing, accessing, and exchanging content. The way WARC files will be created and resources
34、 stored and rendered will depend on software and applications implementations. Besides the primary content recorded in ARCs, the extended WARC format accommodates related secondary content, such as assigned metadata, abbreviated duplicate detection events, later-date transformations, and segmentatio
35、n of large resources. The extension may also be useful for more general applications than web archiving. To aid the development of tools that are backwards compatible, WARC content is clearly distinguishable from pre-revision ARC content. The WARC file format is made sufficiently different from the
36、legacy ARC format files so that software tools can unambiguously detect and correctly process both WARC and ARC records; given the large amount of existing archival data in the previous ARC format, it is important that access and use of this legacy not be interrupted when transitioning to the WARC f
37、ormat. After the Internet Engineering Steering Group (IESG: http:/www.ietf.org/iesg.html) approval, IANA (Internet Assigned Numbers Authority: http:/www.iana.org/) is expected to register the WARC type “application/warc“ using the application provided in this International Standard and following pro
38、cedures defined in RFC2048. SANS 28500:2009This s tandard may only be used and printed by approved subscription and freemailing clients of the SABS .INTERNATIONAL STANDARD ISO 28500:2009(E) ISO 2009 All rights reserved 1Information and documentation WARC file format 1 Scope This International Standa
39、rd specifies the WARC file format: to store both the payload content and control information from mainstream Internet application layer protocols, such as the HTTP, DNS, and FTP; to store arbitrary metadata linked to other stored data (e.g. subject classifier, discovered language, encoding); to supp
40、ort data compression and maintain data record integrity; to store all control information from the harvesting protocol (e.g. request headers), not just response information; to store the results of data transformations linked to other stored data; to store a duplicate detection event linked to other
41、 stored data (to reduce storage in the presence of identical or substantially similar resources); to be extended without disruption to existing functionality; to support handling of overly long records by truncation or segmentation, where desired. 2 Normative references The following referenced docu
42、ments are indispensable for the application of this document. For dated references, only the edition cited applies. For undated references, the latest edition of the referenced document (including any amendments) applies. ISO 8601, Data elements and interchange formats Information interchange Repres
43、entation of dates and times RFC1035 Mockapetris, P. Domain names Implementation and specification. STD 13, November 1987. Available at: http:/www.faqs.org/rfcs/rfc1035.html RFC1884 Hinden, R. and Deering, S. IP Version 6 Addressing Architecture. December 1995. Available at: http:/www.faqs.org/rfcs/r
44、fc1884.html RFC2045 Freed, N. and Borenstein, N. Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies. November 1996. Available at: http:/www.faqs.org/rfcs/rfc2045 RFC2540 Eastlake, D. Detached Domain Name System (DNS) Information. March 1999. Available at: http:/
45、www.faqs.org/rfcs/rfc2540.html RFC2616 Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L., Leach, P. and Berners-Lee, T. Hypertext Transfer Protocol HTTP/1.1. June 1999 (TXT, PS, PDF, HTML, XML). Available at: http:/www.faqs.org/rfcs/rfc2616.html SANS 28500:2009This s tandard may only be
46、 used and printed by approved subscription and freemailing clients of the SABS .ISO 28500:2009(E) 2 ISO 2009 All rights reservedRFC2822 Resnick, P. (ed.) Internet Message Format. April 2001. Available at: http:/www.faqs.org/rfcs/rfc2822 RFC3629 Yergeau, F. UTF-8, a transformation format of ISO 10646
47、. STD 63, November 2003. Available at: http:/www.faqs.org/rfcs/rfc3629.html RFC3986 Berners-Lee, T., Fielding, R., Masinter, L. Uniform Resource Identifier (URI): Generic Syntax. STD 66, January 2005 (TXT, HTML, XML). Available at: http:/www.faqs.org/rfcs/rfc3986.html RFC4027 Josefsson, S. Domain Na
48、me System Media Types. April 2005. Available at: http:/www.faqs.org/rfcs/rfc4027.html W3CDTF Date and Time Formats: note submitted to the W3C. 15 September 1997 (W3C profile of ISO 8601). Available at: http:/www.w3.org/TR/NOTE-datetime 3 Terms, definitions and abbreviated terms 3.1 Terms and definit
49、ions For the purposes of this document, the following terms and definitions apply. 3.1.1 WARC record basic constituent of a WARC file, consisting of a sequence of WARC records 3.1.2 WARC record content block part (zero or more octets) of a WARC record that follows the header and that forms the main body of a WARC record 3.1.3 WARC record payload data object referred to, or contained by a WARC record as a meaningful subset of the content block 3.1.4 WARC record header begi