1、A Vector API for Java,Ian Graves ,Legal Disclaimers,2,INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTELS TERMS AND CONDITIO
2、NS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYR
3、IGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. A “Mission Critical Application“ is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTELS PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDE
4、MNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY
5、, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS. Intel may make changes to specifications and product descriptions at any time, without noti
6、ce. Designers must not rely on the absence or characteristics of any features or instructions marked “reserved“ or “undefined“. Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The informati
7、on here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available o
8、n request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or
9、 go to: http:/ Intel, the Intel logo, Intel Xeon, and Xeon logos are trademarks of Intel Corporation in the U.S. and/or other countries. Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor fam
10、ilies: Go to: Learn About Intel Processor Numbers http:/ *Other names and brands may be claimed as the property of others. Copyright 2015 Intel Corporation. All rights reserved.,Legal Disclaimers Continued,3,Some results have been estimated based on internal Intel analysis and are provided for infor
11、mational purposes only. Any difference in system hardware or software design or configuration may affect actual performance. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are
12、measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performa
13、nce of that product when combined with other products. Intel does not control or audit the design or implementation of third party benchmarks or Web sites referenced in this document. Intel encourages all of its customers to visit the referenced Web sites or others where similar performance benchmar
14、ks are reported and confirm whether the referenced benchmarks are accurate and reflect performance of systems available for purchase. Relative performance is calculated by assigning a baseline value of 1.0 to one benchmark result, and then dividing the actual benchmark result for the baseline platfo
15、rm into each of the specific benchmark results of each of the other platforms, and assigning them a relative performance number that correlates with the performance improvements reported. SPEC, SPECint, SPECfp, SPECrate, SPECpower, SPECjbb, SPECompG, SPEC MPI, and SPECjEnterprise* are trademarks of
16、the Standard Performance Evaluation Corporation. See http:/www.spec.org for more information. TPC Benchmark, TPC-C, TPC-H, and TPC-E are trademarks of the Transaction Processing Council. See http:/www.tpc.org for more information. Intel Advanced Vector Extensions (Intel AVX)* are designed to achieve
17、 higher throughput to certain integer and floating point operations. Due to varying processor power characteristics, utilizing AVX instructions may cause a) some parts to operate at less than the rated frequency and b) some parts with Intel Turbo Boost Technology 2.0 to not achieve any or maximum tu
18、rbo frequencies. Performance varies depending on hardware, software, and system configuration and you should consult your system manufacturer for more information. Intel Advanced Vector Extensions refers to Intel AVX, Intel AVX2 or Intel AVX-512. For more information on Intel Turbo Boost Technology
19、2.0, visit http:/ In this Presentation,Is still a rough prototype! Subject to change! Part of the OpenJDK Project Panama Licensed Under GPLv2 With ClassPath Exception Get the code here! http:/ CodeSnippets Vector API Design Wrap Up,Introduction: Vector API Project Team,Oracle Vladimir Ivanov John Ro
20、se Paul Sandoz Intel Michael Berg Steve Dohrmann Ian Graves Shravya Rukmannagari Sandhya Viswanathan,Terminology,Code Snippets: Encoding instructions as data in Java Binding to MethodHandle Vector API: API encompassing operations with vector instruction support. Implemented on top of Code Snippets.,
21、Motivation,Many popular applications benefit from data-parallel computations Architectural support remains opaque to the JVM developer Looking to expose “pure Java” performant solutions that map to the architecture well. No JNI interfacing single language solutions Minimized Boilerplate generated co
22、de is good quality,Project Goals,Expose data-parallel vector operations for developer use in Java Portability and performance Scalability Idiomatic,Code Snippets,CodeSnippets as a Substrate,A portable API for expressing primitives More flexible than HotSpot intrinsics Less technical debt with Graal
23、on the horizon ISAs can use the same API In prototype phase, but good perf observed Value objects to registers MethodHandle invocation achieves good code quality.,Implementing a Primitive,Primitives Bind to MethodHandle Invoked via MethodHandle methods MethodHandles library has additional combinator
24、s Types of CodeSnippets represented as MethodType objects Vector represented by Long2/4/8 objects Wrappers for 128,256,and 512-bit values. Wrappers are elided in the best case. Values registerized. Escape analysis a work in progress,Binding to Machine Instruction,static final MethodType MT_L4_BINARY
25、 = MethodType.methodType(Long4.class, Long4.class, Long4.class);private static final MethodHandle MHm256_vaddps = MachineCodeSnippet.make(“mm256_vaddps“, MT_L4_BINARY, requires(AVX),new RegisterxmmRegistersSSE, xmmRegistersSSE, xmmRegistersSSE,(Register regs) - Register out = regs0;Register in1 = re
26、gs1;Register in2 = regs2;int vex = vex_prefix(rBit(out),X_LOW,bBit(in2),M_0F,W_LOW,in1,L_256,PP_NONE);return vex_emit(vex, 0x58, modRM(out, in2););,Registers via JVMCI,Desired Register Masks,MethodHandle Type,Feature-checking predicate,Macro-ized x86 encoding,Checked Invocation,private static Long4
27、vaddps_naive(Long4 a, Long4 b) float res = new float8;for (int i = 0; i 8; i+) resi = getFloat(a, i) + getFloat(b, i);return long4FromFloatArray(res,0);public static Long4 vaddps(Long4 a, Long4 b) try Long4 res = (Long4) MHm256_vaddps.invokeExact(a, b);assert assertEquals(res, vaddps_naive(a, b);ret
28、urn res; catch (Throwable e) throw new Error(e);,Pure Java equivalent function.,Type-safe invocation point.,A Small Example,public static float proc(float left, float right, float res)if(left.length != right.length)throw new UnsupportedOperationException(“Arrays unequal.“); else if (left.length % 8
29、!= 0) throw new UnsupportedOperationException(“Length must be n*8“);for(int i = 0; i left.length; i+=8)addArrays(left,right,res,i);return res; /Convenience,Loop Kernel,Small Example (contd),/Isolated for code quality purposes in prototypepublic static void addArrays(float left, float right, float re
30、s, int i)/VMOVDQU ymmX, YMMWORD PTR Long4 l = PatchableVecUtils.long4FromFloatArray(left,i);Long4 rr = PatchableVecUtils.vaddps(l,right,i);/VMOVDQU YMMWORD PTR , ymmXPatchableVecUtils.long4ToFloatArray(res,i,rr);,Scaled load,Scaled store,vaddps reg, YMMWORD PTR .,Generating C2 Code,java -XaddExports
31、:java.base/jdk.internal.misc=ALL-UNNAMED -XaddExports:java.base/jdk.internal.vm.annotation=ALL-UNNAMED -XX:+UnlockDiagnosticVMOptions -XX:-UseSuperWord-XX:LoopMaxUnroll=1-XX:PrintAssemblyOptions=intel -XX:CompileCommand=option,*AddArraysLong4PS:addArrays,PrintAssembly-cp build AddArraysLong4PS,Snipp
32、ets!,Generated Code,Performance of This Example,Compared to Scalar implementation Disabled SuperWord and Loop Unrolling We see a 40% reduction in clock cycles spent in the loop kernel with the vectorized version. This workload is a prototype PoC, we need more advanced workloads that better leverage
33、vectorization. Bigger, more intensive workloads to come Wall clock time indicates overhead coming from outside of the loop kernel vs. the scalar version more work to do!,The Vector API,Java Needs an Abstraction for Vectors,Vector ISA Extensions are powerful, expressive, and deep. Most instructions h
34、ave many different forms and support differing operand sizes NxM problems abound for API writers Needs to be to capture the essence of vectorization in the spirit of Java Platform independence Snippets too low level Meaningful static checking Familiar patterns to abstract operational complexity,Vect
35、or API,Intended API to encompass the CodeSnippets implementation Proposed by John Rose*. Work continues within the Panama Project interface Vector S - Shape type describes the size of the Vector E - The element type of the Vector Broadest support for Float, Integer, Double Draft implementations chec
36、ked into Project Panama,* http:/ of the API,Vector,FloatVector,FloatVector128,FloatVector256,FloatVectorXYZ,Factory-Constructed Classes,Factory methods here.,Basic Vector-Vector Functionality,interface Vector Vector add(Vector v2);Vector mul(Vector v2);Vector and(Vector v2); ,Immutability!,More Adva
37、nced,interface Vector E getElement(int i);Vector putElement(int i, E elem);E sumAll();E toArray();fromArray(E ary, int offset); ,Scalar/Vector Interfacing,Horizontal Reductions. Multiple snippets.,Loading and storing to arrays,Fully Realized Expressiveness,interface Vector Vector map(UnaryOperator o
38、p);Vector mapWhere(Mask mask, UnaryOperator op);Vector map(BinaryOperator op, Vector v2);Vector mapWhere(Mask mask, BinaryOperator op, Vector this2); ,Kernel with Vector API,public static void addArrays(float left, float right, float res, int i)FloatVector l = float256FromArray(left,i),r = float256F
39、romArray(right,i),lr = l.add(r);lr.intoArray(res,i); ,27,Higher Order Components,Highly desirable, modern part of this API A programmer specifies a loop body Minimal thought given to vectorization Using regular arithmetic and logical syntactic operators Requires a way to “crack” or inspect lambdas a
40、t runtime Ways Forward We need better control of our higher order components Factories for constructing primitive arithmetic operations Need to be composable,Kernel Construction,We can construct our “higher order” operations from existing parts. We can constrain our support to operations that are ve
41、ctorizable. Arity-one, or arity-two (maybe three) operations Restricting to arithmetic and logical operations that are broadly supported Our existing work on CodeSnippets can form the base! MethodHandles are highly composable, even with snippets,f = (x,y) - (x+y) * y;,MethodType mt = MethodType.meth
42、odType(Long4.class,Long4.class,Long4.class);MethodHandle MHm256_vaddps = CodeSnippet.make(,mt,),MHm256_vmulps = CodeSnippet.make(,mt,);MethodHandle f_pre = MethodHandles.collectArguments(MHm256_vmulps, 0, MHm256_vaddps);MethodHandle f = MethodHandles.permuteArguments(f_pre,mt,0,1,1);,Statically Type
43、d Wrappers,A layer over MethodHandles for encapsulating the lower level details and making them type safe will coincide with the existing API spec. One method proposed is VectorOp Proposed on Project Panama* Vector Operations explicit and exposed to the user to compose and use as kernels. Another ap
44、proach is to use a lightweight syntax tree Hand off to a Vector object for interpretation/conversion to an equivalent MethodHandle structure for execution. Vector objects visit the tree to compose the according MethodHandles. Same syntax trees could be handed off to different Vector types. Still ver
45、y much in the works!,* http:/ Thoughts.,Most Vector operations are simple expressions Expressions are (basically) trees MethodHandles can be combined together in a tree-like fashion permuteArguments() collectArguments() filterArguments() filterReturn() Method Handles have added benefits (high level
46、models matter!) Weve already observed good code with Method Handles, so lets try it! Coding this way can elide the need to box Long2/4/8,32,Expressions Bind to Method Handles.,33,*,+,y,y,x,(x,y) -,AST Visitor,Theres more!,34,34,*,+,y,y,x,(x,y) -,256_visitor,128_visitor,XYZ_visitor,Babys First EDSL,i
47、nterface Expression default Expression add(Expression right)return new AddExpression(this,right);default Expression mul(Expression right)return new MulExpression(this,right);default Expression not()return new NotExpression(this);default Expression trace(Consumer f)return new TraceExpression(this,f);
48、default Expression fromFloat(Float f)return new ConstExpression(f);R evaluate(ExpressionEvaluator e); ,35,Careful!,BinaryOperation expr = (l,r) - Expression e1 = l.add(r);return e1.mul(r); ,36,expr.apply(Symbol.LEFT,Symbol.RIGHT);To populate leaf nodes. Symbol non-public.,MethodHandle binaryReduction(float left, float right, float dst, BinaryOperator);MethodHandle br = binaryReduction(left,right,dst,(l,r) - Expression e1 = l.add(r);return e1.mul(r); );/Execute the entire computation br.invokeExact();/Making it hot for inspection for(int i = 0; i BIGNUMBER; i+)br.invokeExact(),
copyright@ 2008-2019 麦多课文库(www.mydoc123.com)网站版权所有
备案/许可证编号:苏ICP备17064731号-1