春末夏初之际,我们迎来了2015年第一场YOCSEF学术报告会。本次会议我们邀请了美国伊利诺伊理工大学孙贤和教授,计算所张云泉研究员,北卡州立大学的唐厚君博士生做主题为“高性能计算前沿进展"的系列学术报告。会议将于2015年5月5日(星期二)14:30-18:00在深圳大学科技楼701报告厅召开。欢迎大家踊跃报名。
主办单位:中国计算机学会青年计算机科技论坛深圳分论坛(CCF YOCSEF深圳)、深圳大学 广东省普及型高性能计算机重点实验室、深圳市电脑学会
执行主席:毛睿 CCF YOCSEF深圳 副主席
议 程:
14:00 签到
14:30 报告会开始
CCF YOCSEF深圳 组织方 致辞
14:40 特邀讲者:孙贤和, 美国伊利诺伊理工大学教授
演讲题目:Parallelism for High Performance Data Processing: a rethinking
15:40 特邀讲者:张云泉,中国科学院计算技术研究所研究员
演讲题目:yaSpMV: Yet Another SpMV Framework on GPUs
16:40 特邀讲者: 唐厚君,美国北卡罗来纳州立大学博士研究生
演讲题目:Usage Pattern-Driven Dynamic Data Layout Reorganization
报告1:Parallelism for High Performance Data Processing: a rethinking 特邀讲者:孙贤和
报告提要:Scalable data management for big data applications is a challenging task. It puts even more pressure on the lasting memory-wall problem, which makes data access the prominent performance bottleneck for high performance computing (HPC), and has changed the interest of HPC to HPDP (High Performance Data Processing). HPC is known for its massively parallel architectures. A natural way to achieve HPDP is to increase and utilize memory concurrency to a level commensurate with that of HPC. We argue that substantial memory concurrency exists at each layer of current memory systems, but it has not been fully utilized. In this talk we reevaluate memory systems and introduce the novel C-AMAT model for system design analysis of concurrent data accesses. C-AMAT is a paradigm shift to support sustained data accessing from a data-centric view. The power of C-AMAT is that it has opened new directions to reduce data access delay. In an ideal parallel memory system, the system will explicitly express and utilize parallel data accesses. This awareness is largely missing from current memory systems and missing from current architecture and algorithm design. We will review the concurrency available in modern memory systems, present the concept of C-AMAT, and discuss the considerations and possibility of optimizing parallel data access for big data applications. We will also present some of our recent results which quantize and utilize parallel I/O following the parallel memory concept for HPDP.
报告2:yaSpMV: Yet Another SpMV Framework on GPUs 特邀讲者:张云泉
报告提要:SpMV is a key linear algebra algorithm and has been widely used in many important application domains. As a result, numerous attempts have been made to optimize SpMV on GPUs to leverage their massive computational throughput. Although the previous work has shown impressive progress, load imbalance and high memory bandwidth remain the critical performance bottlenecks for SpMV. In this talk, we present our novel solutions to these problems. First, we propose a new SpMV format, called blocked compressed common coordinate (BCCOO), which uses bit flags to store the row indices in a blocked common coordinate (COO) format so as to alleviate the bandwidth problem. We further improve this format by partitioning the matrix into vertical slices to enhance the cache hit rates when accessing the vector to be multiplied. Second, we revisit the segmented scan approach for SpMV to address the load imbalance problem. We propose a highly efficient matrix-based segmented sum/scan for SpMV and further improve it by eliminating global synchronization. Then, we introduce an auto-tuning framework to choose optimization parameters based on the characteristics of input sparse matrices and target hardware platforms. Our experimental results on GTX680 GPUs and GTX480 GPUs show that our proposed framework achieves significant performance improvement over the vendor tuned CUSPARSE V5.0 (up to 229% and 65% on average on GTX680 GPUs, up to 150% and 42% on average on GTX480 GPUs) and some most recently proposed schemes (e.g., up to 195% and 70% on average over clSpMV on GTX680 GPUs, up to 162% and 40% on average over clSpMV on GTX480 GPUs).
报告3:Usage Pattern-Driven Dynamic Data Layout Reorganization 特邀讲者:唐厚君
报告提要:As scientific simulations move towards exascale and generate increasingly huge amounts of data, the data access performance for analytic applications becomes crucial. A mismatch often happens between write and read patterns of data accesses, typically resulting in poor read performance. Data layout reorganization has been used to improve the locality of data accesses. However, current data reorganizations are static and focus on generating a single (or set of) optimized layouts that rely on prior knowledge of exact future access patterns. We propose a framework that recognizes the data usage pattern, replicates the data of interest in multiple reorganized layouts that would benefit common read patterns, and makes runtime decisions on selecting a favorable layout for the read pattern. This framework supports reading individual elements and chunks of a multi-dimensional array of variables. Our pattern-driven layout selection strategy achieves multi-fold speedups compared to reading from the original dataset.
所有评论仅代表网友意见