GTC 2012 Program Guide - GPU Technology Conference

PRESENTED BY PLATINUM SPONSORS 

MAY 14-17, 2012 | SAN JOSE, CA 

PROGRAM 

GUIDE

The Power 

to do More 

HP GPU computing ranging from personal 

supercomputing Z-workstations to the 

world’s most self-sufficient GPU enabled 

servers. Come and talk with the HP 

GPU experts about the performance, 

HP Ad 

efficiency and agility you get with HP. 

Visit HP Booth #47 for 

more information. 

www.hp.com/go/zworkstations 

www.hp.com/go/accelerators

WELCOME 

TO GTC 

Dear GTC Attendees, 

Back in 2009 we had an idea to bring together the wide 

variety of people who use GPUs in their work. 

Disciplines from quantum chemistry to computational 

fluid dynamics and astrophysics. People from every 

corner of the world. We hosted our first GPU 

Technology Conference. 

We were proud of its success. More than 90 percent of 

the presentations were made by people outside 

NVIDIA. Meetings spilled into the hallways. The place 

buzzed with energy. We realized that GPU computing 

was bigger than NVIDIA. And that GTC was a conduit 

into the collective power of brilliant scientists, 

technologists and thought leaders. It was an honor to 

host it on your behalf. 

In 2010, we doubled down, with more than 280 sessions, 

and 2,000 attendees. The reach of GPU computing was 

growing. And its impact was truly breathtaking. 

Researchers from Adobe showcased their work in 

computational photography, which one day will redefine 

the field. A surgeon described how GPUs were vital in 

performing surgery on a beating heart. 

And GTC 2012 promises to be better still. 

You can choose from among hundreds of sessions. 

Among them are talks by Oak Ridge National 

Laboratory on using GPUs to build Titan, the world’s 

largest supercomputer; Tokyo Institute of Technology, 

winner of last year’s Gordon Bell Prize, on 

stereoscopic 3D visualization; and Beijing’s BGI on 

using GPUs for bioinformatics research. A variety of 

entrepreneurs will speak at the Emerging Companies 

Summit about how their startups use GPUs. 

GTC will also play host for the first time to two other 

events. Los Alamos National Laboratory will hold its 

Accelerated HPC Symposium, bringing together world 

leaders in supercomputing. InPar will provide a 

first-tier academic venue for peer-reviewed, archival 

publications in the emerging fields of parallel 

computing. 

And NVIDIA will discuss Kepler, our first new 

architecture in two years, and its impact on computing. 

Our first Kepler-based graphics card recently launched 

to fantastic reviews. We can’t wait to share how these 

powerful, super energy-efficient GPUs open up new 

horizons in high performance computing and scientific 

discovery. 

We will also be talking much more about GPU and 

cloud computing, as well as our Maximus technology, 

which creates a workstation so powerful that it 

simulates the physics of a design while it is being 

created. 

It should make for the best GTC yet. 

Enjoy the conference! 

Sincerely, 

The NVIDIA GTC Team 

CONFERENCE GUIDE

“0.1 of a second can be the difference 

between winning and losing in Formula One. 

Data analysis doesn’t get much 

more critical than that.” 

DELL AD? 

See how Dell helped Caterham F1 Team deploy an 

enterprise-class IT system able to do real-time analysis of 

data sent from the car, while withstanding the intense heat 

and vibration of the Formula 1 TM 

trackside environment. 

Learn more at Dell.com/EfficientIT. 

Mark Smith 

Technical Director 

Caterham F1 Team 

Join our breakout session on Wednesday, May 16th at 2pm, room M, where Dr. Jeff Layton, 

HPC Enterprise Technologist, will discuss compelling new technology advancements in GPU Computing.

IMPORTANT INFORMATION 

If there is anything we can do to make your conference experience better, please stop by the 

info desk and let us know. 

REGISTRATION / INFORMATION DESK HOURS 

SUNDAY, MAY 13 

16:00 to 18:00 

MONDAY, MAY 14 

08:00 to 18:00 

TUESDAY, MAY 15 

07:00 to 19:00 

WEDNESDAY, MAY 16 

08:00 to 18:00 

THURSDAY, MAY 17 

08:00 to 16:00 

EXHIBIT AND MEAL HOURS 




12:00 to 14:00 Lunch / Exhibits Open 

18:00 to 20:00 Reception / Exhibits Open 


18:00 to 20:00 Reception / Exhibits Open 


ENROLL IN YOUR SESSIONS Go to https://registration.gputechconf.com/schedule and log in to 

start adding sessions to your personal schedule. Priority access 

into each session will be given to those who enroll. Enrolling in 

sessions also helps us schedule the most popular sessions in the 

largest rooms. 

WIRELESS INTERNET ACCESS Free wireless internet access can be found under GTC2012 and is 

available in most session rooms, keynote hall, exhibit hall and 

throughout the concourse. 

DOWNLOAD THE MOBILE APP Keep up-to-date with the latest news and information at the 

conference through the GTC 2012 Mobile App. Download it from the 

Android market at https://play.google.com/store. You can also 

access news and announcements from the home page of 

www.gputechconf.com. 

BUSINESS CENTER / SHIPPING The Marriott Hotel and the Hilton Hotel both have business centers 

located on the first floor, near their respective front lobbies. 

Alternatively, there is a Fedex Office Print & Ship Center at 93 E. San 

Carlos Street, near 3rd Street (3 blocks from the Convention Center, 

call 408-295-4336 for hours). 

GO GREEN! Take part in the shared goal of minimizing our collective impact on 

the environment. Please take only the conference materials you 

need and recycle, and reuse, whenever possible throughout the 

week. Please turn in your badges for recycling at the conclusion of 

the event. 

BAG AND COAT CHECK Bag check is available at the bell desk of the Marriott and Hilton 

hotels, connected to the Convention Center. It is also available on the 

concourse of the Convention Center. 

LOST AND FOUND Please check the information desk should you lose or find an article. 

FIRST AID / EMERGENCY Should there be a medical emergency, please dial 911 and alert the 

nearest conference personnel.

Lenovo ® recommends Windows ® 7 Professional. 

MONTHS OF PLANNING | A FUTURISTIC MOVIE SET | AN IDEA TURNED TO REALITY. 

LENOVO AD? 

DREAM. 

CREATE. 

INTRODUCING THE LENOVO® THINKSTATION® 30 SERIES, 

FEATURING THE D30 FOR HIGH-END GRAPHICS AND PROCESSING 

POWER. 

The Lenovo ThinkStation® 30 series was designed for those who push technology to the limits and depend 

on professional applications and platforms to get there. The ThinkStation® 30 Series is certified to run the 

applications you need most from Adobe, Autodesk, Dassault Systemes, PTC and Siemens. Designed to tackle 

the biggest challenges, the D30 delivers the ultimate in performance and expandability. And now armed with 

the latest generation of Intel® Xeon® processors, Genuine Windows ® 7 Professional and supporting discrete 

Quadro and Tesla graphics technology from NVIDIA® - you can defy expectations like never before. 

Energy-efficient � Quiet Acoustics � Scalable Storage � ISV-certified 

www.lenovo.com/thinkstation 

Lenovo, the Lenovo logo, For Those Who Do and ThinkStation are trademarks or registered trademarks of Lenovo. Microsoft and Windows are registered trademarks of Microsoft Corporation in 

the U.S. and other countries. Intel and Intel Xeon are registered trademarks of Intel Corporation in the U.S. and other countries. Nvidia is registered trademarks of Nvidia Corporation in the 

U.S. and other countries. 

© Lenovo 2012. All rights reserved.

1 

3 

6 

10 

20 

23 

27 

47 

69 

83 

103 

145 

160 

TABLE OF CONTENTS 

Welcome Letter 

Important Information 

Conference Highlights - Don’t Miss These Events! 

Emerging Companies Summit 

Los Alamos National Laboratory Accelerated High 

Performance Computing Symposium 

Sessions Listing - Monday 

Sessions Listing - Tuesday 

Sessions Listing - Wednesday 

Sessions Listing - Thursday 

Research Posters Listing 

Speakers and Panelists Listing 

Sponsors and Exhibitors 

Stay Connected!

CONFERENCE 

HIGHLIGHTS – 

DON’T MISS 

THESE EVENTS! 

NVIDIA ® Nsight Lab 

The lab will be open daily for product discussions, testing of your application with 

the latest version of Nsight, or a place to simply hang out and relax with the Nsight 

development team. The lab is located on the first floor next to the Nsight Lab. 

C++ AMP LOUNGE, by Microsoft 

While attending GTC, come learn from the experts at the C++ AMP Lounge by 

Microsoft, a casual environment for hands-on learning and instruction. Experts 

will be available each day to answer questions and provide instruction. The 

lounge is located on the concourse. 

Ask the CUDA Expert 

Stop by Ask the CUDA Expert on the main concourse for a quick consultation with 

NVIDIA software engineers and developer technology experts. Experts on CUDA 

C, Fortran, OpenACC, GPU-Accelerated Libraries and more will be on hand to 

answer your questions. No question is too challenging or too easy for this crew! 

Ask the CUDA Expert will be open as follows: 

Monday 10:00 to 16:00 

Tuesday 12:00 to 19:00 

Wednesday 10:00 to 11:00, 12:00 to 19:00 

Thursday 10:00 to 11:00, 12:00 to 16:00 

DigitalGuru: Where Smart People Get Smarter 

DigitalGuru Technical Bookshop of Cupertino, California is pleased to 

participate in GTC 2012. Please visit our table during the conference for a wide 

and relevant selection of books on parallel programming, computer science, 

application tools and more. Books sold at GTC are available at 20% off list 

price. For more info visit www.digitalguru.com. 

Dinner with Strangers 

Over a meal in some of the best restaurants in Silicon Valley, engage in lively 

conversation and share your best ideas. Pre-reserved tables for small groups 

will be made available to GTC attendees to mix and mingle with fellow attendees. 

Dinner with Strangers is open to all, but space is limited and is on a first come, 

first serve basis. Stop by the sign-up board located on the concourse. Dinner 

with Strangers happens on Monday and Tuesday night with reservations at 20:00.

SUNDAY, MAY 13 

08:30 to 17:35 InPar 2012, Foundations & Applications of GPU, Manycore, and 

Heterogeneous Systems (Room J) 


08:40 to 17:00 InPar 2012, Foundations & Applications of GPU, Manycore, and 

Heterogeneous Systems (Room J) 

09:00 to 15:50 Pre-Conference Tutorials 

16:00 to 18:00 Research Poster Showcase and Reception 


10:30 to 11:50 Opening Keynote with Jen-Hsun Huang, NVIDIA CEO and Co-Founder 

(Keynote Hall, Hall 1) 

12:00 to 14:00 Exhibits Open / Networking Lunch (Exhibit Hall) 

14:00 to 18:00 GPU-accelerated Science on Titan: Tapping into the World’s 

Preeminent GPU Supercomputer to Achieve Better Science, Jack 

Wells, Director of Science, Oak Ridge Leadership Computing Facility, Oak 

Ridge National Laboratory (Room A2) 

16:00 to 16:50 CUDA 5 and Beyond, Mark Harris, Chief Technologist, GPU Computing, 

NVIDIA (Hall 1) 

18:00 to 20:00 Exhibits Open / Networking Reception (Exhibit Hall) 


9:00 to 9:30 Emerging Companies Summit Opening Address with Jeff Herbst, VP 

Business Development, NVIDIA (Marriott Hotel, Ballroom 4) 

09:00 to 10:20 Exascaling Your Apps, moderated by Mike Bernhardt, Publisher, The 

Exascale Report (Room C) 

11:00 to 11:50 Day 2 Keynote with Dr. Iain Couzin, Professor, Princeton University 

(Keynote Hall, Hall 1) 

12:00 to 14:00 Exhibits Open / Networking Lunch (Exhibit Hall) 

14:00 to 14:50 Emerging Companies Summit Fireside Chat with Jen-Hsun Huang, 

NVIDIA CEO and Co-Founder (Marriott Hotel, Ballroom 4) 

14:00 to 15:20 Inside Kepler, Stephen Jones, CUDA Developer, NVIDIA, Lars Nyland, 

Senior Architect, NVIDIA (Hall 1) 

14:00 to 17:55 Los Alamos National Laboratory Accelerated High Performance 

Symposium (Room J1) 

18:00 to 20:00 Exhibits Open / Networking Reception (Exhibit Hall) 

20:00 to 23:00 GTC Party (Civic Auditorium) 

During a week of rigorous learning, it’s important to cut loose and 

celebrate with fellow members of the GPU community. Come party and 

enjoy the comedic and juggling talents of The Passing Zone and try your 

luck in the casino. And don’t forget to raise a glass to your success! 


09:00 to 15:50 Los Alamos National Laboratory Accelerated High Performance 

Symposium (Room J1) 

11:00 to 11:50 Day 3 Keynote with Robert Boehme CEO & Team Lead, Part-Time 

Scientists and Wes Faler, Head of Software Development, Part-Time 

Scientists (Keynote Hall, Hall 1) 

12:00 to 14:00 Exhibits Open / Networking Lunch (Exhibit Hall)

OPEN GENOMICS ENGINE 

Accelerating the DNA-analysis pipeline 

for cancer research 

Visit the Open Genomics Engine booth (#118) in the 

GTC exhibit hall to learn more. 

Developed by Sponsored by 

An NVIDIA Foundation Initiative

Welcome to 

NVIDIA’s Emerging 

Companies 

Summit (ECS) 2012! 

We are thrilled to once again showcase promising 

startups that are using the massive computing power 

of GPU technology to transform existing industries and 

create new ones. 

From gesture-recognition technology and interactive 

video to virtualization and cloud computing, the dozens 

of companies from around the world participating in 

ECS 2012 are at the cutting-edge of technology. GPUs 

have recently stormed the handheld computing 

market, so you’ll also find a large number of mobile 

companies participating in this year’s summit. 

ECS itself has become something of a growth industry. 

In addition to this being our fourth event in Silicon 

Valley, we have recently held successful summits in 

Israel and China, with more planned in the near future. 

The conference has proven to be a great venue for 

startups, analysts, executives and industry experts to 

exchange information and understand where 

technology is heading. 

As a key part of the GPU Technology Conference, ECS 

2012 will be host to hundreds of participants – 

including panelists, presenters, analysts, industry 

execs and others in our growing audience. Awaiting 

them is our best program yet. 

This year sees the return of our hugely popular “CEO 

on Stage” format, where a select group of CEOs 

present their companies to a distinguished panel of 

experienced investors, analysts and technology 

leaders, who in turn respond with insightful feedback. 

NVIDIA CEO and founder Jen-Hsun Huang will also sit 

down for another thoughtful and entertaining fireside 

chat, this year with Tim Bajarin, president of Creative 

Strategies Inc., a leading Silicon Valley industry 

analysis and market intelligence firm. 

New this year are special events like Startup 

University, where presenting and exhibiting companies 

will hold workshops on topics such as “Protecting Your 

IP Assets in a Global Marketplace” and “Best Practices 

for Building Valuable Relationships with Technology 

Industry Analysts.” In addition, the exhibit halls will be 

filled with the innovative work of companies in a 

diverse array of fields. And this year a jury will select 

the most promising companies with the “One to 

Watch” awards, announced Wednesday evening in the 

Hilton ballroom. 

The GPU computing ecosystem is growing rapidly – 

and you, as an ECS attendee, are a key part of its 

success. I encourage you to participate in as many 

sessions as possible and thank you for joining us at 

what promises to be another superb event. 

In closing, I’d like to express gratitude to our sponsors 

who are helping to make this event possible, including 

Cooley LLP, Morgan Stanley, Silicon Valley Bank, 

Deloitte, mergermarket, and Dow Jones Private Equity 

& Venture Capital. 

Jeff Herbst 

Vice President of Business Development, NVIDIA

AGENDA 

WEDNESDAY, MAY 16, 2012 

MARRIOTT SAN JOSE BALLROOM 4 

9:00 to 9:50 S2000 Emerging Companies Summit Opening with Jeff Herbst (VP of 

Business Development, NVIDIA), followed by CEO on Stage featuring 

� Rocketick (Tomer Ben-David, VP R&D) 

� Cortexica (Iain McCready, CEO) 

Panelists: 

� Jon Peddie, President, Jon Peddie Research 

� Neil Sequeira, Managing Director, General Catalyst Partners 

� Savitha Srinivasan, Partner, IBM Venture Capital Group 

� Jeff Herbst, V.P. Of Business Development, NVIDIA 

10:00 to 10:50 S2001 Emerging Companies Summit: CEO on Stage featuring 

� Unity Technologies (David Helgason, CEO) 

� MirriAd (Mark Popkiewicz, CEO) 

� BioDigital (Aaron Oliker, Partner/Director of 3D Technology and Frank Sculli, 

Co-Founder/Informatics Director) 

Panelists: 






� eyeSight Mobile (Gideon Shmuel, CEO) 

� Numira Biosciences (David Weinstein, CTO) 

� Ubitus (Wesley Kuo, CEO) 

Panelists: 





12:00 to 13:50 Networking Lunch and Exhibits (Hall 2 – San Jose Convention Center)

14:00 to 14:50 S2003 Emerging Companies Summit Fireside Chat with Jen-Hsun Huang 

(CEO, President and Co-Founder, NVIDIA) and Tim Bajarin (President of 

Creative Strategies) 


� GAIKAI (David Perry, CEO and Co-Founder) 

� Immersive Media (Myles M. McGovern, CEO) 

� Numecent (Osman Kent, Co-Founder & CEO) 

Panelists: 

� Tom Furlong, Managing Director, Granite Ventures 

� Rob Enderle, Principal Analyst, Enderle Group 

� Flip Gianos, General Partner, Interwest Partners 



� RealView Imaging (Shaul Geldman, Co-Founder and VP of R&D) 

� Elemental Technologies (Sam Blackman, CEO and Co-Founder) 

� Mersive (Robert Balgley, CEO) 

Panelists: 






� Raytrix (Christian Perwass, CEO) 

� Playcast (Guy De Beer, CEO) 

� Universal Robotics (David Peters, CEO) 

Panelists: 





18:00 to 19:50 Networking Reception (Hall 2 - San Jose Convention Center)

Cooley is a proud Platinum Sponsor of 

the 2012 NVIDIA GTC Conference 

Emerging Company Summit. 

Cooley attorneys have served as counselors, strategists 

and advocates to technology entrepreneurs and 

investment funds since 1959. 

Cooley, a global law firm for the converging worlds of high 

technology, high finance and high-stakes litigation. 

For more information, visit us at www.cooley.com 

Experienced Guides 

PALO ALTO | NEW YORK | SAN DIEGO | SAN FRANCISCO | RESTON, VA | BROOMFIELD, CO | WASHINGTON, DC | BOSTON | SEATTLE | SHANGHAI 

© 2012 Cooley LLP, 101 California Street, 5th Floor, San Francisco, CA 94111. 415/693-2000.

CEO ON STAGE LISTING 

BIODIGITAL 

BioDigital is the leading developer of state of the art biomedical visualization. 

BioDigital recently launched The BioDigital Human - a 3D visualization platform 

with a revolutionary approach for communicating health and medical information 

with interactive tools for exploring human anatomy, physiology and conditions. 

www.biodigital.com 

Speakers Aaron Oliker, Partner/Director of 3D Technology and 

Frank Sculli, Co-Founder/Informatics Director 

Session Time Wednesday, May 16 at 10:45 

CORTEXICA VISION SYSTEMS 

Cortexica Vision Systems are the award winning creators of a bio-inspired vision 

system enabling intelligent image recognition using principles derived from the 

human visual cortex. Cortexica provides a patented platform for radically new 

Visual Search products that deliver exciting new experiences and value for 

consumers and businesses. 

www.cotexica.com 

Speaker Iain McCready, CEO 


ELEMENTAL TECHNOLOGIES 

Elemental Technologies is a leading supplier of video solutions for multiscreen 

content delivery. Founded in 2006 and headquartered in Portland, Oregon, the 

company pioneered the use of graphics processors to power adaptive video 

streaming over IP networks. Top media and entertainment companies around the 

world rely on solutions from Elemental to drive next-generation video services. 

www.elementaltechnologies.com 

Speaker Sam Blackman, CEO and Co-Founder 


13 CONFERENCE GUIDE EMERGING 

COMPANIES SUMMIT

�� 

�� 

EYESIGHT MOBILE TECHNOLOGIES 

eyeSight Mobile Technologies Ltd. presents innovative gesture recognition 

technology that powers Touch Free UI solutions, creating an enhanced user 

experience when interacting with a variety of digital devices. The technology is 

entirely software based, requiring only a standard 2D camera, while operating on 

the full range of operating systems. 

www.eyesight-tech.com 

Speaker Gideon Shmuel, CEO 


GAIKAI 

GAIKAI offers a fully managed cloud platform that is optimized to deliver 

high-end video games and applications within seconds to all leading web 

browsers, operating systems, and devices, even in Facebook. 

www.gaikai.com 

Speaker David Perry, CEO and Co-Founder 


IMMERSIVE MEDIA COMPANY 

Immersive Media is the pioneer and leading world provider of 360º, full motion, 

interactive video. Our immersive 360º video content is delivered via internet to 

PC, Ipad or mobile device. Immersive Media provides the enabling technologies 

for interaction videos to record, process, live stream and deliver images from 

ours or other wide field cameras, with a patent portfolio covering key discoveries 

and capabilities of interactive and immersive video. 

www.immersivemedia.com 

Speaker Myles M. McGovern, President/CEO 

Session Time Wednesday, May 16 at 15:30

MERSIVE 

Since it was founded in 2006, Mersive has revolutionized high performance 

display setup and maintenance enabling a new class of displays. Mersive’s Sol 

software automatically aligns multiple commodity projectors into one seamless 

image of extraordinary quality and resolution without the expense of specialized 

hardware and services. 

www.mersive.com 

Speaker Robert Balgley, CEO 


MIRRIAD 

MirriAd is an end to end marketing solution that can be implemented quickly, 

easily and cost-effectively using our online campaign management system. We 

provide a new and innovative way for advertisers to reach their target audiences, 

and for content owners to generate additional revenue. We have an everexpanding 

library of content, from films and TV series to corporate training 

videos and user-generated material – and we’re always on the lookout for new 

and exciting content owners to work with. 

www.mirriad.com 

Speaker Mark Popkiewicz, CEO 


NUMECENT 

Numecent is a start-up which came out of stealth with a bang in March 2012 and 

is the inventor of ‘cloudpaging’. This patented technology enables friction-free 

digital delivery of native software and other non-linear assets through 

virtualization. One of the benefits of cloudpaging is that it can reduce the 

network footprint of digital downloads between 20x and 100x and execute them 

natively, at full speed, without actually requiring installation. Once cloudpaged, 

applications can even run off-line and always under license control. 

www.numecent.com 

Speaker Osman Kent, Co-Founder and CEO 



COMPANIES SUMMIT

aytrix 

3D light field camera 

NUMIRA BIOSCIENCES 

Numira Biosciences is a leading provider of specialty contract research services 

for preclinical drug and device development. Numira’s customers include the top 

biopharmaceutical companies and academic research institutions. Through its 

next-generation study portal, Numira provides its customers with interactive tools 

for accessing, exploring, and communicating about their preclinical study data. 

www.numirabio.com 

Speaker David Weinstein, CTO 


PLAYCAST MEDIA SYSTEM 

Playcast Media System brings video games to the world’s largest media 

distribution platform – Pay TV networks. The Company’s solution delivers 

off-the-shelf next generation video games to existing cable, IPTV and hybrid 

satellite platforms. We bring cloud gaming to the world’s hundreds of millions of 

paying TV subscribers. 

www.playcast-media.com 

Speaker Guy De Beer, CEO 


RAYTRIX 

Raytrix develops and markets single-lens 3D video cameras based on their 

patented high resolution light field technology, offering solutions for Particle 

Image Velocimetry (PIV), optical inspection, face capturing, microscopy – as well 

as IP for consumer products (mobile phones). 

www.raytrix.de 

Speaker Christian Perwass, CEO 


REALVIEW IMAGING LTD. 

RealView Imaging Ltd. is developing a revolutionary 3D holographic display and 

interface system, initially for medical imaging applications. RealView’s 

proprietary technology projects high-res., full color, dynamic, real-time 3D 

holograms “floating in open air” allowing direct and precise interaction with and 

within the “in air” image by literally touching the image. 

www.realview.co.il 

Speaker Shaul Gelman, Co-Founder and VP of R&D 


ROCKETICK 

Rocketick is a leading provider of software simulation acceleration, enabling 

acceleration of 10x or more for Verilog simulations. The company’s flagship 

product, RocketSim , supports semiconductor companies to reduce the overall 

time to market of new chip designs by up to 30%, allowing development teams to 

tape-out with greater confidence. 

www.rocketick.com 

Speaker Tomer Ben-David, Co-Founder and VP of R&D 


UBITUS 

Ubitus Inc., the technology leader in deploying Cloud-enabled rich media 

services, offers innovative cloud computing solutions for device manufacturers, 

wired/wireless communication service providers, telecommunication operators 

and digital content developers. Founded in 2007 and headquartered in Taipei, 

Taiwan, the company now has 150 employees and 4 offices in Tokyo, Beijing, 

Guangzhou and Seoul. 

www.ubitus.com 

Speaker Wesley Kuo, CEO 



COMPANIES SUMMIT

unity 

UNIVERSAL 

Robotics 

R 

UNITY TECHNOLOGIES 

Unity Technologies is revolutionizing the game industry with Unity, its awardwinning 

breakthrough development platform. Unity Technologies has more than 

450,000 registered users worldwide — including Bigpoint, Cartoon Network, 

Coca-Cola, Disney, Electronic Arts, LEGO, Microsoft, NASA, Nickelodeon, 

Ubisoft, Warner Bros., large and small studios, indies, students and hobbyists 

— all using Unity to create games and interactive 3D on the web, mobile, 

consoles and beyond. Unity Technologies is aggressively innovating to expand 

usability, power and platform reach along with its Asset Store digital content 

marketplace and Union distribution service. 

www.unity3d.com 

Speaker David Helgason, CEO 


UNIVERSAL ROBOTICS 

Universal Robotics is a software company which has brought to market a new 

form of artificial intelligence that uses sensor information to learn. Called 

Neocortex it discovers patterns in chaotic environments which are relevant to an 

assigned task. It then analyzes those patterns to understand complexity, 

improving process. The company has targeted the materials handling industry 

as its first market, increasing the flexibility in automated machines. Among 

various accolades, Universal won an “Emerging Company to Watch” award from 

NVIDIA in 2010 

www.universalrobotics.com 

Speaker David Peters, CEO 


CONFERENCE GUIDE 

19

WEDNESDAY, MAY 16 & THURSDAY, MAY 17, 2012 

ROOM J 

Los Alamos National Laboratory, a leading U.S. national security research 

institution, co-locates the Accelerated HPC Symposium at GTC 2012 and bring 

together world leaders in supercomputing to share knowledge and help solve 

the world’s most crucial technology challenges. 

Symposium highlights include: 

� Learning how accelerator technologies can be leveraged in innovative ways to 

advance the state-of-the-art for simulations on large-scale systems 

� Establishing hardware and software requirements that can meet the 

requirements of power, scalability and fault tolerance needed for the next 

generation of HPC 

� Understanding how legacy codes can be adapted to make use of modern 

computing architectures 

� Providing a forum for feedback to the vendor community to aid in the adoption 

of accelerator technologies

AGENDA 


Plenary Session I 14:00–14:45 Opening Keynote with Bill Barth of TACC 

14:50–15:15 A New GPU Appliance Sorin Faibish (EMC) 

15:20–15:45 Accelerator Architectures for HPC Justin Tripp (LANL) 

Plenary Session II 16:00–16:25 Adaptive Heterogeneous Computing with OpenCL Simon McIntosh-Smith 

(University of Bristol) 

16:30–16:55 Accelerating Iterative Linear Solvers Hui Liu (University of Calgary) 

17:00–17:25 Efficient AMG on Hybrid GPU Clusters Thomas Brandes (SCAI) 

17:30–17:55 PISTON: Visualization Portability and 

Performance 

Christoper Sewell (LANL) 


Scalability: 

9:00–9:10 Introduction: Justin Tripp (Chair) 

Hardware and Software 9:10–9:20 The FPGA: Another Piece of the Puzzle Justin Tripp (LANL) 

9:20–9:30 Increasing Efficiency with Kepler Stephen Jones (NVIDIA) 

9:30–9:50 Discussion 

9:50–10:00 Break 

10:00–10:10 Can You Keep All of the Astronomers Happy All Christopher Fluke 

of the Time? 

(Swinburne University of 

Technology) 

10:10–10:20 In situ Image Analysis for Large Scale 

Visualization 

Christopher Sewel (LANL) 

10:20–10:40 GPU Acceleration of MapReduce Miao Xin (Junnan University) 


Applications – 

Methods and 

Programming Models, 

Part 1 

Applications – 

Methods and 

Programming Models, 

Part 2 

9:00–9:10 Introduction: Guillaume Colin de Verdiere (Chair) 

9:10–9:20 Preconditioning for Large-Scale Linear Solvers Dimitar Lukarski 

(Karlsruhe Institute of Technology) 

9:20–9:30 Changing Data Structures for a Changing World Hui Liu (University of Calgary) 

9:30–9:40 Leveraging Roadrunner Experiences Jamaludin Mohd-Yusof (LANL) 


9:50–10:00 Break 

10:00–10:30 Taming Laser Plasma Interactions: PIConGPU Michael Bussmann (Helmholtz- 

Zentrum Dresden-Rossendorf) 


14:00–14:10 The Portability Wall: How hard can it really be? John Stone (Urbana Champaign) 

14:10–14:20 Accelerating NAMD James Phillips (University of 

Illinois) 

14:20–14:30 Refitting Legacy Software for the New Reality John Humphrey (EM Photonics) 

14:30–14:40 Unstructured Data Structures: An Achilles Heel? Raphael Poncet (CEA) 


14:50–15:00 Break 

15:00–15:10 Power: The New Metric Simon MacIntosh-Smith 

(University of Bristol) 

15:10–15:20 It’s About Concurrency, Stupid! Stanley Tzeng (UC Davis) 


*Please note: Session details can be found within the daily sessions pages that follow. 

CONFERENCE GUIDE 

21

SPONSORED BY: 

SYNNEX 

GTC NETWORK 

Please visit these Tesla Preferred Partners exhibits and be 

entered into a daily drawing to win a free NVIDIA Tesla C2075! 

ACE Computers AMAX Appro Aspen Systems 

Colfax International Creative Consultants Exxact Technologies Microway 

Penguin Computing, Inc Seneca Data Themis

SESSION INFORMATION – 

PRE-CONFERENCE TUTORIALS – 


MONDAY, MAY 14, 09:00 (80 MINUTES) 

PRE-CONFERENCE TUTORIAL - ROOM A5 

S0005 Languages, APIs and Development Tools for 

GPU Computing 

Get a head start on the conference with this first-day introduction 

to key technologies for GPU Computing. This 90-minute tutorial 

session will cover the key features and differences between the 

major programming languages, APIs and development tools 

available today. Attendees will also learn several high level design 

patterns for consumer, professional and HPC applications, with 

practical programming considerations for each. 

Speaker(s): Will Ramey (Sr. Product Manager, GPU 

Computing, NVIDIA) 

Topic(s): General Interest, Development Tools & Libraries, Application 

Design & Porting Techniques (Beginner) 



S0023 NVIDIA OpenGL for 2012 

Attend this session to get the most out of OpenGL on NVIDIA 

Quadro and GeForce GPUs. Topics covered include the latest 

advances available for Cg 3.1, the OpenGL Shading Language 

(GLSL); programmable tessellation; improved support for 

Direct3D conventions; integration with Direct3D and CUDA 

resources; bindless graphics; and more. When you utilize the 

latest OpenGL innovations from NVIDIA in your graphics 

applications, you benefit from NVIDIA’s leadership driving OpenGL 

as a cross-platform, open industry standard. 

Speaker(s): Mark Kilgard (Principal Software Engineer, NVIDIA) 

Topic(s): Computer Graphics, Development Tools & Libraries, 

Visualization, Audio, Image and Video Processing (Intermediate) 


PRE-CONFERENCE TUTORIAL - ROOM C 

S0614 Part 1: Introduction to GPU Programming 

(Presented by Acceleware) 

Join us for an informative introduction to GPU Programming. The 

session will begin with a brief overview of CUDA and dataparallelism 

before focusing on the GPU programming model. We 

will explore the fundamentals of GPU kernels, host and device 

responsibilities, CUDA syntax and thread hierarchy. A 

programming demonstration of a simple CUDA kernel will 

be provided. 

Introduction to GPU Programming 

�� 

�� 

�� 

GPU kernels 

Host vs. device responsibilities 

CUDA syntax 

Thread hierarchy 

�� 

Speaker(s): Chris Mason (Product Manager, Acceleware) 

Topic(s): Parallel Programming Languages and Compilers, 

Development Tools & Libraries (Beginner) 



S0341 See the Big Picture Scalable Visualization 

Solutions for System Integrators 

NVIDIA Quadro Scalable Visualizations Solutions provide many 

feature for System Integrators who are building large scale 

displays. Come join us in this tutorial session on how to configure 

multi-projector systems, stereoscopic and immersive displays. 

Speaker(s): Doug Traill (Senior Solutions Architect, NVIDIA) 

Topic(s): Visualization (Beginner) 


PRE-CONFERENCE TUTORIAL - ROOM B 

S0517A Programming GPUs with OpenACC (Part 1 of 3) 

OpenACC is a programming standard for parallel computing on 

accelerators (including GPUs) using directives. It is designed to 

harness the transformative power of heterogeneous computing 

systems easily and quickly. In this tutorial you will learn how to 

add simple compiler hints to your code to expose parallelism to 

the compiler, allowing it to map computation onto an accelerator. 

OpenACC directives allow developers to make simple and 

portable code changes, enabling an easier migration to 

accelerated computing. 

This is part 1 of a 3-part tutorial that will take you from an 

overview through how to optimize your code. The tutorial starts 

with an overview of OpenACC programming in which you will learn 

about applying basic OpenACC directives to your code, with 

examples. You will also learn more about how GPUs execute 

parallel programs, and apply this understanding to optimizing 

more advanced OpenACC examples to gain larger speedups and 

accelerate applications with various types of parallelism. 

Lastly, you will see how to use NVIDIA profiling tools to target 

your optimizations. 

Speaker(s): Mark Harris (Chief Technologist, GPU Computing, NVIDIA), 

Duncan Poole (Senior Manager, HPC, NVIDIA), Cliff Woolley (CUDA 

Developer Technology Engineer, NVIDIA) 

Topic(s): Parallel Programming Languages & Compilers (Beginner) 



S0603 GPU Ray Tracing 

Learn the latest approaches in levering GPUs for the fastest 

possible ray tracing results from experts developing and 

leveraging the NVIDIA OptiX ray tracing engine, the team behind 

NVIDIA iray, and those making custom renderers. Multiple 

rendering techniques, GPU programming languages, out-of-core 

rendering, and optimal hardware configurations will be covered in 

this cutting-edge discussion. 

Speaker(s): Phillip Miller (Director, Workstation Software Product 

Management, NVIDIA) 

Topic(s): Ray Tracing (Beginner) 



S0615 Part 2: Introduction to the GPU Architecture and 

Memory Model (Presented by Acceleware) 

Explore the memory model of the GPU. The first part of the 

session covers task parallelism and thread cooperation in GPU 

computing. The second part focuses on the different memory 

types available on the GPU. We will define shared, constant and 

global memory and discuss the best locations to store your 

23 CONFERENCE GUIDE MONDAY

MONDAY 

application data for optimized performance. A programming 

demonstration of shared memory will be delivered. 

Introduction to the GPU Architecture and Memory Model 

�� 

�� 

�� 

Shared memory 

Constant memory 

Global memory 

�� 






S0624 Introduction to CUDA C 

Starting with a background in C or C++, learn everything you need 

to know in order to start programming in CUDA C. Beginning with 

a “Hello, World” CUDA C program, explore parallel programming 

with CUDA through a number of hands-on code examples. 

Examine more deeply the various APIs available to CUDA 

applications and learn the best (and worst) ways in which to 

employ them in applications. 

Speaker(s): Justin Luitjens (Devtech Engineer, NVIDIA) 

Topic(s): Programming Languages & Techniques (Beginner) 



S0517B Programming GPUs with OpenACC (Part 2 of 3) 





add simple compiler hints to your code to expose parallelism to 

the compiler, allowing it to map computation onto an accelerator. 




This is part 2 of a 3-part tutorial that will take you from an 

overview through how to optimize your code. The tutorial starts 

with an overview of OpenACC programming in which you will learn 

about applying basic OpenACC directives to your code, with 

examples. You will also learn more about how GPUs execute 

parallel programs, and apply this understanding to optimizing 

more advanced OpenACC examples to gain larger speedups and 

accelerate applications with various types of parallelism. 

Lastly, you will see how to use NVIDIA profiling tools to target 

your optimizations. 







S0530 Multi-Display Roundtable 

Join NVIDIA product manager and application engineers for 

multi-display systems for an interactive discussion on the 

current trends in video walls, blended multi-projector systems 

and its deployment. 

Speaker(s): Andrew Page (Senior Product Manager, NVIDIA), Shalini 

Venkataraman (Senior Applied Engineer, NVIDIA), Ian Williams (NVIDIA) 

Topic(s): Visualization (Beginner) 



S0604 NVIDIA Advanced Rendering Solutions 

The full range of advanced rendering solutions and frameworks 

from NVIDIA will be explored in this insightful product and 

technology discussion and demonstration. Come learn about the 

latest possibilities involving advanced rendering techniques and 

how they integrate within commercial products – from production 

ray tracing to volumetric and distributed rendering. 

Speaker(s): Phillip Miller (Director, Workstation Software Product 

Management, NVIDIA) 

Topic(s): Ray Tracing (Advanced) 



S0616 Part 3: Debugging GPU Programs (Presented 

by Acceleware) 

Get the low down on debugging your GPU program. This session 

includes discussion on debugging techniques and tools to help 

you identify issues in your kernels. The latest debugging tools 

provided in CUDA 4.1 including Parallel NSight, cuda-gdb and 

cuda-memcheck will be discussed. A programming 

demonstration of Parallel NSight will be provided. 

Debugging GPU Programs 

�� 

�� 

�� 

�� 

�� 






S0629 CUDA Accelerated Compute Libraries 

The libraries distributed in the CUDA SDK and offered by third 

parties provide a wealth for functions commonly encountered in a 

GPU acceleration project. Using these libraries can often 

significantly shorten the development time of a GPU project while 

leading to high-performance, high-quality software. In this 

tutorial, we will provide an overview of the libraries in the CUDA 

SDK, including cuBLAS, cuRAND, NPP and Thurst and introduce 

common use cases. The audience will not only learn about the 

strengths of the individual libraries, but also learn about the 

decision making process to select the best suited library for 

their project. 

Speaker(s): Peter Messner (NVIDIA) 




S0630 Part 1 of 2: Programming Heterogeneous Manycores 

Using Directives (Presented by CAPS) 

Directive-based programming is a very promising technology to 

deal with Many-Core. In this context, HPC users can rely on 

emerging standards such as OpenACC and OpenHMPP. CAPS will 

introduce OpenACC and HMPP directive-based programming

models with companion tools (e.g. for tracing, tuning, debugging): 

HMPP Wizard, CULA, ArrayFire, Vampir, Paraver, DDT, 

CodeletFinder, etc. The speakers will provide insights on how GPU 

/ CPU can be exploited in a unified manner and how code tuning 

issues can be minimized. The discussion will also cover the use of 

libraries which is essential when addressing Many-Core 

Programming. Pathscale will present its product supporting 

OpenHMPP programming model. 

Speaker(s): Francois Bodin (CAPS), Christopher Bergström (Pathscale) 

Topic Area(s): Parallel Programming Languages & Compilers; 




S0027A All-In-One Debugging Experience with CUDA- 

GDB and CUDA-MEMCHECK 

CUDA Debugger tools CUDA-GDB and CUDA-MEMCHECK provide 

a whole new feature set to help improve your CUDA application 

development cycle. This session is a detailed walk-through of the 

key new features and advanced techniques on using CUDA-GDB 

and CUDA-MEMCHECK together to improve overall code 

productivity. This tutorial will also include live demos. 

This session will repeat on Wednesday at 14:00. 

Speaker(s): Geoff Gerfin (Technical Manager and Senior Engineer, 

NVIDIA), Vyas Venkataraman (Software Engineer, NVIDIA) 

Topic(s): Development Tools & Libraries (Intermediate) 



S0517C Programming GPUs with OpenACC (Part 3 of 3) 





add simple compiler hints to your code to expose parallelism to the 

compiler, allowing it to map computation onto an accelerator. 




This is a 3-part tutorial that will take you from an overview 

through how to optimize your code. The tutorial starts with an 

overview of OpenACC programming in which you will learn about 

applying basic OpenACC directives to your code, with examples. You 

will also learn more about how GPUs execute parallel programs, 

and apply this understanding to optimizing more advanced 

OpenACC examples to gain larger speedups and accelerate 

applications with various types of parallelism. Lastly, you will see 

how to use NVIDIA profiling tools to target your optimizations. 







S0522 Introduction to CUDA Fortran 

This tutorial will cover various aspects of writing code in CUDA 

Fortran, which is the Fortran interface to the CUDA architecture. 

Topics covered will include a basic introduction to parallel 

programming concepts using CUDA, performance measurements 

and metrics, optimization, and multi-GPU programming via CUDA 

4.0’s peer-to-peer capability and MPI. Several case studies will be 

presented as well. 

Speaker(s): Massimiliano Fatica (Manager, NVIDIA), Gregory Ruetsch 

(Applied Engineer, NVIDIA) 




S0601 GPU-Based Video Processing Round Table 

Have questions, concerns or thoughts about the direction of 

GPU-based video and image processing? Join NVIDIA engineers 

and product managers for a lively discussion of such topics as 

application design, multi-GPU architecture, data movement, 

threading, APIs, and color management as they apply to Video and 

Image processing applications. 

Speaker(s): Alina Alt (Applied Engineer, NVIDIA), Andrew Page (Senior 

Product Manager, NVIDIA), Thomas True (Senior Applied Engineer, 

NVIDIA), Ian Williams (Director of Applied Engineering, NVIDIA), Eric 

Young (Manager of Applied Research, NVIDIA) 

Topic(s): Audio, Image and Video Processing (Beginner) 



S0617 Part 4: Introduction to Optimizations and Profiling 

(Presented by Acceleware) 

Learn how to optimize and profile your algorithms for the GPU. 

This session will cover the essentials of code optimization and will 

include: arithmetic optimizations, warps, branching efficiency, 

memory latency/occupancy and memory performance 

optimizations. Real life commercial examples will be discussed to 

highlight the critical aspects of GPU optimization techniques. A 

programming demonstration using the NVIDIA Visual Profiler will 

be included. 

Introduction to Optimizations and Profiling 

�� 

�� 

�� 

�� 

�� 

�� 






S0631 Part 2: Programming Heterogeneous Many-cores 

Using Directives (Presented by CAPS) 

Directive-based programming is a very promising technology to 

deal with Many-Core. In this context, HPC users can rely on 

emerging standards such as OpenACC and OpenHMPP. CAPS will 

introduce OpenACC and HMPP directive-based programming 

models with companion tools (e.g. for tracing, tuning, debugging): 

HMPP Wizard, CULA, ArrayFire, Vampir, Paraver, DDT, 

CodeletFinder, etc. The speakers will provide insights on how GPU 

/ CPU can be exploited in a unified manner and how code tuning 

issues can be minimized. The discussion will also cover the use of 

libraries which is essential when addressing Many-Core 

Programming. Pathscale will present its product supporting 

OpenHMPP programming model. 

Speaker(s): Francois Bodin (CAPS), Christopher Bergström (Pathscale) 

Topic Area(s): Parallel Programming Languages & Compilers; 


25 CONFERENCE GUIDE MONDAY

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

��

SESSION INFORMATION 


TUESDAY, MAY 15, 09:00 (25 MINUTES) 

ROOM J1 

S0102 Flame On: Real-Time Fire Simulation for 

Video Games 

Fire and explosions are common elements in video games and 

other virtual environments. We present a real-time fire simulator 

inspired by the paper “Directable, High-Resolution Simulation of 

Fire on the GPU” [Horvath and Geiger 2009], but this time 

implemented entirely in CUDA and targeted at adding interactive 

fire to video games. This talk will describe both the tricks necessary 

to implement an efficient fluid simulator in CUDA, and techniques 

for rendering the results to achieve realistic looking fire. 

Speaker(s): Simon Green (Senior Software Engineer, NVIDIA), 

Christopher Horvath (Global Technology Technical Director, Pixar) 

Topic(s): Computer Graphics, Computational Fluid Dynamics (Intermediate) 


MARRIOTT BALLROOM 3 

S0248 Excitements, Challenges, and Rewards In 

Optimizing GPGPU Kernels 

Learn about the excitements and challenges in optimizing CUDA 

kernels for the last two generations of NVIDIA GPGPUs. 

Autotuning, although crucially important, is merely a silver bullet 

to port code from one generation of GPU to another. The process 

required many steps: (a) architecture specific algorithms, (b) 

tuning algorithms, (c) finding innovative tricks to handle generic 

cases, (d) tweaking GPU’s internal scheduling to handle partition 

camping, and (e) above all, the dedication of many enthusiastic 

programmers. We will share our experiences and discoveries 

through the development of MAGMABLAS - a subset of CUDA 

BLAS, highly optimized for NVIDIA GPGPUs. 

Speaker(s): Rajib Nath (Student, University of California San Diego), 

Stanimire Tomov (Research Director, University of Tennessee, Knoxville) 

Topic(s): Algorithms & Numerical Techniques, Application Design & 

Porting Techniques, Supercomputing (Intermediate) 


ROOM A8 

S0268 Virtual Process Engineering - Realtime 

Simulation of Multiphase Systems 

Realtime simulation and virtual reality with quantitatively correct 

physics for industrial processes with multi-scale and multiphase 

system is once a remote dream for process engineering, but is 

becoming true now with CPU-GPU hybrid supercomputing. 

Numerical and visualization methods for such simulations on 

thousands of GPUs will be reported with applications in chemical 

and energy industries. 

Speaker(s): Wei Ge (Professor, Institute of Process Engineering, 

Chinese Academy of Sciences) 

Topic(s): Computational Fluid Dynamics, Molecular Dynamics, 

Computational Physics, Algorithms & Numerical Techniques (Advanced) 


ROOM A7 

S0296 A GPU-Enabled SPH Method for Micro and 

Nanofluidic Simulations 

With SPH methods multi-phase flows within complex geometries 

can be efficiently investigated. Also physical effects present in 

micro- and nanofluidic applications are described with little effort 

using the SPH methodology. In order to investigate microfluidic 

applications relevant to industry, large domains and high spatial 

resolutions are required. Therefore, a SPH method for accelerated 

computations on GPUs is currently developed. The code features 

dynamic casting of computational data into blocks of appropriate 

size to fit the GPU memory layout. Also tree-like data structures 

for efficient manipulation of particle distributions help to obtain 

significant performance gains on GPU hardware. 

Speaker(s): Daniel Gaudlitz (Research Associate, Technische 

Universität München) 

Topic(s): Computational Fluid Dynamics, Algorithms & Numerical 

Techniques (Intermediate) 


ROOM J3 

S0317 Compiling a Parallel Domain Specific Language 

to GPUs 

Discuss techniques for compiling Parallel DSLs to GPUs. Verilog 

is a Domain Specific Language for Hardware Description. Verilog 

users express parallelism with guarded processes similar to 

Occam’s guarded commands. Review Verilog semantics, and 

different approaches to compiling Verilog to parallel architectures 

and to GPUs. Discuss challenges with (a) Verilog description’s 

runtime behavior (b) managing process dependency. Discuss 

approaches and challenges in compiling a parallel DSL to CUDA C. 

Speaker(s): Ramesh Narayanaswamy (Principal Engineer, Synopsys Inc.) 

Topic(s): Electronic Design Automation, Application Design & Porting 



ROOM K 

S0337 High-Throughput Epistasis Screening Using GPUs 

Epistasis is the interaction of two or more genes in coding for a 

biological property. Epistasis is believed to be an important factor 

in an individual’s susceptibility to disease, and the search for 

epistasis is a major component in the development of 

personalized approaches to genomic medicine. Statistical tests 

for epistasis are typically confounded by the multiple-testing 

problem, that is, the aggregated loss of precision incurred through 

repeated hypothesis testing. One way to circumvent this problem 

is to simulate a false-discovery rate via resampling. We report 

success in using GPUs to accelerate these highly computeintensive 

resampling techniques. 

Speaker(s): Mark Seligman (Senior Scientist, Insilicos LLC) 

Topic(s): Bioinformatics, Life Sciences, Supercomputing, 

Cloud Computing (Intermediate) 


ROOM A2 

S0395 GPU Enablement in Adobe Photoshop 

Photoshop is one of the most popular products in history. It 

attempts to delight the customers with an immersive experience. 

Since CS4, Adobe has been tapping into the horsepower of the 

GPU to create a compelling playground for the imaginations of 

creative pros. Please join us to review the latest developments on 

how GPUs have been an enabling force. 

Speaker(s): Jeff Chien (Adobe Systems), Jerry Harris (Senior Computer 

Scientist II, Adobe Systems) 

Topic(s): Digital Content Creation & Film, Audio, Image and Video 

Processing (Beginner) 

27 CONFERENCE GUIDE TUESDAY

TUESDAY 


ROOM C 

S0419A Optimizing Application Performance with CUDA 

Profiling Tools 

NVIDIA provides two powerful profiling tools that you can use to 

maximize your application’s performance. The NVIDIA Visual Profiler 

helps you understand your application’s behavior with a detailed 

timeline and data from GPU performance counters. The Visual 

Profiler also provides an automatic, data-driven analysis engine that 

provides suggestions on potential optimization strategies for your 

application. Nvprof is a command-line profiler that provides 

gprof-like functionality for the GPU. Nvprof provides summary 

information about where your application is spending the most time, 

so that you can focus your optimization efforts. This session will 

provide a step-by-step walk through of both of these profiling tools, 

showing how you can use these tools to identify optimization 

opportunities at the application, kernel, and source-line levels. 

This session will repeat Wednesday at 14:00 (S0419B). 

Speaker(s): David Goodwin (Software Engineer, NVIDIA) 

Topic(s): Development Tools & Libraries (Beginner) 


ROOM J2 

S0527 GPUs and the Next-Generation Aerial Surveillance 

Graphics processors are already used for computationally 

intensive video tasks in many ISR (Intelligence, Surveillance, 

Reconnaissance) applications; GPU-based system for video 

enhancement and analytics outperforms a similarly priced 

CPU-based system 5-to-1 at HD resolutions. Our initial tests on 64 

megapixel Wide Area Aerial Surveillance (WAAS) data show at least 

10x speedup with tasks such as super-resolution or moving target 

indication. In this talk, we’ll discuss unique design and 

implementation challenges of real-time processing of very large 

video data sets. We will demonstrate our existing GPU-based 

software, IKENA ISR, and discuss its video-processing pipeline and 

innovative processing solutions that are promising to dramatically 

expand capabilities of emerging aerial surveillance platforms. 

Speaker(s): Nikola Bozinovic (CTO, MotionDSP) 

Topic(s): General Interest (Beginner) 


ROOM A1 

S0607 High Performance 3D Perception 

The path to general purpose graphics programming was driven by 

computer graphics: the process of rendering 3d models into 2d 

viewpoints. With the advent of flexible programming of GPGPU 

processing, this process can be reversed. 3D perception is the 

problem of inferring structure and motion of the physical world 

from 2d and 3d measurements. In this talk, we will demonstrate 

the role GPGPU plays in a diverse set of applications in high speed 

3d perception and discuss optimization of these techniques for the 

GPGPU. We also demonstrate several capabilities of future 

systems which are enabled by GPGPU technologies. 

Speaker(s): Chris Slaughter (President, University of Texas Perception, 

Lynx Labs) 

Topic(s): Computer Vision (Beginner) 


ROOM J2 

S0040 Introducing CUDA in KBE Applications for Digital 

Vehicle Development Programs 

Get the latest development in Next Generation Knowledge Based 

Engineering (KBE) software which provides real results over the 

traditional design approach. Today there exist numerous KBE 

applications in the field of vehicle ergonomics, suspension, NVH, 

safety, regulations etc which deal with huge number of iterations 

and mathematical algorithm. With GPU computing and CUDA the 

KBE kernel is restructured to incorporate parallel programming 

model which helps the applications run faster and achieving time 

reduction from hours to seconds. KBE geometry kernel also gets 

benefited by enabling CUDA in topology based operations which 

take lot of time when performed on CPU. 

Speaker(s): Avijit Santra (Project Manager, Knowledge Based 

Engineering, Tata Motors Limited) 

Topic(s): General Interest (Intermediate) 


ROOM K 

S0083 Swift: A GPU-based Smith-Waterman Sequence 

Alignment Program 

This session describes Swift, a GPU-based Smith-Waterman 

implementation for aligning short DNA sequences to large 

genomes. Swift has been designed to reduce computation time 

and lower hardware cost. Also, unlike other leading GPU-based 

Smith-Waterman sequence alignment programs like CUDASW++ 

and SWCUDA which focus on protein sequence alignment, Swift 

has been developed for DNA sequence alignment. Swift performs 

200x faster than CUDASW++ using a test data set containing 1000 

reads (100 bases each) and 1000 references (1000 bases each), 

and it performs 11x faster than the CPU-based implementation of 

Smith-Waterman using 24 million reads (100 bases each) and 

human chromosome 1. 

Speaker(s): Pankaj Gupta (Bioinformatics Application Developer, St 

Jude Children’s Research Hospital) 

Topic(s): Bioinformatics (Beginner) 


ROOM A7 

S0258 Sailfish: Lattice Boltzmann Fluid Simulations with 

GPUs and Python 

Learn how Run-Time Code Generation (RTCG) techniques allowed 

for fast development of a lattice Boltzmann (LB) fluid dynamics 

solver called Sailfish. Sailfish is completely open source, supports 

a wide variety of LB models (single and multiple relaxation times, 

the entropic model; single and binary fluids) and can take 

advantage of multiple GPUs. Even though the project is written 

predominantly in Python, no performance compromises are made. 

This talk will introduce the basic design principles of Sailfish and 

illustrate how RTCG allows to exploit the power of GPUs with 

minimal programmer effort. 

Speaker(s): Michal Januszewski (PhD Student/Software Engineer, 

University of Silesia in Katowice/Google Switzerland) 

Topic(s): Computational Fluid Dynamics, Computational Physics, 

Development Tools & Libraries (Intermediate) 


ROOM J3 

S0329 Using GPUs to Speedup Computational Lithography 

In this paper we show how GPUs can be used to significantly 

speedup computational lithography, which is heavily used in the 

Electronic Design Automation (EDA) industry. In particular, we 

demonstrate a noticeable performance increase in several basic 

optical lithography algorithms as well as the speedup of the 

full-chip verification software, crucial parts of which were ported

to NVIDIA’s GPUs. We summarize the advantages, disadvantages 

and challenges of using GPUs and compare it to more traditional 

multithreading and distributed computing alternatives for the 

same applications. 

Speaker(s): Constantin Chuyeshov (Algorithm Engineer, Cadence 

Design Systems) 

Topic(s): Electronic Design Automation (Intermediate) 


ROOM A1 

S0404 Computer Vision Libraries with GPUs 

Learn how Computer Vision libraries can take advantage of GPUs. 

Computer Vision algorithms are extremely well suited for GPU 

architectures because they demand large computational power 

that GPUs offer over CPUs. This talk provides an overview of the 

different GPU libraries such as (OpenCV, GPUCV, PCL, and NPP 

Libraries) and online resources (GPU4Vision and OpeNVIDIA) 

available for developers today. Examples and demonstrations of 

practical applications making use of these libraries will also be 

shown throughout the talk. 

Speaker(s): Eric Young (Manager of Developer Technology Profesional 

and Consumer Applications, NVIDIA) 

Topic(s): Computer Vision, Audio, Image and Video Processing (Beginner) 


ROOM B 

S0430 Developing Next-Generation CUDA Acceleration 

in Wolfram’s Mathematica with Parallel Nsight 

Since version 8, Mathematica offers advanced support for GPU 

acceleration with optimized CUDA functions and a built-in 

framework for developing scientific CUDA kernel code. In this 

session, the Wolfram development team will share their 

experience developing their next-generation CUDA support in 

Mathematica. From the unique ability of Parallel Nsight to attach 

its CUDA debugger to a running process, the new parallel Warp 

Watch for warp-wide variable views and expression evaluation, to 

the latest runtime CUDA profiling experiments; they will 

demonstrate how they were able to take advantage of Parallel 

Nsight to get the most out of CUDA and the GPU. 

Speaker(s): Abdul Dakkak (Kernel Developer, Wolfram), Sebastien Domine 

(Sr. Director, Software Engineering, Developer Tools, NVIDIA), Ulises 

Cervantel-Pimentel (Senior Kernel Developer, Wolfram) 



ROOM M 

S0618 Best Practices of a 800TFlop Hybrid 

Supercomputer Implementation (Presented by Appro) 

Learn about the “Frontier Computing System”, deployed by Appro 

for the University Of Tsukuba Center Of Computational Sciences in 

Japan containing over half a million GPU cores. Learn how 

reliability, availability, manageability and compatibility were 

essential for this successful 800TF hybrid supercomputing 

implementation. Explore new techniques in how HA-PACS is 

accelerating large scale parallel code by combining CPU/GPU 

processing cluster configurations for scientific research, such as 

astrophysics and climate modeling. Learn how to improve data I/O 

performance and memory size limitations in hybrid systems 

configured with Lustre File System offering the best 

performance per dollar and excellent memory capacity per/FLOP. 

Speaker(s): Taisuke Boku (Deputy Director of Center for Computational 

Sciences at University of Tsukuba), Steve Lyness (VP of HPC Solutions 

Engineering, Appro) 

Topic(s): Supercomputing, Astronomy & Astrophysics (Intermediate) 


ROOM NVIDIA NSIGHT LAB 

S0800 NVIDIA Nsight Lounge 

Come to the NVIDIA Nsight Lounge to meet the Nsight 

development team! Whether you would like a private meeting to 

discuss specific product features or test out your application with 

the latest version of Nsight, or you just want to hang out with the 

team after attending one of the exciting training session, the lab is 

great place to learn everything you ever wanted to know about the 

tool. 

Speaker(s): NVIDIA Developer Tools Team 



ROOM J2 

S0013 GPUs for Fast Triggering in NA62 Experiment 

We discuss an approach for using commercial graphic processors 

(GPUs) at the earliest trigger stages in high-energy physics 

experiments, and study its implementation on a real trigger 

system in preparation. In particular we focus on the possibility to 

reconstruct rings in a Cherenkov detector as building block of a 

selective trigger condition for rare decay search. Latency and 

processing rate measurements on several state-of-the-art 

devices are presented, and the potential issues related to 

processing time jitter and data transfer throughput are discussed. 

Speaker(s): Gianluca Lamanna (Researcher, CERN), Marco Sozzi 

(Associate Professor, Physics Department of Pisa) 



ROOM A8 

S0031 Unstructured Grid Numbering Schemes for GPU 

Coalescing Requirements 

Learn how to achieve high performance for computational fluid 

dynamics (CFD) solvers over unstructured grids using numbering 

schemes tailored for GPU coalescing requirements. Using these 

techniques, unstructured grid CFD solvers can make more 

effective use of memory bandwidth, which is an otherwise 

significant performance bottleneck that has so far led to relatively 

limited performance gains on GPUs in comparison to structured 

grid CFD solvers. Performance benchmarks will be shown using 

the Jet Engine Noise Reduction (JENRE) code. 

Speaker(s): Andrew Corrigan (Research Mathematician, Naval 

Research Laboratory), Johann Dahm (University of Michigan) 


Techniques, Computational Physics (Advanced) 


ROOM A7 

S0251 RANS CFD Solver on Fermi 

SJTU-NS3D is an in-house CFD code co-developed by SJTU and 

COMAC for large civil airplane, solving 3D Reynolds Average 

Navier-Stokes (RANS) equations on structured grids by finite 

volume method, which could be used in designing wing model. In 

this talk, we will present the design and further optimization of 

CUDA version of SJTU-NS3D, and it achieves 20-fold speedup for 

standard M6 wing model and 37-fold speedup for wing model 

candidate from COMAC on single Fermi C2050. 


GPU SuperBlade® 

SBI-7127RG 

Suports 20 GPUs in 7U 

4U 4 GPU SuperServer® 

SS7047GR Series 

Supports Up to 4 Double-Width GPUs in 4U 

�� 

HPC Systems Optimized for Scientifi c, Engineering and Computational Finance Applications 

�� Up to 20 GPUs in 7U 

�� Non-Blocking Native PCI-E 3.0 x16 Direct Connections to GPUs 

�� Centralized Remote Management Module 

(IPMI 2.0, KVM-over-IP, Remote Virtual Media) 

�� Redundant Platinum Level (94%+) High-Effi ciency Power Supplies 

�� New Dual Intel® Xeon® E5-2600 Processor Family 

2U 4/6 GPU SuperServer® 



www.supermicro.com/X9 

1U 3/4 GPU SuperServer® 



© Super Micro Computer, Inc. Specifi cations subject to change without notice. 

Intel®, the Intel® logo, Xeon®, and Xeon® Inside, are trademarks or registered trademarks of Intel Corporation in the US and other countries. All other brands and names are the property of their respective owners. 

SMCI-20120221- 1

Speaker(s): James Lin (Assistant Professor, Shanghai Jiao 

Tong University) 

Topic(s): Computational Fluid Dynamics (Intermediate) 



S0255 Telecom Systems Simulations Acceleration via 

CPU/GPU Co-Processing: Turbo Codes Case Study 

Learn how the struggle for acceleration of simulations of a 

Serially Concatenated turbo code (SCCC) led to the knowledge of 

new techniques applicable to a broad range of non-natively 

parallel physical layer telecommunication systems simulations. 

The overall architectural features of CUDA became inspiring for 

newer parallelization techniques involving algorithm engineering; 

the simulation acceleration attained for iterative SCCC Decoder 

represents an example of efficiency of leveraging on 

heterogeneous GPU-CPU coprocessing concepts. The registrants 

will deep dive into data sets and tasks organization strategies 

as well as into results and insights, all widely presented 

and discussed. 

Speaker(s): Paolo Spallaccini (System Engineer, Ericsson) 

Topic(s): Algorithms & Numerical Techniques, Audio, Image and Video 

Processing, Supercomputing (Intermediate) 


ROOM A2 

S0300 Jet: A Domain-Specific Approach to Parallelism 

for Film Fluid Simulation 

Discover how a domain-specific language can not only provide fast 

parallel performance but a simpler user experience in an 

environment that highly values flexibility. This talk will present the 

Jet language and heterogeneous compiler built on the LLVM 

compiler framework that enables efficient generation of X86 

machine code or NVIDIA PTX for stencil computation on 

structured grids. We show that moving target-specific 

optimizations upstream into the compiler can greatly improve the 

ability to manipulate the logic of the solver and thus lower the 

barrier-to-entry for artists and developers without compromising 

on performance. 

Speaker(s): Dan Bailey (R&D, Double Negative) 

Topic(s): Parallel Programming Languages & Compilers, Digital 

Content Creation & Film, Computational Fluid Dynamics (Intermediate) 


ROOM L 

S0343 A Quantum Chemistry Domain-Specific Language 

For Heterogeneous Clusters 

This talk discuss the development of a Domain-Specific Language 

(DSL), the tools and the related runtime for efficiently generating 

Tensor Contractions (generalized matrix multiplications), an 

important part of many quantum chemistry methods (e.g. Coupled 

Cluster Theory). Starting from a high level description of the 

computation, the tool analyses it and generates optimized C, 

OpenCL or CUDA implementations. The runtime, supporting a 

task based computation model, is then able to execute the 

generated code on GPU-accelerated heterogeneous large scale 

clusters, maximizing the utilization of the processing elements 

and minimizing communication costs. 

Speaker(s): Antonino Tumeo (Research Scientist, Pacific Northwest 

National Laboratory), Oreste Villa (Research Scientist, Pacific 

Northwest National Laboratory) 

Topic(s): Quantum Chemistry, Supercomputing (Intermediate) 


ROOM K 

S0376 Dynamic Programming on CUDA: Finding the Most 

Similar DNA Sequence 

Learn a couple of techniques to speed up compute-heavy Dynamic 

Programming algorithms on the GPU. Our particular problem 

regarded DNA sequences: given a reference sequence, how to find 

the one most similar to it among a large database? The sequences 

are millions characters long, and their similarity is calculated with 

a (quadratic) DP algorithm, which makes the problem very tough 

even for the GPUs. We speed up both the theoretical and practical 

side: we present programming techniques that enable Dynamic 

Programming to be performed at the hardware speed, and 

improvements to the algorithm itself that drastically lower the 

execution time. 

Speaker(s): Grzegorz Kokosinski (Software Engineer, IBM Poland), 

Krzysztof Zarzycki (Senior Software Developer, IBM Poland) 

Topic(s): Bioinformatics, Algorithms & Numerical Techniques 

(Intermediate) 


ROOM J3 

S0520 Using GPUs to Speedup Chip Verification 

As VLSI designs become more complex, the process of verifying 

them becomes increasingly expensive and time consuming. 

Verification of such designs has become quite taxing as they take 

simulators to the edge in terms of both runtime demands and 

host memory requirements. In order to reduce verification time, 

different verification methodologies have been adopted including 

the use of emulators. However, emulators’ price point is high and 

so is the engineering time to set them up. Rocketick develops a 

Verilog co-simulator that uses GPUs as an acceleration platform. 

Rocketick’s product, RocketSim® is now part of NVIDIA’s design 

flow and it is being used to accelerate simulations by 10X-30X 

compared to the standard simulator and to reduce the memory 

footprint by 5X. In this session RocketSim ® will be presented using 

some real-world examples of verification flows. 

Speaker(s): Tomer Ben-David (Co-Founder and Vice President, 

R&D, Rocketick) 

Topic(s): Electronic Design Automation (Beginner) 


KEYNOTE – HALL 1 

S3000 Opening Keynote 

Do not miss this opening keynote, featuring Jen-Hsun Huang, CEO 

and Co-Founder of NVIDIA. Hear about what’s next in computing 

and graphics, and preview disruptive technologies and exciting 

demonstrations from across industries. Jen-Hsun co-founded 

NVIDIA in 1993 and has served since its inception as president, 

chief executive officer and a member of the board of directors. 

Speaker(s): Jen-Hsun Huang (CEO & Co-Founder, NVIDIA) 

Topic(s): General Interest (All Levels) 


ROOM A3 

S0024 GPU-Accelerated Path Rendering 

Standards such as Scalable Vector Graphics (SVG), PostScript, 

TrueType outline fonts, and immersive web content such as Flash 

depend on a resolution-independent 2D rendering paradigm that 

GPUs have not traditionally accelerated. This session explains a 

new opportunity to greatly accelerate vector graphics, path 

rendering, and immersive web standards using the GPU. By 


TUESDAY 

attending, you will learn how to write OpenGL applications that 

accelerate the full range of path rendering functionality. Not only 

will you learn how to render sophisticated 2D graphics with 

OpenGL, you will learn to mix such resolution-independent 

2D rendering with 3D rendering and do so at dynamic, 

real-time rates. 

Speaker(s): Mark Kilgard (Principal Software Engineer, NVIDIA) 

Topic(s): Computer Graphics, GPU Accelerated Internet, Digital 

Content Creation & Film, Visualization (Beginner) 


ROOM J3 

S0069 GPU Computing Advances in 3D 

Electromagnetic Simulation 

Learn about the latest developments in GPU acceleration for 3D 

Full Wave Electromagnetic simulation. The latest version of CST 

Studio Suite supports the full range of Tesla products on both 

Windows and Linux operating systems. Using GPU, multi-GPU and 

MPI-GPU Computing drastically reduces the simulation times for 

CST customers. We will provide a status of current and future GPU 

developments at CST and share detailed simulation results. 

Speaker(s): Andreas Buhr (Department Manager - Performance 

Optimization, CST AG), Fabrizio Zanella (Systems Manager, CST 

of America) 

Topic(s): Electronic Design Automation (Intermediate) 


ROOM C 

S0088 Point Cloud Library (PCL) on CUDA 

The Point Cloud Library (PCL - http://pointclouds.org) is a large 

scale, open project for 3D point cloud processing. The PCL 

framework contains numerous state-of-the art algorithms 

including filtering, feature estimation, surface reconstruction, 

registration, model fitting and segmentation. Due to the massively 

parallel nature of many of the above algorithms, GPGPU 

accelerations holds great potential for achieving real-time 

performance in numerous applications. In this work we 

demonstrate some of the recent advances in GPGPU 

programming for 3D point cloud processing, and outline plans for 

future development. 

Speaker(s): Michael Dixon (Research Engineer, Willow Garage, Inc), 

Radu Rusu (Research Scientist, Willow Garage, Inc), 

Topic(s): Computer Vision, Algorithms & Numerical Techniques, 

Stereoscopic 3D, Machine Vision (Intermediate) 


ROOM A5 

S0254 Graphics in the Cloud - How NVIDIA is Enabling 

Cloud Visualization 

Engineers, artists, scientists, and gamers are the most 

demanding visual thinkers on the planet, and as such have not 

been willing to move their computing environments to the 

infamous “cloud”. These remotely accessed systems are seen as 

slow and not up to the visual experience that users expect when 

dealing with these types of applications. NVIDIA aims to change 

that perception with the NVIDIA Virtual Graphics Platform. In this 

session you will hear about the technologies behind accelerating 

graphics in the cloud, and some of the industry partnerships that 

are enabling it. 

Speaker(s): Will Wade (Manager, Quadro Advanced Technologies, NVIDIA) 

Topic(s): Cloud Computing, Visualization, Computer Graphics 




S0313 Understanding and using Atomic 

Memory Operations 

Atomic memory operations provide powerful communication and 

coordination capabilities for parallel programs, including the 

well-known operations compare-and-swap and fetch-and-add. The 

atomic operations enable the creation of parallel algorithms and 

data structures that would otherwise be very difficult (or 

impossible) to express without them - for example: shared parallel 

data structures, parallel data aggregation, and control primitives 

such as semaphores and mutexes. In this talk we will use examples 

to describe atomic operations, explain how they work, and discuss 

performance considerations and pitfalls when using them. 

Speaker(s): Stephen Jones (CUDA Developer, NVIDIA), Lars Nyland 

(Compute Architect, NVIDIA) 

Topic(s): Algorithms & Numerical Techniques, Parallel Programming 

Languages & Compilers (Advanced) 


ROOM N 

S0319 Advanced Driver Assistance System Testing 

using OptiX 

Learn in this session how the AUDI AG and its partners make use 

of OptiX as a unified platform for the simulation of perception 

sensors utilizing different physical measurement principles, e.g. 

Video Camera, LIDAR, Ultra Sonic, etc. The aim is to generate 

synthetic sensor data with realistic measurement errors for 

testing Advanced Driver Assistance Systems. Get details about the 

challenges they faced during the implementation of the necessary 

tools for validating the sensor models and join the discussion 

when they describe the upcoming challenges related to real-time 

Ray Tracing and advanced material descriptions, when multiple 

sensors are simulated simultaneously. 

Speaker(s): Erwin Roth (Researcher, Technische Universitaet 

Muenchen), Tugkan Calapoglu (Lead Graphics Software Developer, 

VIRES Simulationstechnologie GmbH) 

Topic(s): Ray Tracing, Machine Vision (Intermediate) 


ROOM A8 

S0321 GPU-Based Monte Carlo Ray Tracing Simulation 

for Solar Power Plants 

Learn about real time simulations of Concentrating Thermal Solar 

Power using GPU technology to enable performance optimization 

of these utility scale plants. By leveraging the power of GPUs and 

the parallel aspect of the field of thousands sun-tracking mirrors, 

we have been successful in cutting the computation time by 

orders of magnitude versus the previously required minutes and 

hours runtime. We will present an overview of the problem 

domain and describe how we used the GPU to derive a Monte 

Carlo physics ray tracing method to simulate the flux reflected by 

the mirrors onto the solar receiver. 

Speaker(s): Michel Izygon (Tietronix Software), Claus Nilsson 

(Programmer, Tietronix Software) 

Topic(s): Energy Exploration, Computational Physics, Ray Tracing 

(Beginner) 


ROOM J2 

S0328 Best Practices in GPU-Based Video Processing 

The combination of the GPU’s massively parallel compute engine

with extremely high memory bandwidth and new programming 

paradigms such as CUDA and OpenCL have made the GPU well 

suited for image and video processing applications. This session 

will explore best practices and techniques for the development of 

efficient GPU-based video and image processing applications. 

Topics to be discussed include image segmentation and threading 

models for efficient parallelism, optimal memory usage strategies 

to reduce expensive data movement as well as multi-GPU 

considerations. Case studies and examples specific to video and 

image processing will be presented. 

Speaker(s): Thomas True (Applied Engineer, NVIDIA) 

Topic(s): Audio, Image and Video Processing, Digital Content Creation & 

Film, Computer Vision, Medical Imaging & Visualization (Intermediate) 


ROOM J1 

S0364 Interacting with Huge Particle Simulations in 

Maya with the GPU 

We present a plug-in for Maya which enables an artist to simulate 

huge particle counts in real-time by leveraging the NVIDIA GPU. 

Being able to interact with the simulation opens up new 

possibilities for modifying the workflow. We will demonstrate the 

plug-in, and provide insight into the algorithms used. 

Speaker(s): Wil Braithwaite (Senior Applied Engineer, NVIDIA) 

Topic(s): Digital Content Creation & Film, Computational Fluid 

Dynamics, Visualization (Beginner) 


ROOM A1 

S0412 A 2-Petaflops Stencil Application with 

Stereoscopic 3D Visualization - Gorden Bell Prize 2011 

Most stencil applications such as CFD and structure analysis are 

memory-bound problems. GPU has high performances in both 

computation and memory bandwidth suitable for them. The 

TSUBAME 2.0 supercomputer with 4224 GPUs has started since 

November 2010. We study a metal dendritic solidification by solving 

the phase-field model. The performance of 2.0 Petaflops was 

achieved for 4,096x6,500x1,0400 mesh on 4000 GPUs and we 

received the ACM Gordon Bell Prize in 2011. We also demonstrated 

several large-scale stencil applications (Lattice Boltzmann, 

weather prediction and so on) with stereoscopic 3D visualization. 

Speaker(s): Takayuki Aoki (Professor, Tokyo Institute of Technology) 

Topic(s): Supercomputing, Computational Fluid Dynamics, Climate & 

Weather Modeling, Stereoscopic 3D (Intermediate) 


ROOM L 

S0418 High Productivity Computational Finance on GPUs 

Learn how Aon Benfield helps clients use GPUs to develop and 

accelerate Monte Carlo derivatives pricing models. We will 

present our PathWise software tools used by actuaries and quants 

in order to rapidly develop and deploy production quality, GPU grid 

enabled, Monte Carlo models, using only high-level languages and 

tools without requiring any knowledge of CUDA or C/C++. We will 

describe our approaching of using Code Generation, Visual 

Programming, Domain Specific Languages and scripting 

languages to create a High Productivity Computing software stack 

for financial services applications. 

Speaker(s): Aamir Mohammad (Associate Director, Aon Benfield 

Securities), Peter Phillips (SVP, Aon Benfield Securities) 

Topic(s): Finance, Application Design & Porting Techniques, Parallel 

Programming Languages & Compilers (Beginner) 


ROOM A7 

S0434 Schlumberger LiveQuest: Application Delivery 

and Collaboration Solution 

The LiveQuest application delivery and collaboration solution 

allows petro-technical professionals to securely access and share 

exploration and production (E&P) applications and data, including 

3D visualization applications, anytime, anywhere. By utilizing web 

and thin-client technologies, LiveQuest provides platformindependent 

and application-agnostic real-time collaboration. In 

this session, Mario Dean will provide an introduction to the needs 

of the O&G exploration from an application and large data 3D 

visualization perspective. He will discuss the LiveQuest solution 

stack, with specific focus on the 3D remote visualization 

technology, and share customer deployment examples and overall 

ROI considerations. 

Speaker(s): Mario Dean (Schlumberger) 

Topic(s): Energy Exploration (Beginner) 


HALL 1 

S0515 Multi-GPU Programming 

CUDA releases starting with 4.0 include a number of features that 

facilitate multi-GPU programming and computing. In this session 

we will review the features useful for programming for multiple 

GPUs, both within a single node and across network. We will cover 

peer-to-peer GPU communication, communication patterns for 

various GPU topologies, as well as streams in the context of 

multiple GPUs. Concepts will be illustrated with a case study of 3D 

forward wave modeling, common in seismic computing. 

Speaker(s): Paulius Micikevicius (Developer Technology 

Engineer, NVIDIA) 

Topic(s): Parallel Programming Languages & Compilers (Advanced) 


ROOM K 

S0519 GPU Accelerated Bioinformatics Research at BGI 

After digitizing DNA double helix by sequencing, computation is 

the key connecting raw sequences with life science discoveries. As 

massive data is generated, how to process and analysis as well as 

storage them in an efficiently manner turns out to be a major 

challenge. By developing GPU accelerated bioinformatics tools 

and integrate them into pipelines, BGI researchers now run 

analysis pipelines in several hours instead of several days. These 

tools include SOAP3 aligner, SNP calling and tool for population 

genomics. The speed up is generally around 10-50x comparing 

with traditional counterparts. 

Speaker(s): BingQiang Wang (Head of High Performance 

Computing, BGI) 

Topic(s): Bioinformatics, Life Sciences, Algorithms & Numerical 

Techniques, Supercomputing (Intermediate) 


ROOM A2 

S0606 GPU-accelerated Science on Titan: Tapping into the 

World’s Preeminent GPU Supercomputer to Achieve 

Better Science 

This year, the leadership-class computing facility at Oak Ridge 

National Labs is upgrading its largest supercomputer for open 

science, “Jaguar”, to employ high-performance, power- efficient 

GPUs. Once the transition is complete, the machine will be known 

as “Titan”. In this extended GTC session, we will feature a range of 


BULL Ad?

presenters showcasing research codes that will run 

computational science on the GPU at scale. Through these 

selected presentations, we will investigate the progress and 

anticipated results of GPU-acceleration of these significant codes. 

In this session, we will also explain how research scientists 

interested in tapping into the immense capabilities of Titan can do 

so, through programs such as the Incite program sponsored by 

the US Department of Energy. The presenters include: 

�� 

National Laboratories) 

“Direct Numerical Simulation of Turbulence-Chemistry 

Interactions: Fundamental Insights Towards Predictive Models” 

�� 

“S3D Direct Numerical Simulation - Preparations for the 

10-100PF Era” 

�� 

Princeton Plasma Physics Laboratory (PPPL), Princeton) 

“Fusion Energy Sciences & Computing at the Extreme Scale” 

�� 

�� 

“Computer Simulation of Lignocellulosic Biomass” 

�� 

Science, Princeton) 

“Toward Global Seismic Imaging based on Spectral-Element 

and Adjoint Methods” 

Speaker(s): Jack Wells, Ph.D. (Director of Science, Oak Ridge 

Leadership Computing Facility, Oak Ridge National Laboratory) 

Topic(s): Supercomputing (Intermediate) 


ROOM B 

S0609 Computational Graphics: An Overview of Graphics 

Research at NVIDIA 

The future of computer graphics presents many challenges. The 

worlds we render will be vastly more complex in geometry and 

artistic “texture”. Real-time rendering will use global illumination 

to achieve a far richer appearance, robustly. And content creation, 

which has grown to be the dominant cost of producing both games 

and film, must get simpler and less expensive. The NVIDIA 

Graphics Research group addresses these challenges with a focus 

on “Computational Graphics”: using general-purpose computation 

to enhance and extend the traditional pipelines and capabilities of 

real-time rendering. In this talk David Luebke, who leads graphics 

research, will give an overview of recent and ongoing work in 

computational graphics at NVIDIA Research. 

Speaker(s): David Luebke (Senior Director of Graphics 

Research, NVIDIA) 

Topic(s): Computer Graphics (Intermediate) 


ROOM M 

S0632 Learn how Adobe After Effects CS6 takes 

advantage of NVIDIA Optix technology for 3D Ray Tracing 

(Presented by Adobe) 

Adobe After Effects CS6 unveils an amazing new 3D ray-traced 

rendering engine based on NVIDIA Optix technology with GPU 

acceleration of up to 50x faster than a CPU alone. This enables 

simple and quick designs of realistic geometric text and shapes in 

3D space. Motion graphics artists can now create more physically 

accurate scenes with beautiful results such as reflections, 

transparency, soft shadows, and depth-of-field blur directly in 

After Effects. GPU-accelerated ray tracing drastically improves 

the workflow by enabling motion graphics artists to develop these 

3D effects entirely within After Effects. 

Speaker(s): Steve Forde (Senior Product Manager, After Effects) 

Topic(s): Digital Content Creation (Beginner) 



S0801 CUDA Debugger Training on Windows 

Nsight offers a variety of powerful CUDA debugging feature set 

that enables developers to quickly spot bugs. From the memory 

checker to advanced breakpoints and variable warp watch panel, a 

developer can quickly isolate access memory errors, filter out the 

thousands of threads to a specific thread and quickly spot 

abnormal variable value ranges. Through a set of comprehensive 

exercises, the attendee will be able to utilize these features to 

become fully proficient at developing CUDA code. 




ROOM J3 

S0046 Application of the GPU to a Two-Part 

Computational Electromagnetic Algorithm 

The shooting and bouncing ray (SBR) method is one way to 

simulate electromagnetic field radiation. Like all methods, there 

are certain problems where it does not yield accurate results. In 

this presentation, we will explain one such case that consists of an 

antenna resonating between two metal plates. We will discuss 

how we used the graphics processing unit (GPU) to separate the 

problem into two parts. Each part is simulated individually with 

SBR producing an improved result. Such a GPU-accelerated, 

two-part approach can be applied to other more general 

hybrid simulations. 

Speaker(s): Eric Dunn (Electromagnetic Research Scientist, SAIC) 

Topic(s): Computational Physics, Algorithms & Numerical Techniques, 

Ray Tracing (Beginner) 


ROOM A1 

S0351 Strong Scaling for Molecular Dynamics 

Applications 

In this session we will talk about how to improve strong scaling for 

molecular dynamics applications. Using the NAMD molecular 

dynamics code as our primary case study, we will discuss the 

types of issues that can impede scaling, how to use already 

available and custom tools to discover such issues, and how to 

build a model to help analyze and predict scaling performance. 

Although this session is primarily focused on molecular dynamics 

applications, most of the lessons can be applied equally well to 

many other areas and applications. 

Speaker(s): Sarah Tariq (Software Engineer, NVIDIA) 

Topic(s): Molecular Dynamics, Cluster Management, Life Sciences 



ROOM A8 

S0379 GPU-based High-Performance Simulations 

for Spintronics 

The joint utilization of the electron’s charge and spin in 

“spintronics” represents a promising technology for data 

processing and storage in nanostructures. The complex quantum 

effects like the spin-Hall effect in these devices require 

demanding numerical simulations providing a convenient link 

between idealized analytical models to often very complex results 


TUESDAY 

from measurements. The simulations involving multiplications 

and inversions of large matrices provide an ideal showcase for 

performance gain by employing GPGPUs in the execution of the 

algebraic routines on these matrices in computing environments 

with shared execution of algorithms on multiple nodes with 

multiple GPGPUs and CPU cores. 

Speaker(s): Jan Jacob (Postdoctoral Researcher, University of Hamburg) 

Topic(s): General Interest, Computational Physics, Application Design 

& Porting Techniques (Intermediate) 


ROOM K 

S0516 The Advantage of GPU Computation for Analyzing 

Complex Traits 

Most import agriculture traits and human diseases are complex 

traits which are controlled by gene network with gene by gene 

interaction (epistasis) and gene by environment interaction (GE). 

New statistic methods and software are developed for analyzing 

genetic architecture for complex traits based on genome-wide 

association study (GWAS). When deal with large mapping 

population and huge amount of molecular information, GPU 

computation has an advantage over CPU computation. We will 

demonstrate the newly developed GPU based software 

QTLNetwork V3.0 and GWAS-GMDR for mapping genes with 

epistasis and GE interaction for complex traits of human, crops, 

and mouse. 

Speaker(s): Jun Zhu (Professor, Zhejiang University) 

Topic(s): Bioinformatics, Life Sciences (Intermediate) 


ROOM B 

S0610 Octree-Based Sparse Voxelization For Real-Time 

Global Illumination 

Discrete voxel representations are generating growing interest in 

a wide range of applications in computational sciences and 

particularly in computer graphics. A new real-time usage of 

dynamic voxelization inside a sparse voxel octree is to compute 

voxel-based global illumination. When used in real-time contexts, 

it becomes critical to achieve fast 3D scan conversion (also called 

voxelization) of traditional triangle-based surface representations. 

This talk describes an new surface voxelization algorithm that 

produces a sparse voxel representation of a triangle mesh scene 

in the form of an octree structure using the GPU hardware 

rasterizer. In order to scale to very large scenes, our approach 

avoids relying on an intermediate full regular grid to build the 

structure and constructs the octree directly. 

Speaker(s): Cyril Crassin (Postdoctoral Research Scientist, NVIDIA) 



ROOM A2 

S0655 Direct Numerical Simulation of Turbulence- 

Chemistry Interactions: Fundamental Insights Towards 

Predictive Models 

Recent petascale direct numerical simulation (DNS) of turbulent 

combustion have transformed our ability to interrogate finegrained 

‘turbulence-chemistry’ interactions in canonical 

laboratory configurations. In particular, three-dimensional DNS, 

at moderate Reynolds numbers and with complex chemistry, is 

providing unprecedented levels of detail to understand 

fundamental coupling between turbulence, mixing and reaction. 

This information is leading to new physical insight and is providing 

unique validation data for assessing model assumptions in 

coarse-grained engineering CFD approaches used to design 

modern combustors. The role of petascale DNS is illustrated 

through selected examples relevant to controlling ignition and 

combustion rates in homogeneous charge compression ignition 

engines and to fuel injection processes in stationary gas turbines 

for power generation. Petascale simulations presently generate 

upwards of a petabyte of complex, multi-scale, time-varying data 

used by combustion modelers to validate subfilter combustion and 

mixing models in large-eddy simulation. With the advent of 10-20 

petaflop hybrid architectures with accelerators like Titan at Oak 

Ridge National Laboratory, it will be possible to dramatically 

increase the chemical complexity of DNS. This will help accelerate 

the development of predictive subprocess models which will be 

used by engine developers to better understand and tailor the 

combustion of gasoline and new, more complex types of fuels in 

advanced engines. With Titan, simulations will move beyond 

today’s studies of simple fuels—hydrogen, syngas and methane— 

to more complex, larger-molecule hydrocarbon fuels like 

isooctane (a surrogate for gasoline), commercially important 

oxygenated alcohols (for example, ethanol and butanol), and 

biofuel surrogates. 

Speaker(s): Jacqueline H. Chen (Combustion Research Facility, Sandia 

National Laboratories) 



ROOM L 

S0034 Real-Time Risk Simulation: The GPU Revolution In 

Profit Margin Analysis 

Discover how ICHEC helped a world leading company in its sector, 

to dramatically speed-up and improve the quality of its real-time 

risk management tool chain. In this session, we present the 

method used for porting the core-part of the simulation engines 

to GPUs using CUDA. This porting was realized on two very 

different simulation algorithms and resulted in speed-ups of 2 to 

3 orders of magnitude, allowing much greater accuracy of the 

results in a real-time environment. 

Speaker(s): Gilles Civario (Senior Software Architect, ICHEC), Renato 

Miceli (Computational Scientist, ICHEC) 

Topic(s): Finance, Application Design & Porting Techniques, Algorithms 

& Numerical Techniques (Intermediate) 


ROOM C 

S0036 Multiparticle Collision Dynamics on GPUs 

See how we employ GPUs to simulate the interaction of millions of 

solvent and solute particles of a fluid system. Often the domain of 

large cluster system, the most time consuming part of our 

simulations can now be done on desktop PCs in reasonable time. 

This contribution shows how GPUs can effectively be used to 

accelerate existing programs and how techniques like streaming 

and increased data locality significantly enhance calculation 

throughput. It also shows how a GPU-optimized program 

structure yields usually expensive additional functionality “almost 

free”. Furthermore, a well-scaling single-node/multi-GPU 

implementation of the program is presented. 

Speaker(s): Elmar Westphal (Software Developer, 

Forschungszentrum Juelich) 

Topic(s): Computational Physics, Computational Fluid Dynamics, 

Molecular Dynamics (Intermediate)


ROOM J2 

S0049 Using the GPU Direct for Video API 

This tutorial will demonstrate how video I/O devices can take 

advantage of the GPU Direct for Video API to optimize the data 

transfer performance for digital video, film and broadcast 

applications and computer vision applications. The GPU Direct for 

Video API is a technology that permits the DMA transfer of data 

buffers between video I/O devices and the GPU through the use of 

a shared system memory buffer for immediate processing by 

OpenGL, DirectX, CUDA and OpenCL. This direct transfer can 

improve synchronization and eliminate latency between video 

capture, GPU processing and video output. 

Speaker(s): Alina Alt (Applied Engineer, NVIDIA), Thomas True (Applied 


Topic(s): Audio, Image and Video Processing, Development Tools & 

Libraries, Digital Content Creation & Film, Machine Vision (Advanced) 


ROOM A8 

S0067 PIConGPU - Bringing large-scale Laser Plasma 

Simulations to GPU Supercomputing 

With powerful lasers breaking the Petawatt barrier, applications 

for laser-accelerated particle beams are gaining more interest 

than ever. Ion beams accelerated by intense laser pulses foster 

new ways of treating cancer and make them available to more 

people than ever before. Laser-generated electron beams can 

drive new compact x-ray sources to create snapshots of ultrafast 

processes in materials. With PIConGPU laser-driven particle 

acceleration can be computed in hours compared to weeks on 

standard CPU clusters. We present the techniques behind 

PIConGPU, detailed performance analysis and the benefits of 

PIConGPU for real-world physics cases. 

Speaker(s): Michael Bussmann (Junior Group Leader Computational 

Radiation Physics, Helmholtz-Zentrum Dresden-Rossendorf), Guido 

Juckeland (System Engineer (HPC), Technical University Dresden) 

Topic(s): Computational Physics, Algorithms & Numerical Techniques, 

Application Design & Porting Techniques, Supercomputing (Advanced) 


ROOM A1 

S0075 Oculus Real-Time Modular Cognitive Vision System 

This session will explore ways to integrate GPU processing into a 

real-time computer vision architecture. While there has been a 

rapid push to move vision algorithms onto GPUs, integration into 

an efficient vision system architecture remains elusive. We will 

discuss our development of a modular vision system architecture 

that enables rapid prototyping of complex pipelines using multiple 

GPUs. The system incorporates modules for segmentation, 

disparity mapping, optical flow and particle filter tracking on the 

GPU. Our talk will explore the various difficulties associated with 

developing such a system and will give a hands-on demonstration 

of Oculus, our vision platform. 

Speaker(s): Jeremie Papon (PhD Student, University of Gottingen), 

Alexey Abramov (PhD Student, University of Gottingen) 

Topic(s): Computer Vision, Audio, Image and Video Processing, 

Application Design & Porting Techniques, Machine Vision (Intermediate) 


ROOM N 

S0223 Rapid Training of Acoustic Models Using GPUs 

Learn how to realize robust and accurate speech recognition 

systems by training acoustic models on GPUs. For common 

languages, state-of-the-art systems are now trained on 

thousands of hours of speech data, which can take weeks even 

with a large cluster of machines. To overcome this development 

bottleneck, we propose a new framework for rapid training of 

acoustic models using highly parallel GPUs. With a single NVIDIA 

GTX580 GPU, our proposed approach is shown to be 51x faster 

than a sequential CPU implementation, enabling a moderately 

sized acoustic model to be trained on 1000-hour speech data in 

just over 9 hours. 

Speaker(s): Jike Chong (Co-Director of CUDA Research Center, 

Carnegie Mellon University), Ian Lane (Assistant Research Professor, 

Carnegie Mellon University) 

Topic(s): Audio, Image and Video Processing, Machine Learning & AI 




S0308 Recent Trends in Hierarchical N-body Methods 

on GPUs 

See the newest developments in the area of hierarchical N-body 

methods for GPU computing. Hierarchical N-body methods have 

O(N) complexity, are compute bound, and require very little 

synchronization, which makes them a favorable algorithm on 

next-generation supercomputers. In this session we will cover 

topics such as hybridization of treecodes and fast multipole 

methods, auto-tuning kernels for heterogenous systems, fast tree 

construction based on prefix sums, fast load balancing of global 

trees, and more. Examples will be given using ExaFMM --an open 

source hierarchical N-body library for heterogenous systems 

developed by the speaker. (Released at SC11) 

Speaker(s): Rio Yokota (Research Scientist, King Abdullah University of 

Science and Technology) 

Topic(s): Algorithms & Numerical Techniques, Supercomputing, 

Development Tools & Libraries (Intermediate) 


ROOM J3 

S0349 Tree Accumulation on the GPU 

Learn how to map irregular tree structured computations to the 

GPU efficiently. See how extremely irregular data-dependent 

computations can be implemented by composing them out of 

regular data-parallel primitives. In particular we focus on the 

problem of tree accumulation, a generalization of the scan primitive 

to arbitrary tree data structures. We first show how tree orderings 

and properties can be computed using the Euler tour technique and 

standard scan primitives. Using these orderings we then develop 

our new approach to computing tree accumulations in parallel. 

Speaker(s): Scott Rostrup (Software Engineer, Synopsys Inc) 

Topic(s): Algorithms & Numerical Techniques, Application Design & 

Porting Techniques (Advanced) 


ROOM J1 

S0403 NURBS Tessellation with CUDA 

NURBS, or Non Uniform Rational B Splines, are a curved surface 

representation commonly used in computer aided design and 

digital content creation. This recursive representation gives a great 

deal of flexibility, allowing arbitrary surface order and knot vectors, 

enabling a single NURBS surface to contain many contiguous 

patches. However, this recursive representation is also expensive to 

compute, so a NURBS surface is often converted into multiple 

Bezier patches before being tessellated. In this implementation, we 


TUESDAY 

present an efficient method for directly tessellating NURBS 

surfaces using the NVIDIA CUDA computing API. 

Speaker(s): Brent Oster (Applied Engineer, NVIDIA) 

Topic(s): Computer Graphics (Advanced) 


ROOM A3 

S0407 A High Level Programming Environment for 

Accelerated Computing 

One of the critical hurdles for the widespread adoption of accelerated 

computing in HPC is programming difficulty. Users need a simple 

programming model that is portable and is not significantly different 

from the approaches used on current multi-core x86 processors. In 

this talk I will present Cray’s strategy to accelerator programming, 

which is based on a high level programming environment with tightly 

coupled compilers, libraries, and tools. Ease of use is possible with 

compiler making it feasible for users to write applications in Fortran, 

C, C++, tools to help users port and optimize for accelerators, and 

auto-tuned scientific libraries. 

Speaker(s): Luiz DeRose (Director of Programming Environment, 

Cray Inc.) 

Topic(s): Development Tools & Libraries, Parallel Programming 

Languages & Compilers (Intermediate) 


ROOM A5 

S0413 Delivering 3D Professional Graphics from the 

Cloud with Citrix XenDesktop 

Recent technological advances have made it practical to deliver 

3D professional graphics applications from the Cloud (private or 

public) with a high quality user experience and at an attractive 

cost. Organizations can keep their intellectual property safe in the 

data center since only fully-rendered screen images are sent over 

the network. Users in remote locations no longer have to wait for 

large file transfers. And they can access 3D models from a wide 

variety of devices, including iPads and Android tablets. Learn how 

Citrix XenDesktop, XenServer and Receiver technologies have 

made all of this a reality for many organizations today. 

Speaker(s): Derek Thorslund (Director of Product Management, Citrix 

Systems, Inc.) 

Topic(s): Cloud Computing, Computer Graphics, Visualization (Beginner) 


ROOM A7 

S0436 Integrated GPU Acceleration With Real Time 

Visualization Of Terabyte Data 

Computation and visualization doesn’t necessarily have to act as 

two separate entities. This talk explains the integration of real-time 

compute with real-time visualization. Industry and academia have 

provided attractive solutions for compiler-directive optimized code 

for computations. To support cases that involves massive yet ad-hoc 

data I/O and computation with interactive visualization, Hue 

developed a different model which bridges the gap between 

“complete system rewrite” and “compiler directive optimized code”. 

The talk explains how highly optimized data I/O mechanisms 

coupled with predefined input and output definitions for kernels 

provide excellent scalability and interactivity during runtime. 

Speaker(s): Kelly Walker (Senior Software Developer, Hue) 

Topic(s): Visualization, Energy Exploration (Beginner) 


ROOM B 

S0611 Edge-Aware Shaders for Real-Time 

Computer Graphics 

The most common approach in rendering is to define behavior at a 

point in terms of material properties and incident illumination. 

That approach works well when the geometry and material 

properties are well-known, and the light physics are simulated 

accurately. We present a technique to help situations where the 

model and/or physics is incomplete. This technique augments 

shaders with information about nearby edges, such as corners 

and boundaries between materials, and makes it natural to add 

richness procedurally near these visually critical regions. 

Speaker(s): Peter-Pike Sloan (Principal Research Scientist, NVIDIA) 



ROOM M 

S0620 VSIPL++: A High-Level Programming Model 

for Productivity and Performance (Presented by 

Mentor Graphics) 

Learn how VSIPL++ can improve your productivity and provide 

software portability, without sacrificing performance. We will 

describe how VSIPL++’s open-standard high-level programming 

model addresses the challenges of writing high-performance 

embedded software on GP-GPUs and other heterogeneous 

hardware, using advanced C++ techniques and data abstraction – 

and how we make this work in the real world. We will also present 

a comparison of performance results from various configurations 

of CPU and GP-GPU processing engines for a signal processing 

application developed using VSIPL++. 

Speaker(s): Brooks Moses, Ph.D. (Sourcerer, Mentor 

Graphics Corporation) 

Topic(s): Supercomputing (Beginner) 


ROOM A2 

S0625 S3D Direct Numerical Simulation - Preparations 

for the 10-100PF Era 

The evolution of supercomputing into the mid-petaflop era has 

been typified by heterogenous compute nodes with the majority of 

the compute capability delivered by a large number of lightweight 

cores. In order to prepare for the extension of this trend, the DNS 

code S3D has been retooled in anticipation of a target architecture 

offering 10s of thousands of heterogeneous nodes containing many 

X86 cores as well as GPU derived accelerators. Movement of outer 

loops to the highest level in the code facilitates hybrid MPI-OpenMP 

performance and an elegant path to accelerated kernels using 

OpenACC. It is anticipated that relevant scientific simulations at this 

scale will have a per-node footprint that can be contained entirely 

on the accelerator, so provision is made to maintain primary 

solution variables in accelerator memory with specific regions 

moved to the CPU for inter-node communication and workload 

balancing. With the current performance it is estimated that the 

new code will make it possible to meet early science goals with the 

full build-out of the anticipated Titan system as well as provide a 

platform to transition into the exascale software research space. 

Speaker(s): Ray Grout (National Renewable Energy Laboratory) 

Topic(s): Supercomputing (Beginner)



S0802 CUDA Profiler Training on Windows 

Nsight offers a comprehensive set of performance analysis tools. 

From the ability to trace complete system multi-core CPU and 

multi GPU activities, to profile CUDA kernel with precise profiling 

experiments, developers can identify system level optimization 

opportunities as well as expensive and inefficient CUDA kernels 

requiring in-depth analysis with the CUDA profiler. Through a set 

of comprehensive exercises, the attendee will be able to utilize 

these features to become fully proficient at optimizing complex 

CUDA applications. 




ROOM K 

S0152 Accurate Sequence Alignment using Distributed 

Filtering on GPU Clusters 

Learn how GPUs enable new ways to rethink a complex 

bioinformatics problem: Accurate sequence alignment. What was 

once prohibitive to compute can become the basic block of novel 

GPU-based algorithms. Modern DNA sequencing machines 

generate enormous amounts of short sequences within minutes, 

and they should be aligned to a reference genome in real time. 

Most solutions only find a few locations that match a short 

sequence. We introduce a new technique to find all matching 

locations inside a reference sequence for a given number of 

mismatches. Our technique is based on a distributed filtering 

scheme and GPU based processing. 

Speaker(s): Reza Farivar (PhD Student, University of Illinois at Urbana- 

Champaign), Shivaram Venkataraman (PhD Student, UC Berkeley) 




ROOM J3 

S0316 Using GPUs to Accelerate Synthetic Aperture 

Sonar Imaging via Backpropagation 

This presentation describes our development of a GPUaccelerated 

backpropagation implementation for Synthetic 

Aperture Sonar systems that supports multiple nodes via MPI and 

multi-GPU nodes. This implementation can form a complexvalued 

gigapixel image in one hour on a single C2050. We further 

scale this implementation to the Keeneland system where we can 

form the same gigapixel image in 21 seconds on 48 nodes with 

144 C2070 Tesla GPUs. Our talk will discuss the details of our 

implementation, including our optimizations and scaling results 

for various node and GPU configurations, as well as the 

applicability to other domains, including Synthetic Aperture Radar. 

Speaker(s): Thomas Benson (Research Engineer II, Georgia Tech 

Research Institute) 

Topic(s): Application Design & Porting Techniques (Intermediate) 


ROOM J1 

S0366 OptiX Out-of-Core and CPU Rendering 

OptiX has broken some major barriers recently by enabling 

out-of-GPU-core memory rendering and by adding a CPU 

rendering back-end when an OptiX-capable GPU is not present in 

the system. OptiX users and CUDA developers will be interested in 

how we accomplished these feats within the existing GPU 

architecture. This talk will provide a brief introduction to OptiX and 

then dive into what the new features provide. We will then go 

under the covers and show how we pulled it off. 

Speaker(s): David McAllister (OptiX Manager, NVIDIA, OptiX group) 

Topic(s): Ray Tracing, Computer Graphics (Intermediate) 


ROOM B 

S0409 Stochastic Rasterization 

Learn how to render transparency, motion blur, and depth of field 

effects in real time using random sampling. These effects 

combine multiple objects in each pixel, making them expensive to 

compute directly. But recent research shows that, with stratified 

sampling and clever reconstruction, good image quality can be 

achieved with surprisingly small numbers of samples per pixel. 

We will explain how to do this on the GPU, and explore trade-offs 

of performance, quality, accuracy, and noise. 

Speaker(s): Eric Enderton (Research Scientist, NVIDIA), Morgan 

McGuire (Visiting Professor, NVIDIA and WIlliams College) 

Topic(s): Computer Graphics, Digital Content Creation & Film 



ROOM A7 

S0444 Explore New Techniques in Volume Rendering/ 

Segmentation with Open Inventor 

The goal of this session is to show the improvements in quality, 

performance and flexibility of the volume rendering implementation 

of Open Inventor. The latest GPU techniques, such as virtual 

textures and ray casting, have been combined into a flexible shader 

API and applied on out of core data. The techniques of volume 

rendering, sugarcube rendering, basic and complex clipping, 

sculpting, editing and segmentation will be demonstrated using 

examples from a geobody extraction workflow. The great ease and 

flexibility of the shader pipeline API will be illustrated, and we will 

discuss the broad future perspectives of that technology. 

Speaker(s): Mike Heck (Technology Advisor, VSG) 

Topic(s): Computer Graphics (Advanced) 


ROOM A2 

S0654 Fusion Energy Sciences & Computing at the 

Extreme Scale 

The fusion energy sciences community has made excellent progress 

in developing advanced codes for which computer run-time and 

problem size scale well with the number of processors on massively 

parallel supercomputers. A good example is the effective usage of 

the full power of modern leadership class computational platforms 

from the terascale to the petascale and beyond to produce nonlinear 

particle-in-cell simulations which have accelerated progress in 

understanding the nature of plasma turbulence in magneticallyconfined 

high temperature plasmas. Illustrative results provide great 

encouragement for being able to include increasingly realistic 

dynamics in extreme-scale computing campaigns to enable 

predictive simulations with unprecedented physics fidelity. 

Speaker(s): William Tang (Fusion Simulation Program at the Princeton 

Plasma Physics Laboratory (PPPL), Princeton) 

Topic Area(s): Supercomputing (Intermediate) 


The Many-Core Company 

Discover our global solutions for many-core programming: 

Software tools 

Expertise 

and the methodology to safely port your code 

www.caps-entreprise.com

TUESDAY, MAY 15, 16:00 (50 MINUTES 

ROOM K 

S0008 Algorithms and Tools for Bioinformatics on GPUs 

Learn how to use GPUs to accelerate compute- and data-intensive 

applications and algorithms Bioinformatics. High-throughput 

techniques for DNA sequencing and gene expression analysis with 

microarrays have led to a rapid growth in the amount of digital 

biological data, e.g. the NCBI Sequence Read Archive (SRA) houses 

raw sequence data generated by next-generation sequencing (NGS) 

technologies which succeeds 25 trillion base-pairs. Therefore, 

modern bioinformatics tools need to be scalable; i.e. they need to 

deal with an ever growing amount of data. GPUs and CUDA provide 

the opportunity to significantly reduce the runtime of many 

biological algorithms on inexpensive hardware. 

Speaker(s): Bertil Schmidt (Nanyang Technological University) 

Topic(s): Bioinformatics, Life Sciences (Intermediate) 


ROOM J3 

S0050 High Performance Logic Simulation with GPUs 

Verification has become the bottleneck of IC design process due to 

its fast increasing complexity. The fundamental means of verifying 

digital circuits is logic simulation, which can be performed at both 

register-transfer level (RTL) and gate level. In this work, we 

developed GPU based logic simulation solutions. We implemented 

a Chandy-Misra-Bryant parallel simulation protocol on GPUs for 

sufficient parallelism. A dynamic GPU memory allocator was 

introduced to efficiently manage GPU memory resources. RTL 

simulation is performed in a compiled-code scheme by translating 

Verilog code into equivalent CUDA code. Experimental results 

proved that the GPU simulators significantly outperform their 

CPU counterparts. 

Speaker(s): Yangdong Deng (Associate Professor, Tsinghua University) 

Topic(s): General Interest, Algorithms & Numerical Techniques 

(Advanced) 


ROOM A1 

S0062 Inverse 3D Vision: Detection and Tracking of 

NVIDIA Glasses 

Computer Vision is becoming increasingly popular and important 

nowadays. With the advent of powerful mobile devices and 

increasing power of desktop PCs, it is important to improve user 

experience by tackling the hardest problems of real-time 

interaction with the user. These include body parts tracking, face, 

and gesture recognition. This talk discusses techniques behind an 

interaction pattern between a user and a 3D visualization system, in 

which the system tracks the position of NVIDIA 3D Vision Glasses, 

and accounts this information during rendering. The mentioned 

techniques include Histograms of Oriented Gradients and Template 

Matching. The system implementation is discussed too. 

Speaker(s): Anton Obukhov Engineering Consultant, (Ubiquiti Networks) 

Topic(s): Computer Vision, Machine Vision, Development Tools & 

Libraries, (Advanced) 


ROOM A3 

S0089 Accelerator Directives, OpenACC and OpenMP4ACC 

Rather than require the programmer to rewrite code for 

accelerators several directive sets have been created and 

proposed to support non-cache coherent and cache coherent 

accelerators. This talk will present the OpenACC specification and 

its implementation for Cray developers, as well as touch on a 

similar proposal being evaluated by the OpenMP language 

committee. The presentation will start by discussing the Memory 

and Execution model needed to allow a programmer to write 

codes that will run effectively on both distinct memory systems 

and unified memory systems. Once a proper background has been 

set the directives will be examined via usage examples. 

Speaker(s): James Beyer (Software Engineer, Cray Inc), David Oehmke 

(Cray Inc.) 

Topic(s): Parallel Programming Languages & Compilers, 

Supercomputing (Intermediate) 


ROOM C 

S0108 An Innovative Massively Parallelized Molecular 

Dynamic Software 

In this paper, we present how we improved the speedup of the 

electronic structure calculator VASP by more than an order of 

magnitude. Recently, the research works done (at IFP Energies 

Nouvelles) have shown that by coupling traditional clusters or 

High Performance Computing (HPC) machines with accelerators 

based on graphical processor units (GPUs), by recording the most 

time consuming parts of the codes (with programming languages 

like CUDA, OpenCL) and offloading them on the graphic chips, it is 

possible to reduce the computing time to ensure a speedup of a 

factor of 5 to 15. 

Speaker(s): Thomas Guignon (Research Engineer, IFPEN), Ani Anciaux 

Sedrakian (IFP Energie Nouvelles) 

Topic(s): Molecular Dynamics, Supercomputing, Application Design & 

Porting Techniques (Intermediate) 



S0221 1024 Bit Parallel Rational Arithmetic Operators for 

the GPU 

Learn how to create a set of rational arithmetic operators that 

manipulate 1024 bit operands on a Tesla C2050. These operators 

are used to create a numerically stable implementation for Bessel 

functions. Naive implementations of the Bessel functions produce 

unreliable results when they are used to solve Maxwell’s 

equations by way of Mie theory. Maxwell’s equations are used to 

model the scattering of light by small particles. Light scatter is 

used in Particle Characterization to measure the quality of 

materials like cocoa, cement and pharmaceuticals. 

Speaker(s): Robert Zigon (Sr. Staff Development Engineer, 

Beckman Coulter) 

Topic(s): Algorithms & Numerical Techniques, Computational Physics 



ROOM A8 

S0245 Porting Legacy Plasma Codes to GPU 

Learn how to port legacy Fortran plasma codes to GPU. Many legacy 

plasma codes are written in Fortran and have many lines of codes. 

We will discuss techniques in porting such legacy codes easily and 

efficiently to CUDA C/C++. Performance analysis of major algorithmic 

patterns in plasma codes will be discussed. The discussion will use 

the GTC and GeFi plasma code as realistic examples. 

Speaker(s): Peng Wang (Devtech Engineer, NVIDIA) 

Topic(s): Computational Physics, Computational Physics (Intermediate) 


TUESDAY 


ROOM A5 

S0261 Scalable GPU Computing Service Architecture 

In this session we describe our GPU accelerated computing 

service which supports several internal business processes in a 

large scale company setup. The service supports diverse 

computational needs such as on-demand rendering, mesh 

optimization, a Massive Multiplayer Online Game (MMO), product 

visualizations and other demanding computational tasks. We 

present the architectural considerations for a service-oriented 

computational framework and the practical learning’s and 

opportunities encountered during development a enterprise 

system using NVIDIA technologies such as CUDA, OptiX, OpenGL 

and OpenCL. Our aim is to share knowledge and present LEGO’s 

vision for a GPU accelerated computational platform as a 

business-driven technology. 

Speaker(s): Henrik Høj Madsen (Solution Architect, LEGO), Michael 

Schøler (Senior Consultant, LEGO) 

Topic(s): Cloud Computing, Computer Graphics, Ray Tracing 



ROOM A7 

S0336 GPU Acceleration for Seismic 

Interpretation Algorithms 

The oil and gas industry is already leveraging GPUs for seismic 

data processing, but what about 3D seismic interpretation? This 

session will cover how the GPU is being used by TerraSpark 

Geosciences to dramatically decrease the runtime of algorithms 

for enhancing faults, computing horizon orientation, and 

calculating volumetric curvature. We will share our experiences in 

porting these techniques to the GPU, the challenges encountered, 

the solutions found, and, of course, the benefits to execution time. 

Speaker(s): Jonathan Marbach (Director, Software Architecture and 

Engineering, TerraSpark Geosciences) 



ROOM J2 

S0356 Optimized Texture Transfers 

Many real world graphics applications need to transfer textures 

efficiently in and out of GPU memory in the form of 2D images, 

2.5D terrains or 3D volumes as well as their time-varying 

counterparts. The first part of this talk covers technical pointers 

on how to optimize your OpenGL application to overlap transfers 

with rendering using the NVIDIA Copy Engines. The second part 

demonstrates the integration and performance of this feature 

within the a real world latency-sensitive broadcast graphics 

application from VizRT. 

Speaker(s): Shalini Venkataraman (Senior Applied Engineer, NVIDIA), 

Gerhard Lang (Chief Engineering Officer, VizRT ) 

Topic (s): Computer Graphics, Visualization 


ROOM L 

S0435 Leveraging GPGPU Technology for Valuation of 

Complex Insurance Products 

We share our experiences moving a mature, large scale insurance 

application from a CPU to GPU environment. This session explores 

the nuances of porting a C++ application when ‘blank sheet’ 

re-architecture is not an option. This session will cover: Insurance 

differences from other financial products (and the implications for 

the GPU), Considerations when moving an existing, fully featured 

C++ system to a GPGPU platform, Supporting CPU and GPU 

implementations from a single code base, Supporting user defined 

code extensions on the GPU, CUDA 4.0 C++ extensions: experiences, 

challenges and limitations and Performance case study. 

Speaker(s): Chris Stiefeling (Oliver Wyman Financial Services) 

Topic(s): Finance (Intermediate) 


ROOM N 

S0526 Tools for Mobile Computational Photography 

This session will talk about advances in Mobile Computational 

Photography and the tools that NVIDIA is putting together to 

enable these on Tegra powered devices. It will demonstrate the 

use of FCam, an Application Programming Interface (API) that 

allows for easy and precise control of the camera system. In 

addition, the FCam API can enable the application developer to 

replace basic camera routines such as metering, which are 

typically hidden inside black boxes in traditional camera 

programming models. 

Speaker(s): Alejandro Troccoli (Mobile Imaging Researcher, NVIDIA) 

Topic(s): Computational Photography (Intermediate) 


ROOM M 

S0638 Lenovo ThinkStation Accelerates Medical 

Research with Beckman Coulter (Presented by Lenovo) 

Lenovo ThinkStations utilize Nvidia Maximus technology to 

accelerate mission critical applications across multiple industries, 

including manufacturing, media & entertainment, and Life 

Sciences. Discover how GPUs are used to accelerate medical 

research from product experts with Lenovo and Beckman Coulter. 

Beckman Coulter has utilized Nvidia GPUs to reduce software 

development and test cycles by 50% with their Kaluza software. 

Kaluza is a revolutionary flow cytometry analysis software solution 

that provides visualization tools, speed and an innovative 

simplicity to the flow community. See how Kaluza allows users to 

analyze 10 million cells in real time. Session attendees will 

receive a drawing entry to win a brand new ThinkPad Tablet. 

Speaker(s): Scott Ruppert (ThinkStation Technical Solutions Manager, 

Lenovo), Tanmay Dharmadhikari (Senior Software Development 

Engineer, Beckman-Coulter) 

Topic(s): Computer Graphics, Life Sciences (Beginner) 


HALL 1 

S0641 CUDA 5 and Beyond 

CUDA, NVIDIA’s platform for parallel computing, has grown 

rapidly in the past 5 years. The performance and efficiency of 

software built on CUDA, combined with a thriving ecosystem of 

programming languages, libraries, tools, training, and service 

providers, have helped make GPU computing a leading HPC 

technology. CUDA 5 and the Kepler GPU architecture don’t just 

increase application performance; they enable a more powerful 

parallel programming model that expands the possibilities of GPU 

computing, and language features that improve programmer 

productivity. In this talk you’ll hear about these revolutionary 

features and get insight into the philosophy driving the 

development of new CUDA hardware and software. You will learn 

about NVIDIA’s vision for CUDA and the challenges for the future 

of parallel software development.

Speaker(s): Mark Harris (Chief Technologist, GPU Computing, NVIDIA) 









team after attending one of the exciting training session, the 

lounge is great place to learn everything you ever wanted to know 

about the tool. 




ROOM J1 

S0021 OptiX for DirectX Programmers - EVE Online’s 

GPU-Raytraced Portraits 

By integrating NVIDIA’s OptiX system for real-time GPU raytracing 

into a DirectX9 based engine, CCP Games enables high-quality 

raytraced player portraits for the single shard MMO Eve Online, 

reusing the game’s assets and pipeline. We selectively add 

stochastic effects while closely maintaining the look of the 

DX9-based renderer that Art Direction aimed for. In this talk we 

approach OptiX from the point of view of a programmer familiar 

with DirectX, discuss integrating these two systems, and show 

how we reproduced some DirectX-based effects like transparency 

and subsurface scattering within OptiX. 

Speaker(s): Bert Peers (Senior Graphics Programmer, CCP Games) 

Topic(s): Ray Tracing, Computer Graphics, Application Design & 



ROOM A1 

S0104 GPU Implementation of Deep Learning for 

Intelligent Computer Vision 

Learn how to use GPU supercomputing for intelligent computer 

vision, via deep learning algorithms. We will focus on a case study 

of visual object and event recognition in a humanoid robotics 

context, involving a port to CUDA of the DeSTIN “compositional 

spatiotemporal deep learning network” vision processing 

algorithm (originally implemented at the University of Tennessee 

in Knoxville for conventional serial computers). The audience will 

learn how to use the open-source DeSTIN CUDA code, and also 

how to port other deep learning algorithms to CUDA. 

Speaker(s): Ben Goertzel (CEO, Novamente LLC) 

Topic(s): Computer Vision, Algorithms & Numerical Techniques 

(Advanced) 


ROOM C 

S0314 Efficient k-Nearest Neighbor Search Algorithms 


Come see how to select the k smallest elements from an unsorted 

list. We present a selection and combination of different 

algorithms that perform exact k-nearest neighbors search 

(k-NNS) on GPUs and outperform the competition. In this session 

we present four different selection algorithms designed to exploit 

differently the parallelization of the GPU according to the relative 

size of the corpus data set, the size of the query set and the 

number of neighbors sought. We show the application of Logo 

Retrieval with SIFT vector matching on two different GPUs, the 

Tesla C1060 and the Fermi GTX480. 

Speaker(s): Nikos Pitsianis (Assistant Professor, Aristotle University, 

Greece), Xiaobai Sun (Professor, Duke University) 

Topic(s): Machine Learning & AI, Databases, Data Mining, Business 

Intelligence, Algorithms & Numerical Techniques (Beginner) 


ROOM A7 

S0628 GPUs in Energy & Exploration: Software 

Development and Production 

This session will feature expert panelists that will share their 

experience adopting GPUs in their respective environments. Since 

2009, these production systems have been boosting throughput, 

and shorten cycle times while delivering enhanced images using 

NVIDIA technologies. Featured panelists will include: Hess, 

Schlumberger, Petrobras, Chevron and more. 

Speaker(s): Paulius Micikevicius (Developer Technology Engineer, NVIDIA), 

Alexander Loddoch (Chevron), Dave Nichols (Schlumberger), Paulo Souza 

(Petrobas), Mauricio Araya (Repsol) 



ROOM A2 

S0659 Computer Simulation of Lignocellulosic Biomass 

Biomass from terrestrial plants offers the potential of an 

abundant source of cellulosic ethanol. However, technical 

problems still hinder the cost-effective conversion of biomass to 

ethanol arising from the recalcitrance of biomass to hydrolysis. 

Here, computer simulation of biomass is employed to understand 

the physical origins of biomass recalcitrance. The temperaturedependent 

structure and dynamics of lignin polymers in aqueous 

solution are examined using extensive molecular dynamics 

simulations. Neutron scattering experiments and molecular 

dynamics simulations reveal the structure of lignin aggregates. 

Finally, the interaction of lignin with cellulose is examined and 

differential binding to crystalline and amorphous cellulose 

explained thermodynamically. 

Speaker(s): Loukas Petridis (Staff Scientist, Oak Ridge National 

Laboratory) 

Topic Areas: Supercomputing (Intermediate) 


ROOM K 

S0037 SeqNFind: Application Of CUDA GPU 

Technologies To Sequence Alignment Techniques 

Explosive growth in the amount of genomic data has created a 

need for faster systems that align and compare nucleotide 

sequences. With the development of tools for leveraging the 

massively parallel architecture of NVIDIA GPUs it is a logical next 

step to construct algorithms for genomic analysis on GPU clouds/ 

clusters. Although a seemingly simple task, there are a number of 

challenges to deploying the current algorithms. Every algorithm 

from Smith-Waterman to BLAST has its own unique set of 

barriers. Presented here some of the lessons learned and how 

ongoing genomic research projects have benefitted from the 

increased speed and accuracy. 

Speaker(s): D. Andrew Carr (Director of Bioinformatics, Accelerated 

Technology Laboratories) 


(Advanced) 


TUESDAY 


HALL 1 

S0156 Towards Computing the Cure for Cancer 

Attend this session to learn about how to create “designer” 

genomic analysis pipelines as part of the “Compute the Cure” for 

cancer initiative from NVIDIA Foundation. It will offer an overview 

of an open-source framework that enables the creation of 

customized genomic analysis pipelines. It will disucss how 

different plug-ins from the “mapping/realignment/discovery” 

repositories, respectively, can be composed to form a genomic 

analysis pipeline. Attendees will learn to use next-generation 

sequencing data to characterize previously undetectable genetic 

changes between normal and malignant cells and ways to 

contribute to the “Compute the Cure” cause. 

Speaker(s): Wu Feng (Professor, Virginia Tech), Heshan Lin (Research 

Scientist, Virginia Tech) 

Topic(s): Bioinformatics, Life Sciences, Supercomputing, Algorithms & 

Numerical Techniques (Intermediate) 


ROOM C 

S0219 Efficient Top-Down Planning in 

Business Intelligence 

In business intelligence, tasks like corporate planning or what-if 

analysis complement traditional reporting and analysis. One main 

difference is that while the latter only read data, the former 

require the change of possibly large numbers of existing and 

creation of new data records in the business model, preferably in 

real time. In this session, we describe the extension of an existing 

BI tool, Jedox OLAP, by GPU-based parallel algorithms for 

interactive planning scenarios. Compared to sequential inmemory 

algorithms, our CUDA approach yields tremendous 

speedups and can also cope with large amounts of data by using 

multiple GPUs. 

Speaker(s): Tobias Lauer (Senior Researcher, Jedox AG), Alexander 

Haberstroh (Software Developer, Jedox AG) 

Topic(s): Databases, Data Mining, Business Intelligence, Finance, 

Algorithms & Numerical Techniques (Intermediate) 



S0247 3D ADI Method for Fluid Simulation on 

Multiple GPUs 

Find out about a multiple GPU implementation of the Alternating 

Direction Implicit method for large 3D domains. The ADI technique 

is applied towards direct numerical fluid simulation. Modeling 

complex flows demands extremely large grids and a distributed 

computation is required for sharing the memory among multiple 

GPUs. In this session a novel distributed tridiagonal solver as well 

as parallelization and load balancing strategies will be covered in 

detail. Finally, a comprehensive performance analysis and scaling 

studies for different input geometries and possible future 

improvements will be discussed. 

Speaker(s): Nikolay Markovskiy (HPC DevTech Engineer, NVIDIA), 

Nikolai Sakharnykh (Developer Technology Engineer, NVIDIA) 

Topic(s): Algorithms & Numerical Techniques, Computational 

Fluid Dynamics (Intermediate) 


ROOM J2 

S0267A Mixing Graphics and Compute with Multiple GPUs 

In this session we will cover all the different aspects of interaction 

between graphics and compute. The first part of the session will 

focus on compute API interoperability with OpenGL (using CUDA 

and OpenCL APIs), while the second part of the session will delve 

into interoperability at a system level. In particular we will go 

through the challenges and benefits of dedicating one GPU for 

compute and another for graphics, how different system 

configurations affect data transfer between two GPUs, and how it 

translates into application design decisions helping to enable an 

efficient, cross-GPU interoperability between compute and 

graphics contexts. 

This session is repeated on Thursday at 15:30 (S0267B). 

Speaker(s): Alina Alt (Applied Engineer, NVIDIA) 

Topic(s): Visualization Application Design & Porting Techniques 

(Beginner) 


ROOM A5 

S0359 VMware and NVIDIA: Delivering 3D Workstations 

from the Cloud 

This session will detail the delivery of the most demanding 

Workstation class workloads from the private cloud using 

technologies from NVIDIA and VMware. We will cover the 

configuration and performance metrics of the combined VMware, 

NVIDIA direct pass through hardware accelerated graphics 

solution. Using sample workloads, we will demonstrate how 

customers can realize the operational and security benefits of 

cloud based personal computing without sacrificing performance. 

Speaker(s): Aaron Blasius (Sr. Product Manager, VMware), Warren 

Ponder (Director, Product Management, VMware) 

Topic(s): Visualization, Cloud Computing (Advanced) 


ROOM L 

S0427 Intra-Day Risk-Management with Parallelized 

Algorithms on GPUs 

The challenge with intra-day risk management is that a very large 

number of calculations are required to be performed in a very 

short amount of time. Typically, we may be interested in 

calculating VaR for 100 to 1000 securities per second based on 

100 million potential scenarios. The magnitude of these 

calculations is not Utopian but it reflects the reality of modern 

financial institutions and exchanges. In this presentation, we 

outline how the complex problem of intra-day risk management 

can be solved using parallelized algorithms on GPUs. The 

methodology has been proven in a POC at 2 financial institutions. 

Speaker(s): Partha Sen (CEO, Fuzzy Logix) 

Topic(s): Databases, Data Mining, Business Intelligence, Finance, 

Algorithms & Numerical Techniques, Supercomputing (Advanced) 


ROOM A3 

S0602 An Introduction to the Thrust Parallel 

Algorithms Library 

Thrust is a parallel algorithms library which resembles the C++ 

Standard Template Library (STL). Thrust’s high-level interface 

greatly enhances developer productivity while enabling performance 

portability between GPUs and multicore CPUs. Interoperability with 

established technologies (such as CUDA, TBB and OpenMP) 

facilitates integration with existing software. In this talk we’ll walk 

though the library’s main features and explain how developers can 

build high-performance applications rapidly with Thrust.

Speaker(s): Nathan Bell (Senior Research Scientist, NVIDIA), Julien 

Demouth (Developer Technology Engineer, NVIDIA) 


Development Tools and Libraries (Beginner) 

TUESDAY, MAY 15, 17:00:00 AM (25 MINUTES) 

ROOM A2 

S0608 Toward Global Seismic Imaging based on 

Spectral-Element and Adjoint Methods 

Precise information about the structure of the solid Earth comes 

from seismograms recorded at the surface of a highly 

heterogeneous lithosphere. Seismic imaging based on spectralelement 

and adjoint methods can assimilate this information into 

three-dimensional models of elastic and anelastic structure. 

These methods fully account for the physics of wave excitation, 

propagation, and interaction by numerically solving the 

inhomogeneous equations of motion for a heterogeneous 

anelastic solid. Such methods require the execution of complex 

computational procedures that challenge the most advanced 

high-performance computing systems. Current research is 

petascale; future research will require exascale capabilities. We 

illustrate the current state-of-the-art based on an inversion for 

European upper-mantle structure. Our ultimate goal is to move 

toward “adjoint tomography” of the entire planet. 

Speaker(s): Jeroen Tromp (Director, Princeton Institute for 

Computational Science, Princeton) 

Topic(s): Supercomputing, (Intermediate) 


ROOM M 

S0643 Hybrid Architectures for Advanced Seismic 

Imaging: Recent Experiences at Bull (Presented by Bull) 

The two-part presentation describes Bull’s system architecture 

for accelerated seismic applications using GPUs, together with 

the parallel programming aspects involved and some examples of 

recent work. The first part covers hybrid system architectures, 

basic principles of Reverse Time Migration and the numerical 

methods used to implement it in various forms, together with the 

architectural features needed, depending on the specific 

algorithms used. The second part examines CUDA programming 

aspects and the use of compiler-based directives and libraries to 

convert existing codes for maximum performance and scalability 

on GPU architectures. 

Speaker(s): Mathieu Dubois (Senior HPC Consultant, Bull), Guy Gueritz 

(Oil & Gas Business Development Director, Bull) 

Topic(s): Energy Exploration, High Performance Computing 



ROOM A8 

S0646 Massively Parallel Code Development on Stelletto 

CDA (Presented by Creative Consultants) 

Come participate in the global launch of Stelletto – a multi-Node, 

office based, GPU accelerated conSTELLAtion compute platform. 

Join Rob Farber (author/scientist), Denis Gerrer (CAPS 

Enterprise), and Greg Scantlen (Creative Consultants) to learn 

how to create and leverage massively parallel applications. 

Whether you are porting legacy code or developing new code from 

scratch, the Stelletto Code Development Appliance offers a 

cost-effective methodology for producing scalable apps. In 50 

minutes you will learn the essentials of assembling a complete 

hardware and software solution for scalable Many-Core and GPU 

accelerated code development from plug-in Stelletto to massively 

parallel executable code. 

Speaker(s): Rob Farber (BlackDog Endeavors, LLC), Denis Gerrer 

(CAPS enterprise), Greg Scantlen (CreativeC.com) 

Topic (s): Parallel Programming Languages & Compilers (Beginner) 
















ROOM C 

S0043 30x Faster Regular Expressions on a GPU 

We present a regular expression (regex) engine on a GPU. We 

utilize the highly parallel architecture of GPUs to accelerate such 

searches. We believe that previous attempts to utilize the GPU for 

this task did not fully tap its potential. Regex present imbalanced 

compute workloads which are very different from common GPU 

applications (CFD, CG and image processing). Hence, they can 

teach us general lessons on how to utilize GPUs for more general 

workloads.Our initial results show 30x improvement in running 

time relative to single threaded commercial regex engines. 

Speaker(s): David Lehavi (Senior Research Scientist, HP) 

Topic(s): Databases, Data Mining, Business Intelligence (Advanced) 


ROOM K 

S0287 Jacket for Multidimensional Scaling in Genomics 

In this tutorial, we will present AccelerEyes’ Jacket software 

which enables GPU computing in MATLAB through a user case 

study entitled “Multidimensional Scaling for Genomics”. We show 

how Jacket enables developers to write and run code on the GPU 

in the native M-Language used in MATLAB. By simply casting data 

to Jacket’s GPU data structure, MATLAB functions are 

transformed into GPU functions. Additionally, we will also include 

demos of running MATLAB code on the GPU for image and signal 

processing, life science, finance, and other applications. A Q/A 

session will enable audience members to ask specific questions 

about Jacket. 

Speaker(s): Chris McClanahan (Software Engineer, AccelerEyes) 



ROOM A2 

S0657 Applying for INCITE Program, Conclusions, Q&A 

This session offers a wrap-up of “GPU-accelerated Science on 

Titan: Tapping into the World’s Preeminent GPU Supercomputer to 

Achieve Better Science” with Jack Wells. 

Speaker(s): Jack Wells, Ph.D. (Director of Science, Oak Ridge 

Leadership Computing Facility, Oak Ridge National Laboratory ) 



C++ Accelerated Massive Parallelism (C++ AMP) 

�� 

What is C++ AMP, how can it help me, and where can I get it? 

C++ AMP is a key new C++ language feature plus an STL-like library. It's designed to help you increase the performance of 

�� 

�� 

�� 

�� 

�� 

MICROSOFT Ad? 

What platforms and hardware does C++ AMP support? 

�� 

�� 

�� 

�� 

What new language feature does C++ AMP introduce? 

Microsoft added the restrict(amp)� �� 

function can be executed on a C++ AMP accelerator. The restrict keyword instructs the compiler to statically check that the 

��void myFunc() restrict(amp) {…} 

�� 

for purposes that are unrelated to C++ AMP. 

What new classes (APIs) does C++ AMP introduce? 

�� 

�� 

�� 

�� 

�� 

What does C++ AMP code look like? 

�� 

void AddArrays(int n, int m, int * pA, int * pB, int * pSum) { 

concurrency::array_view a(n, m, pA), b(n, m, pB), sum(n, m, pSum); 

concurrency::parallel_for_each(sum.extent, [=](concurrency::index i) restrict(amp) 

{ 

sum[i] = a[i] + b[i]; 

}); 

} 

Follow our blog: �� 

Ask questions: ��



WEDNESDAY, MAY 16, 09:00 (50 MINUTES) 

ROOM N 

S0010 Towards Routine Microsecond Molecular 

Dynamics Simulations on Commodity Hardware 

The original AMBER 11 provided performance on one GPU 

equivalent to an 8 node cluster and almost 60ns/day for 8 GPUs 

running the JAC production benchmark without additional 

approximations outstripping the performance of all conventional 

supercomputers. Here we describe further optimization of the 

code, coupled with hardware and software advances on the part of 

NVIDIA, that provides performance of >50ns/day on a single GPU 

with multiple GPUs providing simulation rates on systems the size 

of DHFR approaching a microsecond per day. This brings 

performance levels on desktops and commodity hybrid clusters to 

levels previously only considered possible using custom silicon. 

Speaker(s): Ross Walker (Assistant Professor, University of California 

San Diego) 

Topic(s): Molecular Dynamics, Life Sciences (Advanced) 


ROOM A8 

S0017 4D Medical Image Processing with CUDA 

Learn how to do 4D image processing with CUDA, especially for 

medical imaging applications. In this session we will give a couple 

of examples of how 4D image processing can take advantage of 

the computational power of the GPU. We will present how to use 

the GPU for functional magnetic resonance imaging (fMRI) 

analysis and true 4D image denoising. Most of our examples use 

the GPU both to speedup the analysis and to visualize the results. 

Speaker(s): Anders Eklund (PhD Student, Linköping University) 

Topic(s): Medical Imaging & Visualization, Audio, Image and Video 

Processing, Neuroscience, Visualization (Advanced) 


ROOM B 

S0072 GPU-Enabled Spatiotemporal Model of Stochastic 

Cardiac Calcium Dynamics and Arrhythmias 

Calcium ions play a central role controlling the contraction of the 

heart to pump blood. This requires tight regulation of cellular 

calcium dynamics which depends upon over 1,000,000 calcium 

channels that open and close stochastically and have a very specific 

spatial arrangement. In the School of Systems Biology at George 

Mason University, CUDA technology coupled to novel algorithms for 

Monte Carlo simulation have made possible this computationally 

expensive spatiotemporal model of calcium dynamics in the heart 

muscle cell to study the regulation of calcium dynamics and what 

aberrations leads to cardiac arrhythmia. 

Speaker(s): Mohsin Jafri (Professor and Chair, George Mason University), 

Hoang-Tron Minh Tuan (PhD Student, George Mason University) 

Topic(s): Life Sciences, Bioinformatics (Beginner) 


ROOM A7 

S0171 Numerical Modeling Of 3D Anisotropic Seismic 

Wave Propagation On MultiGPU Platforms 

We present an efficient and accurate numerical algorithm for the 

simulation of seismic experiments. The basis of the approach is a 

heterogeneous spectral element method implemented on 

MultiGPU applied to anisotropic elastic wave equation. The 

approach was designed to simulate wave propagation in 3D 

arbitrary anisotropic elastic media. Due to the use of an 

unstructured grid, the spectral element algorithm enables 

handling complicate geometries of the layers. We discuss results 

and computational efforts of simulation on MultiGPU platform. 

Several aspects of the code implementation are considered: 

optimal domain decomposition, data transfers between GPU by 

means of P2P and UVA, etc. 

Speaker(s): Denis Sabitov (Schlumberger) 

Topic(s): Energy Exploration, Algorithms & Numerical Techniques, 

Supercomputing, Molecular Dynamics (Intermediate) 


ROOM M 

S0253 Sensor Processing with Rugged Kepler GPUs 

(Presented by GE Intelligent Platforms) 

Swimming in sensors and drowning in data? Turn the tide on 

high-bandwidth sensors with rugged next-generation Kepler GPUs 

from NVIDIA. See how we deploy Kepler into the most extreme of 

environments, providing GPGPU capabilities onboard platforms 

where SWaP and GFLOPS/watt is key. Dig into four realtime CUDA 

sensor processing applications - Hyperspectral Imaging,Wide-Area 

Surveillance, 360° Situational Awareness, and GSM Cellular SIGINT. 

Discuss the CUDA algorithms, interconnects, and rugged platforms 

behind each. Learn how we utilize GPUDirect and realtime Linux for 

improved latency and determinism. 

Speaker(s): Dustin Franklin (GPGPU Applications Engineer, GE 

Intelligent Platforms) 

Topic(s): Audio, Image and Video Processing, General Interest, Machine 

Vision, Computer Vision (Intermediate) 



S0289 Fine-Grained Parallel Preconditioners for Fast 

GPU-based Solvers 

Leverage the power of GPUs for efficient parallel solution of large 

sparse linear systems of equations by means of fine-grained and 

scalable parallel preconditioners. In this session we describe 

parallel preconditioners for GPUs based on multicolor re-ordering 

for Gauss-Seidel-type and ILU-type preconditioners as well as 

approximate inverse (FSAI) preconditioners. With the power(q)pattern 

method we detail a novel method for controlling the fill-in 

pattern of ILU(p) factorizations that introduces a high degree of 

parallelism in the preconditioning phase. We demonstrate 

significant improvements with respect to solver time for various 

problem scenarios and different Krylov-type solvers. 

Speaker(s): Dimitar Lukarski (Research Associate, Karlsruhe Institute 

of Technology (KIT)), Jan-Philipp Weiss (Junior Professor, Karlsruhe 

Institute of Technology) 

Topic(s): Algorithms & Numerical Techniques (Advanced) 


ROOM A1 

S0353 Programming Multi-GPU’s for Scalable Rendering 

Multi-GPU configurations are becoming common affordable 

options for OpenGL applications to scale performance, data size, 

display size and image quality. We show how to structure your 

application for multi-gpu rendering by using multiple threads and 

OpenGL contexts and handle the synchronization and data 

transfer. We conclude with a discussion of how to implement 

common parallel rendering approaches such as sort-first, 

sort-last and hybrid techniques. 

47 CONFERENCE GUIDE WEDNESDAY

WEDNESDAY 

Speaker(s): Shalini Venkataraman (Senior Applied Engineer, NVIDIA 

Topic(s): Visualization (Advanced) 


ROOM L 

S0383 Speedup Derivatives and Structured Products 

Pricing, Reduce TCO Using GPUs 

Numerix will share its experience using GPU to significantly 

reduce its customers’ Total Cost of Ownership (TCO) and 

accelerate forward Monte Carlo pricing methods and hybrid 

models of complex financial structured products and variable 

annuities. Numerix will describe how it combines complex 

financial and actuarial modeling with user scripting to drive GPU 

execution from a script interpreted at run time. This architecture 

is well suited to financial services firms with portfolios of many 

different types of structured products where deals are 

represented independently from the models used to price them. 

Speaker(s): Steve Karmesin (Senior Developer, Numerix) 

Topic(s): Finance, Algorithms & Numerical Techniques (Intermediate) 


ROOM A5 

S0420 NSight IDE for Linux and Mac 

NSight IDE for Linux and Mac is an all-in-one development 

environment that lets you develop, debug and optimize CUDA code in 

an integrated UI environment. If you were waiting for an IDE on Linux 

and Mac then this session is for you. This session provides a detail 

usage walk-through of a fully CUDA aware source editor, build 

integration of the CUDA toolchain, graphical debugger for both CPU 

and GPU, and graphical profiler to enable performance optimization. 

Speaker(s): David Goodwin (Software Engineer, NVIDIA), Eugene 

Ostroukhov (Tools Developer, NVIDIA) 



ROOM K 

S0431 Evolving Use of GPU for Dassault Systems 

Simulation Products 

SIMULIA, the Dassault Systems brand for simuliation, has been 

working with NVIDIA GPGPU cards to accelerate the computation 

required in doing large-scale structural finite-element 

simulations with the widely used Abaqus product line. SIMULIA’s 

initial efforts with GPGPU’s have been focused on accelerating 

particularly costly parts of the code when running both on 

workstations and clusters. We will look at success in these areas 

with existing products. Futher SIMULIA is now looking at how 

evolving programming models like OpenACC open the door to 

using GPU’s as a compute platform more than acceleration for 

limited parts of an application. 

Speaker(s): Luis Crivelli (Dassault Systemes, SIMULIA) 

Topic(s): Computational Structural Mechanics, Parallel Programming 

Languages & Compilers (Intermediate) 


ROOM C 

S0531 Exascaling Your Apps 

In the global exascale race, hardware often takes center stage. 

But the race might ultimately be won or lost based on how well 

the industry optimizes new and existing applications for extreme 

parallelism. Today’s apps will not just run on tomorrow’s systems, 

so we must think strategically and creatively about how to design 

applications that take maximum advantage of the first power- 

efficient, accelerator-driven exascale systems. This panel of HPC, 

software and computer science experts will discuss what we can, 

and should be doing, including a review of new scientific and 

commercial HPC requirements, programming model options and 

how to best align architecture and software design processes. 

Speaker(s): Mike Bernhardt (The Exascale Report), Olav Lindtjorn 

(Schlumberger), Satoshi Matsuoka (Titech), Steve Scott (CTO, Tesla 

Business, NVIDIA), Jeff Vetter (Oak Ridge National Laboratory) ) 

Topic(s): Supercomputing (Beginner) 













Speaker(s): NVIDIA Developer Tools Team) 




S2000 Emerging Companies Summit Opening Address, 

Followed by CEO on Stage featuring Rocketick and Cortexica 

The Emerging Companies Summit is a unique forum for startup 

companies to showcase innovative applications that leverage the 

GPU to solve visual and compute-intensive problems. The opening 

address includes an overview of NVIDIA’s GPU ecosystem 

development activities. ECS is a great opportunity to discover new 

players in the GPU ecosystem, find great investments, explore 

partnership and customer/vendor opportunities, network/build 

relationships, and discuss the future of an industry that is 

reshaping computing. Immediately following the opening address is 

the ECS CEO on Stage session featuring two startups who will each 

have 15 minutes to introduce their companies and interact with a 

panel of leading venture capitalists, technology executives, and 

industry analysts. 

Speaker(s): Jeff Herbst (Vice President of Business Development, NVIDIA), 

Tomer Ben-David (VP R&D, Rocketick), Iain McCready (CEO, Cortexica) 

Topic(s): General Interest 


ROOM K 

S0225 Speedup Altair RADIOSS Solvers Using NVIDIA GPU 

Solvers are the heart of Altair’s HyperWorks computer aided 

engineering simulation software. In this session, you will learn how 

GPU can improve their performance. Direct solver is widely used in 

structural analysis and sensitivity calculations. By offloading the 

intensive matrix computation on the GPU and using heterogeneous 

computing, you will discover how its speed can be increased 

compared to multi-core approach. Iterative solver is particularly 

suited to solve large problems with millions of degrees of freedom. 

An innovative hybrid parallelization using multi GPUs and MPI 

allowing dramatic solution time reduction will be presented. 

Speaker(s): Eric Lequiniou (Director, High Performance Computing, 

Altair), Hongwei Zhou (Senior Software Development Engineer, Altair) 

Topic(s): Computational Structural Mechanics (Beginner)



S0415 An Accelerated Weeks Method for Numerical 

Laplace Transform Inversion 

Mathematical methods based on the use of the Laplace transform 

are a standard component of undergraduate education. Real world 

problems however often yield Laplace space solutions which are 

too complex to be analytically inverted to expressions in physically 

meaningful variables. A robust numerical inversion approach is 

thus desirable. In this talk, I present one of the approaches to 

compute an approximate inverse, the Weeks method. I will also 

discuss the difficulties in performing numerical inversion. Finally, 

I will show how we have been able to utilize Jacket from 

AccelerEyes in MATLAB to more efficiently and robustly 

implement the Weeks method. 

Speaker(s): Patrick Kano (Co-Owner, Acunum Algorithms and 

Simulations, LLC) 

Topic(s): Algorithms & Numerical Techniques (Beginner) 


ROOM A2 

S0016 NVIDIA Grad Fellowship Fast Forward 

We invite you to a special presentation from our 2011-2012 

Graduate Fellowship recipients to learn “what’s next” in the world 

of research and academia. The NVIDIA Graduate Fellowship 

recipients were selected from 200 applications in 27 countries. 

Sponsored projects involve a variety of technical challenges, 

including computer architecture, computer vision, programmability 

and optimization for heterogeneous systems, automotive computing 

and much more. We believe that these minds lead the future in our 

industry and we are proud to support the 2011-2012 NVIDIA 

Graduate Fellows. For more information on the 2011-2012 NVIDIA 

Graduate Fellows, please visit www.NVIDIA.com/fellowship. 

Speaker(s): David Luebke (Director, NVIDIA Research) 



ROOM N 

S0058 Advancing GPU Molecular Dynamics: Rigid Bodies 

in HOOMD-blue 

Learn how rigid body dynamics are implemented in HOOMD-blue. 

Previous releases were capable of executing classical molecular 

dynamics -- where free particles interact via smooth potentials and 

their motion through time is computed using Newton’s laws. The 

latest version allows particles to be grouped into bodies that move 

as rigid units. Users can now simulate materials made of cubes, 

rods, bent rods, jacks, plates, patchy particles, bucky balls, or any 

other arbitrary shapes. This talk covers how these algorithms are 

implemented on the GPU, tuned to perform well for bodies of any 

size, and discusses several use-cases relevant to research. 

Speaker(s): Joshua Anderson (Research Area Specialist, University of 

Michigan), Trung Dac Nguyen (University of Michigan) 

Topic(s): Molecular Dynamics, Computational Physics (Intermediate) 


ROOM K 

S0066 Particleworks: Particle-based CAE Software 

Fully Ported on Multi-GPU 

Get the latest information on Particle-based fluid simulation + 

multi-GPU computing as a commercial CAE software named 

“Particleworks” in Japan. In this session, we provide the 

information such as (1) Particle simulation trends in CAE, (2) 

Particle simulation development in Japanese industry, (3) 

Implementation and performance of full GPU porting and (4) 

Multi-GPUs scaling with the several clients’ cases. 

Speaker(s): Yoshiaki Hanada (CEO, Prometech Software), Issei Masaie 

(Chief Engineer, Prometech Software) 

Topic(s): Computational Fluid Dynamics (Intermediate) 


ROOM A7 

S0125 Memory Efficient Reverse Time Migration in 3D 

Learn how we can image the interior of the Earth in three dimensions 

using Reverse Time Migration. We discuss how GPUs accelerate this 

method using parallel wave propagation kernels, texture memories 

and minimal device to host transfers. Further we discuss how the 

progression to 3D presents a multitude of new problems, particularly 

memory based - causing the system to be IO limited. By manipulating 

boundary positions and values to a pseudo-random form we show 

how many of these memory restrictions can be diminished and how 

detailed subsurface images can be fully constructed using GPUs. 

Speaker(s): Chris Leader (Research Assistant, Stanford 

Exploration Project) 

Topic(s): Energy Exploration, Computational Physics (Intermediate) 


ROOM A5 

S0235 Compiling CUDA and Other Languages for GPUs 

This talk gives an overview of the technology behind NVIDIA’s 

CUDA C and OpenCL C compilers, as well as the GPU architecture 

as seen from a compiler’s perspective. Similarities and 

differences with compiling to a CPU are also discussed. We 

provide insights into compiler optimizations affect performance 

and how other languages could be targeted to GPUs. 

Speaker(s): Vinod Grover (Senior Manager, NVIDIA), Yuan Lin (Senior 




ROOM L 

S0250 From GPU Computing Toward Full HPC In Finance 

with GPUs 

During the previous GTC Murex has shown how the company had 

adapted their generic Monte-Carlo & PDE codes compatible with a 

payoff language. With one more year of experience with GPUs and 

OpenCL Murex will show how the company has broadened the 

usage of GPUs for other subjects like vanilla screening or model 

calibration and focus on their new challenge ‘use as many GPUs 

as possible’ for one single computation. 

Speaker(s): Pierre Spatz (Head of Quantitative Research, Murex SAS) 


WEDNESDAY, MAY 16, 10:00 (25 MINUTES 

ROOM B 

S0262 GPU-Accelerated Model-Based Drug Development 

Explore how GPUs can be used to improve the efficiency of drug 

development. Drug development is a very time-consuming, 

complex and expensive process that has low successful rate. A 

model-based drug development paradigm has been proposed as a 

possible solution to overcome these problems. A key challenge is 

to develop computational intensive drug and disease-specific 

models from a large quantity of highly complicated preclinical and 

clinical data. This session will describe how GPUs can and will 


WEDNESDAY 

play a key role in shortening the model development times and 

improving the efficiency of model-based drug development. 

Speaker(s): Chee Ng (Research Assistant Professor of Pediatrics, 

Children Hospital of Philadelphia/University of Pennsylvania) 

Topic(s): Life Sciences, Algorithms & Numerical Techniques, 

Bioinformatics (Beginner) 


ROOM A8 

S0312 GPU Implementation for Rapid Iterative Image 

Reconstruction in Nuclear Medicine 

GPU implementation can greatly accelerate iterative techniques of 

3D image reconstruction in nuclear medicine imaging. Single 

Photon Emission Computed Tomography (SPECT) is a functional 

imaging modality widely used in clinical diagnosis. To obtain high 

quality images within reduced scanning times high sensitivity 

collimators need to be used and their response function modeled 

in the reconstruction. This is in general very computationally 

intensive and unfeasible with CPU and algorithm 

implementations. Our software is able to perform the 

reconstruction of patient data within clinically acceptable times 

using relatively low cost and widely available hardware. 

Speaker(s): Jakub Pietrzak (Software Engineer, University of Warsaw) 

Topic(s): Medical Imaging & Visualization, Computational Physics, 

Computer Graphics (Intermediate) 


ROOM A1 

S0322 Warping & Blending for Multi-Display Systems 

This talk will describe how to scale up from one to many displays for 

high end visualization. You will learn about NVIDIA’s new Warp and 

Blend capability that allows you to create a truly seamless logical 

display comprised of many individual display outputs. With this new 

capability you can project your graphics onto curved surfaces and 

implement the correct transformation entirely on the GPU without 

any external hardware to get the correct display transformations. 

Speaker(s): Shalini Venkataraman (Senior Applied Engineer, NVIDIA) 

Topic(s): Visualization, Computer Graphics (Beginner) 


ROOM A3 

S0325 ArrayFire Graphics: A Tutorial 

Learn how to use the graphics primitives for GPU computing 

available in ArrayFire, a new C and C++ library for GPU computing 

in both CUDA and OpenCL. In this session, we will cover the 

capabilities of ArrayFire’s graphics primitives and show how to 

build fast, visual computing applications. The tutorial centers 

around the construction of an application for the computation of 

optical flow on the GPU and will illustrate how to couple graphics 

with compute using ArrayFire’s graphics primitives. We will also 

show how the graphics primitives can be composed to result in 

scalable, fast graphics that complement GPU applications. 

Speaker(s): Chris McClanahan (Software Engineer, AccelerEyes) 



ROOM M 

S0633 Learn about new Hewlett-Packard GPU 

Systems, Solutions, and Applications! (Presented by 

Hewlett-Packard) 

Learn how to shorten time to discovery, gain faster insight, and 

beat the barriers to innovation, with performance, efficiency and 

agility! Hear the latest on how you can do this and more with HP’s 

purpose built SL server line. Servers are specifically designed for 

GPUs with HP ProActive Insight Architecture. Discover what a new 

generation of workstation desktop GPU computing technology 

from HP and NVIDIA can do for you! HP will compare and contrast 

GPU compute performance on the PCI Express Gen2 architecture 

available in HP’s Z800 Workstation to the PCI Express Gen3 

architecture in HP’s latest Z820 Workstation. 

Speaker(s): David Korf (Senior Marketing Manager, Hewlett-Packard), 

John Brown (Principle Engineer, Hewlett-Packard) 





Come to the NVIDIA Nsight Lounge to meet the Nsight development 

team! Whether you would like a private meeting to discuss specific 

product features or test out your application with the latest version 

of Nsight, or you just want to hang out with the team after attending 

one of the exciting training session, the lounge is great place to 

learn everything you ever wanted to know about the tool. 





S2001 Emerging Companies Summit: CEO on Stage 

Featuring Unity Technologies, MirriAd, and BioDigital 

See the hottest new technologies from startups that are 

transforming computing. In a lively and fast-paced exchange, the 

Emerging Companies Summit CEO on Stage sessions will feature 

CEOs from three startups who will each have 15 minutes to 

introduce their companies and interact with a panel of leading 

venture capitalists, technology executives, and industry analysts. 

Speaker(s): David Helgason (CEO, Unity Technologies), Mark 

Popkiewicz (CEO, MirriAd), Aaron Oliker (Partner/Director of 3D 

Technology, BioDigital), and Frank Sculli (Co-Founder/Informatics 

Director, BioDigital) 

Panelist(s): Jon Peddie (President, Jon Peddie Research), Neil 

Sequeira (Managing Director, General Catalyst Partners), Savitha 

Srinivasan (Partner, IBM Venture Capital Group) 


WEDNESDAY, MAY 16, 10:30 (25 MINUTES 


S0115 Specialized Sparse Matrix Formats and SpMV 

Kernel Tuning for GPUs 

This session is focused on optimizing sparse matrix-vector product 

for NVIDIA GPUs. This is a frequently studied kernel that appears in 

applications employing iterative methods for solving systems of 

linear equations. In the majority of cases the computation is 

memory bandwidth bound. Our study focuses on developing 

specialized sparse matrix storage formats and corresponding 

CUDA SpMV implementation that achieves high performance at the 

cost of additional start-up time required for conversion and tuning. 

The proposed storage formats allow to reduce required memory 

bandwidth by providing compact coding for locations of some 

frequently observed patterns of non-zero elements. 

Speaker(s): Arutyun Avetisyan (Deputy Director, ISP, Russian Academy 

of Sciences), Alexander Monakov (Researcher, ISP, Russian Academy 

of Sciences) 

Topic(s): Algorithms & Numerical Techniques (Intermediate)


ROOM A3 

S0209 Performance of 3-D FFT Using Multiple GPUs with 

CUDA 4 

Get the latest information on performance of 3-D fast Fourier 

transform using multiple GPU devices. CUDA 4.0 enables efficient 

data transfer between GPUs. It is really important in FFT computation 

since it requires a large amount of all-to-all data exchange between 

GPUs. The peer-to-peer communication feature of GPUDirect V2 

improves the communication between the devices on same node. 

GPUDirect also accelerates the communication between GPUs on 

different nodes. We will present the latest performance results on a 

four-GPU system and up to 128 compute nodes of TSUBAME 2.0. 

Speaker(s): Akira Nukada (Researcher, Tokyo Institute of Technology) 

Topic(s): Algorithms & Numerical Techniques, Development Tools 

& Libraries (Advanced) 


ROOM B 

S0272 GPU GWAS - CUDA Based Genome Wide 

Association Studies 

We have developed a CUDA based GWAS analyzer that has 

achieved a 10x analysis speed-up per GPU. Genome wide 

association studies scans through millions of SNP markers across 

the human genome seeking the genetic basis of life threatening 

diseases such as coronary artery disease and prostate cancer. The 

prospect of the $1,000 genome heralds a potential new scale of 

GWAS involving hundreds of thousands of patients. We will 

discuss how we utilized the Python, R, and C languages to produce 

a robust GWAS algorithm that can be extended to multiple GPUs 

and GPU clusters. 

Speaker(s): Tim Bi (Graduate Research Analyst, Johns Hopkins 

University / George Mason University) 

Topic(s): Life Sciences, Bioinformatics (Intermediate) 


ROOM K 

S0304 Large Scale Computational Fluid Dynamics 

Simulations on Hybrid Supercomputers 

Learn how to approach the all-too-common program of trying to 

retrofit a major application for speed in the modern era of the 

hybrid supercomputer. In this talk, we will focus on computational 

fluid dynamics (CFD) codes that are run on Top500 

Supercomputers. Many of these applications have existed for 20 or 

more years, so the process of adding the GPU and getting 

wall-clock improvements in performance can be very challenging! 

Our talk will discuss how to properly target your effort, the impact 

of directives-based coding, and how to maintain efficiency across 

a hybrid cluster. 

Speaker(s): John Humphrey (Engineering Director, EM Photonics), Eric 

Kelmelis (CEO, EM Photonics) 

Topic(s): Computational Fluid Dynamics, Supercomputing 



ROOM A8 

S0348 GPUs Open New Avenues in Medical MRI 

See how GPUs enable exciting new developments in medical 

Magnetic Resonance Imaging (MRI). Their computational power 

makes now practical new MRI techniques that can bring shorter 

imaging sessions, better images, and more insight into human 

physiology. Learn about the characteristics of the general 

computational approach for obtaining the final image, and how it 

can be implemented using an iterative conjugate gradient 

algorithm. The algorithm exhibits massive parallelism and fits 

well the GPU architecture. Learn about its CUDA implementation 

details and Matlab integration. See throughput measurements of 

Tesla GPUs compared to top of the line many-core and large RAM 

CPU systems. 

Speaker(s): Chris A. Cocosco (Scientist, University Medical Center 

Freiburg, Dept. of Radiology, Medical Physics) 

Topic(s): Medical Imaging & Visualization (Beginner) 


ROOM A7 

S0352 GPU-Accelerated Parallel Computing for 

Simulation of Seismic Wave Propagation 

We adopted GPU to accelerate large-scale, parallel finitedifference 

(FDTD) simulation of seismic wave propagation. 

Effective parallel implementation is needed because the size of 

the memory of a single GPU is too small for real applications. 

Thus we describe the memory optimization, the threedimensional 

domain decomposition, and overlapping the 

communication and computation adopted in our program. We 

achieved so far a high performance (single-precision) of about 61 

TFlops by using 1200 GPUs of TSUBAME-2.0, the GPU 

supercomputer in Tokyo Institute of Technology, Japan. As an 

important application, we show the results of the simulation of the 

2011 Tohoku-Oki mega-quake. 

Speaker(s): Taro Okamoto (Assistant Professor, Tokyo Institute 

of Technology) 

Topic(s): Energy Exploration, Computational Physics, General Interest 

(Advanced) 


ROOM A1 

S0355 Seamless Scalable Displays- Using NVDIA Warp + 

Intensity API 

In this talk we will discuss how we use the NVIDIA Warp and 

Intensity API to create seamless displays made up of 

multiprojectors based on our camera feedback systems. We will 

show and discuss case studies in production including a 25 

megapixel touch wall, military dome simulation systems, VR 

Walls, VR Caves, and immersive conference rooms that are made 

affordable and enabled by this technology. 

Speaker(s): Rajeev Surati (President, Scalable Display Technologies) 

Topic(s): Visualization, Audio, Image and Video Processing, Computer 

Vision, Computer Graphics (Beginner) 


HALL 1 

S3001 Day 2 Keynote: From Democratic Consensus to 

Cannibalistic Hordes: GPU Computing Reveals the 

Principles of Collective Behavior 

Collective behavior is one of the most pervasive features of the 

natural world. Our brains are composed of billions of 

interconnected cells communicating with chemical and electrical 

signals. We are integrated in our own human society. Elsewhere in 

the natural world a fish school convulses, as if one entity, when 

being attacked by a predator. How does individual behavior 

produce dynamic group-level properties? Do animal groups -or 

even cells in a tumor- function as some form of ‘collective mind’? 

How does socially contagious behavior spread through natural 

human crowds? In his keynote address, Prof. Iain D. Couzin, will 

demonstrate how GPU computing has been pivotal in the study of 


NVIDIA ® Quadro ® by PNY 

Visually Amplify Your Desktop 

If you’re an artist, designer, or video professional, accelerate your 

® ® Quadro by PNY professional 

graphic solutions. Delivering excellent graphics performance 

across a broad range of design, animation and video 

applications, NVIDIA Quadro by PNY offers the advantage. 

Get The Advantage· 

To learn more go to www.pny.com/quadro 

© 2012 NVIDIA Corporation. NVIDIA, the NVIDIA logo, Quadro are trademarks or registered trademarks of NVIDIA Corporation in the United States and other countries. 

Other company and product names may be trademarks of the respective companies with which they are associated. All rights reserved. 

The PNY logo is a registered trademark of PNY Technologies, Inc. All other trademarks are the property of their respective owners. Copyright © 2012 PNY Technologies, Inc. All rights reserved.

collective behavior, helping reveal how collective action emerges 

in a wide range of groups from plague locusts to human crowds, 

and the critical role that uninformed, or weakly-opinionated, 

individuals play in democratic consensus decision-making. 

Speaker(s): Iain Couzin (Assistant Professor, Princeton University) 





Featuring eyeSight Mobile, Numira Biosciences, and Ubitus 







Speaker(s): Gideon Shmuel (CEO, eyeSight Mobile), David Weinstein 

(CTO, Numira Biosciences), Wesley Kuo, (CEO, Ubitus) 

Panelist(s): Jon Peddie (President, Jon Peddie Research), Neil 

Sequeira (Managing Director, General Catalyst Partners), Savitha 

Srinivasan (Partner, IBM Venture Capital Group) 



ROOM C 

S0027B All-In-One Debugging Experience with CUDA- 

GDB and CUDA-MEMCHECK 

CUDA Debugger tools CUDA-GDB and CUDA-MEMCHECK provide 

a whole new feature set to help improve your CUDA application 

development cycle. This session is a detail walk-through of the 

key debugger features and advanced techniques on using printf, 

CUDA-GDB and MEMCHECK together to improve overall code 

productivity on Linux and MacOS platforms. This tutorial will also 

include live demos. 

Speaker(s): Geoff Gerfin (Technical Manager / Senior Engineer, 

NVIDIA), Vyas Venkataraman (Software Engineer, NVIDIA) 




S0029 Leveraging Matrix Block Structure In Sparse 

Matrix-Vector Multiplication 

The commonly occurring block structure of sparse matrices can 

be effectively leveraged to improve the performance of Sparse 

Matrix-Vector multiplication (SpMV) on GPUs. This session will 

present one such algorithm and discuss both its design and its 

performance relative to other SpMV algorithms. In particular, 

aspects of GPU floating point performance, GPU memory use, and 

datastructure translation effort will be detailed. 

Speaker(s): Steve Rennich (HPC Developer Technology Engineer, NVIDIA) 

Topic(s): Algorithms & Numerical Techniques (Intermediate) 


ROOM K 

S0064 MSC Nastran Sparse Direct Solvers for Tesla GPUs 

The current implementation of MSC Nastran’s MSCLDL and 

MSCLU sparse direct solvers for multiple Tesla GPUs is 

presented. The matrix is first statically decomposed into a 

prescribed number of domains. The Schur compliments are then 

calculated with CPUs and GPUs, and the residual structure is 

solved afterward. Back-substitution is used to find the solution at 

every grid point. Merits of this method are discussed and 

performance comparisons are made. 

Speaker(s): Cheng Liao (Development Manager, MSCsoftware) 

Topic(s): Computational Structural Mechanics (Beginner) 


ROOM A7 

S0140 Accelerating Reservoir Simulation and Algebraic 

Multigrid with GPUs 

Given a model of a reservoir’s rock and well properties, a 

reservoir simulator solves the PDEs for the multiphase flow 

through porous rock to predict well production. Over the past 

several decades, simulation has progressed from coarse 2D 

models to detailed 3D models, providing strong fidelity to 

empirical production rates. By reformulating the Marathon Oil 

Corporation’s Multiscale Flow Simulator to use GPUs, we improve 

the overall execution speed by a factor of over 100, allowing fast 

turnaround on a GPU workstation. We also introduce GAMPACK, a 

fully-accelerated GPU algebraic multigrid solver, and demonstrate 

its performance relative to CPU solvers. 

Speaker(s): Kenneth Esler (Computational Physicist, Stone Ridge 

Technology), Vincent Natoli (Founder & CEO, Stone Ridge Technology) 

Topic(s): Energy Exploration (Intermediate) 


ROOM N 

S0142 VMD: High Performance Molecular Visualization 

and Analysis on GPUs 

This talk will present recent successes in the use of GPUs to 

accelerate interactive molecular visualization and analysis tasks 

on desktop computers, and batch-mode simulation and analysis 

jobs on GPU-accelerated HPC clusters. We’ll present Fermispecific 

algorithms and optimizations and compare with those for 

other devices. We’ll also present performance and performance/ 

watt results for VMD analysis calculations on GPU clusters, and 

conclude with a discussion of ongoing work and future 

opportunities for GPU acceleration, particularly as applied to the 

analysis of petascale simulations of large biomolecular complexes 

and long simulation timescales. 

Speaker(s): John Stone (Senior Research Programmer, University of 

Illinois at Urbana-Champaign) 

Topic(s): Molecular Dynamics, Algorithms & Numerical Techniques, 



ROOM A3 

S0307 New Advances in GPU Linear Algebra 

Hear product experts explain how we have created two of the most 

widely used libraries in the GPU computing ecosystem. The CULA 

library for dense linear algebra has been expanding to multi-GPU 

and out-of-core applications, meaning that users are no longer 

limited by the onboard GPU memory for their work. In this field, 

effectively using multiple GPUs is significantly more challenging than 

a single GPU! The brand new CULA Sparse library tackles the tough 

world of sparse linear algebra and achieves 10x speedups. Learn 

more about what makes these two libraries work in this session. 

Speaker(s): John Humphrey (Engineering Director, EM Photonics), Kyle 

Spagnoli (Research Engineer, EM Photonics) 



WEDNESDAY 


ROOM B 

S0327 Large and Sparse– Mass Spectrometry Data 

Processing in the GPU 

Learn how the GPU helps identify millions of ions in datasets of 

several billion points of four-dimensional sparse data. The data is 

first reduced to 3D to locate regions of dense data, and then only 

those regions are processed in 4D. Processing involves combining 

several steps of convolution filters in three axes, finding local 

maximums in volumes of data, and extracting information from 

the data around each local maximum. 

Speaker(s): Jose de Corral (Principal Consulting Engineer, 

Waters Corporation) 



ROOM A1 

S0335 Live 3D-Video with a Lightfield Camera 

In this session you will learn what a lightfield camera is, how it 

works and what you can do with it. Next to the theoretical 

presentation we give a live demo of the camera system developed 

by our company Raytrix that gives you 3D live video from a single 

camera through a single lens currently at up to 10fps with a 

maximum effective resolution of 3 megapixels synthesized from 

an 11 megapixel sensor using CUDA algorithms on a GTX580. 

Post-production features include pixel-wise focusing, depth zoom, 

variable stereo base-line and base-line rotation. 

Speaker(s): Christian Perwass (CEO, Raytrix GmbH) 

Topic(s): Computational Photography, Audio, Image and Video 

Processing, Stereoscopic 3D, Computer Vision (Beginner) 


ROOM A8 

S0342 Volumetric Processing and Visualization on 

Heterogeneous Architecture 

Volumetric data is typically very large and involves intensive 

computation for processing and visualization. We have developed an 

OpenCL-based framework that can utilize all available resources in 

a system or a cluster of systems. The framework manages one or 

more OpenCL devices. A large volume is partitioned into bricks. 

Each OpenCL device is associated with a set of brick producers that 

generates the contents of bricks while optionally utilizing other 

bricks as input. The framework is also composed of a scheduler 

that distributes brick workloads to different devices and chooses an 

optimized processing order aiming at certain criteria. 

Speaker(s): Wei Li (Research Scientist, Siemens Corporation) 

Topic(s): Visualization, Supercomputing (Advanced) 


ROOM L 

S0369 Running Risk On GPUs 

A key component of Basel III is the Credit Value Adjustment (CVA) 

which is in essence the value of counter-party credit risk. 

Quantifying the CVA on simple products already poses 

considerable computational challenges and considering many 

banks have hundreds of thousands of positions it becomes clear 

that the computational challenges of CVA are massive. Calculating 

CVA sensitivities for hedging only add to this burden. In this talk 

we will discuss real world applications of GPUs in risk 

management and show how, using CUDA, GPU computing is an 

enabling technology to address the computational challenges of 

an evolving regulatory environment. 

Speaker(s): Norbert Hari (Trading Quantitative Analyst, ING Bank nv), Tim 

Wood (Quantitative Analyst, ING Bank nv) 



ROOM A5 

S0419B Optimizing Application Performance with CUDA 

Profiling Tools 

NVIDIA provides two powerful profiling tools that you can use to 

maximize your application’s performance. The NVIDIA Visual 

Profiler helps you understand your application’s behavior with a 

detailed timeline and data from GPU performance counters. The 

Visual Profiler also provides an automatic, data-driven analysis 

engine that provides suggestions on potential optimization 

strategies for your application. Nvprof is a command-line profiler 

that provides gprof-like functionality for the GPU. Nvprof provides 

summary information about where your application is spending 

the most time, so that you can focus your optimization efforts. 

This session will provide a step-by-step walk through of both of 

these profiling tools, showing how you can use these tools to 

identify optimization opportunities at the application, kernel, and 

source-line levels. 

Speaker(s): David Goodwin (Software Engineer, NVIDIA) 



ROOM A2 

S0600 Scalable GPU Graph Traversal 

Breadth-first search (BFS) is a core primitive for graph traversal 

and a basis for many higher-level graph analysis algorithms. It is 

also representative of a class of parallel computations whose 

memory accesses and work distribution are both irregular and 

data-dependent. Recent work has demonstrated the plausibility of 

GPU sparse graph traversal, but has tended to focus on 

asymptotically inefficient algorithms that perform poorly on 

graphs with non-trivial diameter. We present a BFS parallelization 

focused on fine-grained task management constructed from 

efficient prefix sum that achieves an asymptotically optimal 

O(|V|+|E|) work complexity. Our implementation delivers excellent 

performance on diverse graphs, achieving traversal rates in 

excess of 3.3 billion and 8.3 billion traversed edges per second 

using single and quad-GPU configurations, respectively. This level 

of performance is several times faster than state-of-the-art 

implementations both CPU and GPU platforms. 

Speaker(s): Duane Merrill (Research Scientist, NVIDIA) 

Topic(s): Algorithms and Numerical Techniques (Beginner) 


ROOM M 

S0637 Analyzing performance and power of applications 

with GPUs on Dell 12G platforms (Presented by Dell) 

In this talk, both performance and power aspects of running 

various applications on NVIDIA GPUs on Dell 12G platforms will be 

presented. These platforms utilize the latest PCIe Gen 3 slots and 

processors in conjunction with varying number of NVIDIA GPUs 

and are tested with several applications both from a performance 

perspective and a power perspective. 

Speaker(s): Dr. Jeff Layton (HPC Enterprise Technologist, Dell) 

Topic(s): Supercomputing, Visualization (Intermediate)


HALL 1 

S0642 Inside Kepler 

In this talk, individuals from the GPU architecture and CUDA 

software groups will dive into the features of the compute 

architecture for “Kepler” – NVIDIA’s new GPU. From the 

reorganized processing cores with new instructions and 

processing capabilities, to an improved memory system with 

faster atomic processing and low-overhead ECC, we will explore 

how the Kepler GPU achieves world leading performance and 

efficiency, and how it enables wholly new types of parallel 

problems to be solved. 

Speaker(s): Stephen Jones (CUDA Developer, NVIDIA), Lars Nyland 

(Senior Architect, NVIDIA) 



ROOM J1 

S0700 Stampede System Architecture and Early 

Accelerator Programming Experiences 

We present a description of the design of the Stampede system to 

be deployed at TACC over the course of 2012. Stampede comprises 

a 2PF Intel Sandy Bridge cluster with FDR InfiniBand augmented 

8PF of Intel MIC Architecture co-processors. We will describe the 

design of the system, the datacenter that houses it, and expected 

programming models and usage modes. In support of this, we will 

present early experiences programming for the Intel MIC 

Architecture using the Knights Ferry Software Development 

Platform. Key to this will be the presentation of several different 

programming models and the scalability of the resulting codes. 

Speaker(s): Bill Barth (Director of High Performance Computing, Texas 

Advanced Computing Center, University of Texas at Austin) 

















S2003 Emerging Companies Summit: Fireside Chat with 

Jen-Hsun Huang (CEO and Co-Founder, NVIDIA) and Tim 

Bajarin (President, Creative Strategies) 

NVIDIA CEO and co-founder Jen-Hsun Huang will take part in a 

fireside chat with Tim Bajarin, one of IT world’s pre-eminent 

analysts and president of Creative Strategies. They will discuss 

trends in mobile, visual and parallel computing, and the 

transformational changes ahead for the industry. 

Speaker(s): Jen-Hsun Huang (CEO, President and Co-Founder, 

NVIDIA), Tim Bajarin (President, Creative Strategies) 



ROOM A3 

S0085 Floating Point and IEEE 754 Compliance for 

NVIDIA GPUs: Precision & Performance 

As a result of continuing improvements, NVIDIA offers GPUaccelerated 

floating-point performance in compliance with IEEE 

754. It is our experience that a number of issues related to floating 

point accuracy and compliance are a frequent source of confusion 

both on CPUs and GPUs. The purpose of this talk is to discuss the 

most common ones related to NVIDIA GPUs and to supplement 

the documentation in the CUDA C Programming Guide 

Speaker(s): Alex Fit-Florea (Senior Engineer, NVIDIA) 


& Libraries (Intermediate) 


ROOM A8 

S0105 Hardware Acceleration for Vessel 

Visualization Tasks 

To analyze datasets visually, systems with fast feedback loops on 

user interaction are beneficial. In this session rendering and 

preprocessing techniques for medical volume data will be 

presented using OpenGL and CUDA. In the context of the coronary 

artery disease the analysis of individual vessel branches is 

important. We show how local transfer function application and 

generation by means of histogramm analysis can help navigating 

and finding details in the datasets. Furthermore, domain-specific 

acceleration and illustration techniques for volume rendering are 

also applied to datasets from brain aneurysms. 

Speaker(s): Christoph Kubisch (Developer Technology Engineer, NVIDIA) 

Topic(s): Medical Imaging & Visualization, Computer Graphics (Beginner) 


ROOM K 

S0143 Fluid-Structure-Interaction Using SPH and 

GPGPU Technology 

There are two goals when developing engineering analysis 

software, one is accuracy and the other is speed. In the area of 

Fluid-Structure Interaction (FSI) computational time has always 

been the major impediment to solving large realistic engineering 

problems. In our implementation the fluid/structural dynamics 

solver uses a combination of GPU/CPU processing. The added 

benefit of using a powerful GPU workstation is that it is roughly 10 

times less expensive than a regular CPU cluster. In this paper, we 

present the use of GPU Technology as implemented in the explicit 

dynamic finite element software IMPETUS Afea Solver ® . 

Speaker(s): Jean Luc Lacome (IMPETUS Afea SAS), Jerome Limido 

(IMPETUS Afea SAS) 

Topic(s): Computational Structural Mechanics, Algorithms & 

Numerical Techniques, Computational Fluid Dynamics (Intermediate) 


ROOM A7 

S0190 Large-Scale Reservoir Simulation on GPU 

Develop highly parallel GPU-based GMRES solver and several 

precondtioners, and couple them with the in-house reservoir 

simulator to speedup large-scale reservoir simulation with over 

one million grid blocks. For those preconditioners, we develop the 

highly parallelized ILU(k), ILUT, and block ILU(k), block ILUT, with 

matrix partition by METIS on GPU. The excellent speedup and 

accurate results can demonstrate the great promising future of 

the GPU parallel device in parallel reservoir simulation. 


WEDNESDAY 

Speaker(s): Song Yu (Chemical & Petroleum Department, University 

of Calgary) 

Topic(s): Application Design & Porting Techniques, Algorithms & 




S0271 Fast Adaptive Sampling Technique for Multi- 

Dimensional Integral Estimation Using GPUs 

Evaluating multi-dimensional integrals is a commonly encountered 

problem in many areas of science including Physics and Volume 

estimation of convex bodies. One of the widely used techniques for 

integral evaluation in large dimensions is the Monte Carlo method. 

Vanilla Monte Carlo methods of Integral Estimation use uniform 

sampling techniques. Variance of such uniform sampling reduces 

as 1/√Sample-size, which is too slow for most real life applications. 

In this study, we discuss about an adaptive sampling technique 

called VEGAS which reduces the variance at a much faster rate than 

uniform sampling. We present a new parallel implementation for 

VEGAS based on CUDA that can significantly reduce the 

computation time of multi-dimensional integrals. We show that our 

GPU based implementation of VEGAS achieves up to a 45x speed up 

over an equivalent CPU based implementation. 

Speaker(s): Srinivasa Prasanna (Professor, Internation Institute of 

Information Technology Bangalore), Pradeep Rao (Technology 

Architect, Infosys Technologies Ltd) 

Topic(s): Algorithms & Numerical Techniques, Finance (Intermediate) 



S0035 GPU Parallelization of Gibbs Sampling: 

Abstractions, Results, and Lessons Learned 

Monte-Carlo-Markov-Chain (MCMC) estimation of Hierarchical 

Bayesian (HB) models is not only time-consuming, but also 

difficult to parallelize due to its sequential (Markovian) nature. We 

present an abstraction of a widely-used MCMC algorithm, called 

Gibbs sampling. We define a taxonomy of variable blocks, and for 

each type of variable block we offer suitable parallelization 

strategies, along with their corresponding CUDA implementations. 

For large problems where model estimation may take several 

hours or days using a single-threaded software, we see speedups 

in the 30x-100x range, thereby reducing estimation time to a few 

hours. In addition to lower computation cost relative to MPI-based 

parallelization, the reduction in estimation time allows for a more 

interactive modeling experience. We offer an extensive discussion 

of lessons learned for the broader scientific computing field, 

including an analysis of tradeoffs between computation costs and 

development costs, implications of our tradeoff analysis for 

optimal software development and parallelization, and some 

practical tips and gotcha’s for rookie GPU programmers. 

Speaker(s): Alireza Mahani (Quantitative Modeler, Sentrana) 

Topic(s): Algorithms & Numerical Techniques, Databases, Data Mining, 

Business Intelligence (Intermediate) 


ROOM A3 

S0042 Solving Challenging Numerical Linear Algebra 

Algorithms using Multiple GPU Accelerators 

See the newest features integrated in MAGMA (Matrix Algebra on 

GPU and Multicore Architectures) to tackle the multiple GPU-based 

systems for numerical linear algebra. In this talk, we describe how 

we leveraged MAGMA to solve existing and new challenging 

numerical problems on multiple hardware accelerators. Using a 

hybridization methodology, the new multiGPU-enabled MAGMA is 

characterized by a representation of linear algebra algorithms as 

directed acyclic graphs, where nodes correspond to tasks and edges 

to data dependencies among them, and a dynamic runtime system 

environment StarPU used to schedule various computational kernels 

over hybrid architectures of GPUs and homogeneous multicores. 

Speaker(s): Hatem Ltaief (Computational Scientist, KAUST 

Supercomputing Laboratory), Stanimire Tomov (University of Tennessee) 


& Libraries (Intermediate) 


ROOM A5 

S0099 Debugging GPU Applications For Correctness 

and Performance 

This session reveals how debugging CUDA applications is made 

straightforward with the powerful Allinea DDT debugger. New 

features enabling greater understanding of performance 

optimizations will be explored, showing how they can be used to 

produce better, faster CUDA code. Coupled with newly released 

support for multiple languages and compilers we will also show 

how Allinea DDT is enabling developers on desktops and the 

largest supercomputers to achieve both correct and efficient 

GPU applications. 

Speaker(s): David Lecomber (CTO, Allinea Software) 



ROOM N 

S0127 Petascale Molecular Dynamics Simulations on 

GPU-Accelerated Supercomputers 

The highly parallel molecular dynamics code NAMD was chosen in 

2006 as a target application for the NSF petascale supercomputer 

now know as Blue Waters. NAMD was also one of the first codes 

to run on a GPU cluster when G80 and CUDA were introduced in 

2007. How do the Cray XK6 and modern GPU clusters compare to 

300,000 CPU cores for a hundred-million-atom Blue Waters 

acceptance test? Come learn the opportunities and pitfalls of 

taking GPU computing to the petascale and the importance of 

CUDA 4.0 features in combining multicore host processors and 

GPUs in a legacy message-driven application. 

Speaker(s): James Phillips (Senior Research Programmer, University 

of Illinois) 

Topic(s): Molecular Dynamics, Application Design & Porting 

Techniques, Parallel Programming Languages & Compilers, 



ROOM K 

S0214 GPU Based Stacking Sequence Optimization For 

Composite Skins Using GA 

The goal of this session is to showcase how GPUs can be used to 

achieve high performance in a Genetic algorithm based optimization. 

The particular domain applied is stacking sequence optimization of 

Aircraft wing skins. The concepts illustrated use CUDA but are 

generic to any other GPU language. It is assumed that the 

registrants have exposure to optimization in engineering domain. 

Speaker(s): Sathya Narayana K. (Principal Consultan, Infosys Ltd.), 

Ravikumar G.V.V. (Infosys Ltd, Bangalore) 


Numerical Techniques, Parallel Programming Languages & 

Compilers, Algorithms & Numerical Techniques (Advanced)


ROOM A8 

S0259 A High Performance Platform for Real-Time 

X-Ray Imaging 

We will share our experience on development of the GPU-based 

platform for synchrotron-based X-ray imaging aimed to analysis 

of dynamic processes. The complete data flow from the camera to 

the data storage will be discussed with a special focus on I/O 

issues, hardware platform, and ways to utilize the available 

system resources. An efficient GPU-implementation of filtered 

back projection will be presented highlighting differences of 

implementations for GT200, Fermi, and AMD Cypress 

architectures. We will introduce our software platform used to 

abstract current configuration of the imaging station and to 

simplify the development of parallel image processing algorithms. 

Speaker(s): Suren Chilingaryan (Researcher, Karlsruhe Institute 


Topic(s): General Interest, Supercomputing, Audio, Image and Video 

Processing, Algorithms & Numerical Techniques (Intermediate) 


ROOM A1 

S0281 Accelerate a Fully Functional Photo Editing 

Software with GPU 

Introduce how to design a fully functional GPU-based photo 

editing software, which provides features like layering and 

selecting, and integrates various adjusting tools and image filters. 

This design contains a fast layer rendering engine, an image filter 

framework which manages different filters supporting visual 

feedback for filter parameter adjustment. We will also introduce 

how to design undoing system for GPU-based image processing 

software. Specifically a CUDA-accelerated HDR tool will be 

presented in detailed. 

Speaker(s): Kaiyong Zhao (PhD Student, Hong Kong Baptist University) 

Topic(s): Computational Photography, Computer Graphics (Beginner) 


ROOM C 

S0365 Delite: A Framework for Implementing 

Heterogeneous Parallel DSLs 

Domain-specific languages can be a solution for heterogeneous 

parallel computing since they provide higher productivity and 

performance. To lower the barrier for DSL development, we 

implemented the Delite compiler framework and runtime. DSL 

developers can easily extend the framework to build a new DSL. 

The framework provides various optimization facilities and 

automatically generates code for heterogeneous hardware 

including GPU. The runtime executes the generated code in 

parallel by scheduling the kernels on target devices and managing 

the memory allocations and data transfers. This talk will cover the 

details of Delite with examples from OptiML, a machine learning 

DSL implemented with the framework. 

Speaker(s): HyoukJoong Lee (PhD Student, Stanford University), Kevin 

J. Brown (Research Assistant, Stanford University) 

Topic(s): Parallel Programming Languages & Compilers (Intermediate) 


ROOM L 

S0405 New Generation GPU Accelerated Financial 

Quant Libraries 

Learn from industry experts how new generation GPU accelerated 

solutions for derivative pricing, hedging, and risk management 

can be build more efficiently with modern technology and 

functional programming languages like F# on .NET or Scala on 

the Java VM. As a concrete example we report from a large 

derivative pricing project developed in F# on .NET. We will 

introduce the key design concepts and parallelization strategies, 

which lead to an efficient and transparent GPU acceleration. 

Several examples will illustrate the benefit of the functional as 

compared to the classical object oriented approach. 

Speaker(s): Daniel Egloff (Managing Partner, QuantAlea GmbH) 

Topic(s): Finance, Application Design & Porting Techniques, Algorithms 

& Numerical Techniques, Cloud Computing (Advanced) 


ROOM A7 

S0432 New Ideas for Massively Parallel Preconditioners 

Linear Solvers on serial machines tend to be highly recursive, but 

that’s not an option on GPUs. In this paper we describe a new 

preconditoner for GMRES and similar Krylov subspace linear 

solvers that is highly parallel, but also provides effective 

mechanisms to reconcile remote driving forces in a spatially 

discretized system. We will present results, taken from some 

real-world studies using a commercial oil reservoir simulator, 

showing how it compares with a state of the art serial solver, and 

showing how performance scales in a domain decomposition 

formulation run on a multiple CPU+GPU cluster. 

Speaker(s): John Appleyard (Managing Director, Polyhedron Software 

Ltd), Jeremy Appleyard (Analyst, Polyhedron Software Ltd) 

Topic(s): Algorithms & Numerical Techniques, Computational Fluid 

Dynamics, Energy Exploration (Advanced) 


ROOM M 

S0635 How to Bake Portable Many-Core Programs 

(Presented by CAPS enterprise) 

A legacy code, a cool many-core accelerator and a directive-based 

programming environment are the main ingredients of the recipe to 

transform your legacy code into a portable many-core one. This 

presentation shows by the example how to exploit accelerators in 

legacy code without sacrificing portability. We describe a 

methodology and the use of directives, such as HMPP and OpenACC, 

to exploit the massive parallelism provided by many-core devices. 

During the presentation we illustrate using numerous illustrations 

how to analyze performance, tune accelerator code, reduce data 

transfers, deal with libraries, exploit multiple accelerators, etc. 

Speaker(s): François Bodin (Chief technology Officer, CAPS enterprise) 



ROOM J1 

S0701 New GPU Appliance for Co-processing 

In the Petascale era, the super computers were used both for 

simulation and the graphical visualization of the results in-situ. At 

Exascale the compute resources will be more precious than 

before and using them for co-processing tasks will be not 

efficient. We are designing at a new appliance that will move the 

processing required for graphical visualization on a separate 

appliance that will allow visualization as co-processing to the 

simulation. We showcased the appliance at SC11. Running a 

pipeline of computational simulation and visualization, we show 

that our prototype system reduces total time to simulation 

completion by up to 30%. 


WEDNESDAY 

Speaker(s): Sorin Faibish (EMC Corporation) 

Panelist(s): Tom Furlong (Managing Director, Granite Ventures), Rob 

Enderle (Principal Analyst, Enderle Group), Flip GIanos (General Partner, 

InterWest Partners), Jeff Herbst (VP of Business Development, NVIDIA) 

Topic(s): HW/SW Architectures for Co-processing (Intermediate) 
















Featuring GAIKAI, Immersive Media, and Numecent 







Speaker(s): David Perry (CEO and Co-Founder, GAIKAI), Mark 

McGovern (CEO, Immersive Media), Osman Kent (CEO, Numecent) 



ROOM A1 

S0073 Cost-effective GPU Acceleration of a Video 

Restoration and Archiving Workflow 

The goal of this session is to present a complex GPU-accelerated 

video restoration and archiving workflow. The workflow consists of 

many different processing steps and a final review application. 

Fast and cost-effective processing and real-time display of the 

processed video material is a key requirement. It will be shown in 

detail how a GPU based acceleration can be achieved for many 

different processing steps and the review application based on the 

use of OpenCV, OpenCL, and OpenGL. Furthermore, an object 

oriented software architecture supporting the acceleration of 

several different processing tasks on the same graphics adapter 

will be presented. 

Speaker(s): Klaus Gaedke (Lab Manager, Technicolor) 

Topic(s): Audio, Image and Video Processing (Intermediate) 


ROOM B 

S0103 Accelerating Protein Sequences and Classification 

using GPU-HMMER Search 

In this paper we present the results of parallelizing HMMer, which 

is a widely used tool for protein sequence homology detection, as 

well as functional annotation of homologous protein sequences, 

and protein family classification. The HMMer program is based 

upon a Viterbi algorithm coded in C, and is quite time consuming. 

We modify the Viterbi algorithmic logically to port it on GPGPU. We 

test multiple enhancements in our GPU kernels in order to 

demonstrate the effectiveness of each strategy. Our 

implementation cuda_hmmsearch achieves overall up to 30x 

speedup over intel single core CPU. 

Speaker(s): Mahesh Khadtare (PhD Student - Scientist ESP, I2IT, 

Pune University) 



ROOM A8 

S0141 GPU-Accelerated Optical Coherence 

Tomography Imaging 

We developed a series of GPU-based technologies to accelerate 

the imaging reconstruction and visualization for optical coherence 

tomography (OCT). Several GPU-based algorithms such as 

non-uniform fast Fourier transform, numerical dispersion 

compensation, simultaneous phase modulation and multi-GPU 

implementation were developed to achieve improved impulse 

response, better SNR, doubled imaging range and higher system 

stability. The GPU-accelerated 4D-OCT system was validated by 

imaging both in vivo and ex vivo biological tissues. This technology 

overcomes the imaging reconstruction and visualization 

bottlenecks that widely exist in current ultrahigh speed OCT 

systems and opens the way to interventional OCT imaging for 

applications in guided microsurgery. 

Speaker(s): Kang Zhang (Research Scientist, GE Global Research) 

Topic(s): Medical Imaging & Visualization (Beginner) 


ROOM N 

S0207 GPU Enabled Macromolecular Simulation: 

Challenges and Opportunities 

GPU enabled simulation of fully atomistic macromolecular 

simulation is rapidly gaining momentum, enabled by the massive 

parallelism and due to parallelizability of various components of 

the underlying algorithms and methodologies. The massive 

parallelism in the order of several hundreds to few thousands of 

cores, presents opportunities as well poses implementation 

challenges. In this talk dive deep into the various key aspects of 

simulation methodologies of macro molecular systems 

specifically adapted to GPUs. Learn some of the underlying 

challenges and get the latest solutions devised to tackle them in 

the FEN ZI code for fully atomistic macromolecular simulations. 

Speaker(s): Michela Taufer (Assistant Professor, University of 

Delaware), Sandeep Patel (University of Delaware) 

Topic(s): Molecular Dynamics, Algorithms & Numerical Techniques 

(Advanced) 


ROOM K 

S0293 Culises – A Library for Accelerated CFD on Hybrid 

GPU-CPU Systems 

The vast majority of CFD simulations relies on the solution of 

large-scale systems of linear equations (SLE), where the solution of 

a system can consume most of the total CPU time. We have 

developed a library (Culises) for state-of-the-art solution of SLE that 

is targeted on hybrid GPU-CPU platforms. Culises can be connected 

to MPI-parallelized CFD codes (e.g. OpenFOAM) via an applicationspecific 

interface. In this talk, we focus on efficient implementation 

of preconditioned Krylov subspace methods. Using the computing 

power of GPUs, Culises can significantly accelerate pure CPU 

computations for a multitude of industrial CFD applications.

Speaker(s): Bjoern Landmann (Development Engineer, FluiDyna GmbH) 




ROOM A5 

S0340 Debug Multi-GPU Applications on CUDA- 

Accelerated Clusters with TotalView 

Learn how TotalView can help you develop CUDA applications on 

single servers, multi-GPU servers, and HPC-style clusters. For 

more than 20 years the TotalView debugger has set the standard 

for parallel and multi-core debugging on Linux, HPC clusters and 

custom supercomputers such as the Cray XT/XE/XK series. CUDA 

developers deal with the same types of complexity and can realize 

the same productivity benefits. This talk will introduce TotalView 

for CUDA and show how you can program more easily with CUDA 

3.2, 4.0 and 4.1. 

Speaker(s): Chris Gottbrath (Principal Product Manager, Rogue 

Wave Software) 

Topic(s): Development Tools & Libraries, Supercomputing 



ROOM A7 

S0433 Accelerated FDTD Technique for Marine 

Controlled Source Electromagnetic Imaging 

Find out about the newest method for Marine Hydrocarbon 

Exploration. In this session we will profile the use of Finite 

Difference Time Domain (FDTD) technique in combination with 

Mittet’s method and GPUs to produce faster, cheaper, more 

accurate forward modeling for electromagnetic imaging 

(Controlled Source Electromagnetic or CSEM). Unlike many 

frequency domain CSEM techniques this accelerated method does 

not require simplifying assumptions to reduce the memory and 

computational burden and has excellent scaling properties 

(essentially linear) across clusters of GPU accelerated nodes. 

CSEM is used in the industry to enhance confidence in 

hydrocarbon reservoir discoveries. 

Speaker(s): Geoff Clark (CEO, Acceleware Ltd.), Michal Okoniewski 

(Director of Marketing, Acceleware Ltd.) 

Topic(s): Energy Exploration (Intermediate) 


HALL 1 

S0514 GPU Performance Analysis and Optimization 

This session will present the fundamental performanceoptimization 

concepts and illustrate their practical application in 

the context of programming for Fermi and Kepler GPUs. The goal 

is twofold: make the optimization process a methodical sequence 

of steps, facilitate making performance-aware algorithmic 

decisions before coding even starts. In order to maximize GPU 

performance, a code should have sufficient parallelism, access 

memory in a coalesced pattern, and be amenable to vector 

execution within warps (groups of 32 threads). We will show how 

to quantify these requirements for a specific GPU in order to 

determine performance limiters and their importance for a given 

code. To address the limiters, we will review hardware operation 

specifics and related optimization techniques. Optimization 

process will be illustrated using NVIDIA profiling tools and kernel 

case studies. 

Speaker(s): Paulius Micikevicius (Developer Technology Engineer, NVIDIA) 



ROOM J1 

S0702 The Architecture of Acceleration in HPC 

High Performance Computing applications push the envelope of 

what can be computed today. Acceleration technologies play a 

critical role in extending and enhancing capability. Balancing the 

impact of acceleration within hardware and software is a difficult 

art, where critical decisions can have dramatic impacts. We 

present the role of acceleration in tightly and loosely coupled 

settings, as well as data structures and execution model. 

Speaker(s): Justin Tripp (Technical Staff Member, Los Alamos National 

Laboratory), Zack Baker (Los Alamos National Laboratory) 



ROOM K 

S0055 Particle Dynamics with MBD and FEA using CUDA 

Many sphere particles are solved with DEM (Discrete Element 

Method) and simulated with GPU technology. Fast algorithm is 

applied to calculate hertzian contact forces between many sphere 

particles (from 100,000 to 1,000,000) and NVIDIA’s CUDA is used to 

accelerate the calculation. Many sphere particles and MBD and 

FEA entities are simulated within commercial software RecurDyn. 

Many models are built and simulated; fork lifter with sand model, 

oil in oil tank model, oil filled engine system and water filled 

washing machine model. All models are simulated with NVIDIA’s 

GPU and the result is shown. 

Speaker(s): Graham Sanborn (Lead Software Developer, FunctionBay) 

Topic(s): Computational Structural Mechanics, Computational Physics, 

Computational Fluid Dynamics (Intermediate) 


ROOM B 

S0109 SOAP3: GPU-based Compressed Indexing and 

Ultra-fast Parallel Alignment of Short Reads 

We give the fi_x000C_rst implementation of a compressed index 

(Burrows-Wheeler Transform) on the GPU, supporting very 

efficient parallel alignment of short patterns (reads) onto the 

human genome. The new alignment software SOAP3 is tens of 

times faster than existing ones and can catch up the throughput 

(Giga to Tera bp) of next generation DNA sequencer. It takes 2.4 

seconds to perform exact matching for one million length-100 

reads (tens of seconds for small-error approximate matching). 

Technically, we show how to minimize memory accesses to the 

index from individual threads and to control the branching and 

divergence of the threads. 

Speaker(s): BingQiang Wang (BGI) 

Topic Areas: Bioinformatics (Advanced) 


ROOM A8 

S0131 Multi-GPU Real-Time Ptychographic X-ray 

Image Reconstruction 

Learn how a new imaging technique, combined with the 

computational power of GPUs and the brightness of modern X-ray 

synchrotrons can quickly and easily produce images with 

nanometer level resolution. Ptychography is a recent X-ray 

imaging technique in which overlapping regions of a sample are 

exposed in quick succession and the resulting scattering is used 

to reconstruct a high resolution image of the sample. Discover 

why GPUs can substitute for the lack of X-ray lenses and how they 


enabled a dramatic reduction in the feedback time for users of the 

technique from days to seconds. 

Speaker(s): Filipe Maia (Postdoctoral Fellow, Lawrence Berkeley 

National Laboratory) 

Topic(s): Audio, Image and Video Processing, Algorithms & 



ROOM A3 

S0149 On the Parallel Solution of Sparse Triangular 

Linear Systems 

A parallel algorithm for solving a sparse triangular linear system on 

the GPU is proposed. It implements the solution of the triangular 

system in two phases. The analysis phase builds a dependency graph 

based on the matrix sparsity pattern and groups the independent 

rows into levels. The solve phase obtains the full solution by iterating 

sequentially across the constructed levels. The solution elements 

corresponding to each level are obtained in parallel. The numerical 

experiments are presented and it is shown that the incomplete-LU 

and Cholesky preconditioned iterative methods can achieve a 2x 

speedup on the GPU over their CPU implementation. 

Speaker(s): Maxim Naumov (Software Engineer, NVIDIA) 

Topic(s): Algorithms & Numerical Techniques, Development Tools & 

Libraries (Intermediate) 


ROOM L 

S0206 Monte-Carlo Pricing Under a Hybrid Local 

Volatility Model 

This session shows how to calculate the prices of several financial 

products, vanilla and exotic, under Dupire’s Local Volatility model. 

We start with vanilla options on the foreign exchange rate and 

explain how to rescale the Local Volatility matrix in order to take 

advantage of the fast texture memory interpolation. We then extend 

this framework to two factors by including stochastic interest rates 

following Hull-White model, and show how to price Power-Reverse 

Dual Coupon swaps with an exotic TARN feature. We provide details 

of the algorithms and compare accuracy and speed with typical 

performances of single-core production implementations. 

Speaker(s): Sebastien Gurrieri (Quantitative Analyst, Mizuho 

International) 

Topic(s): Finance, Algorithms & Numerical Techniques (Intermediate) 


ROOM A1 

S0273 Fast JPEG Coding on the GPU 

The goal of this session is to demonstrate how high speed JPEG 

compression and decompression can be efficiently implemented 

on the GPU using CUDA. In this session we will present: detailed 

analysis of Baseline JPEG compression and decompression 

processes and its constituent parts (such as Huffman Coding, 

RLE, Differential Coding, Quantization, Discrete Cosine Transform) 

and their suitability for the GPU architecture, analysis of achieved 

results and comparison with existing implementations, 

applications to high-speed imaging. 

Speaker(s): Fyodor Serzhenko (SEO, Fastvideo), Victor Podlozhnyuk 

(NVIDIA) 

Topic(s): Audio, Image and Video Processing, Algorithms & 

Numerical Techniques (Advanced) 


ROOM A2 

S0286 Scaling Applications to a Thousand GPUs 

and Beyond 

Discover how to scale scientific applications to thousands of GPUs 

in parallel. We will demonstrate our techniques using two codes 

representative of a wide spectrum of programming methods. The 

Ludwig lattice Boltzmann package, capable of simulating 

extremely complex fluid dynamics models, combines C, MPI and 

CUDA. The Himeno three-dimensional Poisson equation solver 

benchmark combines Fortran (using the new coarray feature for 

communication) with prototype OpenMP accelerator directives (a 

promising new high-productivity GPU programming method). We 

will present performance results using the cutting-edge 

massively-parallel Cray XK6 hybrid supercomputer featuring the 

latest NVIDIA Tesla 2090 GPUs. 

Speaker(s): Alan Gray (HPC Architect, The University of Edinburgh), 

Roberto Ansaloni (Cray Italy) 

Topic(s): Supercomputing, Computational Fluid Dynamics, Parallel 

Programming Languages & Compilers, Application Design & 



ROOM C 

S0299 Exploiting Fault Tolerant Heterogeneous 

Parallelism with SPM.Python 

In this session, we shall review how SPM.Python enables the 

exploitation of parallelism across servers, cores and GPUs in a 

fault tolerant manner. We will start off by describing the how/ 

what/why SPM.Python augments the traditional (serial) Python 

with parallel concepts like parallel task managers and 

communication primitives. Specifically, the context for and 

solutions to three formally open technical problems will be 

described. We will conclude by reviewing examples of how SPM. 

Python can be used to exploit both coarse and fine grain 

parallelism using GPUs within and across servers in a fault 

tolerant manner. 

Speaker(s): Minesh B Amin (Founder / CEO, MBA Sciences) 




S0332 Efficient Graph Matching and Coloring on the GPU 

The goal of this session is to compare the performance of graph 

matching and graph coloring algorithms on massively parallel 

devices such as GPUs. We present novel algorithms, which produce 

superior results for certain graphs and also discuss the techniques 

used to efficiently implement these algorithms on the GPU. 

Speaker(s): Patrice Castonguay (Emerging Applications Intern, 

NVIDIA), Jonathan Cohen (Emerging Applications, NVIDIA) 



ROOM N 

S0363 Efficient Molecular Dynamics on Heterogeneous 

GPU Architectures in GROMACS 

Molecular Dynamics is an important application for GPU 

acceleration, but many algorithmic optimizations and features still 

rely on code that prefers traditional CPUs. It is only with the latest 

hardware and software we have been able to realize a 

heterogeneous GPU/CPU implementation and reach performance 


WEDNESDAY 

significantly beyond the state-of-the-art of hand-tuned CPU code 

in our GROMACS program. The sub-millisecond iteration time 

poses challenges on all levels of parallelization. Come and learn 

about our new atom-cluster pair interaction approach for 

non-bonded force evaluation that achieves 60% work-efficiency 

and other innovative solutions for heterogeneous GPU systems. 

Speaker(s): Berk Hess (PhD Student, KTH Royal Institute of Technology), 

Szilárd Páll (PhD Student, KTH Royal Institute of Technology) 

Topic(s): Molecular Dynamics, Computational Physics, Life Sciences 



ROOM A7 

S0507 Interactive and Scalable Subsurface Data 

Visualization Framework 

The goal is to present an interactive visualization framework for 

large geo-spatial data. This framework has been developed by 

NVIDIA Advanced Rendering Center for the oil and gas 

(Hydrocarbone) industry. The Cuda based application is running on 

the cloud at interactive frame-rates. The visualization is remote 

on clients in a browser, including tablets. The scalable 

visualization framework can handle terra bytes of. 

Speaker(s): Tom-Michael Thamm (Director, Software Product 

Management, NVIDIA ARC), Marc Nienhaus (NVIDIA ARC) 

Topic(s): Visualization, Cloud Computing (Intermediate) 


ROOM M 

S0636 Supermicro: Worldwide leader in GP/GPU Servers 

and Workstation Platforms (Presented by Supermicro) 

Discover the measurable advantages that make Supermicro the 

time-to-market leader in GPU platform enablement. See how 

Supermicro’s innovative Application-Optimized designs enable 

partners to both scale-up and scale-out for maximum return on 

investment. Review actual case studies that highlight Supermicro’s 

leadership in Compute Density, Peak Performance, Scalability, 

Power Efficiency, Manageability, Reliability and Cost Effectiveness. 

Speaker(s): Don Clegg (VP, Supermicro) 



ROOM J1 

S0703 Adaptive Heterogeneous Computing with OpenCL: 

A Molecular Docking Case Study 

Modern computer systems routinely include multiple types of fully 

programmable computing resource, such as multi-core CPUs and 

many-core GPUs. Most research into accelerator-based 

computing tends to focus on just one part of the system, typically 

the GPU. In our work we have developed methods to harness all of 

the available computing resources in a system simultaneously, 

including CPUs and GPUs, using OpenCL as the underpinning 

cross-platform layer. In this paper we shall include results from a 

molecular docking program, which has been shown to scale 

across hundreds of hybrid CPU/GPU systems, yielding significant 

increases in performance and energy efficiency. 

Speaker(s): Simon McIntosh-Smith (University of Bristol) 



















Featuring RealView Imaging, Elemental Technologies, 

and Mersive 







Speaker(s): Shaul Geldman (Co-Founder and VP of R&D, RealView 

Imaging), Sam Blackman (CEO and Co-Founder, Elemental 

Technologies), Robert Balgley (CEO, Mersive) 


Enderle (Principal Analyst, Enderle Group), Flip GIanos (General 

Partner, InterWest Partners), Jeff Herbst (VP of Business Development, 

NVIDIA) 



ROOM A1 

S0052 Fast High Quality Image and Video Background 

Removal with CUDA 

A tool to efficiently and easily cut out objects from a taken picture 

has great practical value. In this session we present aspects on how 

to efficiently implement such a tool with CUDA and the NPP library 

based on the GrabCut approach by Rother et al. Through GPU 

acceleration both runtime and accuracy is improved compared to 

CPU based implementations such as the one in MS Word 2011. 

Further we show how to extend our GPU implementation to enable 

live background removal in a webcam video stream. 

Speaker(s): Timo Stich (Developer Technology Engineer, NVIDIA) 

Topic(s): Audio, Image and Video Processing, Machine Learning & AI 



ROOM K 

S0070 Large-Scale Matrix-Free Topology Optimization 

on the GPU 

Popular topology optimization methods today are based on the 

SIMP concept. Unfortunately, SIMP leads to ill-conditioned 

stiffness matrices that are difficult to solve on GPU architectures. 

In this talk, I will present a new topology optimization method 

called PareTO that relies on the concepts of topological sensitivity 

and pareto-tracing. The resulting stiffness matrices are well

conditioned, and one can now fully exploit GPU architectures for 

fast matrix-free implementation of the finite element method. 

Numerical experiments demonstrate that the efficacy of PareTO. 

Speaker(s): Krishnan Suresh (Associate Professor, University 

of Wisconsin) 




ROOM B 

S0084 CUMACH - A Fast GPU-based Genotype 

Imputation Tool 

The goal of this session is to introduce a GPU-implemented tool in 

bioinformatics. Genotype imputation is method which extrapolates 

genetic correlations from a densely characterized reference panel 

to a sparsely typed study sample. There have already been lots of 

CPU-based tools, but they all cost lots of time for large data-set. 

In this session, we try to implement a GPU-based imputation tool 

which can get relatively good result and fast speed. There will be 

three main parts for the session: 1) Introduce the background and 

its HMM based algorithm, 2) GPU implementation and 

optimization, 3) Results. 

Speaker(s): Agatha Hu (NVIDIA) 

Topic(s): Bioinformatics (Intermediate) 


ROOM N 

S0121 Software Architecture to Facilitate CUDA 

Development 

We describe a workflow architecture and its use in developing 

Schrödinger’s core-hopping application. The application supplies 

the stages as callbacks. A stage may have multiple 

implementations; for example, CUDA and CPU. An implementation 

can be assigned a maximum number of simultaneous threads. 

When any stage completes, a scheduling algorithm determines 

which implementation of which stage will be launched next. The 

application may detect “special” environments, such as CUDA, and 

set up its stages accordingly, or it may allow specification of which 

implementation of each stage to run. This makes it easy to develop 

and debug CUDA stages flexibly and incrementally. 

Speaker(s): Peter Shenkin (Vice President, Schrodinger), K. Patrick 

Lorton (Principal Developer, Schrodinger) 

Topic(s): Development Tools & Libraries, Life Sciences (Intermediate) 


ROOM A8 

S0202 Terascale Volume Visualization in Neuroscience 

Learn how to create a scalable volume visualization system for 

interactive rendering of terascale EM data. We will describe the 

major design principles, how we can avoid the standard approach 

of pre-computing a 3D multi-resolution hierarchy such as an 

octree, and how to handle continuous streaming of newly acquired 

data. For rendering we build upon a visibility-driven approach and 

3D virtual texturing, and perform interactive volume rendering of 

a “virtual” volume, where the corresponding physical storage is 

only represented and populated in a sparse manner with 2D 

instead of 3D image data on the fly during rendering. 

Speaker(s): Johanna Beyer (Postdoctoral Fellow, King Abdullah 

University of Science and Technology), Markus Hadwiger (Assistant 

Professor, KAUST) 

Topic(s): Visualization, Neuroscience (Intermediate) 



S0241 Large Graphs on Multi-GPUs 

The goal of this session is to propose new paradigms to explore 

large graphs on GPUs. Graphs with billions of edges don’t fit 

within the memory of a single GPU. A possible solution is to resort 

to multiple GPUs. Most of common graph algorithms show low 

arithmetic intensity and irregular access patterns. These features 

lead to a poor load balance among threads and un-coalesced 

access to memory. We show how to balance the load to exploit as 

much as possible all threads and then how to use fast algorithms, 

as radix-sort and scan, to rearrange data before process them. 

Speaker(s): Enrico Mastrostefano (PhD Student, Sapienza Università 

di Roma) 



ROOM A5 

S0257 Trace Based Performance Analysis For GPU 

Accelerated Multi-Hybrid Applications 

Get in contact with performance tuning experts for multi-hybrid 

applications and see first hand how VampirTrace/Vampir can 

significantly speed up application porting and development. 

Speaker(s): Guido Juckeland (System Engineer (HPC), Leader 

Hardware Accelerator Group, TU Dresden - ZIH) 



ROOM C 

S0367 Physis: An Implicitly Parallel Framework for 

Stencil Computations 

This session presents how to implement finite difference methods 

in a concise, readable, and portable way, yet achieving good 

scalability over hundreds of GPUs, using the Physis high-level 

application framework. Physis extends the standard C language 

with a small set of custom declarative constructs for expressing 

stencil computations with multidimensional structured grids, 

which are automatically translated to CUDA for GPU acceleration 

and MPI for node-level parallelization with automatic domainspecific 

optimizations such as overlapped boundary exchanges. 

We demonstrate the programmability improvement and 

performance of Physis using hundreds of GPUs on TSUBAME2.0. 

Speaker(s): Naoya Maruyama (Assistant Professor, Tokyo Institute 


Topic(s): Parallel Programming Languages & Compilers, 

Supercomputing, Development Tools & Libraries, Computational Fluid 

Dynamics (Intermediate) 


ROOM L 

S0377 C++ Data Marshalling Best Practices 

When integrating CUDA C++ kernels into existing C++ applications, 

it is at times desirable to migrate a C++ object instance from the 

host to the device or vice versa. Given variations among host 

compilers regarding structure layout, accomplishing this data 

marshalling in a manner that is reliable, simple, and efficient is a 

complex issue. cudaMemcpy is our primary means to transfer 

data to the GPU, but memcpy-style operations are more readily 

amenable to C-style structures and arrays than to C++ objects or 

collections of objects. In this session, we will cover the caveats 

and best practices for marshalling C++ data. 


WEDNESDAY 

Speaker(s): Cliff Woolley (CUDA Developer Technology Engineer, NVIDIA) 

Topic(s): Finance, Application Design & Porting Techniques (Intermediate) 


ROOM A7 

S0511 3D Helmholtz Solver with a Shifted Laplace 

Multigrid on Multi-GPUs 

Learn about an iterative solver of the 3D Helmholtz equation on 

multi-GPU using CUDA. The Helmholtz equation discretized by a 

second order finite differences is solved with Bi-CGSTAB 

preconditioned by a shifted Laplace multigrid method. Two 

multi-GPU approaches are considered: data parallelism and 

algorithm-split. Their implementations on multi-GPU architecture 

are compared to a multi-threaded CPU and single GPU 

implementation. The results show that the data parallel 

implementation is suffering from communication between GPUs 

and CPU, but is still several times faster compared to many-cores. 

The algorithm-split across GPUs limits communication and 

delivers speedups comparable to a single GPU implementation. 

Speaker(s): Kees Lemmens (Delft University of Technology) 

Topic(s): Energy Exploration, Algorithms & Numerical Techniques 



ROOM A3 

S0525 Copperhead: Data Parallel Python 

Copperhead is a data parallel language suitable for GPU 

programming, embedded in Python, which aims to provide both a 

productive programming environment as well as excellent 

computational efficiency. Copperhead programs are written in a 

small, restricted subset of the Python language, using standard 

constructs like map and reduce, along with traditional data 

parallel primitives like scan and sort. Copperhead programs 

interoperate with existing Python numerical and visualization 

libraries such as NumPy, SciPy, and Matplotlib. In this talk, we will 

discuss the Copperhead language, the open-source Copperhead 

runtime, and selected example programs. 

Speaker(s): Bryan Catanzaro (Research Scientist, NVIDIA) 



ROOM J1 

S0704 Accelerating Iterative Linear Solvers on GPUs 

In this talk, we present our work on solving sparse linear systems 

on NVIDIA Tesla GPU. We develop a new matrix format for GPU, 

HEC (Hybrid of ELL and CSR). The corresponding sparse matrix 

vector multiplication kernel and other related BLAS 1/2 

subroutines are developed. Based on these subroutines, seven 

Krylov subspace solvers and two algebraic multigrid solvers 

(AMG) are implemented. Several commonly used preconditioners, 

such as Neumann polynomial, approximate inverse, ILU(k), ILUT, 

block ILU(k), block ILUT, domain decomposition (DDM) and AMG 

preconditioners, are also developed. Besides, a new parallel 

triangular solver for GPU is designed. With this solver, a unified 

framework for ILU-related preconditioners is implemented. 

Speaker(s): Hui Liu (University of Calgary) 



ROOM L 

S0100 Mathematica as a Practical Platform for GPU- 

Accelerated Finance 

With the introduction of GPU support in version 8, Mathematica 

has become an excellent environment for integrating CUDA with 

high level code for interpretation or visualization. In this 

presentation, we will show the usefulness of Mathematica in the 

venue of computational finance. In addition to demonstrating the 

GPU-accelerated financial computations which can be readily 

performed within Mathematica, we will show that these 

calculations can easily be integrated with third-party data sources 

including Microsoft Excel and databases. Furthermore, we will 

cover the UnRisk Mathematica package written by MathConsult, 

which seamlessly adds GPU-accelerated complex model 

calibration algorithms to Mathematica’s repertoire. 

Speaker(s): Abdul Dakkak (Kernel Developer, Wolfram Research), 

Dylan Roeh (Kernel Developer, Wolfram Research) 

Topic(s): Finance, Development Tools & Libraries (Intermediate) 


ROOM A1 

S0128 V:Screen: A Real-Time Augmented Video Method 

This presentation presents a tool for image editing that allows us 

to modify a region of any image or video by another image or 

video. This application is useful for advertisements, commercials, 

music videos, movies, etc. We named “Virtual Screen” or just 

VScreen, to our development. The main difference between editing 

(augmenting) videos and fixed images is that the occlusions need 

be managed. Moving objects in the foreground may occlude the 

augmented region in background. So that we use a procedure for 

foreground-background video segmentation, that is implemented 

in NVIDIA video cards to fulfill the real-time requirement. 

Speaker(s): Francisco J. Hernandez-Lopez (PhD Student, CIMAT A.C.), 

Mariano Rivera (Researcher-Professor, CIMAT A.C.) 

Topic(s): Computer Vision (Beginner) 


ROOM N 

S0139 GPU-Based Molecular Dynamics Simulations of 

Protein and RNA Assembly 

Protein and RNA biomolecular folding and assembly problems 

have important applications because misfolding is associated with 

diseases like Alzheimer’s and Parkinson’s. However, simulating 

complex biomolecules on the same timescales as experiments is 

an extraordinary challenge due to a bottleneck in the force 

calculations. To overcome these hurdles, we perform coarsegrained 

molecular dynamics simulations where biomolecules are 

reduced into simpler components. Furthermore, our GPU-based 

simulations have a significant performance improvement over 

CPU-based simulations, which is limited to systems of 50-150 

residues/nucleotides. The GPU-based code can simulate protein/ 

RNA systems of 400-10,000+ residues/nucleotides, and we 

present ribosome assembly simulations. 

Speaker(s): Samuel Cho (Assistant Professor, Wake Forest University) 

Topic(s): Molecular Dynamics, Computational Physics (Intermediate) 


ROOM A3 

S0242 Harnessing GPU Compute with C++ AMP (Part 1 of 2) 

C++ AMP is an open specification for taking advantage of 

accelerators like the GPU. In this session we will explore the C++

AMP implementation in Microsoft Visual Studio 11. After a quick 

overview of the technology understanding its goals and its 

differentiation compared with other approaches, we will dive into 

the programming model and its modern C++ API. This is a code 

heavy, interactive, two-part session, where every part of the 

library will be explained. Demos will include showing off the 

richest parallel and GPU debugging story on the market, in the 

upcoming Visual Studio release. 

Speaker(s): Daniel Moth (Principal Program Manager, Microsoft) 

Topic(s): Parallel Programming Languages & Compilers, Development 

Tools & Libraries (Intermediate) 


ROOM A5 

S0298 Performance Tools for GPU-Powered Scalable 

Heterogeneous Systems 

Discover the latest parallel performance tool technology for 

understanding and optimizing parallel computations on scalable 

heterogeneous platforms. The session will present the TAU 

performance system and its support of measurement and analysis 

of heterogeneous platforms composed of clusters of sharedmemory 

nodes with GPUs. In particular, TAU’s integration of the 

CUPTI 4.1+ technology will be described and demonstrated 

through CUDA SDK examples and the SHOC benchmarks. 

Attendees will be provided LiveDVDs containing the TAU toolsuite 

and many pre-installed parallel tool packages. It will also include 

the last CUDA driver, runtime library, and CUPTI. 

Speaker(s): Allen Malony (Professor, University of Oregon) 

Topic(s): Development Tools & Libraries, Parallel Programming 

Languages & Compilers, Application Design & Porting Techniques 



ROOM B 

S0361 Lossless Data Compression on GPUs 

In this talk, we will discuss common data compression algorithms 

used in the bzip2 implementation. We will also discuss our efforts 

towards parallelizing the Burrows-Wheeler Transform, Move-to- 

Front Transform, and Huffman encoding. The Burrows-Wheeler 

Transform is an algorithm used in both lossless data compression 

and bioinformatics. We’ll explain how it was computed using a 

parallel string-sorting algorithm. We will also show performance 

comparisons to serial implementations of each algorithm. 

Speaker(s): Jason Mak (Graduate Student, UC Davis), Ritesh Patel 

(Student, University of California Davis) 

Topic(s): Algorithms & Numerical Techniques, Bioinformatics 




S0410 Computing Hausdorff Distances Between 

Freeforms on the GPU 

We present new GPU algorithms for computing the directed 

Hausdorff distance between freeform surfaces, with applications in 

shape matching, mesh simplification, and geometric approximation 

and optimization. Our algorithms run in real-time with very small 

error bounds for parametric models defined by complex NURBS 

surfaces and can be used to interactively compute the Hausdorff 

distance for models made of dynamic deformable surfaces. We 

discuss implementation decisions and tradeoffs between OpenGL, 

Cuda, and Thrust, and the advantages and disadvantages of parallel 

hierarchical culling methods for this application. 

Speaker(s): Sara McMains (Professor, UC Berkeley), Adarsh 

Krishnamurthy (Post-doctoral Researcher, UC San Diego) 

Topic(s): Algorithms & Numerical Techniques, Computer Graphics, 

Computer Vision (Intermediate) 


ROOM K 

S0518 GPU Computing: From Sand to Tank Dynamics 

This talk explores the use of heterogeneous CPU/GPU computing, 

as enabled by an in-house developed Heterogeneous Computing 

Template (HCT), for physics-based simulations of mechanical 

systems. HCT draws on five components: advanced modeling 

techniques (formulating the governing equations); algorithmic 

support (solving these equations); proximity computation; domain 

decomposition/data exchange (for multi-node distributed CPU/GPU 

computing); and post-processing/visualization. These five 

components provide the foundation of a computational framework 

used to analyze mechanical systems with millions of interacting 

elements. Example applications will include granular terrain 

simulation, tracked and wheeled vehicle mobility studies (tanks, 

rovers), fluid-solid interaction and nonlinear finite element analysis. 

Speaker(s): Dan Negrut (Associate Professor, University of 

Wisconsin-Madison) 

Topic(s): Computational Structural Mechanics, Computational 

Fluid Dynamics (Advanced) 


ROOM C 

S0605 cudaDMA: Emulating DMA engines on GPUs for 

Performance and Programmability 

The CudaDMA library is a collection of DMA objects that support 

efficient movement of data between off-chip global memory and 

on-chip shared memory in CUDA kernels. CudaDMA objects 

support many different data transfer patterns including sequential, 

strided, gather, scatter, and halo patterns. The library encapsulates 

efficient synchronization and data transfer implementations to 

achieve high memory bandwidth utilization. Programmer 

productivity is achieved by avoiding the need for thread array 

shapes to match data layout. Using CudaDMA, speedups of up to 

1.37x on synthetic micro-benchmarks and 1.15x-3.2x on kernels 

from scientific applications have been demonstrated. 

Speaker(s): Brucek Khailany (Senior Research Scientist, NVIDIA) 

Topic(s): Development Tools and Libraries (Intermediate) 

WEDNESDAY, MAY 16, 17:00 TBD (25 MINUTES) 

ROOM A8 

S0623 Visualizing Heterogeneous Performance Tested 

on MPI+CUDA Gigapixel Panorama Stitching 

This session consists of two technical parts. In the first part, we 

explain the use and implementation of a hybrid Poisson solver for 

gradient domain processing of massive images. Specifically, we 

provide a parallel out-of-core method for the seamless stitching 

of gigapixel panoramas in a parallel CUDA + MPI environment. In 

the second part, we shall cover the ongoing work of using novel 

visualizing techniques to understand performance data of 

heterogeneous computing clusters. The Poisson solver application 

shall be taken up as an example to demonstrate various features 

of this performance visualization tool. 

Speaker(s): Valerio Pascucci (Director of the Center for Extreme Data 

Management, Analysis and Visualization, University of Utah) 

Topic(s): Supercomputing, Visualization, Development Tools and 

Libraries (Beginner) 


WEDNESDAY 


ROOM J1 

S0705 Efficient AMG on Hybrid GPU Clusters 

This talk presents the implementation of an AMG solver for a hybrid 

cluster that exploits distributed and shared memory parallelization 

and uses the available GPU accelerators on each node. This solver 

has been written by using LAMA (Library for Accelerated Math 

Applications). This library does not only provide an easy-to-use 

framework for solvers that might run on different devices with 

different matrix formats, but also comes with features to optimize 

and hide communication and memory transfers between CPUs and 

GPUs. These features are explained and their impact on the 

efficiency of the AMG solver is shown. The benchmark results 

demonstrate that an efficient use of hybrid clusters is even possible 

for multi-level methods like AMG where fast solutions are needed 

on all levels for multiple problems sizes. 

Speaker(s): Thomas Brandes (Senior Scientist, Fraunhofer Institute for 

Algorithms and Scientific Computing SCAI), Jiri Krau, Fraunhofer 

Institute for Algorithms and Scientific Computing SCAI) 





Featuring Raytrix and Playcast Featuring Raytrix, 

Playcast and Unviversal Robotics 







Speaker(s): Christian Perwass (CEO, Raytrix), Guy De Beer, (CEO, 

Playcast), David Peters (CEO, Universal Robotics) 


Enderle (Principal Analyst, Enderle Group), Flip GIanos (General Partner, 

InterWest Partners), Jeff Herbst (VP of Business Development, NVIDIA) 



ROOM M 

S0639 Presented by Penguin 

Description unavailable at press time. 

Topic(s): General (Beginner) 


ROOM A7 

S0647 Effective HPC Architecture - Design, Develop, 

Implement (Presented by ELEKS) 

Effective HPC system is so much more than just GPGPU. Realworld 

applications often need to stream large amounts of data from 

across system boundaries to the dozens of worker nodes in a most 

scalable and efficient way. They usually require storing huge 

amounts of data, scheduling of computation jobs, monitoring of 

system health and results visualization. Having first-hand 

experience in design, development and implementation of end-toend 

HPC solutions, our engineers will share their experience on 

some of the pitfalls to avoid and things to consider when planning 

your next HPC system that works. 

Speaker(s): Oleh Khoma (Head of HPC Unit, ELEKS) 

Topic(s): Supercomputing; Application Design & Porting Techniques 

Intermediate (Beginner) 




Come to the NVIDIA Nsight Lounge to meet the Nsight development 

team! Whether you would like a private meeting to discuss specific 

product features or test out your application with the latest version 

of Nsight, or you just want to hang out with the team after attending 

one of the exciting training session, the lounge is great place to 

learn everything you ever wanted to know about the tool. 





S0096 Summed Area Ripmaps 

In this presentation, we show how ripmaps can replace Summed 

Area Tables (SATs) for the purpose of computing a large number 

of spatially varying box filter kernels throughout the input data, 

providing both higher accuracy and higher speed for typical use 

cases. For this purpose, we demonstrate an implementation of 

ripmap generation in CUDA C (accelerated by shared memory 

usage), and a texture-cache based box filter for spatially varying 

kernel sizes, which can be implemented in both CUDA C and 

graphics-based APIs (e.g. OpenGL and DirectX). 

Speaker(s): Gernot Ziegler (Compute Developer Technology, NVIDIA) 

Topic(s): Algorithms & Numerical Techniques, Computer Vision, 



ROOM K 

S0217 Efficient Implementation of CFD Algorithms on 

GPU Accelerated Supercomputers 

The goal of this session is to introduce the concepts necessary to 

perform large computational fluid dynamic (CFD) problems on 

collections of many GPUs. Communication and computation 

overlapping schemes become even more critical when using fast 

compute engines such as GPUs that are connected via a relatively 

slow interconnect (such as MPI on InfiniBand). The algorithms 

presented are validated on unsteady CFD simulations of 

turbulence using 192 graphics processors to update half-a-billion 

unknowns per computational timestep. The performance results 

from three different GPU accelerated supercomputers (Lincoln, 

Forge, and Keeneland) are compared with a large CPU based 

supercomputer (Ranger). 

Speaker(s): Ali Khajeh Saeed (PhD Candidate, University of 

Massachusetts, Amherst), Blair Perot (University of Massachusetts, 

Amherst) 

Topic(s): Computational Fluid Dynamics, Computational Physics, 

Supercomputing, Application Design & Porting Techniques (Intermediate) 


ROOM C 

S0311 Teaching Applied Parallel Computing with GPUs 

Learn how the next generation of HPC developers are learning 

hands-on skills with GPUs, and how GPU computing is being 

incorporated into Computer Science courses. We will discuss how 

GPUs are being used to enhance student learning of parallel 

computing concepts through a cross-teaching approach, where 

students with different domain expertise are grouped into teams 

and tasked with parallelizing an application such as ray tracing. 

We’ll show that student projects that emphasize optimization of 

architectural resources and performance tuning allow students

with no prior experience to parallelize a large-scale application with 

significant performance improvement in as little as six weeks. 

Speaker(s): Chris Lupo (Assistant Professor, California Polytechnic 

State University) 

Topic(s): General Interest, Ray Tracing (Intermediate) 


ROOM N 

S0346 GPGPU Accelerated Protein Similarity Measures 

Identifying Biological Relevant Structure 

Atomic structure similarity measures for proteins help in de novo 

protein structure prediction. For a large set of computationally 

generated protein structures (~20k) all pairwise similarities have to 

be calculated to cluster structures. Common similarity measures 

are root mean square deviation (RMSD) and global distance test 

total score (GDT_TS). Although GDT_TS has advantages over RMSD, 

it is not used due to its time consuming calculation. Afore 

mentioned and other similarity measures are ported for parallel 

execution on GPGPUs to make them amenable for clustering de 

novo generated structural models to find the largest cluster 

representing the biological relevant protein conformations. 

Speaker(s): Edward Lowe (Research Assistant Professor, Vanderbilt 

University), Nils Woetzel (Research Assistant, Vanderbilt University) 

Topic(s): Bioinformatics, Application Design & Porting Techniques 



ROOM A1 

S0425 File Sharing Plus Real Time Media and 

Document Collaboration 

Studiopass is a cloud based file sharing and visual collaboration 

tool which allows participants to collaborate on Microsoft 

documents and media files including 1080p video. It is graphic 

intensive and requires the best GPU performance to push 

playback of heavy files. This session will discuss how NVIDIA 

Tegra powered devices delivers the graphic and video 

performance needed for efficient collaboration needs and how it 

will bring more acceleration with the new Tegra 3 Quad Core plus 

1. Studiopass collaboration is not only accelerated by Tegra 

devices but also leverages NVIDIA Tesla accelerated transcoding 

running on Amazon Web Services. 

Speaker(s): Kevin Jackson (Founder / CEO, Viewpartners) 

Topic(s): Mobile Applications & Interfaces, Cloud Computing (Beginner) 


ROOM L 

S0656 kdb+ and GPUs for Market Data Analytics 

and Trading 

Market data volumes increase year-on-year with the occasional 

extraordinary capacity-breaking peak. We must capture, store and 

process these data to gain insights for quantitative and 

algorithmic trading using a variety of market data analytics and 

techniques. kdb+ from KX Systems is a memory-based column 

database, written in the vector-functional language q, often used 

in finance for these analyses. In this session we demonstrate a 

method for the enhanced performance of general programs 

written in q and kdb+ by executing them on the GPU. 

Speaker(s): Philip A. Beasley-Harling (Bank of America Merrill Lynch) 

Topic Area(s): Finance (Beginner) 


ROOM J1 

S0706 PISTON: Portability and Performance for Data- 

Parallel Visualization and Analysis Operators 

Due to the wide variety of current and next-generation 

supercomputing architectures, the development of highperformance 

parallel visualization and analysis operators 

frequently requires re-writing the underlying algorithms for many 

different platforms. In order to facilitate portability, we have 

devised a framework for creating such operators that employs the 

data-parallel programming model. 

Speaker(s): Christopher Sewell (Los Alamos National Laboratory), 

Li-Ta Lo (Los Alamos National Laboratory) 

Topic(s): GPU/Hybrid Computing, Data Science and Visualization 



ROOM L 

S0653 C++ and CUDA Birds-of-a-Feather 

This birds-of-a-feather will provide an opportunity for C++ and 

GPU users to learn about how the powerful C++ language can be 

used on the CUDA platform. NVIDIA and guest speakers will 

present details of the latest C++ features in CUDA and the Thrust 

open source template library, as well as discuss some goals and 

directions for C++ on the CUDA platform. It will also provide 

attendees a valuable opportunity to network with other attendees 

and NVIDIA engineers who share their interest in C++. 

Speaker(s): Mark Harris (Chief Technologist, GPU Computing, NVIDIA) 





THURSDAY, MAY 17, 09:00 (25 MINUTES) 


S0057 GPU-Accelerated Molecular Dynamics Simulation 

of Solid Covalent Crystals 

An efficient and highly scalable algorithm for molecular dynamics 

(MD) simulation (using sophisticated many-body potentials) of solid 

covalent crystals is presented. Its effective memory throughput on a 

single C2050 GPU board reached 102 GB/s (81% of the peak), the 

instruction throughput reached 412 Ginstr/s (80% of the peak), and 

27% of the peak flops of a single GPU was obtained. Parallel 

efficiency of the algorithm can be as high as 95% on all 7168 GPUs 

of Tianhe-1A, reaching possibly a record in high performance of MD 

simulations, 1.87Pflops in single precision. 

Speaker(s): Wei Ge (Professor, Institute of Process Engineering, 


Topic(s): Molecular Dynamics, Algorithms & Numerical Techniques, 



ROOM A8 

S0129 A Monte Carlo Thermal Radiation Solver in GPU/ 

CPU Hybrid Architecture 

A Monte Carlo ray-tracing code is developed to predict radiative 

heat transfer behaviours in CFD simulation of combustion 

phenomena. Using emission-reciprocal method, each random ray 

casting of each node could be independently conducted for 

parallel computations. The code is efficiently implemented in 

hybrid GPU/CPU HPC resources using a dedicated dynamic load 

balancing strategy. A linear speedup scaling of hybrid HPC 

resources has been shown in demonstrating calculation of 

radiative heat transfer of a helicopter engine’s combustion 

chamber, while adding one GPU in HPC resources pool is in sense 

of nine CPU cores supplements. 

Speaker(s): Oliver Gicquel (Professor, Laboratoire E.M2.C, Ecole 

Centrale Paris), Gaofeng Wang (Postdoc Fellow, Laboratoire E.M2.C, 

Ecole Centrale Paris) 

Topic(s): Computational Fluid Dynamics, Computational Fluid 

Dynamics, Computational Physics, Ray Tracing (Intermediate) 


ROOM A3 

S0133 Improving Mars Rover Image Compression Via 

GPUs And Genetic Algorithms 

Learn how to use Jacket to accelerate genetic algorithm (GA) 

image compression. Our research uses a GA to optimize lossy 

compression transforms that outperform state-of-the-art 

wavelet-based approaches for a variety of image classes, 

including fingerprints, satellite, medical, and images transmitted 

from the Mars Exploration Rovers. A typical training run evolves a 

population of transforms over many generations; since each 

transform must be applied to each image from the training set, 

each run entails thousands of independent, parallelizable fitness 

evaluations. By using MATLAB, and Jacket to perform 2D 

convolution on the GPU, we have greatly reduced the total 

computation time needed. 

Speaker(s): Brendan Babb (Student/Research Technician, University of 

Alaska Anchorage) 

Topic(s): Machine Learning & AI, Audio, Image and Video Processing, 



ROOM N 

S0256 A Stencil Library for the New Dynamic Core 

of COSMO 

We will present a stencil library used in the heart of the COSMO 

numeric weather prediction model. During the talk we’ll show 

how we implemented an abstraction that allows easy development 

of new stencils and solvers on top of a framework allowing 

execution on both CPU and GPU. The library makes efficient use 

of GPU resources and we will show how to structure memory 

accesses and computation optimally. Developers involved in 

porting or writing fully-featured C++ libraries for CUDA will also 

be interested in attending. 

Speaker(s): Tobias Gysi (Supercomputing Systems AG), 

Paul Messner (NVIDIA) 

Topic(s): Climate & Weather Modeling, Development Tools & 

Libraries (Advanced) 



S0302 Accelerating miniFE: A Finite Element 

Mini-application 

The Mantevo performance project is a collection of self-contained 

proxy applications that illustrate the main performance 

characteristics of important algorithms. miniFE is intended to be 

and approximation to an unstructured implicit finite element or 

finite volume application. Our work investigated algorithms for 

assembling a matrix on the GPU. Parallelization algorithms using 

both 1 thread and 8 threads per element were investigated. Using 

these approaches a significant speedup (over 60x for double 

precision) compared to the serial algorithm. 

Speaker(s): Justin Luitjens (Developer Technology, Compute, NVIDIA) 



ROOM C 

S0303 GPU Acceleration for Threshold Based Region 

Growth Algorithms 

Come learn how the massively parallel computing power of 

modern GPUs help to create faster and more accurate volume 

rendered images for the medical imaging community. Attendees 

of this session will gain insight into how GPUs can accelerate 

region growth algorithms and how these algorithms can be 

optimized for the latest generation of NVIDIA hardware. Topics 

covered will include fundamental of region growth, GPU 

implementations, and practical examples of vessel tracking 

algorithms based on GPU accelerated algorithms. 

Speaker(s): Supratik Moulik (Cardiovascular Imaging Fellow, University 

of Pennsylvania), Jason Walsh (University of Pennsylvania 3D Lab) 

Topic(s): Medical Imaging & Visualization, Bioinformatics (Beginner) 


ROOM A1 

S0326 Next Generation InfoWall 

Learn how you can use a multiple display configuration to render 

video content captured from multiple sources, utilizing the power 

of GPUs to achieve unprecedented performance. 

Speaker(s): Alina Alt (Applied Engineer, NVIDIA), Andrew Page (Sr. 

Product Manager, NVIDIA), Shalini Venkataraman (Senior Applied 

Engineer, NVIDIA), Ian Williams (NVIDIA) 

Topic(s): Visualization, Computer Graphics (Intermediate) 

69 CONFERENCE GUIDE THURSDAY

THURSDAY 


ROOM B 

S0333 GMAC-2: Easy and Efficient Programming for 

CUDA-Based Systems 

In this talk we introduce GMAC-2, a framework that eases the 

development of CUDA applications and tools while achieving 

similar or better performance than hand-tuned code. The new 

features implemented in GMAC-2 allow programmers to further 

fine-tune their code and remove some limitations found in the 

original GMAC library. For example, memory objects can be now 

arbitrarily mapped on several devices without restrictions and a 

host thread can launch kernels on any GPU in the system. 

Moreover, GMAC-2 transparently takes advantage of the new 

features offered by the hardware like the GPUDirect 2 peer-topeer 

communication. 

Speaker(s): Javier Cabezas (PhD Student, Barcelona Supercomputing 

Center), Isaac Gelado (Senior Researcher, Barcelona 

Supercomputing Center) 



ROOM M 

S0347 Accelerating Radio Astronomy Cross-Correlation 

beyond 1 Tflops using Fermi 

Radio astronomy is a signal processing application that requires 

extreme supercomputing. While today’s radio telescopes require 

10-100 Tflops of computational power, by the end of the decade 

this will increase to 1 Exaflops. The most compute intensive part 

of this problem is the so-called cross-correlation algorithm, which 

is a linear-algebra problem. In this session we demonstrate that 

the Fermi architecture is ideally suited to this problem, and 

through exploiting the Fermi memory hierarchy it is possible to 

achieve close to 80% of peak performance in a real application. 

Speaker(s): Michael Clark (Compute DevTech Engineer, NVIDIA) 

Topic(s): Astronomy & Astrophysics, Supercomputing (Intermediate) 


HALL 1 

S0362 Maximizing Performance on Multi-GPU Systems 

Are 512 CUDA Cores not enough? This session is for power users 

that are looking to scale applications to multi-GPU systems. We 

will take a holistic approach towards optimization. Rather than 

just focusing on CUDA programming, this session will cover 

techniques for reducing pressure on the PCIe bus, using CUDA 

Streams to improve load balance, dealing with NUMA impacts, 

and taking advantage of CPU threads. This talk will also cover 

strategies for developing applications that run on clusters with 

100 or more GPUs. 

Speaker(s): Kenneth Czechowski (Student, Georgia Tech) 

Topic(s): Supercomputing (Advanced) 


ROOM L 

S0619 Hate to Wait? Flash Memory for Full-Throttle 

GPU Acceleration (Presented by Fusion-io) 

Are you guilty of ever not trying out an idea because of the time it 

would take to process the effect? With flash memory throttling your 

system like jet fuel for your GPU, you can finally make sluggish 

application performance a bad memory. This session will couple a 

technical overview of the latest in PCIe-attached flash memory 

technology for accelerating graphics processing with developer best 

practices and tuning for GPU applications using flash memory for 

image compositing, editing, video playback, 3D content creation, 

video capture and many other data-intensive tasks. 

Speaker(s): Vincent Brisebois (Visual Computing Product Manager, 

Fusion-io), Robert Wipfel (Fellow, Fusion-io) 

Topic(s): Digital Content Creation & Film, Computer Graphics 



ROOM A2 

S0648 Presented by ASUS 

Description unavailable at press time. 

Topic(s): General 


ROOM J2 

S0707 Accelerated HPC Symposium: Scalability: 

Hardware and Software (Presented by LANL) 

This session will feature an introduction by Justin Tripp, followed 

by a short talk on “The FPGA: Another Piece of the Puzzle” 

followed by talk on “Increasing Efficiency with Kepler.” After a 

short discussion and break, we’ll end this session with three short 

talks, “Image Analysis for Terascale Radio Astronomy,” “In situ 

Image Analysis for Large Scale Visualization,” and “GPU 

Acceleration of MapReduce. 

Speaker(s): Justin Tripp (LANL), Stephen Jones (NVIDIA),Christopher 

Fluke (Swinburne University of Technology),Christopher Sewel (LANL), 

Miao Xin (Junnan University) 



ROOM J3 

S0708 Accelerated HPC Symposium: Applications - 

Methods and Programming Models, Part 1 (Presented 

by LANL) 

This session will feature an introduction by Guillaume Colin de 

Verdiere, followed by a short talk on “Precondition for Large-Scale 

Linear Solvers.” Proceeding this segment are two short talks on 

“Changing Data Structures for a Changing World,” and 

“Leveraging Roadrunner Experiences,” After a short discussion 

and break, we will then end this Part 1 of 2 talks with “Taming 

Laser Plasma Interactions: PIConGPU”. 

Speaker(s): Dimitar Lukarski (Karlsruhe Institute of Technology), Hui 

Liu (University of Calgary) and Michael Bussmann (Helmholtz-Zentrum 

Dresden-Rossendorf), Jamal Mohd-Yusof (Los Alamos National 

Laboratory) 














Topic(s): Development Tools & Libraries (Beginner)


ROOM A3 

S0081 Parallel Computing In Mobile Robotics for RISE 

RISE, Risky Intervention and Surveillance Environment, is a very 

demanding task. In this presentation, three areas of research are 

discussed, these include: 3D data registration, robot navigation 

and 3D cloud of points processing. The approach based on robust 

KNN nearest neighborhood search applied for improvement of ICP 

algorithm is shown. The path planning parallel approach based on 

wave propagation method is shown. On line segmentation of 3D 

cloud of points based on normal vector computation is given. The 

set of proposed algorithms where tested on GPGPU NVIDIA CUDA 

GF 580, the results are satisfying. 

Speaker(s): Janusz Bedkowski (Researcher) 

Topic(s): Machine Vision (Beginner) 


ROOM K 

S0238 Tesla Cluster Monitoring & Management APIs 

Learn more about cluster management and monitoring of Tesla 

and Quadro products. This includes a detailed description of the 

NVIDIA Management Library (NVML) and user facing third party 

software. Additionally, a brief summary of our out-of-band 

capabilities will be provided. 

Speaker(s): Robert Alexander (CUDA Tools Software Engineer, NVIDIA) 

Topic(s): Cluster Management (Beginner) 


ROOM A8 

S0264 CU++: An Object-Oriented Framework for 

Computational Fluid Dynamics (CFD) Applications 

In this session, I will elucidate the power of blending C++ 

expression templates and CUDA which has resulted in a smart 

framework - CU++ for solving Computational Fluid Dynamics 

problems on structured and unstructured meshes. Briefly, CU++ 

allows a code developer with just C/C++ knowledge to write 

computer programs that will execute on the GPU with minimal 

knowledge of specific programming techniques in CUDA. It allows 

the user to reuse existing C/C++ CFD codes with minimal 

changes. Codes written in CU++ can also be compiled in serial 

mode to be executed on a CPU using the tool ugc. 

Speaker(s): Dominic Chandar (Postdoctoral Research Associate, 

University of Wyoming) 

Topic(s): Computational Fluid Dynamics, Algorithms & 




S0290 Algorithm Acceleration for Geospatial Analysis 

Learn how the power of GPU computing is being leveraged to 

accelerate algorithms in the field of geospatial image analysis. 

The data volume and computation requirements associated with 

geospatial imagery are rapidly expanding as a result of the 

increasing number of satellite and airborne sensors, greater data 

accessibility, and expanded utilization of data intensive 

technologies. This equates to a growing need for highperformance 

computing in this field. We demonstrate the capacity 

for GPU computing to meet this need by accelerating a complex 

non-linear optimization algorithm used for the mapping and 

assessment of coral reef ecosystems. 

Speaker(s): James Goodman (President/CEO, HySpeed Computing 

LLC), Matthew Sellitto (Northeastern University) 

Topic(s): Algorithms & Numerical Techniques, General Interest 




S0354 Bcl::ChemInfo Suite Enables Machine Learning- 

Based Drug Discovery Using GPUs 

High-throughput screening data allows the training of machine 

learning quantitative structure activity relationship models which 

can be used for in silico drug discovery screening. Here, we present 

a GPU- accelerated suite for descriptor generation, model training, 

feature selection, and data set similarity analysis, bcl::ChemInfo. 

The suite provides functionality for the analysis of constructed 

models as well as for screening external libraries of compounds. 

We examine case studies illustrating how this workflow can now be 

completed in a single day on a Tesla equipped workstation with 

speedups reaching 300x providing a complete GPU-accelerated 

cheminformatics framework for drug discovery. 

Speaker(s): Edward Lowe (Research Assistant Professor, Vanderbilt 

University), Nils Woetzel (PhD Candidate, Vanderbilt University) 



HALL 1 

S0360 Set GPUs Free: Integrating a File System with 

CUDA Programs 

This session seeks the answer to the question: “Can we simplify 

and speed up CUDA programs by allowing them to access files 

residing on a host?” To prove our affirmative answer, we 

demonstrate how the concept of a file system enables programs 

with non-trivial CPU-GPU and GPU-GPU interactions to be 

efficiently and easily implemented on top of a new GPU file-system 

layer. We also show that such a file system enables implementation 

of fully stand-alone GPU programs without any CPU wrapper code. 

Finally we outline the details of the file system design which 

contributed to scalability, data consistency and performance. 

Speaker(s): Mark Silberstein (Post-doctoral Researcher, UT Austin), 

Emmet Witchel (University of Texas, Austin) 



ROOM A5 

S0621 NVIDIA OpenACC 

OpenACC is a directives-based programming standard for parallel 

computing on accelerators (including GPUs). It is designed to 


systems easily and quickly. Adding simple compiler hints to your 

code to express parallelism, allows the compiler to map 

computation onto an accelerator. OpenACC directives allow 

developers to make simple and portable code changes, enabling an 

easier migration to accelerated computing. This talk discusses the 

merits of this model, and provides an overview and guidance of the 

tools available to the developer from the OpenACC members. 

Speaker(s): Duncan Poole (Senior Manager, HPC, NVIDIA) 




S0039 Data-Driven GPGPU Ideology Extension 

In this session we will demonstrate how the GPGPU ideology can 


THURSDAY 

be extended so that it can be used on a scale of Infiniband hybrid 

system. The approach that we are presenting combines delayed 

execution, scheduling techniques and, most importantly, casts 

down the CPU multi-core ideology to the streaming 

multiprocessor’s one enforcing full fledged “GPGPU as a coprocessor” 

way of programming for large-scale MPI hybrid 

applications. Staying compatible with modern CPU/GPGPU 

libraries it provides more than a fine grained control over 

resources - more than you wanted that is. 

Speaker(s): Bela Bauer (Postdoc, Microsoft Research), Alexandr 

Kosenkov (Software Engineer, University of Geneva) 

Topic(s): Application Design & Porting Techniques, Computational 

Physics, Parallel Programming Languages & Compilers, Development 

Tools & Libraries (Advanced) 


ROOM N 

S0053 Real Time GPU-Based Marine Scenes Simulation 

Marine survey, carried out by sea or by air, is of major concern for 

current defense and security applications. Essential surveillance/ 

observation/ identification systems involve electro-optics (visible 

and infra-red) and radar. Optimizing their performance requires 

amounts of expensive observational data spanning the wide 

variability of the marine environment. Computer simulation 

provides a valuable flexible and inexpensive alternative. Since 

2007, ALYOTECH, in partnership with the IFREMER (French 

Research Institute for Exploration of the Sea), has been developing 

a GPU-based real-time ocean scene simulator for visible, infrared 

and radar sensors, in order to meet the challenging requirements 

arising from marine survey issues. 

Speaker(s): Jérôme Graindorge (Project Manager, ALYOTECH), Julien 

Houssay (Software Engineer, ALYOTECH) 

Topic(s): Climate & Weather Modeling, Visualization (Intermediate) 


ROOM B 

S0078 Panoptes: A Binary Instrumentation Framework 

for CUDA 

Traditional CPU-based computing environments offer a variety of 

binary instrumentation frameworks, while the instrumentation 

and analysis tools available to date for GPU environments have 

been more limited. Here we present Panoptes, a binary 

instrumentation framework for CUDA that targets the GPU. By 

exploiting the GPU to run modified kernels, Panoptes allows 

computationally intensive programs to be run at the native 

parallelism of the device during analysis. To demonstrate the 

instrumentation capabilities of Panoptes, we will present our work 

on a memory addressability and validity checker that targets 

CUDA programs. 

Speaker(s): Christopher Kennelly (Research Scientist, 

D. E. Shaw Research) 

Topic(s): Development Tools & Libraries (Advanced) 


ROOM M 

S0124 Signal Processing on GPUs for Radio Telescopes 

This session will present GPU implementations of four highly 

compute-intensive algorithms used by radio telescopes. 

Speaker(s): John Romein (Senior Researcher, ASTRON) 

Topic(s): Astronomy & Astrophysics (Intermediate) 


ROOM C 

S0244 Harnessing GPU Compute with C++ AMP (Part 2 of 2) 

C++ AMP is an open specification for taking advantage of 

accelerators like the GPU. In this session we will explore the C++ 

AMP implementation in Microsoft Visual Studio 11. After a quick 

overview of the technology understanding its goals and its 

differentiation compared with other approaches, we will dive into 

the programming model and its modern C++ API. This is a code 

heavy, interactive, two-part session, where every part of the 

library will be explained. Demos will include showing off the 

richest parallel and GPU debugging story on the market, in the 

upcoming Visual Studio release. 

Speaker(s): Daniel Moth (Principal Program Manager, Microsoft) 

Topic(s): Parallel Programming Languages & Compilers, Development 

Tools & Libraries (Intermediate) 


ROOM A8 

S0305 Classical Algebraic Multigrid for CFD with CUDA 

Classical algebraic multigrid (AMG) is one of the most popular 

algorithms used in engineering, and the engine in many 

successful commercial packages. Among sparse linear solvers, it 

is known for being fast, parallel and scalable, yet it maps to GPU 

architecture with some considerable difficulty. We have tackled 

these difficulties and currently have a full CUDA implementation 

of classical AMG, which has been validated against the goldstandard, 

Hypre. Significant effort was dedicated to reducing 

thread divergence and optimizing memory access, and we 

continue to work on performance improvements. We are aiming 

for a competitive AMG code for fluid dynamics applications. 

Speaker(s): Simon Layton (PhD Candidate, Boston University) 





S0315 Probing Bio-Nano Interface Structure from 

Microsecond Molecular Dynamics on GPUs 

Using the latest algorithmic development in molecular dynamics 

on multiple GPUs over MPI, and technologies like GPUDirect it is 

now possible to address problems of interaction at bio-nano 

interface via large scale atomistic simulations. This talk will 

discuss the aspects of DNA-nanotube interactions and SWCNT 

induced conformational changes in DNA nucleosome structure. 

We will also address technical challenges upon porting and tuning 

AMBER 11 code on Condor GPU cluster at AFRL. 

Speaker(s): Olexandr Isayev (Research Scientist, Case Western 

Reserve University) 

Topic(s): Molecular Dynamics, Life Sciences (Advanced) 


ROOM A1 

S0324 Content Generation and Real-Time Hologram 

Computation for Holographic 3D-Displays 

This session will introduce SeeReal’s sub-hologram technology to 

massively reduce hologram computation effort in comparison to 

classic holography and how SeeReal implemented those still 

compute intensive algorithms to execute on the GPU to enable 

viewing of interactive, rich 3D-content on holographic 3D-displays 

using off-the-shelf graphics hardware. In contrast, you will 

explore why classic holography does not suit well for interactive

applications. Furthermore guidelines to create appropriate 

3D-content are presented, including aspects regarding 

transparency in holograms. Finally the specification and some 

impressions of SeeReal’s 20” holographic prototype will be 

presented, which allows viewing of live computed holograms 

showing 3D-content and 3D-video. 

Speaker(s): Enrico Zschau (Lead Software Architect, SeeReal 

Technologies GmbH) 

Topic(s): Visualization, Stereoscopic 3D, Algorithms & Numerical 

Techniques, Audio, Image and Video Processing (Beginner) 


HALL 1 

S0338 New Features In the CUDA Programming Model 

The continuing evolution of the GPU brings with it new hardware 

capabilities and new functionality. Simultaneously, ongoing 

development of CUDA and its tools, libraries and ecosystem 

brings new features to the software stack as well. Come and learn 

from on of CUDA’s programming model architects about what’s 

new in the GPU, what’s coming in the next release of CUDA, how it 

works, and how it all fits together. 

Speaker(s): Stephen Jones (CUDA Developer, NVIDIA) 



ROOM A2 

S0508 Faster Finite Elements for Wave Propagation Codes 

Learn how to develop faster and better finite-element codes for 

wave propagation using GPUs and MPI combined with overlapping 

techniques to hide the cost of communications and of host/device 

memory copies. Different options based on mesh coloring or on 

atomic operations will be presented. The difficulty to define 

speedup will also be discussed (speedup versus what? Using what 

definition of “cost”?). Examples will be given using SPECFEM3D, a 

highly optimized spectral finite-element code that has won the 

Gordon Bell SuperComputing award and the BULL Joseph Fourier 

award, and that can run on CPU or GPU clusters. 

Speaker(s): Max Rietmann (PhD Student, Institute for Computational 

Science / USI Lugano, Switzerland) 

Topic(s): Algorithms & Numerical Techniques, Computational 

Physics (Intermediate) 


ROOM A3 

S0521 Desktop Supercomputing in the Soft-Matter 

Physics Laboratory 

While many GPGPU applications reside on large clusters, in many 

laboratories the time to move data to an external cluster would 

exceed the time to analyze it upon arrival. By bringing highthroughput 

computational power to the data in the laboratory, 

GPUs offer new capabilities in doing science. This session offers a 

number of ways in which GPUs are making a significant impact on 

our research in experimental physics, biology and chemistry, from 

designing and building apparatus (Quadro and Tesla), to collecting 

data on portable devices (Tegra), to high-throughput analysis of 

large data sets (Tesla). It also presents results from studies 

investigating the motion of diffusing and aggregating colloidal 

particles and swimming bacteria, observing liquid-gas phase 

separation onboard the International Space Station, applying high 

dynamic-range techniques to optical tomography, and using 

low-cost devices to detect chemical and microbial contamination 

in the third world. 

Speaker(s): Peter Lu (Post-Doctoral Research Fellow, Harvard University) 



ROOM A5 

S0622 The PGI Fortran and C99 OpenACC Compilers 

Experienced GPU programmers will learn about the latest PGI 

OpenACC Fortran and C compilers. This session discusses how 

and where to apply the Parallel and Kernels constructs and the 

differences between the two. It includes a review of the latest PGI 

release and a comparison of the OpenACC standard to the PGI 

Accelerator Model. Live component demonstrates how to interpret 

compiler feedback and how to use it to enable better performance 

and how to inter-operate with lower-level explicit GPU languages 

like CUDA and OpenCL. The presentation wraps up with a look at 

planned future enhancements. 

Speaker(s): Brent Leback (Portland Group) 



ROOM L 

S0644 Molecule Dynamics, GPUs, and EC2 (Presented by 

Amazon Web Services) 

GPUs have made molecular dynamics simulations faster, better, 

and cheaper, achieving supercomputer performance from a single 

GPU without sacrificing stability or accuracy. In this talk we 

demonstrate how the GPU refactoring of AMBER 12 Molecular 

Dynamics has led to an implementation that produces results that 

are indistinguishable from the original CPU code. In addition, we 

describe the GPU compute instances available on the Amazon EC2 

platform to show how anyone can run any number of AMBER 12 

simulations, anytime from anywhere. 

Speaker(s): Scott Le Grand (Principal Engineer, Amazon Web Services) 

Topic(s): Molecular Dynamics; Computational Fluid Dynamics 















ROOM A2 

S0079 Warped Parallel Nearest Neighbor Searches 

Using KD-Trees 

We propose a nearest neighbor search algorithm for a set of 

closely located query points that utilizes GPU parallelism and is 

optimized for a single CUDA warp. Instead of each query point 

traversing its own distinct path, a combined non-divergent path 

suitable for the entire query set can constructed. Therefore, for a 

single warp a single stack can be maintained for the entire set of 

query points, allowing for efficient utilization of the shared 

memory and a number of simultaneous queries equal to the 

number of threads in a warp. 


Speaker(s): Roman Sokolov (Director of System Architecture, D4D 

Technologies), Andrei Tchouprakov (Director of System Architecture, 

D4D Technologies) 



ROOM N 

S0107 Acceleration of Long-Wave Rapid Radioactive 

Transfer Model on GPGPU 

The WRF model is a next-generation mesoscale numerical 

weather prediction system designed to serve both operational 

forecasting and atmospheric research communities. WRF offers 

multiple physics options, one of which is the Long-Wave Rapid 

Radiative Transfer Model. We found, porting rtrn() subroutine to 

the CUDA challenging. It has couple of recursive loops, for which 

GPGPUs are actually not suitable. We developed a new technique 

called loop inversion, which helped us in getting 7.7x speed up for 

the individual, rtrn() subroutine without memory transfer, and in 

turn 10x speed up for overall RRTM module including initialization 

and memory transfer. 

Speaker(s): Mahesh Khadtare (PhD Student - Scientist ESP, I2IT, Pune 

University), Prakalp Somawanshi (CRL India) 

Topic(s): Climate & Weather Modeling, Application Design & Porting 

Techniques, Climate & Weather Modeling (Intermediate) 



S0122 Computational Screening of Novel Carbon 

Capture Materials 

Discover how GPUs are used to identify optimal framework 

structures for carbon dioxide separation with the goal of reducing 

carbon emission. We describe the algorithm behind our GPU 

software tool that iterates through a database of hypothetical 

zeolites and computes the selectivity of each of the structures. 

The code can be easily extended to simulate other adsorbent 

structures such as ZIFs (zeolitic imidazolate frameworks) and 

provide valuable insights to both theorists and experimentalists 

who have interest in carbon capture research. 

Speaker(s): Jihan Kim (Postdoctoral Researcher, Berkeley Lab), 

Berend Smit (UC Berkeley/Berkeley Lab) 

Topic(s): Molecular Dynamics (Intermediate) 


ROOM A1 

S0252 Building Real-Time Professional Visualization 

Solutions with OpenCL 

Professional visualization solutions, like high-quality highresolution 

medical displays or very large screens for surveillance 

or entertainment, benefit from GPU’s image and graphics 

compute capabilities to achieve real-time performance, but add 

specific constraints, like low-latency, multiple HD streams and 

strict synchronization. This talk first motivates the industrial 

relevance of development in OpenCL on heterogeneous devices. It 

then explains the techniques currently explored to meet the 

specific design constraints, with a main focus on parallel data 

transfer and compute. The lessons learned are illustrated with a 

real-life example. 

Speaker(s): Kristof Denolf (Research Engineer, Barco), Ronny Dewaele 

(Director Technology Center, Barco) 

Topic(s): Audio, Image and Video Processing, Visualization (Intermediate) 



S0291 LAtoolbox: A Multi-platform Sparse Linear 

Algebra Toolbox 

Find out about an easy way for building sparse linear solvers for 

GPUs and multi-/many-core platforms. Based on data abstraction 

and virtualization of the hardware, the LAtoolbox supports several 

platforms such as GPUs, multi-core CPUs, and accelerators. The 

various backends (CUDA, OpenCL, OpenMP, ...) utilize optimized and 

platform-specific routines and allow seamless integration of GPUs 

into scientific applications. By means of unified interfaces across all 

platforms the library enables you to build generic linear solvers and 

preconditioners on a single code base without specific information of 

your hardware. We demonstrate portability and flexibility of our 

open-source approach on heterogeneous platforms. 

Speaker(s): Dimitar Lukarski (Research Associate, Karlsruhe Institute 

of Technology (KIT)), Jan-Philipp Weiss (Junior Professor, Karlsruhe 


Topic(s): Application Design & Porting Techniques (Intermediate) 


ROOM K 

S0309 Dynamically Allocating GPGPU to Host 

Nodes (Servers) 

Learn how to remotely change the mapping of GPUs to hosts 

based on application needs. Audience will then be presented with 

example scripts and a demo illustrating how this can be 

implemented to improve system resource utilization. 

Speaker(s): Alaa Yousif (Software Solution Architect, Dell), Saeed Iqbal 

(Senior Systems Engineer, Dell) 

Topic(s): Cluster Management (Beginner) 


KEYNOTE HALL 1 

S3002 Day 3 Keynote: Not Your Grandfather’s Moon 

Landing 

Do not miss the day 3 keynote, featuring Part-Time Scientists 

Robert Boehme and Wes Faler. Boehme and Faler are part of a 

team of international scientists and engineers who want to send a 

rover to the moon before the end of the year 2013. In this 

presentation, they will discuss their goals, recent 

accomplishments and milestones, and how GPUs have help in 

unexpected ways. 

Speaker(s): Robert Boehme (CEO & Team Lead, Part-Time Scientists), 

Wes Faler (Head of Software Development, Part-Time Scientists) 



ROOM N 

S0044 A Massively Parallel Two-Phase Solver for 

Incompressible Fluids on Multi-GPU Clusters 

Join our presentation of a multi-GPU fluid solver for high 

performance GPU compute clusters. We use high-order scientific 

techniques to simulate the interaction of two fluids like air and 

water. Scientists, engineers and even the computer animation 

industry will profit from the enormous compute power of tens or 

hundreds of GPUs. A major focus in this talk will be on the applied 

GPU implementation techniques and the performance results 

including performance per Watt and performance per dollar 

results. We also highlight the lessons we learned from porting the 

complex CPU CFD code NaSt3DGPF to the GPU. 


THURSDAY 

Speaker(s): Peter Zaspel (Research Assistant, University of Bonn) 

Topic(s): Computational Fluid Dynamics, Supercomputing, Algorithms & 

Numerical Techniques, Digital Content Creation & Film (Intermediate) 


ROOM C 

S0054 PFAC Library: GPU-Based String Matching Algorithm 

In this section, we first propose an exact string matching 

algorithm, called Parallel-Failureless Aho-Corasick (PFAC) 

algorithm which is used to match input texts against a set of 

string patterns on GPUs. The string patterns are compiled into a 

finite state machine similar to the well-known Aho-Corasick 

algorithm. Furthermore, to accommodate large number of 

patterns, we present two kinds of hash functions which are 

adopted to compress the state transition table. The experimental 

results show that the PFAC library achieves significant 

performance on NVIDIA GPUs. Finally, the PFAC library has been 

released on Google code (http://code.google.com/p/pfac/). 

Speaker(s): Cheng-Hung Lin (Associate Professor, National Taiwan 

Normal University) 

Topic(s): Development Tools & Libraries, Algorithms & Numerical 

Techniques (Beginner) 


ROOM K 

S0119 Best Practices for Architecting and Managing 

High-Performance GPU Clusters 

An overview of designing, deploying, and managing GPU clusters 

for HPC. Learn to build and operate top500-class GPU computing 

resources that provide users with the latest CUDA features. 

Speaker(s): Dale Southard (Senior Solution Architect, NVIDIA) 

Topic(s): Cluster Management, Supercomputing (Intermediate) 


ROOM M 

S0187 GPUs for Radio Imaging 

With the advent of a new breed of Telescopes like the Low 

Frequency Array (LOFAR), which rely on software processing to 

process large data-sets that they generate, there is a need to 

improve the software to run as fast as possible in order to process 

the large data-sets in a reasonable time. In this session we 

describe how we have used the computing power of GPU’s to 

improve the performance of the standard radio imaging 

techniques as well as how this computational power is useful for 

creating a new generation of Radio Imaging Algorithms. 

Speaker(s): Vamsi Krishna Veligatla (GPU Programmer, University 

of Groningen) 

Topic(s): Astronomy & Astrophysics (Intermediate) 


ROOM L 

S0285 Optimization of a Sparse Matrix-Matrix 

Multiplication on the GPU 

The goal of this session is to present advanced techniques to 

optimize CUDA code on the GPU. In particular, we will 

demonstrate the use of advanced CUDA instructions (inline PTX, 

warp instructions, “extended” syncthreads) and load-balancing 

strategies to improve the performance of a sparse matrix-matrix 

multiplication on the GPU. 

Speaker(s): Julien Demouth (Developer Technology Engineer, NVIDIA) 

Topic(s): Algorithms & Numerical Techniques (Advanced) 


ROOM B 

S0320 PTask: OS Support for GPU Dataflow Programming 

This session considers the PTask API, OS-level abstractions that 

support GPUs as first-class computing resources, and supports a 

dataflow programming model. With PTask, the programmer 

specifies where data goes, rather than how and when it should get 

there, allowing the system to provide fairness and isolation 

guarantees, streamline data movement in ways that currently 

require direct programmer involvement, and enable code 

portabality across diverse GPU-based platforms. Our experience 

building the PTask APIs shows that PTask can provide important 

system-wide guarantees and can enable significant performance 

benefits, for example improving the throughput of hand-tuned 

CUDA programs by up to 2x. 

Speaker(s): Jon Currey (Microsoft Research Silicon Valley), Christopher 

Rossbach (Researcher, Microsoft Research Silicon Valley) 

Topic(s): Development Tools & Libraries, General Interest, Parallel 

Programming Languages & Compilers (Advanced) 



S0378 VASP Accelerated with GPUs 

This session will detail the performance and capabilities of 

GPU-accelerated VASP, explain design decisions made in porting 

VASP to CUDA, and present a roadmap for GPU accelerated 

VASP development. We’ve achieved performance improvements 

up to around 20x on systems of around 100 ions and have 

implemented exact-exchange. We are working on ports of more 

conventional functionality. 

Speaker(s): Maxwell Hutchinson (PhD Student, University of Chicago) 

Topic(s): Quantum Chemistry, Application Design & Porting 

Techniques, Computational Physics (Intermediate) 


ROOM J1 

S0709 Accelerated HPC Symposium: Applications - 

Methods and Programming Models: Part 2 (Presented 

by LANL) 

This session is part 2 of Applications- Methods and Programming 

model that will feature short talks on “The Portability Wall: How 

hard can it really be?,” followed by a talk on “Accelerating NAMD” 

as well as “Refitting Legacy Software for the New Reality” and 

“Unstructured Data Structures: An Achilles Heel?” After 

Discussion and break , the session will end with short talks on 

“Power: The New Metric” and “It’s about Concurrency, Stupid!” 

Speaker(s): John Stone (Urbana Champaign), James Phillips 

(University of Illinois), John Humphrey (EM Photonics), Raphael Poncet 

(CEA), Simon MacIntosh-Smith (University of Bristol), Stanley Tzeng 

(UC Davis) 











of comprehensive exercises, the attendee will be able to utilize






ROOM M 

S0022 Scalable Frameworks and Algorithms for 

Terascale Radio Astronomy Images 

Learn how the oldest science is using the newest processors to 

solve a critical problem: how to accomplish traditional image 

analysis and visualization tasks when the images are terabytes in 

size? Simple, standard operations such as displaying 2-d slices, 

evaluating image statistics, and applying histogram equalization 

become manifestly challenging when images dramatically exceed 

single-node memory capacity. We will explain how our hybrid 

CPU-GPU cluster framework – which can volume render a 200GB 

image at >50fps! – will support traditional radio astronomy tasks 

for the colossal images that the Square Kilometre Array and its 

precursor, the Australian SKA Pathfinder, will generate. 

Speaker(s): Christopher Fluke (Senior Lecturer, Swinburne University of 

Technology - Centre for Astrophysics and Supercomputing) 

Topic(s): Astronomy & Astrophysics, Visualization (Intermediate) 


ROOM C 

S0032 Teraflop GPU Acceleration Of Large Matrix Algebra 

Learn how Multipath’s Fast Matrix Solver (FMS) is setting 

performance records using multiple GPU’s solving large matrices 

in production applications. By (1) leveraging NVIDIA’s CUBLAS 

library, (2) operating multiple GPU’s in parallel and (3) overlapping 

data transfers with computation, FMS averages over 2 teraflops of 

performance, even on jobs lasting for days. The presentation also 

includes a description of what problems FMS solves and how it is 

incorporated into applications programs. 

Speaker(s): Ronald Young (President, Multipath Corporation) 

Topic(s): Development Tools & Libraries, General Interest (Beginner) 


ROOM L 

S0106 GPU Based Numerical Methods in Mathematica 

A fast way of developing, prototyping and deploying numerical 

algorithms that can take advantage of CUDA capable systems is 

available in Mathematica 8. Over the past year, educators, 

scientists, and business users have taken advantage of the 

benefits that the support of GPU programming in Mathematica. By 

integrating and implementing CUDA/OpenCL in their programs, 

users make use of a hybrid approach, combining the speed-up 

that GPUs offer and a powerful numerical development system. In 

this presentation several examples describing numerical 

applications ranging from deconvolution of MRI imaging, linear 

solvers for FEM, systems of ODEs, line integral convolution 

visualization are presented. 

Speaker(s): Ulises Cervantes-Pimentel (Senior Kernel Developer, 

Wolfram Research), Abdul Dakkak (Kernel Developer, Wolfram Research) 

Topic(s): Algorithms & Numerical Techniques, Visualization, 

Application Design & Porting Techniques, Development Tools & 

Libraries (Intermediate) 



S0231 Levenberg-Marquardt Using Block Sparse Matrices 

on CUDA 

This session describes the experiences of constructing GPU based 

matrix-vector functions for block sparse matrices having multiple 

block sizes and a domain-specific numerical Jacobian generation 

function. The bundle adjustment algorithm is an optimization 

procedure which attempts to refine the relative camera pose, and 

3D structure location variables, estimated from multiple sets of 

images. The Conjugate Gradient algorithm is used to solve the 

normal equations which appear in the inner loop to the non-linear 

least squares problem. 

Speaker(s): Tetsuo Tawara (Software Engineer, Koozyt) 




ROOM C 

S0071 The High-Level Linear Algebra Library ViennaCL 

and Its Applications 

Get to know ViennaCL, an OpenCL high-level linear algebra 

software, which allows to get the speed of GPU computing at the 

convenience level of the C++ Boost libraries. Decrease the 

development and execution time of applications by utilizing our 

well-tested and widely used library, instead of spending days on 

learning details of GPU architectures and debugging. We provide 

examples that demonstrate not only how quickly existing 

applications are ported efficiently from single-threaded execution 

to fully utilizing multi-threaded environments, but also how to 

utilize the rich set of functionalities ranging from common BLAS 

routines to iterative solvers. 

Speaker(s): Karl Rupp (Project Assistant, TU Wien) 

Topic(s): Development Tools & Libraries, Algorithms & Numerical 

Techniques, Computational Physics (Intermediate) 


ROOM M 

S0087 GPU Acceleration of Dense Stellar 

Clusters Simulation 

Computing the interactions between stars within dense stellar 

clusters is a problem of fundamental importance in theoretical 

astrophysics. This paper presents the parallelization of a Monte 

Carlo algorithm for simulating stellar cluster evolution using 

programmable Graphics Processing Units. The kernels of this 

algorithm exhibit high levels of data dependent decision making 

and unavoidable non-contiguous memory accesses. However, we 

adopt various parallelization strategies and utilize the high 

computing power of the GPU to obtain substantial near-linear 

speedups which cannot be easily achieved on a CPU-based 

system. This acceleration allows to explore physical regimes 

which were out of reach of current simulations. 

Speaker(s): Bharath Pattabiraman (PhD Student, Northwestern University), 

Stefan Umbreit (Postdoctoral Associate, Northwestern University) 

Topic(s): Astronomy & Astrophysics, Computational Physics, Algorithms 

& Numerical Techniques (Intermediate) 


ROOM N 

S0091 Sustainable Hybrid Parallelization of an 

Unstructured Hydrodynamic Code 

The goal of this presentation is to share our methodology for 


THURSDAY 

porting a numerical code to hybrid supercomputing architectures 

using MPI coupled with directive-based languages (OpenMP for 

multicore CPUs, and HMPP for GPUs). Our code, VOLNA, is an 

unstructured partial differential equation hydrodynamic solver 

developed for the simulation of tsunamis. Our results 

demonstrate that using directive-based languages such as HMPP 

for GPU programming, one can retain good performance (e.g. 

speedup of 15 compared to 1 CPU core, 3 compared to 8 CPU 

cores) with minimal modifications of the original CPU source code 

(about 30 lines of directives in our case). 

Speaker(s): Raphaël Poncet (Research Scientist, Commissariat à 

l’Energie Atomique et aux Energies Alternatives) 


Numerical Techniques, Computational Fluid Dynamics, 

Computational Physics (Advanced) 


ROOM B 

S0157 A Study of Persistent Threads Style Programming 

Model for GPU Computing 

We present the usefulness of a new style of GPU programming 

called Persistent Threads, known to be useful on irregular 

workloads. First, we will begin by formally defining the PT model. 

We will then categorize use of PT into four “use cases”, and 

present micro-benchmark analyses of when this model is useful 

over traditional kernel formulations. Third, we will show a full 

speech recognition application that uses all four PT use cases. 

Finally, we will conclude our talk by suggesting appropriate 

modifications to GPU hardware, software, and APIs that make PT 

kernels both easier to implement and more efficient. 

Speaker(s): Kshitij Gupta (Graduate Student Researcher, UC Davis), 

Jeff Stuart (PhD Student, UC Davis) 

Topic(s): Parallel Programming Languages & Compilers, Audio, Image 

and Video Processing (Advanced) 



S0334 The Fast Multipole Method on CPU and GPU 

Processors 

The fast multipole method (FMM) is a widely used numerical 

algorithm in computational engineering. Accelerating the FMM on 

CUDA-enabled GPUs is challenging because the FMM has a 

complicated data access pattern, mostly during the so-called 

multipole-to-local (M2L) operation. We have created several 

schemes to optimize the M2L and have attained a performance of 

over 350 (resp. 160) Gflop/s for single (double) precision 

arithmetic. The optimal algorithm was incorporated into a 

complete FMM code, which can accept any smooth kernel as 

specified by the user, making it very flexible. We have also 

developed a highly efficient CPU version. 

Speaker(s): Eric Darve (Professor, Stanford) 

Topic(s): Computational Physics, Molecular Dynamics, Algorithms & 

Numerical Techniques (Advanced) 


ROOM K 

S0368 Unraveling the Mysteries of Quarks with 

Hundreds of GPUs 

Dive into the world of quarks and gluons, and hear how GPU 

computing is revolutionizing the way many calculations in lattice 

quantum chromodynamics (lattice QCD) are performed. The main 

computational challenge in such calculations is to repeatedly 

solve large systems of linear equations arising from a fourdimensional 

finite-difference problem. In this session, we’ll 

discuss strategies for parallelizing such a solver across hundreds 

of GPUs. These include techniques and algorithms for reducing 

memory traffic and inter-GPU communication. The net result is an 

implementation that achieves better than 20 Tflops on 256 GPUs, 

realized in the open-source “QUDA” library. 

Speaker(s): Ronald Babich (Research Scientist, NVIDIA) 

Topic(s): Computational Physics, Application Design & Porting 

Techniques, Algorithms & Numerical Techniques, Supercomputing 




S0429 Quantum Chemistry: Automated Code Generation 

and Optimization for GPU Kernels 

In this session we discuss the challenges encountered in 

development of quantum chemistry software for GPUs from 

scratch and optimization of the kernels for the best performance. 

We attempt to create a unified framework for automatic 

generation of efficient quantum chemistry codes tailored 

individually for various GPU (NVIDIA, ATI) and CPU architectures 

and programming (CUDA, OpenCL, C/C++) languages using a 

meta-programming approach based on a computer algebra 

system. We demonstrate its utility by generating highly optimized 

GPU and CPU kernels dealing with various integrals over 

Gaussian basis functions implemented in the TeraChem quantum 

chemistry package. 

Speaker(s): Alexey Titov (Engineering Research Associate, Stanford), 

Ivan Ufimtsev (Postdoc, Stanford) 

Topic(s): Quantum Chemistry (Advanced) 














ROOM M 

S0111 An Efficient CUDA Implementation of a Tree-Based 

N-Body Algorithm 

This session presents a complete CUDA implementation of the 

irregular Barnes-Hut n-body algorithm. This algorithm repeatedly 

builds and traverses unbalanced trees, making it difficult to map 

to GPUs. We explain in detail how our code exploits the 

architectural features of GPUs, including lockstep operation and 

thread divergence, both of which are commonly viewed as hurdles 

to achieving high performance, especially for irregular codes. On 

a five million body simulation running on a Tesla C2050, our CUDA 

implementation is 30 times faster than a parallel pthreads version 

running on a high-end 6-core Xeon. 

Speaker(s): Martin Burtscher (Associate Professor, Texas State University) 

Topic(s): Application Design & Porting Techniques, Astronomy & 

Astrophysics, Molecular Dynamics, Supercomputing (Advanced)



S0138 GPU Task-Parallelism: Primitives and Applications 

We explore how a task-parallel model can be implemented on the 

GPU and address concerns and programming techniques for 

doing so. We discuss the primitives for building a task-parallel 

system on the GPU. This includes novel ideas for mapping tasking 

systems onto the GPU including task granularity, load balancing, 

memory management, and dependency resolution. We also 

present several applications which demonstrate how a taskparallel 

model is more suitable than the regular data parallel 

model. These applications include a Reyes renderer, tiled deferred 

lighting renderer, and a video encoding demo. 

Speaker(s): Anjul Patney (PhD Candidate, UC Davis), Stanley Tzeng 

(Graduate Student, UC Davis) 

Topic(s): Application Design & Porting Techniques, Development Tools 

& Libraries, Computer Graphics (Intermediate) 


ROOM L 

S0267B Mixing Graphics and Compute with Multiple GPUs 

In this session we will cover all the different aspects of interaction 

between graphics and compute. The first part of the session will 

focus on compute API interoperability with OpenGL (using CUDA 

and OpenCL APIs), while the second part of the session will delve 

into interoperability at a system level. In particular we will go 

through the challenges and benefits of dedicating one GPU for 

compute and another for graphics, how different system 

configurations affect data transfer between two GPUs, and how it 

translates into application design decisions helping to enable an 

efficient, cross-GPU interoperability between compute and 

graphics contexts. 

Speaker(s): Alina Alt (Applied Engineer, NVIDIA) 

Topic(s):Visualization, Application Design & Porting Techniques 

(Beginner) 



S0392 Large-Scale First Principle Pseudopotential DFT 

Calculations on GPU Clusters 

In this session, we will present a series of work on density 

functional theory (DFT) plane wave pseudopotential(PWP) 

calculations on GPU clusters. The GPU version is developed based 

on a CPU DFT-PWP code: PEtot, which can calculate ~1000 atoms 

on thousands of processors. Our test indicates that the GPU 

version can have a ~20 times speedup over CPU code. A detail 

analysis of the speed-up and the scaling on the number of CPU/ 

GPU(up to 256) will be presented. As far as we know, this is the 

first GPU DFT-PWP code scalable to large number of CPU/GPU. 

Speaker(s): WeiLe Jia (Postgraduate Student, Supercomputing Center of 

CNIC, Chinese Academy of Sciences), Long Wang (Associate Professor, 

Supercomputing Center of CNIC, Chinese Academy of Sciences) 

Topic(s): Quantum Chemistry, General Interest (Advanced) 



S0038 Designing Killer CUDA Applications for X86, 

multiGPU, and CPU+GPU 

CUDA redefined software development with 10 to 1000-times 

faster GPU applications. Now a single CUDA source tree can 

support the x86 mass market (no GPU required) and 1/3 billion 

CUDA-enabled GPUs. MultiGPU and CPU+GPU apps utilize all 

system resources. GPUdirect, UVA, caches, prefetching, ILP 

(Instruction level Parallelism), automated analysis tools and more 

offer ease, capability, and performance. The overall impact on 

software investment, scalability, balance metrics, programming 

API, and lifecycle will be considered. Working real-time video and 

other examples from my book, ”CUDA Application Design and 

Development” provide practical insight to enable augmented 

reality and your killer apps. 

Speaker(s): Robert Farber (Chief Scientist, BlackDog Endeavorsr, LLC) 

Topic(s): Machine Learning & AI, Supercomputing, Databases, Data 

Mining, Business Intelligence, Computer Vision (Intermediate) 


ROOM N 

S0063 Robust Preconditioned Conjugate Gradient for the 

GPU and Parallel Implementations 

Get a closer look on how parallel conjugate gradient(CG) method 

can get an edge over it’s optimized CPU implementation. We have 

developed preconditioning techniques for CG which are suited to 

the GPU and match Block-IC in terms of numerical performance. 

We present our results for two level preconditioned CG on the GPU 

and also compare it with multi-CPU, implementations. Our results 

show that for large problem sizes (1 million unknowns and above) 

it is possible to achieve an order of magnitude and higher 

speedups for the two level preconditioned CG method. 

Speaker(s): Rohit Gupta (PhD Student, Delft University of Technology) 




ROOM K 

S0282 Leveraging NVIDIA GPUDirect on APEnet+ 3D 

Torus Cluster Interconnect 

APEnet+ is a novel cluster interconnect, based on a custom PCI 

card which features a PCI Express Gen2 X8 link and a reconfigurable 

HW component (FPGA). It supports a 3D Torus 

topology and has special acceleration features specifically 

developed for NVIDIA Fermi GPUs. An introduction to the basic 

features and the programming model of APEnet+ will be followed 

by a description of its performance on some numerical 

simulations, e.g. High Energy Physics simulations. 

Speaker(s): Davide Rossetti (Researcher, Italian National Institue for 

Nuclear Physics) 

Topic(s): Supercomputing, Computational Physics (Intermediate) 


ROOM B 

S0428 Panini: A GPU Aware Array Class 

We present a new templated C++ class library, PANINI, for use in 

the development of large-scale scientific simulations in an 

hetrogeneous computing environment. The key feature of this new 

library is a generic parallel array class built on advanced generic 

programming methodologies where details of parallelization is 

hidden inside the array class itself. This library will be used for 

Poison Solver, Advection Diffusion and other equation. 

Speaker(s): Priyanka Sah (Compute DevTech Engineer, NVIDIA), 

Santosh Ansumali (Faculty Fellow, Engineering Mechanics Unit, 

JNCASR, Bangalore) 



THURSDAY 















ROOM M 

S0065 Satellite HUB Communication System GPU Based 

In the last few years the increasing GPU computational power has 

opened new perspectives in telecommunication fields trough SDR 

(software defined radio) approach. Some tasks, such as the one 

we had to deal with, do not offer negotiation margins with the 

execution speed due to the real-time analysis of a radio signal. We 

coped with the implementation of the lowest layer in the protocol 

stack for a land mobile satellite communication system, and we 

were able to deliver a product with a reduced time to market with 

respect to traditional FPGA approach. 

Speaker(s): Gaetano Mendola (Principal Engineer, MBI srl), Francesco 

Basile (Software Engineer, MBI srl) 



ROOM B 

S0218 ASI Parallel Fortran: A General-Purpose Fortran 

to GPU Translator 

Over the last 3 years we have developed a general-purpose 

Fortran to GPU translator: ASI Parallel Fortran does. The talk will 

detail its purpose, design layout and capabilities, and show how it 

is used and implemented. The use of ASI Parallel Fortran will be 

shown for large-scale CFD/CEM codes as well as other general 

purpose Fortran codes. 

Speaker(s): Rainald Lohner (Professor, George Mason University) 

Topic(s): Development Tools & Libraries, Computational Fluid 

Dynamics, Computational Physics, Parallel Programming Languages 

& Compilers (Advanced) 



S0220 Enabling Faster Material Science Modeling Using 

the Accelerated Quantum ESPRESSO 

The goal of this session is to present the advantages of mixing 

CUDA libraries and CUDA kernels to deliver a robust community 

package for material science modeling that fully exploits multicore 

systems equipped with GPUs. The Plane-Wave Self- 

Consistent Field (PWscf) code of the Quantum ESPRESSO suite is 

the focus of this work. During the session the main computationdependent 

components, that also represent fundamental building 

blocks for many other quantum chemistry codes, will be 

discussed and analyzed. Subsequently an in-depth performance 

assessment of several realistic scientific cases will be presented, 

starting from single workstations to large clusters equipped with 

hundreds of GPUs. 

Speaker(s): Filippo Spiga (Computational Scientist, Irish Centre for 

High-End Computing) 

Topic(s): Quantum Chemistry, Supercomputing, Application Design & 



ROOM L 

S0411 Artifact-Free Cloud-Based CAD Rendering 

Cloud computing for mechanical CAD provides centrally stored and 

synchronized models for concurrent engineering. For compactness, 

trimmed parametric NURBS surface representations are optimal 

for data transfer to client devices, which must evaluate and render 

models locally. Direct GPU rendering without pre-tessellation is an 

attractive solution in this context, both for speed and to preserve 

fidelity to the original geometry. However, existing data-parallel 

direct rendering approaches for NURBS suffer from rendering 

artifacts at trim boundaries. This talk proposes a solution to 

address these rendering artifacts that are still preventing widescale 

adoption of all such direct rendering algorithms for trimmed 

parametric models. 

Speaker(s): Sara McMains (Professor, UC Berkeley), Sushrut 

Pavanaskar (PhD Candidate, UC Berkeley) 

Topic(s): Algorithms & Numerical Techniques, Computer Graphics, 

Cloud Computing, Visualization (Beginner) 


ROOM L 

S0074 Techniques for Designing GPGPU Games 

Learn how to develop faster and better games with the use of 

GPGPU thought the use of Game GPU tricks. Normally, games 

process most of its tasks in the CPU, using the GPU only for 

graphics processing. This session shows some techniques on how 

to better use the GPGPU power to process all the game logic, 

achieving speedups when compared to CPU, and traditional GPU 

models. This session also shows some examples of this technique 

in practice. 

Speaker(s): Mark E S Joselli (Researcher, UFF), Esteban Clua 

(Professor, UFF) 









team after attending one of the exciting training sessions, the 






ROOM M 

S0134 On the Integration of OpenCL into a Software 

Defined Radio 

We will present a software defined radio system that allows for 

heterogeneous processing using a host computer’s CPUs and 

GPUs, via dynamic runtime resource allocation provided by our 

Surfer framework and extensions to it using OpenCL. This system

collects runtime statistics including samples / second throughput 

for each signal processing block, data transfer latency between 

different processors, and the host CPU cores’ loads. Using this 

information, a supervisor can move computations between 

processors during runtime, without interrupting data processing. 

We will demonstrate an OFDM transmitter, graphing the system 

throughput and CPU loads while selecting where processing 

occurs for each block. 

Speaker(s): Michael Dickens (Graduate Student, University of Notre Dame) 



GPU Consolidation 

and Virtualization for 

Application Acceleration 

and Data Visualization 

www.nextio.com

ALGORITHMS & NUMERICAL TECHNIQUES 

AN01 - A Novel Parallel Realisation of the 

Element-by-Element FEM Technique 

The element-by-element (EbE) finite element 

method (FEM) is a long known technique, by which 

a conjugate gradient (CG) type iterative solution 

scheme can be entirely decomposed into 

computations on the element level, i.e., without 

assembling the global system matrix. In our 

implementation a CUDA capable GPU is utilized to 

perform the required element-wise computations in 

parallel. Since element matrices need not be stored, 

the memory requirement can be kept extremely low. 

This low-storage but computation intensive 

technique is better suited for GPUs than those 

requiring the massive manipulation of large data 

sets, enabling handling of millions of tetrahedrons. 

Contact: Zsolt Badics (Tensor Research, LLC) 

AN02 - ExaFMM: An Open Source Library for 

Fast Multipole Methods 

The fast multipole method (FMM) is a numerical 

engine use din many applications, from acoustics, 

electrostatics, fluid simulations, wave scattering 

and more. Despite its importance, there is lack of 

open community code, which arguably has 

affected its wider adoption. It is also a difficult 

algorithm to understand and to program, making 

availability of open-source implementations even 

more desirable. We developed a novel treecode- 

FMM hybrid algorithm with auto-tuning 

capabilities. It is highly parallel and GPU-capable. 

Its usage in the simulation of homogeneous 

isotropic turbulence achieved 0.5 petaflop/s on 

2048 GPUs of the Tsubame system. 

Contact: Lorena Barba (Boston University) 

AN03 - Collatz-Type Conjectures on GPU 

We verify two types of Collatz conjectures: on the 

set of rational numbers and the set of matrices 

modulo p, where p is prime. In both cases, the 

number of pairs of rational numbers and matrices 

grow exponentially. However, our algorithm 

exhibits simple parallel patterns which exploit 

GPUs in an efficient way. The preliminary results 

show that the conjecture holds for both cases for 

large sets. 

Contact: Peter Yoon (Trinity College) 

AN04 - CUDA Implementation of Recurrence 

Equation Solvers Using P-scheme approach 

The recurrence equation solver is used in many 

numerical applications and other general-purpose 

applications, but it is inherently a sequential 

algorithm, so it is difficult to implement the 

parallel program for it. We implement a parallel 

and scalable algorithm for solving recurrence 

equations on GPUs by using CUDA and evaluate 

its effectiveness. The algorithm was originally 

implemented for MIMD parallel computers by the 

authors and we modify the algorithm suitable for 

the GPGPU system by rearranging arrays 

configurations. We also show how to determine 

the optimal size of threads in a thread block and 

evaluate its validity. 

Contact: Akiyoshi Wakatani (Konan University) 

AN05 - Accelerating Symmetric Matrix-Vector 

Product on Fermi GPU 

We aim in the work presented here to describe an 

optimized numerical kernels computing the 

symmetric matrix-vector product (Level 2 BLAS) 

on the last NVIDIA TESLA GPU family, codenamed 

Fermi (C2070). Due to its inherent memory-bound 

nature, this kernel represents one of the most 

critical operations in computing the tridiagonal 

form of a symmetric dense matrix, which is the 

preprocessing step toward calculating the 

eigenpairs. Using a novel design to address the 

irregular memory accesses by hiding latency and 

increasing bandwidth, our preliminary asymptotic 

results show up to 3.5 fold speedups over existing 

numerical libraries. 

Contact: Hatem Ltaief (KAUST Supercomputing 

Laboratory) 

AN06 - Rapid Matrix Construction for Wavelet- 

Galerkin Schemes 

The wavelet Galerkin scheme is an efficient 

numerical method used to improve Boundary 

Element Methods and Finite Element Methods for 

solving partial differential equations given 

resulting matrix features like sparseness and 

conditionality. Using CUDA C/C++ we have 

implemented the open-source C++ Library of 

Adaptive Wavelet Applications (LAWA) on the GPU 

and achieve significant performance gain for 

matrix construction. 

Contact: Yuri Nesterenko (Dantec Dynamics A/S) 

AN07 - Big Number Modulo Exponentiations For 

Zero-Knowledge Protocols on GPUs 

In this work we implement parallel big number 

exponentiations having a fixed base on the GPU. 

For this task we develop a new implementation of 

the Montgomery multiplication algorithm. Although 

big number exponentiations benefit from large 

caches like on a CPU, we show that this lack can be 

compensated by a high level of parallelization and 

an adaptation of the algorithms. 

Contact: Tobias Jeske (TU Hamburg-Harburg) 

AN08 - Tuning a Finite Difference Stencil 

Several ways of tuning a finite difference stencil 

computation are discussed. The combination of 

vectorization and a modified data layout, a cache 

aware algorithm, loop unrolling, parallelization 

and parameter tuning lead to optimized 

implementations at a level of up to 90% peak 

performance of the floating point pipelines on 

NVIDIA Fermi GPUs and on CPUs. 

Contact: Gerhard Zumbusch (University Jena) 

CONFERENCE GUIDE POSTER LISTINGS 

83

POSTER LISTINGS 

AN09 - Parallel Programming on CPU-GPU for 

Solving Population Balance Equation 

The population balance equation (PBE) is one of 

those. The Dual Quadrature Method of Generalized 

Moments (DuQMoGeM) is a promising method for 

solving the PBE. The drawback of this methodology 

is the large computational cost associated with the 

adaptive numerical integration. Therefore, the 

adaptive cubature algorithm was implemented in 

hybrid architecture (MPI-CUDA) to accelerate the 

DuQMoGeM. The maximum speed up was about 

48x using 4 GPUs and 4 nodes and the maximum 

speed up was about 40x using 2 GPUs and 1 node. 

Contact: Fabio Pereira dos Santos (Institute for 

Medical Physics) 

AN10 - GPU Enabled Comparison Between 

Stochastic Decomposition Methods 

The scale of engineering problems has sharply 

increased over the last twenty years. The ability to 

learn the coupling (inter-dependence) structure of 

a problem during the solution process could lead 

to large reductions in the time to analyze complex 

problems. Such decomposition methods could 

also provide engineering insight on the 

fundamental physics driving problem solution. 

This work forwards the current state of the art in 

engineering decomposition through the 

application of techniques originally developed 

within computer science and information theory. 

CUDA enabled a detailed comparison between the 

current practice of using Genetic Algorithms and a 

newly introduced method called MIMIC. 

Contact: Richard Otero (Los Alamos National Lab) 

AN11 - GPU-Accelerated 3-D Electromagnetic 

Particle-in-Cell Implementations in VORPAL 

We present recent developments in implementing 

3D GPU-accelerated eletromagnetic particle-incell 

particle updates in the plasma physics 

framework VORPAL. The primary challenge in PIC 

methods on GPUs is thread contention during the 

current deposition stage: we resolve these thread 

contentions by sorting particles into ‘tiles’ of many 

cells each time step. Multiple thread blocks may 

be assigned to each tile, and each block 

accumulates the contribution from a moderate 

number of particles via an unsegmented 

Esirkepov 1st-order scheme. We achieve update 

times of 50 ns per-particle per-timestep for a 

variety of realistic self-consistent double-precision 

EM simulations. 

Contact: Keegan Amyx (Tech-X Corporation) 

AN12 - LU Factorization for 10,000s of Small 

Dense Matrices 

LU factorization is a ”high-level” algebraic 

description for Gaussian elimination and is a 

fundamental operation performed in linear 

algebra. By implementing a register heavy 

mapping in CUDA specifically for small matrices, 

speed-up factors of more than 10 are achieved vs. 

an OpenMP parallelized Intel MKL implementation 

running on a high-end quad-core CPU. 

Contact: Ian Wainwright (High Performance 

Consulting) 

AN13 - GPU Implementation of a Streaming 

Broadband RF Receiver 

An experimental radio broadcasting system 

spreads the signal with a PN code. To reconstruct 

the original signal, the receiver correlates the PN 

code with the signal received, numerically 

sampled, requiring a direct as well as an inverse 

Fast Fourier Transform, plus other conditioning 

and filtering operations. Since the target speed of 

the system is 625 mega samples per second, 

processed in segments of one mega samples 

each, performing this computation on a standard 

CPU system is prohibitive and GPU processing is 

an attractive option. This project describes an 

initial CUDA implementation that performs almost 

at target speed. 

Contact: Andrea Di Blas (University of California, 

Santa Cruz) 

AN14 - Efficient Algebraic Multigrid Methods 


Algebraic multigrid methods for large, sparse 

linear systems are a necessity in many 

computational simulations, yet parallel algorithms 

for such solvers are generally decomposed into 

coarse-grained tasks suitable for distributed 

computers with traditional processing cores. We 

develop a parallel algebraic multigrid method 

which exposes substantial fine-grained 

parallelism in both the construction of the 

multigrid hierarchy as well as the cycling or solve 

stage. The resulting solver achieves an average 

speedup of 1.8x in the setup phase and 5.7x in the 

cycling phase when compared to a representative 

CPU implementation. 

Contact: Steven Dalton (University of Illinois at 

Urbana-Champaign) 

APPLICATION DESIGN & PORTING 

TECHNIQUES 

AP01 - Debugging Floating Point 

Implementations on GPUs 

To debug GPU code it is important to understand 

differences between both CPU and GPU 

implementations. The differences arise due to 

floating point (FP) differences and casting from 

floating point to fixed point. FP differences arise 

due to the lack of associativity of FP, differences in 

instruction implementation, and choices made by 

the compiler. We analyzed medical image 

reconstruction code for breast reconstruction and 

showed that GPU and CPU code could be made to 

produce identical results. We also analyze the 

performance implications of choosing different 

implementation options on the GPU and CPU to 

make the codes match. 

Contact: Miriam Leeser (Northeastern University)

AP02 - KILO Transactional Memory for GPU 

GPUs are designed to efficiently execute of 1000s 

of concurrent threads on multiple SIMT cores to 

hide long latency operations. Currently, threads in 

different CUDA blocks can only communicate via 

global memory accesses, and programmers have 

to consider data-races. Although fine-grained 

locks can be constructed using 32-/64-bit word 

atomic operations in recent GPUs, operations 

involving multiple locks can have deadlocks. We 

propose to solve these problems by extending 

GPUs to support transactional memory. Some of 

the major challenges are to support 1000s of 

concurrent transactions, to commit nonconflicting 

transactions in parallel, and to 

integrate with stack-based SIMT execution. 

Contact: Wilson Wai Lun Fung (University of 

British Columbia) 

AP03 - CUDA-Based GPU Computing Framework 

for GNU Octave 

This poster presents the design of a CUDA-GPU 

based parallel processing framework for GNU 

Octave. Octave is a high-level interpreted 

language, primarily intended for numerical 

computations. GNU Octave being an open source 

alternative to Matlab, is widely used in academic 

and research institutes. The GPU framework 

allows Octave users to accelerate their software 

written in Octave high-level ‘M’ language on GPUs 

with minimal code modifications. To my 

knowledge, this is the first attempt to build a GPU 

framework for Octave, contrary to previous 

attempts to provide GPU variants for a set of 

Octave functions. 

Contact: John Melonakos (AccelerEyes) 

ASTRONOMY & ASTROPHYSICS 

AA01 - Adaptive Beam-Forming for Radio 

Astronomy on GPUs 

With the advent of a new breed of Telescopes like 

the Low Frequency Array (LOFAR), which rely on 

software processing to process large data-sets 

that they generate, there is a need to improve the 

software to run as fast as possible in order to 

process the large data-sets in a reasonable time. 

In this session we describe how we have used the 

computing power of GPU’s to improve the 

performance of the standard radio imaging 

techniques as well as how this computational 

power is useful for creating a new generation of 

Radio Imaging Algorithms. 

Contact: Vamsi Krishna Veligatla (University 

of Groningen) 

AA02 - Accelerating Real-Time Processing of the 

ATST Adaptive Optics System 

The real-time processing of the four meter 

Advanced Technology Solar Telescope (ATST) 

adaptive optics (AO) system with approximately 

1750 sub-apertures and 1900 actuators requires 

massive parallel processing to complete the task. 

The parallel processing is harnessed with the 

addition of hardware accelerators such as 

Graphics Processing Unit (GPU). We investigate 

the hybrid data processing architecture of the 

Shack-Hartmann correlation and wavefront 

reconstruction using FPGAs and GPUs. The ATST 

AO algorithm is implemented, benchmarked on 

the FPGA-GPU system and compared with the 

existing legacy Digital Signal Processing (DSP) 

based hardware system. 

Contact: Vivek Venugopal (United Technologies 

Research Center) 

AA03 - Cosmological Calculations on the GPU 

Cosmological measurements often involve the 

calculation of non-trivial quantities over 

increasingly large datasets. The next generation of 

survey telescopes will yield information for billions 

of galaxies. The scale of the datasets, and the type 

of calculations involved, are ideal models for use 

of the GPU. We present two cosmological 

measurements, and describe the implementation 

and improvements found with the GPU. 

Contact: Deborah Bard (SLAC National Accelerator 

Laboratory) 

AA04 - Fast Cross-Matching of Astronomical 

Catalogs on GPUs 

We present a method of cross-matching objects of 

large astronomical catalogs, over 150 million 

objects, in under 4 minutes. We utilize up to 6 

NVIDIA c2050 and have achieved an over 40x 

speedup versus conventional methods. 

Contact: Matthias Lee (Johns Hopkins University) 

AUDIO, IMAGE & VIDEO PROCESSING 

AV01 - Rapid Training of Acoustic Models Using 

GPUs 

Robust and accurate speech recognition systems 

can only be realized with adequately trained 

acoustic models. For common languages, 

state-of-the-art systems are now trained on 

thousands of hours of speech data, which can take 

weeks even with a large cluster of machines. To 

overcome this development bottleneck, we 

propose a new framework for rapid training of 

acoustic models using highly parallel GPUs. With 

a single NVIDIA GTX580 GPU, our proposed 

approach is shown to be 51x faster than a 

sequential CPU implementation, enabling a 

moderately sized acoustic model to be trained on 

1000-hour speech data in just over 9 hours. 

Contact: Jike Chong (Carnegie Mellon University) 

AV02 - 2 Million Pixel Experiment 

This experimental application has been created as 

a piece of computational art using visual computing 

technologies. It maps a high definition video source 

(1080p) into 3D space. The pixel transformation is 

accelerated by a CUDA kernel to achieve realtime 


85


accuracy. Beside the production of visual effects in 

arts this method may be utilized for video quality 

checking on lower pixel level. 

Contact: Philipp Drieger (Noumentalia.de - Digital 

Arts / KU Eichstätt-Ingolstadt) 

AV03 - Speeding Up Camera Sabotage Detection 

on CUDA 

Camera Sabotage Detection (CSD) algorithms, 

namely Camera Moved Detection, Camera Out of 

Focus Detection and Camera Covered Detection, 

are used to detect tampering attempts on 

surveillance cameras. CSD algorithms are required 

to be run on a high number of cameras in realtime, 

bringing high computational load to the video 

analytics systems. In this work, the CSD algorithms 

are accelerated by using CUDA. The overall system 

test results show that parallelization in GPU makes 

the system 18 times faster than its CPU 

counterpart and up to 400 cameras can be 

supported in real time on a GTX 470. 

Contact: Alptekin Temizel (Middle East Technical 

University) 

AV04 - Remote Sensing on GPU: A Case Study 

Satellite images have become widely available; as 

a result there are increasing number of 

commercial applications utilizing these images. 

Satellites provide data in different wavelengths 

and they have higher resolution and larger data 

size compared to typical images. Running complex 

algorithms on satellite images for large data 

volumes is highly time consuming using CPUs and 

can be speeded-up using GPUs. In this paper, 

performance of shadow detection and vegetation 

detection algorithms are investigated and their 

performance on GPU and CPU are compared. 

Results show that up to 10.2 times speed up could 

be achieved using GPU. 


University) 

AV05 - Finite Difference-Based Sound Synthesis 

Using GPUs 

Finite Difference (FD) methods can be the basis 

for physics-based music instrument models that 

generate realistic audio output. However, such 

methods are compute-intensive; large simulations 

cannot run in real time on current CPUs. In this 

poster, we describe the current state of our 

implementation of a real-time sound synthesizer 

using an FD-based simulation of a twodimensional 

membrane executed on GPUs. We 

demonstrate that it is possible to use this method 

to create a usable real-time audio synthesizer. 

Contact: Marc Sosnick (San Francisco State 

University) 

AV06 - Parallelization of Hough Transform for 

Circles Using CUDA 

Hough Transform (HT) is a well-known technique 

used for detection of parametric shapes in image 

processing. However, various optimizations are 

necessary in its implementation due to large 

memory and computational requirements. In this 

paper, we consider the case of parallelization of 

Hough Transform for circles. A number of different 

implementation approaches of the algorithm is 

compared in CUDA. Results show that up to 360 

times speed up could be achieved compared to its 

CPU version, enabling real time applications. 


University) 

AV07 - Accelerating an Imaging Spectroscopy 

Algorithm Using GPUs 

Graphics Processing Units (GPUs) have proven to 

be effective at accelerating a range of scientific 

applications. As data needs increase, and more 

complex data analysis methods are used, the 

processing requirements for solving scientific 

problems also increase. The parallel processing 

power of GPUs can be harnessed and used 

alongside multi-core CPUs to address this. As an 

example, many problems require solving 

optimization problems of multiple variables across 

large arrays of data. By utilizing modern 

optimization techniques and combining them with 

the computational throughput of a CPU-GPU 

computing platform, we can greatly decrease the 

processing time required to solve these problems. 

Contact: Matthew Sellitto (LLC IntroVision) 

AV08 - CUVILib - GPU Accelerated Vision & 

Imaging Library 

Image Processing algorithms are used in a variety 

of different domains, from surveillance to medicine 

to industry. CUVI (CUDA Vision and Imaging Library) 

provides GPU accelerated Vision and Imaging 

functionality with plug-and-play ease of use, simple 

yet powerful interface and support for both NVIDIA 

and AMD GPUs. With over 1000 users of the Beta 

version, CUVI has fast grown into a mature solution 

of choice when it comes to delivering real-time 

performance for your Imaging/Vision applications 

and software-frameworks. 

Contact: Salman Ul Haq (TunaCode) 

AV09 - Implementation of Raptor Code on GPU 

Raptor Code comes as an improvement to 

LT-Code, which performs as close as possible to 

the Shannon’s channel limit and provides linear 

encoding and decoding time. It has been chosen 

for the forward error correction (FEC) scheme in 

3GPP and DVB-H standards. We implement 

Raptor Codes on GPU for the purpose of 

processing large block size and symbol size 

effectively and efficiently.Our GPU decoding 

achieve up to a 40x speedup over the sequential 

CPU decoding. 

Contact: Linjia Hu (Michigan Technological 

University) 

AV10 - Real-Time Wind Velocity Estimation from 

Aerosol Lidar Data Using GPUs 

The REAL is an atmospheric light detection and

anging (LIDAR) system. It produces nearhorizontal 

and vertical cross-sectional images of 

the lower atmosphere. The images reveal the 

spatial distribution of atmospheric aerosol 

(particulate matter). By applying motion 

estimation algorithms to image sequences, 

two-dimensional vector wind fields can be 

determined. We will explore the use of GPU 

computing in the real-time computation of wind 

vector fields. 

Contact: Chris Mauzey (Johns Hopkins University, 

Applied Physics Laboratory) 

AV11 - GPU Based Feature Extraction 

Implementation 

In this poster, we introduce an efficient parallel 

implementation of Mel-frequency Cepstral 

Coefficient (MFCC)-based feature extraction and 

describe the optimizations required for effective 

throughput on many core Graphic Processing 

Units (GPU) processors. We demonstrate that the 

feature extraction process in automatic speech 

recognition is well suited for GPUs and a 

substantial reduction in computation time can be 

obtained by performing feature extraction on 

these platforms. Using a single NVIDIA GTX460 

GPU our proposal approach is shown to be 

approximately 25x faster than a sequential CPU 

implementation, enabling feature extraction to be 

performed in real-time. 

Contact: Haofeng Kou (SCU) 

BIOINFORMATICS 

BI01 - Acceleration of Complex Network Analysis 

The scientific role of complex networks nowadays is 

of great importance. Their universal characteristics 

can be adopted for use from all over the scientific 

fields as network pharmacology.There is need for 

acceleration where the time execution of the used 

algorithms will be decreased in a large scale.The 

breakthrough is the use of GPUs and parallel 

computing in order to accelerate the whole 

process.The transformation of common algorithms 

as matrix multiplication to a parallel model has 

shown large acceleration, which is a promising 

point for the field of network analysis. 

Contact: Athanasios Grivas (Newcastle University) 

BI02 - GHOSTM: A GPU-Accelerated Homology 

Search Tool for Metagenomics 

A vast amount of sensitive homology searches is 

required for mapping sequence data to known 

protein sequence databases in metagenomic 

analysis. However, fast search tools such as BLAT 

do not have enough search sensitivity for 

metagenomic analysis. Thus a sensitive and 

efficient homology search tool is highly required. 

We develop GPU optimized algorithm for 

performing sensitive sequence homology 

searches. We implemented as the GPU- 

Accelerated Homology Search Tool for 

Metagenomics (GHOSTM), achieves calculation 

speeds faster and search accuracy higher than 

BLAT program. Our results indicate that GHOSTM 

offers a potentially cost-efficient solution to the 

increasingly difficult computational analysis of 

metagenomic data. 

Contact: Shuji Suzuki (Tokyo Institute of Technology) 

CLIMATE & WEATHER MODELING 

CW01 - CUDA/JAVA Model for Gas Line-by-Line 

Absorption of Atmospheric Radiation 

The potential of graphics processing units (GPU) to 

speed up the calculation of radiative energy 

absorption by atmospheric gases is presented. Gas 

absorption calculations are needed at millions of 

electromagnetic waves to have an accurate 

depiction of the Earth’s in-coming and out-coming 

radiative energies. The CUDA/GPU portion obtains 

the gases’ Voigt lineshapes, whereas the Java/CPU 

portion performs efficient I/O tasks on the large 

HITRAN database of molecular gas parameters. A 

modular combination of the lower-level CUDA 

algorithms and the higher-level Java language 

results in an accessible interface to the end-user 

that is not an expert in GPU. 

Contact: William Godoy (NASA Langley Research 

Center) 

CW02 - Heat Transfer Ray Tracing with OptiX 

QUIC Radiant is part of a suite of GPU-assisted 

tools developed by our research group that aim to 

increase knowledge for how environment and 

urban form interact. Our hypothesis is that urban 

structures exist that can minimize energy use 

while also minimizing air pollution exposure. Our 

efforts investigate the complex interactions of 

various types of urban structures by developing 

design strategies for optimizing urban form under 

a variety of constraints. 

Contact: Scot Halverson (University of 

Minnesota Duluth) 

COMPUTATIONAL FLUID DYNAMICS 

CD01 - Coalesced Simulation of Incompressible 

Navier-Stokes Equations Over Airfoil Using GPU 

This work presents GPU based implementation of 

Finite Differencing Time Domain (FDTD) methods, 

for solving unsteady incompressible viscous flow 

over airfoil using the Stream function-Vorticity 

formulation for the structured grid. For the 

large-scale simulations, FDTD methods can be 

computationally expensive and require 

considerable amount of time to solve on CPUs. On 

the contrary, modern GPGPUs are designed to 

accelerate lots of independent calculations due to 

advantage of their parallel architecture. Our 

implemented FDTD simulation has efficient global 

memory coalescence with 66.67% of occupancy. 


87

with High 

SGI® 

Performance 

GPU 

NVIDIA® 

Tesla® 

Compute Solutions 

© 2012 Silicon Graphics International Corp. SGI is a trademark of Silicon Graphics International Corp or its subsidiaries in the 

U.S. and/or other countries. NVIDIA and Tesla are trademarks of NVIDIA Corporation in the U.S. and/or other countries. 

SGI Ad? 

�� 

SGI ® servers with NVIDIA ® Tesla ® 

GPUs deliver massive parallel 

compute power. Power that 

accelerates the pace at which our 

customers can solve their most 

compute-intensive challenges 

including structural design, drug 

research, oil and gas exploration, 

�� 

sgi.com/products/gpu 

Come visit us at booth #4 in 

the exhibitor hall of the GTC 

conference.

GPU based version of flow solver is over 28 times 

faster than a sequential CPU version. 

Contact: Iman Gohari (University of Tehran) 

CD02 - Parallel Computations on GPU in 3D 

Vortex Particle Method 

In this poster the Vortex in Cell (VIC) method for 

solution of the fluid equation in 3D and its 

implementation for parallel computation in 

muliticore architecture of the graphics cards was 

shortly presented. One of the most important 

components of VIC method algorithm is solution of 

the Poisson equation. Multigrid and Full Multigrid 

methods were chosen for its solution. It was 

obtained 12 times speed-up comparing to the 

direct fast solution algorithm for a single processor. 

The VIC method was fully implemented on the GPU 

and a 46 times speed-up was obtained. The tests of 

the method were also shown. 

Contact: Andrzej Kosior (Wroclaw University 


CD03 - Reynolds Equation Solver on GPGPU for 

Gas Film Lubrication Problem 

In the present study, we implemented a Reynolds 

equation solver on GPGPU for gas film lubrication 

problem. By using Red-Black Gauss-Siedle 

iteration scheme, we achieved 106x speedup for 

core calculation part and overall 12x speedup 

(double precision), relative to 1 core of AMD Llano 

A8-3850. A small serial part becomes a critical 

bottleneck and degrades overall speedup as the 

problem size gets bigger and GPU efficiency 

increases. Future work will include the 

development of general gas film analysis solver 

and the development of parallelization scheme for 

remaining serial part, such as integration, error 

check, and et al. 

Contact: Ji-Hoon Kang (KISTI) 

CD04 - Digital Core Analysis with GPU 

Application 

Markets associated with the use of computed 

tomography (CT) for the calculation of core 

characteristics is one of the fast-growing markets 

in the oilfield services. Multi-GPU system 

processes raw data from CT-scanner using 

cheaper and more efficient way than CPU 

clusters. Calculation of key parameters of core 

such as porosity, absolute permeability and 

acoustic properties was processed using MPI and 

CUDA technologies. Special attention was paid to 

optimize memory usage and computational 

algorithms. Algorithms were tested on 

“Lomonosov” supercomputer and had close to 

linear increase in the computation speed 

according to the number of GPU devices in use. 

Contact: Dmitry Senin (University of Illinois at 


CD05 - Immersed Boundary Turbulent Flow 

Simulations on GPU Clusters 

A survey of recent literature reveals that GPU 

speedup factors are generally much higher for 

structured Cartesian mesh methods than 

unstructured mesh methods. However, Cartesian 

mesh methods do not readily extend to complex 

geometries. To this end, immersed boundary (IB) 

methods extend Cartesian methods to complex 

geometry flow problems by imposing the boundary 

conditions on the equations as a forcing term. In 

this study we further develop our multi-GPU 

parallel flow solver, GIN3D, to complex geometry 

turbulent flow problems by implementing the IB 

method along with the Lagrangian dynamic 

large-eddy simulation (LES) technique, which is 

suitable for arbitrarily complex shapes. 

Contact: Rey DeLeon (Boise State University) 

CD06 - Framework for Advanced Plasma 

Simulations on GPU HPC Clusters 

We present a fluid code called WARPM utilizing 

modern many-core computing devices – namely 

GPUs. WARPM is designed to both minimize data 

movement and maximize data-parallel 

computation. The code is a hybrid combination of 

OpenCL for parallel computation, MPI for 

communication between nodes, and threads for 

task-parallelism. The OpenCL standard is central 

to the code. GPUs and/or multi-core CPUs are 

utilized simultaneously to compute updates to the 

system of fluid equations using patch sequencing. 

We believe this new framework is representative of 

the future of high-performance fluid simulations 

and can be useful now to others in the community. 

Contact: Noah Reddell (University of Utah) 

COMPUTATIONAL PHYSICS 

CP01 - High Performance Beam Dynamics 

Simulator for the LANSCE Linear Accelerator 

The LANSCE accelerator complex located at the 

Los Alamos National Laboratory is a multi-beam 

facility that provides high-intensity H+ and H- 

particle beams for a variety of user programs. At 

the heart of the facility is a ½-mile long linear 

accelerator (linac). During beam operations, linac 

parameters are adjusted to maintain minimal 

beam spill, but without detailed knowledge of the 

beam distribution. We are presently developing a 

high performance multiparticle beam dynamics 

simulator using GPU that will provide fast and 

valuable information about the beam distribution 

in pseudo real-time during accelerator operations. 

Contact: Xiaoying Pang (The University of Plymouth) 

CP02 - Accelerating Atomic Collisions 

Calculations with CUDA: Atomic Basis Overlaps 

Atomic collisions calculations are relevant in many 

areas of science, from research in new materials 

to atmospheric studies, and even radiation therapy 

treatments. Accurate atomic computations are 

difficult and time consuming, computer codes in 

those areas rely basically in approximate models. 

The high performance computing power of GPUs 


89


will allow to include precise computations in those 

codes. We started our research using simple ways 

to accelerate basic atomic collisions calculations 

using CUDA, and found excellent speed ups. 

Contact: Flavio Colavecchia (Div. Colisiones 

Atómicas/Instituto Balseiro) 

CP03 - Fast Discrete Element Simulations Using 

GPUs in the Million-Particle-Range 

Discrete Element Method (DEM) was introduced 

already in 1979. Even though available, due to 

limited computational power it was a challenge to 

run a simulation of granular assemblies of a few 

hundred disks in two dimensions for a long time. 

Meanwhile three-dimensional simulations in the 

range of 10,000 to 200,000 particles are standard 

and can be achieved on workstations and clusters, 

enabling simulated process times of up to several 

minutes in the latter case. Smart implementations 

with respect to the specific architecture of a GPU 

allows for millions of particles already on a single 

GPU under your desk. 

Contact: Charles Radeke (University of Washington) 

CP04 - Discontinuous Galerkin Time-Domain 

Simulations of Plasmonic Nanostructures on 

NVIDIA GPUs 

The discontinuous Galerkin time-domain (DGTD) 

method is a powerful method to explore the 

electromagnetic properties of nano-scale 

plasmonic and dielectric systems. Here, we present 

the method’s advantages and disadvantages when 

implemented to run on graphic processing units 

(GPUs). The GPU’s superior performance is 

demonstrated for realistic nanophotonic setups 

characterized by both, optical spectroscopy and 

electron energy loss spectroscopy. Compared to 

modern CPU hardware, GPU-based DGTD yields up 

to two orders of magnitude decreased 

computational time. 

Contact: Richard Diehl (Karlsruhe Institute 


CP05 - Inversion of a Sequence of Matrices 

Differing in Diagonal Elements 

We propose an implementation of the GPU 

algorithm for the inversion of special matrices set. 

Each matrix in the set is differs from others only by 

its diagonal elements.The algorithm uses a direct 

product procedure for the matrix inversion. The 

ability to use massive parallelization for the 

calculation of the direct product allows to effectively 

use GPU calculations which speeds up the solution 

of this problem. We implement and study the 

properties of this algorithm for complex valued 

matrices. Using the GPU algorithm for simulation 

of the disordered 2D-lattice systems allows to 

achive significant speed up in calculations. 

Contact: Alexey Osipov (Jet Propulsion Laboratory) 

CP06 - Accelerating Particle Simulations with 


RandomWalk is a program designed to model 

particle dispersion for a city-scale environment. It 

is used to model airborne hazards in urban 

environments. We reimplemented RandomWalk in 

CUDA to achieve significantly faster results. 

Contact: Scot Halverson (University of 

Minnesota Duluth) 

CP07 - Accelerating Particle-Tracking Based 

Beam Dynamics Simulations with GPUs 

Efficient implementation of general-purpose 

particle tracking on GPUs can result in significant 

performance benefits to large-scale particle 

tracking and tracking-based accelerator 

optimization simulations. We present our work on 

CUDA kernels for transfer maps of single-particledynamics 

and collective-effects beamline elements, 

to be incorporated into a GPU-accelerated version 

of the Argonne National Lab’s accelerator code 

ELEGANT. In particular, we discuss techniques for 

efficient utilization of the device shared, cache, and 

local memory in the design of single-particle and 

collective-effects kernels. We also discuss the use 

of data-parallel and hardware-assisted approaches 

for resolving memory contention issues in collective 

effects kernels. 

Contact: Keegan Amyx (Tech-X Corporation) 

COMPUTER GRAPHICS 

CG01 - CUDA-Based Interactive Design of Urban 

Ecosystems 

We address the problem of interactive design of 

urban spaces by integrating plants in urban 

environments. We have developed an interactive 

simulation and procedural system for 3D urban 

models. Using our CUDA-based interactive system 

we can simulate spatial distribution of a large 

ecosystem embedded in a city. We have achieved a 

performance of 50M-70M collision tests per 

second allowing for 250,000 plants being 

simulated at 5-6 fps on a Tesla C2050. 

Contact: Michel Abdul Massih (Purdue University) 

CG02 - Robust GPU Algorithm for Exact 3D 

Minkowski Sum Computation 

We present a robust GPU algorithm to compute 

exact 3D Minkowski sum of two polyhedral 

objects. While Minkowski sum is of great 

importance in mathematics, geometric modeling, 

and robotics, it is hard to compute efficiently and 

robustly. The proposed algorithm achieves high 

performance by mainly running on GPU, while 

filtering out unsafe predicates caused from 

degenerate cases by using interval arithmetic. The 

filtered unsafe predicates are tossed to CPU 

where they are robustly evaluated by using 

extended arithmetic (MPFR). The performance 

result shows speedup of one order of magnitude 

versus a pure CPU algorithm. 

Contact: Min-Ho Kyung (Ajou University)

CG03 - Real-Time Mixed Water Simulation and 

Rendering Techniques for Visual Effects 

The synthesis of realistic scenes is a important 

research areas for applications in games and 

visual effects. Research groups developed 

techniques for realistic water rendering, but there 

are no research work that describes techniques 

and make a comparative analysis of them. The 

present work research analyses the most 

important techniques for water simulation and 

visualization, makes performance comparison, 

and create a system driven for artists. The system 

can choose between algorithms and combine 

them using layers to achieve the desired result. 

Finally it can use a virtual camera to output the 

final render in multiple passes for post production. 

Contact: Rodrigo Marques (California State 

University, Chico) 

COMPUTER VISION 

CV01 - Efficient Dense Stereo Matching Using 

CUDA 

The proposed work demonstrates the general 

strategy for parallelization of dense matching 

methods on GPUs, shows the potential capability 

of common graphics cards for general 

computation, and compares the implementations 

between local and global methods with the 

example of Sum of Absolute difference (SAD) and 

Semi-Global Matching (SGM). 

Contact: Ke Zhu (Technische Universität München) 

CV02 - Scalable Local Feature Extraction with 

Orientation Maps and GPU Computing 

This paper presents scalable computational 

techniques for extracting local invariant features. 

Although several investigators have developed 

efficient algorithms and implementations for 

feature extraction, the scalability in terms of the 

number of extracted features still remains as an 

issue. We introduce the data structure called 

orientation maps and GPU computing to improve 

the scalability of feature extraction. Experimental 

results demonstrate that using orientation maps 

and a GPU enable us to improve the scalability as 

well as the efficiency of computation compared to 

a CPU. 

Contact: Naoyuki Ichimura (National Institute of 

Advanced Industrial Science and Technology (AIST)) 

CV03 - GPU-Accelerated Detection of Severe 

Video Distortions 

We show how to port a previously proposed 

algorithm for detection of severe analog and digital 

video distortions (termed ‘video breakup’), efficiently 

to Fermi Architecture GPUs with CUDA. By porting 

to a GPU, the runtime of the CPU implementations 

can be reduced by an order of magnitude. Thus our 

GPU algorithm is capable of analyzing up to ten Full 

HD (1920 x 1080) video streams in real-time. The 

GPU implementation is integrated in the AV- 

Inspector application, which allows the user to get 

an automatic assessment of the quality of video and 

film material in very short time. 

Contact: Hannes Fassold (JOANNEUM RESEARCH) 

CV04 - VScreen: A Real-Time Augmented 

Video Method 

We present a tool for image editing that allows us 

to modify a region of any image or video by another 

image or video. This application is useful for 

advertisements, commercials, music videos, 

movies, etc. The main difference between editing 

(augmenting) videos and fixed images is that the 

occlusions need be managed. Moving objects in 

foreground may occlude the augmented region in 

background. So that we use a procedure for 

Foreground/Background (FgBg) video 

segmentation, which is implemented in NVIDIA 

video cards to fulfill the real-time requirement. 

Contact: Francisco J. Hernandez-Lopez (CIMAT A.C.) 

CV05 - Accelerated Multiple Region Evaluation 

for Human Motion Tracking 

In this work we present a study about different 

NVIDIA CUDA approaches to the problem of the 

evaluation of a region of interesting (ROI) pixels in 

an image. This problem is usually integrated as 

part of other higher level methods, such as image 

retargeting, completion, video summarization, 

object detection, visual tracking, etc. Because 

of these problems evaluate millions of ROIs, in 

many cases performance is usually far from 

being interactive. 

Contact: David Concha Gomez (Universidad Rey 

Juan Carlos) 

CV06 - Efficient Segmentation Trees on the GPU 

There are numerous computer vision tasks which 

demand a high performance algorithm for 

segmentation trees building. Unfortunately, 

current state-of-the-art methods aimed for the 

CPU are way too slow. Present work describes an 

efficient GPU implementation of a popular 

algorithm. Performance evaluations show that 

unlike its CPU counterpart the proposed method 

is suitable for real-time applications. 

Contact: Yaroslav Ganin (NVIDIA) 

CV07 - GPU Vision: OpenCV’s GPU Module 

Accelerates Computer Vision 

OpenCV is the world’s most used library for 

computer vision with over 3 million downloads 

worldwide. Using the power of CUDA and the 

NVPP library, the most computationally 

demanding of OpenCV’s more than 500 functions 

have been ported for an average speedup of 33X 

over the already highly optimized CPU code. 

Several application work flows have been 

dramatically improved, including HOG pedestrian 

detection, face detection, stereo correspondence, 

and feature detection and matching. 

Contact: Colin Tracey (NVIDIA) 


91


CV08 - Orientation Flows: GPU Implementation 

Clarifies Cortical Computation 

Orientation flows play an important role in shape 

inference. We have developed a model of 

orientation flow extraction that explains the 

statistics of neurophysiologically observed 

connection structure through second order (mean 

and variance). Our GPU-based implementation of 

this model realizes dramatic performance 

improvements over the original C implementation, 

enabling us to pursue formerly prohibitively 

time-consuming studies. 

Contact: Daniel Holtmann-Rice (Yale University) 

CV09 - Michigan Visual Sonification System: 

Driving Efficient Mobile Vision Designs 

Visual Sonification is the process of converting 

visual properties of objects into audio. The 

Michigan Visual Sonification System (MVSS) 

utilizes this process to assist the visually impaired 

in distinguishing objects in their surroundings. 

MVSS uses computer vision to analyze scenes and 

create a dynamic audio representation of each 

object which is presented to the user using 3D 

audio. The performance of MVSS on mobile 

processors exposed a need for improved mobile 

vision performance. Our benchmark suite, 

MEVBench, was used to further analyze the 

computational characteristics of mobile vision. 

The EFFEX architecture was developed for 

efficient feature extraction in mobile vision. 

Contact: Jason Clemons (University of Michigan) 

DATABASES, DATA MINING, BUSINESS 

INTELLIGENCE 

DB01 - Parallel Data Mining Techniques on 

Graphics Processing Unit with CUDA 

Data mining is widely used in various domains and 

has significant applications. However, current data 

mining tools cannot meet the requirement of 

applications with large-scale databases in terms 

of speed. We propose three techniques to 

accelerate fundamental kernels in data mining 

algorithms on CUDA platform, scalable thread 

scheduling scheme for irregular pattern, parallel 

distributed top-k scheme, and parallel high 

dimension reduction scheme. They play a key role 

in our GUCAS_CU-Miner, including three 

representative data mining algorithms, CU- 

Apriori, CU-KNN and CU-K-means. The 

experiments have shown that GPU + CUDA 

parallel architecture is feasible and promising for 

data mining applications. 

Contact: Ying Liu (Graduate University of Chinese 

Academy of Sciences) 

DB02 - Parallel Spectral Graph Partitioning 

on CUDA 

Spectral graph partitioning is a widely used 

technique in many fields such as image 

processing, scientific computing and machine 

learning. In this study, we analyze the subroutines 

of spectral graph partitioning algorithm on CUDA. 

Each step is analyzed using various different 

techniques to lead a conclusion about suitability of 

the step for GPU implementation.Two different 

GPU configurations are implemented and their 

results are compared against the CPU version. 


University) 

DB03 - Red Fox: Accelerating Data Warehousing 

Applications Using GPGPUs 

Red Fox is a compiler optimization framework for 

accelerating large scale data warehousing 

applications on cloud architectures augmented 

with GPUs. Currently, the framework is structured 

around the program transformations based on the 

concepts of kernel fusion and fission, drawing 

upon the analogy with classical loop fusion and 

fission transformations. These transformations 

seek to improve GPU utilization and optimize data 

movement throughout the CPU/GPU memory 

hierarchy. Coupled with the Ocelot dynamic 

compiler, this framework can optimize the 

execution of applications across the CPU and 

GPU. The initial application domain includes 

relational operators and arithmetic functions 

found in data warehousing applications. 

Contact: Haicheng Wu (Georgia Institute of 

Technology) 

DEVELOPMENT TOOLS & LIBRARIES 

DL01 - AutoTune: Automatic Online Code Tuning 

Performance analysis and tuning is an important 

step in programming multicore and manycore 

architectures. There are several tools to help 

developers analyze application performance; still, 

no tool provides recommendations about how to 

tune the code. AutoTune will extend Periscope, an 

automatic online and distributed performance 

analysis tool developed by Technische Universität 

München, with plugins for performance and 

energy efficiency tuning. The resulting Periscope 

Tuning Framework will be able to tune serial and 

parallel codes with and without GPU kernels; in 

addition, it will return tuning recommendations 

that can be integrated into the production version 

of the code. 

Contact: Renato Miceli (Aon Benfield Securities) 

DL02 - Interactive Linked Visualizations for 

Performance Analysis Of Heterogeneous 

Computing Clusters 

Performance analysis is a vital step in identifying 

execution bottlenecks to help target optimizations. 

This analysis is derived from observations of 

performance data collected from the computing 

hardware. Data obtained from computing clusters 

is necessarily complicated because its collection 

involves multiple interacting nodes as opposed to 

just a single serial execution. Further, 

heterogeneous clusters, having CPUs working

together with several GPUs, add additional layers 

of complexity. These characteristics pose a 

serious challenge to the analysis and 

improvement of application performance. We 

present a tool that assists performance analysis 

by visualizing performance data with the help of 

various interactive linked views. 

Contact: Aaditya Landge (Scientific Computing and 

Imaging Institute, University of Utah) 

DL03 - High-Performance Pedestrian Multi- 

Simulation Using GPU Cluster 

We have created a tool that could potentially help 

with decision support and planning of large-scale 

emergency pedestrian evacuations. Through the 

use of our simulation software distributed over a 

GPU cluster, many evacuation scenarios can be 

simultaneously simulated at faster than real-time 

speeds and compared for their effectiveness. 

Contact: Twin Karmakharm (University of Sheffield) 

DL04 - ttgLib - Middleware for Dynamic Software 

Adaptation to Heterogeneous Architectures 

We present ttgLib, a middleware that efficiently 

distributes computational tasks between CPUs 

and GPUs and provides load balancing between 

them on the fly. This enables an application to use 

all available processing units of heterogeneous 

HPC system simultaneously. ttgLib accomplishes 

several dynamic optimization procedures that 

significantly facilitate the development of new 

applications for and porting of existing software to 

heterogeneous platforms. ttgLib can be 

considered as an extension of widely used parallel 

programming tools that can be easily integrated 

into software development process. This 

middleware efficiently solves the most tedious 

problems of ‘heterogeneous coding’ the 

developers usually met with. 

Contact: Sergey Grizan (Moscow State University 

and Siberian Federal University, ttgLabs) 

DL05 - Efficient Formal Verification of CUDA 

SIMD and Atomics 

Detecting and Debugging assertion failures and 

runtime errors in CUDA programs is usually hard. 

Typical multithreaded program verification 

methods are not effective for verifying the largescale 

fine-grained concurrency of CUDA. Our novel 

contribution is a technique to handle CUDA SIMD 

plus Atomics using concolic execution methods. 

Contact: Wei-Fan Chiang (School of Computing, 

University of Utah) 

DL06 - Performance Optimizations And 

Modeling For Large-Scale Heterogeneous 

Computing Systems 

This poster proposes to address the following at 

every level of parallelism in heterogeneous 

computing systems: 1) performance optimizations 

of applications, and 2) performance modeling 

and prediction. 

Contact: Ashwin Aji (Virginia Tech) 

ELECTRONIC DESIGN AUTOMATION 

EA01 - Parallel VLSI CAD Algorithms for Energy 

Efficient Heterogeneous Computing Platforms 

In the past decade, parallel VLSI CAD tools have 

been successfully developed by major EDA 

vendors to leverage multi-core/distributed parallel 

computing powers. However, for recent energy 

efficient heterogeneous computing platforms that 

integrate multi-core CPUs and many-core GPUs, 

very limited progress has been made in VLSI CAD 

research society. Developing efficient CAD 

algorithms for such heterogeneous platforms can 

be extremely challenging, requiring strong 

domain-specific CAD algorithm knowledge as well 

as thorough understanding of the latest hardware 

properties. In this abstract, we show our latest 

research progress on large scale circuit electrical 

and thermal modeling and simulation methods. 

Contact: Zhuo Feng (Michigan Technological 

University) 

EA02 - Ultra-Low Power Transceivers for 

High-Bandwidth Interconnects 

A low-power transceiver for highly parallel 

chip-to-chip data communication is presented. 

The receiver is implemented in a 45nm SOI 

technology. High data rate and low power 

dissipation is achieved using a switched-capacitor 

S/H/summer front-end which enables FEXT 

cancellation with 33µW/Gbps power overhead. It 

operates up to 15Gb/s and dissipates 7.5mW from 

a 1.2V supply. The 15Gb/s transmitter employs an 

analog filtering pre-emphasis equalization 

technique and dissipates 10mW from a 1.2V supply 

while occupies 0.01mm2. It was fabricated in 

65nm CMOS technology and compensates for 

channel losses up to 20dB at Nyquist-rate. 

Contact: Meisam Honarvar Nazari (California 


ENERGY EXPLORATION 

EE01 - The Maven Vector-Thread Architecture 

We present a taxonomy and modular 

implementation approach for data-parallel 

accelerators, including the MIMD, vector-SIMD, 

subword-SIMD, SIMT, and vector-thread(VT) 

architectural design patterns. We have developed 

a new VT microarchitecture, Maven, based on the 

traditional vector-SIMD microarchitecture that is 

simpler to implement and easier to program than 

previous VT designs. Using an extensive designspace 

exploration of full VLSI implementations of 

many accelerator design points, we evaluate the 

varying tradeoffs between programmability and 

implementation efficiency among the different 

architectural patterns. Our results suggest that 

the Maven VT microarchitecture is superior to the 

vector-SIMD architecture, providing both greater 

efficiency and easier programmability. 

Contact: Yunsup Lee (UC Berkeley) 


93

Covering the fastest computers in the world 

and the people who run them 

Subscribe Today! 

HPC Wire Ad? 

www.hpcwire.com

FINANCE 

FA01 - PathWise High Productivity 

Computing Platform 

PathWise High Productivity Computing (HPC) 

platform is a financial modeling environment for 

targeting GPU grids. 

Contact: Aamir Mohammad (Tsinghua University) 

GENERAL INTEREST 

GI01 - High Throughput MIMO-OFDM Detection 

with Graphics Processing Units 

A novel strategy is proposed to implement 

a reconfigurable MMSE-based detector for 

multiple-input multiple-output (MIMO) wireless 

communication systems with orthogonal 

frequency-division multiplexing (OFDM). The key 

component of the strategy is a massively parallel 

implementation of the scalable matrix inversion 

on GPUs. A series of optimization methods 

including multi-threaded matrix inversion with 

multiple data frames, maximizing the utilization 

of the fast on-chip memories, and overlapping 

kernel execution with data transfer, are proposed. 

Experiments demonstrate that the throughputs 

for a 4×4 64QAM MIMO-OFDM system can 

achieve over 100 Mbit/s, satisfying 4G wireless 

communication standards like LTE/LTE-Advanced. 

Contact: Dan Sui (Wireless & Mobile 

Communication R&D Center, Tsinghua University) 

GI02 - A Fast Irregular LDPC Decoder on NVIDIA 

Fermi 

Low-Density Parity-Check (LDPC) codes are 

widely used in many wireless communication 

systems. The decoding algorithms are often 

time-consuming. Graphics Processing Unit (GPU) 

is an attractive co-processor of CPU to implement 

massively parallel computing. The GPU-based 

LDPC decoder is studied, especially for irregular 

LDPC codes. Optimization techniques for GPU are 

considered. Experimental results demonstrate 

that compared to CPU, GPU can achieve more 

than 80 times speedup. 

Contact: Dan Sui (Wireless & Mobile 

Communication R&D Center, Tsinghua University) 

GI03 - Actual Power Consumption in Pattern 

Matching on CUDA GPUs 

For many embedded applications in e.g. the 

Aerospace/Defense industry, power efficiency is 

very important as both cooling and power are 

often difficult to supply. We show that the specified 

max power of a CUDA GPU is not a good measure 

of actual power consumption under a CUDA load, 

and that writing efficient code which reaches high 

utilization is of the essence when it comes to 

power efficiency. 

Contact: Ian Wainwright (High Performance 

Consulting) 

GI04 - GPU-Accelerated Fingerprint Matching 

As biometric databases approach hundreds of 

millions of identities in size, it becomes more 

costly and time-consuming to search these 

databases. Using a GPU-accelerated coarse 

filtering algorithm, we demonstrate that a large 

fingerprint database can be searched very quickly 

for a matching individual by isolating a small list 

of potential matches using GPUs, such that only 

these few records will be given further scrutiny by 

the matching system. 

Contact: Scott Bai (The MITRE Corporation) 

GI05 - Towards Task-Pipelined General Purpose 

Computing on GPUs 

Many real-world applications, especially those 

following a stream processing pattern, feature 

interleaved task-pipelined and data parallelisms. 

Current GPUs are ill-equipped for such 

applications due to the insufficient usage of 

computing resources and/or the excessive off-chip 

memory traffic. This paper focuses on architectural 

enhancements to enable task-pipelined execution 

of data-parallel kernels on GPUs. We propose an 

efficient adaptive dynamic scheduling mechanism 

and a moderately modified L2 cache structure to 

orchestrate both task-pipelined and data 

parallelisms. Simulation results show that the 

proposed GPU architecture improves IPC by 18% 

and reduces the overall access to off-chip GPU 

memory by 11% on average. 

Contact: Shuai Mu (ABB Corporate Research) 

GI06 - High Performance Computing in 

Volumetric Velocimetry 

Since the advent of Particle Image Velocimetry (PIV) 

in experimental fluids measurements there has 

been a steady and sustained incline in the 

throughput capability and resolution of hardware 

devices (i.e.CMOS cameras) needed to acquire and 

transfer the copious amounts of image data. With 

the introduction of tomographic measurement 

techniques the amount of data suddenly increased 

by an order of magnitude. While the development of 

hardware paces reasonably well with the 

acquisition demand placed by current experiments, 

the ability of computers and current algorithms to 

process and further reduce the data within a 

reasonable period has fallen dramatically behind. 

Contact: Thomas Nonn (Moscow State University, 

Physics Department) 

GI07 - Conformal Transformations of 3D Meshes 

in Parallel 

Arbitrary deformations applied on 3D meshes 

pose significant restrictions in many design 

applications. Conformal transformations, these 

that preserve oriented angles for a given 3D mesh 

parametrization, however, offer the right balance 

between flexibility of the geometric form and 

structural preservation. The advantage of using 

such transformations is two-fold: one can 

maintain flexibility of the design process, and 

preserve texture and emblematic features of the 


95


mesh. To this end, we investigate efficient and 

scalable implementations of the methodology 

introduced by Crane et al. in GPU architectures. 

Contact: Nikolaos Yiotis (London College of 

Fashion/ University of Arts London) 

GPU ACCELERATED INTERNET 

GA01 - Accelerating Greater Than-Strong 

Conditional Oblivious Transfer Multiparty 

Protocol Using GPU 

Greater Than-Strong Conditional Oblivious 

Transfer (GT-SCOT) is a protocol used for sharing 

data between two parties without revealing any 

private information. Due to the large number of 

iterative operations and the increasing size of the 

input, the algorithm is computationally intensive, 

and hence cannot be used for large credentials or 

secure database mining. This work presents an 

implementation of GT-SCOT using GPU in order to 

accelerate the operations and handle large 

messages. Results show that GPU 

implementation achieved a speedup of 7x for 

messages with size of 1024 bits using 64 bits of 

encryption for each bit of the message. 

Contact: Axel Rivera (The University of Tokyo) 

LIFE SCIENCES 

LS01 - GPU-Enabled Stochastic Spatiotemporal 

Model of Rat Ventricular Myocyte Calcium 

Dynamics 

Some cardiac arrhythmias are thought to result 

from Ca2+ waves under spark-induced spark 

phenomenon. Calcium sparks - the local elevation 

of calcium, may recruit the sparks in the 

neighboring sites. However, the study of such 

calcium dynamics at a detail whole-cell model is 

computational prohibitive. We introduced a novel 

Markov-Chain Monte Carlo simulation. The time 

steps is at microscale range, i.e.10ns to 1us. The 

simulation thus can capture the dynamics of 

individual ion channel kinetics. The authors 

introduced an on-going effort to study calcium 

dynamics, for the first time, that incorporate detail 

structure of rat ventricular myocytes. 

Contact: Tuan Hoang-Trong (George Mason 

University) 

LS02 - GPU Accelerated Signal Processing in Ion 

Torrent Analysis Pipeline 

We have adopted solutions to provide fast analysis 

results to our customers by accelerating our 

signal processing pipeline using Tesla C2050 GPU. 

This poster presents a high level view of GPU 

application to our processing pipeline. 

Contact: Mohit Gupta (Life Technologies) 

LS03 - A Fast CUDA Compatible Short Read 

Aligner to Large Genomes 

We present CUSHAW, a parallelized short read 

aligner that exploits CUDA-compatible graphics 

hardware as accelerators to achieve fast speed. It 

employs a quality-aware bounded search 

approach based on the Burrows-Wheeler 

transform (BWT) and the FM-index to reduce the 

search space and achieve high alignment quality. 

Performance evaluation reveals that CUSHAW 

running on one or two GPUs achieves significant 

speedups in terms of execution time, while 

yielding comparable or even better alignment 

quality for paired-end alignments compared to 

three popular BWT-based aligners: Bowtie, BWA 

and SOAP2 (availability: http://cushaw. 

sourceforge.net). 

Contact: Bertil Schmidt (Johannes Gutenberg 

University Mainz) 

MACHINE LEARNING & AI 

ML01 - Accelerating Parallel Monte Carlo Tree 

Search Using CUDA 

The poster presents a parallel implementation of 

Monte Carlo Tree Search algorithm on GPU using 

CUDA. It is run on the TESLA equipped TSUBAME 

supercomputer and the results show that in a 

2-player game such as Reversi, the GPU version is 

much stronger than the CPU one. Additionally, it 

can be easily scaled to thousands of GPU cores. 

The scalability factors are presented. 

Contact: Kamil Rocki (KPIT Cummins Infosystems Ltd.) 

ML02 - Message Passing Parallelism for Belief 

Propagation in Junction Trees 

Belief propagation over junction tree is known to be 

computationally intensive in the general case. One 

way of addressing this computational challenge is 

to use parallel computing on GPU. In this paper, we 

develop a two dimensional parallel computing 

model for node level message passing. Based on 

this approach, we further develop a novel clique 

merging technique that leverages the two 

dimensions of parallelismto adapt the various 

Bayesian networks to parallel computing platform. 

We implement our approach on an NVIDIA GPU and 

test it using BNs from several applications. 

Contact: Lu Zheng (Carnegie Mellon) 

ML03 - Parallel Memetic Algorithm 

Implementation on CUDA 

In this poster, a parallel memetic algorithm 

implementation for CUDA platform is described. 

The conventional genetic operators are adapted to 

the GPU considering the GPU architecture. In this 

population based optimization technique, there are 

one more islands and each island consists of 

constant number of individuals. Each CUDA thread 

is responsible for evolution of one individual, and 

islands are mapped as CUDA blocks to benefit from 

the shared memory. The results show up to 38x 

speedup compared to the CPU implementation. 


University)

ML04 - GPU-Accelerated Action Acquisition 

Through Multiple Time Scales Recurrent 

Neural Network 

This poster presents novel results of complex action 

learning experiments based on the use of extended 

multiple timescales recurrent neural 

networks(MTRNN). The experiments were carried 

out with the iCub humanoid robot, as a model of the 

developmental learning of motor primitives as the 

basis of sensorimotor and linguistic 

compositionality. The model was implemented 

through the GPU-accelerated Aquila cognitive 

robotics toolkit. The results presented herein show 

that the model was able to learn and successfully 

reproduce multiple actions in an object manipulation 

task scenario using large-scale MTRNNs. This 

forms the basis on ongoing experiments on action 

and language compositionality. 

Contact: Martin Peniak (Federal University of Rio 

de Janeiro) 

MACHINE VISION 

MV01 - GPU Based Fast Block Matching Using 

Orthogonal Thread Transformation 

Block matching (BM) technique is extensively used 

in object tracking and defect detection problems. 

BM has moderate accuracy for defect detection but 

it suffers from heavy performance drawbacks. 

Modifications in BM with compromised accuracy 

and increased performance have been reported in 

the literature. This technique is exhaustive search 

technique but on the contrary, it is highly data 

parallel in nature. We present the implementation of 

BM algorithm using CUDA using a novel orthogonal 

thread transformation technique to maintain the 

data parallelism throughout the processing. We 

have achieved 350x speed up against CPU and 2.3x 

against other GPU implementations. 

Contact: Sudhakar Sah (University of Oregon) 

MV02 - Integrating Machine Vision and 

Kinematics for a Robotic EV Charger 

This Poster is explains the use of Tegra II ULP 

GeForce GPU for Integrated Machine Vision and 

Inverse Kinematics (IK) on a Robotic (SCARA) 

Electric Vehicle Charging System. The 

convergence of wireless tech, mobile chip sets 

and powerful software environments enabled the 

smartphone revolution. Applying these economies 

with the addition of powerful imaging and GPGPU 

capabilities enables a low-cost, high-performance, 

easily-engineered, embedded machine-tomachine 

(M2M) solution to an emergent problem 

in vehicle transportation. The result: 

PowerHydrant ® ELIMINATES ELECTRIC VEHICLE 

CHARGING INCONVENIENCE. Robotic conductive 

chargers beat wireless inductive chargers on 

efficiency, charger-time and constraint-free use. 

Contact: Kevin Leary (PowerHydrant) 

MEDICAL IMAGING & VISUALIZATION 

MI01 - Optimal Speed Gain for CUDA 

Implementation of SPECT Image Reconstruction 

GPU implementation can greatly accelerate 

iterative techniques of 3D image reconstruction in 

nuclear medicine imaging. To obtain high quality 

images in Single Photon Emission Computed 

Tomography (SPECT) within reduced scanning 

times, high sensitivity collimators need to be used 

and their response function modeled in the 

reconstruction. This is in general very 

computationally intensive and unfeasible with 

conventional PCs and algorithm implementations. 

Our software is able to perform the reconstruction 

of patient data within clinically acceptable times 

(18 s vs 17 min on CPU) using relatively low cost 

and widely available hardware. 

Contact: Jakub Pietrzak (RCPE-TU GRAZ) 

MI02 - Accelerating Mutual Information 

Computation for Nonrigid Registration on the GPU 

Nonrigid registration is a technique for defining a 

geometric relation between each point in images. 

Although this technique helps medical doctors in 

detecting cancers by monitoring changes in size, 

some registration algorithms cannot be efficiently 

implemented due to small shared memory. The 

main objective of this poster is how such a 

capacity issue can be tackled for intra-operative 

registration. As an example, we present a CUDAbased 

method capable of rapidly computing joint 

histograms using shared memory. Our method 

achieved a three-fold speedup by exploiting the 

sparse structure of joint histograms, with 

successful registration of liver CT datasets. 

Contact: Kei Ikeda (Osaka University) 

MI03 - CUDA Accelerated Real Time Steered 

Spatial Compounding in Diagnostic Ultrasound 

Spatial compounding is a real time transmit and 

receive beam steering technique which acquires 

images from multiple lines of sight to increase the 

information content in medical ultrasound 

images. This function is implemented in the latest 

release of the ACUSON SC2000 platform for high 

frequency vascular imaging using CUDA texture 

lookups for geometric transformation to a 

common view. CUDA and the Quadro 2000 enable 

a substantial increase in processing performance 

(>8×) over conventional CPU based processing. 

Contact: Ismayil Guracar (Siemens Healthcare, 

Ultrasound Business Unit) 

MI04 - Ultrafast Multipinhole SPECT 

Iterative Reconstruction Using CUDA-Based 


We have developed an ultrafast SIR method for 

multipinhole SPECT programmed in CUDA and 

tested using a high performance graphic 

processing unit. We show significant performance 

improvement in reconstruction using both 

computer-generated and experimental 

sinograms, demonstrating an up-to fifty-fold 


97


speed enhancement with virtually the same 

accuracy as the CPU-based SIR (with 0.15% 

normalized root mean square error). 

Contact: Fares Alhassen (University of California, 

San Francisco) 

MOBILE APPLICATIONS & INTERFACES 

MA01 - Accelerating Computer Vision with 

Tegra GPU 

The mobile platform is quickly becoming a serious 

computing device, capable of tackling complex 

computer vision tasks. The ability to share memory 

space between GPU and CPU on Tegra 3 offers a 

unique opportunity to utilize the GPU without 

expensive memory copies. We have demonstrated 

that acceleration of just a handful of computeintensive 

CV operations on the Tegra 3 GPU can 

free some common bottlenecks and achieve real 

time performance on Video Stabilization and 

Panoramic Stitching applications. 

Contact: Colin Tracey (NVIDIA) 

MOLECULAR DYNAMICS 

MD01 - GPU-Based Molecular Dynamic 

Simulations Optimized with CUDPP and CURAND 

Libraries 

Computer simulations are indispensible tools for 

deciphering how biomolecular structures and 

folding correspond to functions. These simulations 

benefit greatly from advances in parallel 

computations (e.g., GPUs) because the calculated 

forces are inherently independent computations. 

However, a major limitation of GPUs is that the 

transfer of data between the CPU and GPU must be 

minimized. We introduce a new algorithm for 

calculating neighbor lists and transferring them to 

GPUs with minimal memory transfer. This 

algorithm is readily implemented with CUDPP and 

CURAND libraries. Using simulations of the 

ribosome, we observe a significant improvement in 

the performance, which is system size dependent. 

Contact: Tyson Lipscomb (Wake Forest University) 

MD02 - Plane Wave Pseudopotential Density 

Functional Theory Calculations on GPU Clusters 

In this poster, we present our implementation of the 

density functional theory (DFT) plane wave pseudopotential 

(PWP) calculation on GPU clusters. This 

GPU version is developed based on a CPU DFT-PWP 

code: PEtot. Our test indicates that the GPU version 

can have a ~10 times speed-up over the CPU version 

and is about 5 times faster than the legendary VASP 

code. An analysis of the speed-up and the scaling on 

the number of CPU/GPU computing units(up to 256) 

are presented. The success of our speed-up relies 

on a hybrid reciprocal-space and band-index 

parallelization scheme. 

Contact: WeiLe Jia (Supercomputing Center of 

CNIC, Chinese Academy of Sciences) 

MD03 - Single vs. Double Precision MD 

Simulations: Correlation is Length-Scale 

Dependent 

This poster evaluates how single vs. double 

precision operations affect Molecular Dynamics 

simulations using a GPU-optimized MD simulation 

software by performing coarse-grained MD 

simulations of many biologically relevant systems of 

various size. Three different measures of structural 

similarity are used to analyze structure of 

trajectories and to determine when single precision 

calculations would be appropriate and when would 

not. The conclusion is that the increased 

performance of single-precision implementations of 

MD simulations makes no significant difference in 

the accuracy and precision of MD simulations if the 

system size is sufficiently large. 

Contact: Anqi Zou (Wake Forest University) 

MD04 - GPU-Based Monte Carlo Simulations for 

Canonical and Gibbs Ensembles 

Markov Chain Monte Carlo (MCMC) simulation of 

chemical systems allows examination of 

nanoscopic thermodynamics and associated 

behavior at small time scales. These simulations 

tend to be computationally expensive, requiring 

days or more of CPU time to collect data. 

Optimization work is essential in order to remedy 

the inherent time complexity of these simulations. 

To date, there is no multi-ensemble molecular 

MCMC engine for the simulation of chemical 

systems that leverages GPUs. A speed up of 6.3 

and 14.4 times were achieved for a problem size of 

131072 particles for the canonical and Gibbs 

ensemble implementations, respectively. 

Contact: Loren Schwiebert (Northeastern University) 

MD05 - Simultaneous Evolution of Multiple 

Molecular Dynamics Simulations 

The need to generate statistically significant data 

from time intensive molecular dynamics (MD) 

simulations drives the search for algorithms that 

can take advantage of inherent parallelism in 

computer architectures. CUDA is an ideal platform 

for performing multiple MD simulations for 

ensemble averaging. We demonstrate a proof of 

concept highlighting the potential of CUDA in 

performing multiple MD simulations with different 

initial conditions. Compared to the traditional 

implementation, CUDA is able to deliver the output 

ten times faster. Work is in progress for improving 

the performance through memory optimization. 

Contact: Cory Slep (NC State University) 

MD06 - GPU Accelerated Molecular Dynamics 

Enabling Transformative Drug Development 

One powerful computational technique for the 

science of drug development has been the use of 

molecular dynamics (MD) simulations. MD 

simulations can explore the interactions between 

small molecule drugs and membrane-bound 

proteins on an atomic level. It is now possible to 

understand the biological function of drug targets 

through their structural motions. GPU computing

is revolutionizing the field of MD, with GPU 

accelerated MD code competing with national 

supercomputers. Our research goal is to use GPU 

technology to not only improve MD performance, 

but to improve MD development and workflow for 

drug development. 

Contact: Benjamin Madej (University of California 

San Diego, San Diego Supercomputer Center) 

NEUROSCIENCE 

NS01 - Realtime Cerebellum: Realtime 

Simulation of a Realistic Cerebellar Model 

Realtime computing is a natural demand to deal 

with realtime signal processing ang control. The 

cerebellum plays an essential role in motor 

learning and control. Once we build a cerebellar 

model running in realtime, the model could be 

used as a neural controller of hardware such as 

robots. We built a large-scale spiking network 

model of the cerebellum composed of more than 

100,000 neurons that runs in realtime. We 

succeeded to control a humanoid robot to hit a 

ball thrown by a pitching machine through online 

learning of a proper timing to swing a bat. 

Contact: Tadashi Yamazaki (RIKEN Brain 

Science Institute) 

NS02 - Computational Modeling of Human Head 

Electromagnetics Using GPUs 

This poster presents a computational environment 

ACSON that leverages GPU technology to 

accelerate the solution of the EEG forward problem, 

which is necessary to solve the neuroimaging 

inverse problem. Two finite difference algorithms, 

ADI and VAI, to solve Poisson equation are 

presented. The ADI algorithm can only handle 

isotropic conductivities of the head tissue while VAI 

can hand anisotropic conductivities as well. Their 

performance on different GPUs are evaluated and 

compared with OpenMP implementation. 

Contact: Allen D. Malony (University of Chicago) 

PARALLEL PROGRAMMING LANGUAGES 

& COMPILERS 

PC01 - Automatic Mapping of Shared Memory 

Programs to GPU-Based Heterogeneous Systems 

Realizing the potential of GPU-based 

heterogeneous systems is challenging due to the 

complexity of programming. We have developed a 

compiler-based approach to automatically generate 

optimised OpenCL code from shared memory 

OpenMP programs. A key feature of our scheme is 

that it leverages existing transformations, especially 

data transformations, to improve performance on 

GPU architectures. As not all programs are suitable 

for GPU execution it uses predictive modeling to 

automatically determine if it is worthwhile running 

the OpenCL code on the GPU or OpenMP code on 

the multi-core host. 

Contact: Dominik Grewe (University of Edinburgh) 

PC02 - GKLEE: Practical Concolic Verification 

and Test Generation for GPUs 

We provide a new framework called GKLEE that can 

analyze C++ GPU programs, locating the important 

correctness and performance bugs. For these 

programs, GKLEE can also automatically generate 

tests that provide high coverage, and these tests 

can later be run on the hardware to cross-check 

results. It helps pin-point memory accesses and 

execution steps that cause performance 

degradation. It also provides a versatile user 

interface. GKLEE has detected bugs and issues in 

many CUDA SDK kernels, and also has been able to 

handle non-trivial multi-kernel examples. 

Contact: Peng Li (School of Computing, University 

of Utah) 

PC03 - GPU Ocelot: Dynamic Compilation for PTX 

GPU Ocelot is an open-source dynamic JIT 

compilation framework for GPU compute 

applications targeting a range of GPU and non-GPU 

execution targets. Ocelot supports CUDA 

applications and provides an implementation of the 

CUDA Runtime API enabling seamless integration 

with existing CUDA applications. Its JIT compiler 

supports four backend execution targets - (1) an 

emulator that implements NVIDIA’s Parallel Thread 

Execution (PTX) instruction set architecture, (2) 

NVIDIA GPUs, (3) AMD GPUs, and (4) a translator to 

LLVM for efficient parallel execution of GPU kernels 

on multicore CPUs. Existing CUDA applications are 

seamlessly supported. 

Contact: Andrew Kerr (Georgia Institute 


PC04 - Legion: Expressing Locality and 

Independence with Logical Regions 

Modern parallel architectures have both 

heterogeneous processors and deep, complex 

memory hierarchies. We present Legion, a 

programming model and runtime system for 

programming these machines. Legion is 

organized around logical regions, which express 

both locality and independence of program data. 

Legion also enables explicit, programmer 

controlled movement of data through the memory 

hierarchy and placement of tasks based on locality 

information via a novel mapping interface. 

Running on a 4 node cluster with 8 total GPUs and 

4 levels of memory hierarchy, our implementation 

of Legion achieves a 5.9X speedup over a single 

CPU-GPU node on real-world applications. 

Contact: Michael Bauer (Stanford University) 

PC05 - Compilation Techniques for Demand- 

Driven Execution on Heterogeneous 

Architectures 

In order to leverage massive parallelism, there has 

been a resurgence of demand-driven programming 

models. The goal of this work is to develop 

compilation techniques and language extensions 

for existing imperative parallel programming 

languages that will then be mapped onto 

heterogeneous parallel architectures. In particular, 


99


this work addresses the following topics: automatic 

generation of task-graphs from explicitly parallel 

loops, programming language extensions to 

provide the ordering constraints between sections 

of code, and the mapping of data and computation 

onto massively parallel architectures. 

Contact: Albert Sidelnik (Globo Network) 

PC06 - DL: A Data Layout Transformation System 

for Heterogeneous Computing 

DL is a combination of a novel approach to laying 

out array of aggregate types across GPU and CPU 

architectures to further improve memory 

parallelism and kernel performance beyond what 

is achieved by human programmers using discrete 

arrays today. Our proposed new layout can be 

derived in situ from the traditional Array of 

Structure, Structure of Arrays, and adjacent 

Discrete Arrays layouts used by programmers. 

Second, DL has novel in-place layout conversion 

algorithms implemented as part of a run-time 

library for OpenCL that transparently converts 

data to accommodate application components 

that have different data layout requirements. 

Contact: I-Jui Sung (University of Illinois at 


RAY TRACING 

RT01 - Searching for Cold Trapped Resources in 

the Lunar Regolith 

Our poster describes a ray tracing technique 

applied to the latest digital elevation models of the 

Moon in an effort to find permanent shadows 

where water ice may be cold trapped. Some of the 

shadows we found are characterized with surface 

temperature measurements from the Diviner 

mid-infrared radiometer on the Lunar 

Reconnaissance Orbiter. 

Contact: Andy McGovern (Irish Centre for High-End 

Computing (ICHEC)) 

SUPERCOMPUTING 

SC01 - Multi-GPU Computing 

Our poster details several projects that make 

multi-GPU computing easy. It presents our work on 

a a callback method for GPUs (presented at 

UCHPC 2010), message-passing interface for GPUs 

(IPDPS 2009), a heterogeneous computationalresource 

scheduler (EG 2009), and a multi-GPU 

MapReduce implementation (IPDPS 2011). 

Contact: Jeffery Stuart (UC Davis) 

SC02 - Automatic Generation of FFT Libraries 

for GPUs 

In this poster we present an extension of the 

Spiral code generation system to GPUs. We 

address the key problems of GPU memory 

hierarchy and parallelism, and we introduce a 

variety of FFT algorithms which avoid shared 

memory bank conflicts without wasting space 

using padding and optimized global memory 

bandwidth transfer with minimum register 

allocation even in low occupancy. We demonstrate 

high performance results against cuFFT 1-D and 

2-D DFTs for single precision. This research is still 

in progress, but at the moment we are able to 

match and beat cuFFT library on sizes we have 

generated optimized code. 

Contact: Christos Angelopoulos (Carnegie 

Mellon University) 

SC03 - Computational and Simulation Sciences: 

Applications of Heterogeneous Computing 

As the size and complexity of scientific problems 

grow, scientists from a broad range of discipline 

areas are relying more on computational methods 

and simulations to help solve their problems. This 

work presents summary of heterogeneous 

algorithms and applications that have been 

developed by CSIRO for solving practical and 

challenging science problems faster than is 

possible with conventional multi-core CPUs alone. 

The problem domains include: CFD, imaging and 

visualization, advanced materials modeling, 

computational biology, geosciences and climate 

research. The algorithms utilize NVIDIA GPUs 

and multi-core CPUs on a scale ranging from 

single workstation installations through to large 

GPU clusters. 

Contact: Tomasz Bednarz (CSIRO) 

SC04 - 75-Round SHA-1 Collision Search Using 

GPU Clusters 

SHA-1 is one of the most widely used 

cryptographic hash function. We ported method of 

characteristics for collision search for SHA-1 to 

GPU clusters. Using it, we found a collision for 

75-round version of SHA-1, which is currently the 

world record. 

Contact: Andrew Adinetz (Lomonosov Moscow 


SC05 - GPU Clusters for Large-Scale Analysis of 

X-Ray Scattering Data 

X-ray scattering is a valuable tool for measuring 

the structural properties of materials used in the 

design and fabrication of energy-relevant 

nanodevices. A primary challenge here is in the 

analysis of data due to its generation rate and 

size. We are developing novel HPC algorithms and 

codes for such analyses. Here we present two 

advances using GPUs: a flexible Grazing Incidence 

Small Angle Scattering simulation code. This code 

can compute the scattered light intensity from any 

given sample in all directions of space. Second, an 

efficient inverse modeling code for structural 

fitting problems using Reverse Monte Carlo (RMC) 

simulation algorithm. 

Contact: Abhinav Sarje (Wayne State University)

VISUALIZATION 

VZ01 - CNC Tool Path Planning and Machining 

Simulation on GPU 

Today a main part of a low-volume manufacturing 

cost involving CNC machining is a cost of a tool 

path planning performed by an engineer. The goal 

of this research is to develop an automatic CNC 

machine tool path planning and simulation 

system. In order to achieve a reasonable 

performance we are using GPGPU approach for 

geometry processing and propose to develop a 

new solid geometry representation especially 

designed for parallel processing and GPGPU 

which will become a base for a new automatic tool 

path planning system and will also significantly 

increase speed and accuracy of a machining 

process simulation. 

Contact: Dmytro Konobrytskyi (Clemson University) 

VZ02 - GPU-Accelerated Power System 

State Visualization 

Modern energy management systems aim to 

provide situational awareness to grid operators 

using a variety of tools. Advances in technology 

such as high-frequency data from phasor 

measurement units distributed across the system 

support the display and analysis of the dynamic 

state of the power grid. Scattered data interpolation 

is a computationally intensive problem that benefits 

massively from parallel implementations on GPUs. 

This poster presents a highly optimized network 

state visualization system that fully exploits 

programmable graphics hardware and delivers 

three orders of magnitude performance 

improvements while offering extra features 

compared to a traditional, CPU-based approach. 

Contact: Martin Naef (NVIDIA) 

VZ03 - Image Treatment Implementing Extended 

Depth of Field with NVIDIA CUDA 

Extended depth of field (EDF) is a specific method 

used to analyze and treat specific image zones in 

optical research. Due to the complexity of the EDF 

and the large volume of data processed in optics 

problems, EDF is a good candidate to process in 

parallel architectures. This work is an 

implementation of parallel-extended depth of field 

using NVIDIA CUDA. We propose a solution 

algorithm addressed a multicomputer cluster and 

shared memory represented by an hybrid parallel 

machine based on NVIDIA GPUs. Moreover, a 

performance evaluation in terms of execution 

time is proposed followed by a discussion about 

this approach. 

Contact: Mónica Liliana Hernández Ariza 

(Universidad Industrial de Santander) 

VZ04 - Diderot: A Parallel DSL for Image 

Analysis and Visualization 

The analysis of structure in three-dimensional 

images is increasingly important for biomedical 

research and computational science. In this 

poster, we outline ongoing work developing 

Diderot, a parallel domain-specific language for 

three-dimensional image visualization and 

analysis algorithms, such as volume rendering, 

fiber tractography, and particle systems. Diderot 

supports a high-level mathematical computation 

model coupled with a batch-synchronous 

parallelism model. The poster further describes 

Diderot’s GPU implementation and its high 

performance measurements on GPUs versus 

other sequential and parallel platforms. 

Contact: Lamont Samuels (Lawrence Berkeley 

National Laboratory) 


101

What you need to know. Now. 

Dr. Dobbs Ad? 

Available on the iPad 

100% Free. Try it today!

GTC 2012 

SPEAKERS & PANELISTS 

Alexey Abramov 

PhD Student (University of Gottingen) 

Alexey Abramov received the M.Sc. degree in Computer 

Science from the Moscow Engineering and Physics 

Institute (State University), Moscow, Russia. Currently he 

is a PhD student at the Georg-August University, 

Goettingen, Germany. His research interests include 

image processing, image segmentation and object 

tracking, stereo image processing and real-time 

computer vision with highperformance computing on 

parallel hardware. 

h Session(s): S0075 - Oculus Real-Time Modular 

Cognitive Vision System (Tuesday, 15:00, Room: A1) 

Robert Alexander 

CUDA Tools Software Engineer (NVIDIA) 

Robert Alexander is a software engineer on the NVIDIA 

Tesla Platform Software team. His focus is on 

management, monitoring and diagnostics of GPUs in a 

cluster environment. His work includes the NVIDIA 

Management Library (NVML), the NVIDIA System 

Management Interface (NVIDIA-smi), and he is 

responsible for the Perl and Python NVML bindings. 

Robert has a BS in Computer Science from the 

Rochester Institute of Technology. 

h Session(s): S0238 - Tesla Cluster Monitoring & 

Management APIs (Thursday, 09:30, Room: K) 

Alina Alt 

Applied Engineer (NVIDIA) 

Alina Alt is an Applied Engineer at NVIDIA where her 

responsibilities include helping users incorporate 

NVIDIA’s GPUs, video products and video related driver 

features into their solutions and applications. Her past 

experience includes developing augmented reality 

applications for live sports telecasts and developing a 

scalable, CPU-based cluster graphics driver. 

h Session(s): S0601 - GPU-Based Video Processing 

Round Table (Monday, 14:30, Room: A2) 

h S0049 - Using the GPU Direct for Video API 

(Tuesday, 15:00, Room: J2) 

h S0267A - Mixing Graphics and Compute with 

Multiple GPUs (Tuesday, 17:00, Room: J2) 

h S0326 - Next Generation InfoWall 

(Thursday, 09:00, Room: A1) 

h S0267B - Mixing Graphics and Compute 

with Multiple GPUs (Thursday, 15:30, Room: L) 

Minesh B. Amin 

Founder / CEO (MBA Sciences) 

Dr. Minesh B. Amin is Founder & CEO of MBA Sciences, 

Inc. MBA Sciences enables engineers and scientists to 

rapidly prototype, analyze and deploy robust parallel 

solutions across heterogeneous computing resources 

spanning servers, cores and GPUs from either data 

centers or public clouds. Previously he worked at 

Synopsys, Inc, where he helped, prototype, implement 

and deploy several parallel versions of existing serial 

products including TetraMax TenX ATPG product and 

PrimeTime DMSA. Dr. Amin received his PhD from the 

University of Minnesota. 

h Session(s): S0299 - Exploiting Fault Tolerant 

Heterogeneous Parallelism with SPM.Python 

(Wednesday, 16:00, Room: C) 

Joshua Anderson 

Research Area Specialist (University of Michigan) 

Joshua Anderson is a Research Area Specialist in the 

Laboratory for Computational Nanoscience & Soft 

Matter Simulation at the University of Michigan. Dr. 

Anderson holds a Ph.D. degree in Condensed Matter 

Physics from Iowa State University and is the lead 

developer of HOOMD-blue, a high performance particle 

simulation tool. His current research interests include 

GPU computing, polymer physics, and nanoparticle 

self-assembly. 

h Session(s): S0058 - Advancing GPU Molecular 

Dynamics: Rigid Bodies in HOOMD-blue 

(Wednesday, 10:00, Room: N) 

Roberto Ansaloni 

(Cray Italy) 

Biography unavailable at press time. 

h Session(s): S0286 – Scaling Applications to a 

Thousand GPUs and Beyond 

(Wednesday, 16:00, Room: A2) 

Santosh Ansumali 

(Faculty Fellow, Engineering Mechanics Unit, JNCASR, 

Bangalore) 

Dr. Ansumali is a faculty at EMU, JNCASR and also 

holding Ramanujan Fellowship from DST India since July 

2009. Prior to this, he was an assistant Prof. at NTU, 

Singapore since August 2005. He has done his PhD from 

ETH, Zurich (Switzerland) on mesoscale simulation 

methods. His research area is mesoscale simulation 

methods and high performance computing based on 

Kinetic theory. 

h Session(s): S0428 – Panini: A GPU Aware Array 

Class (Thursday, 16:00, Room: B) 

Takayuki Aoki 

Professor (Tokyo Institute of Technology) 

Takayuki Aoki received a Dr. Sci (1989) from Tokyo 

Institute of Technology, was a visiting researcher in the 

Max-Planck Institute in Germany for one year, has been 

a professor in Tokyo Institute of Technology since 2001. 

He has received the Computational Mechanics 

Achievement Award from Japan Society of Mechanical 

Engineers and many awards and honors in visualization. 

He is also the vice president of the Japan Association for 

Computational Mechanics. He has authored the first 

book in the Japanese language on the CUDA 

programming and applications. His research covers 

numerical schemes for CFD, numerical weather models, 

HPC applications on graphics processors, multi-phase 

flows, and simulation of natural disasters. 

h Session(s): S0412 - A 2-Petaflops Stencil 

Application with Stereoscopic 3D Visualization - 

Gorden Bell Prize 2011 (Tuesday, 14:00, Room: A1) 

Jeremy Appleyard 

Analyst (Polyhedron Software Ltd) 


h Session(s): S0432 – New Ideas for Massively 

Parallel Preconditioners 


John Appleyard 

Managing Director (Polyhedron Software Ltd) 

BA, MA and PhD from Cambridge University. One of the 

Original Developers of the Eclipse Oil Reservoir 

Simulator and an MD of Polyhedron Software Ltd. 

h Session(s): S0432 - New Ideas for Massively 

Parallel Preconditioners 


CONFERENCE GUIDE SPEAKERS AND 

PANELISTS 

103

SPEAKERS AND 

PANELISTS 

Arutyun Avetisyan 

Deputy Director (Institute for System Programming, 

Russian Academy of Sciences) 

Arutyun Avetisyan is Deputy Director of the Institute for 

System Programming of the Russian Academy of 

Sciences (ISP RAS). His research interests are in the 

areas of compiler technologies, HPC and Cloud 

computing. He is leader of several projects, including 

researching compiler support for heterogeneous 

systems. He represents RAS in Steering Committee of 

Open Cirrus Community – the global cloud computing 

testbed for research projects. He is PI of the National 

“University Cluster” program, including in particular the 

technology platform (unihub.ru), which is an opportunity 

of creating wide range of services within a single 

infrastructure, e.g. creating subject-specific web-labs. 

h Session(s): S0115 – Specialized Sparse Matrix 

Formats and SpMV Kernel Tuning for GPUs 

(Wednesday, 10:30, Marriott Ballroom 3) 

Brendan Babb 

Student/Research Technician (University of 

Alaska Anchorage) 

Brendan Babb has over 20 years experience as a 

software programmer and analyst in the engineering 

and telecommunications industries with a background in 

Mathematics. He holds three patents in error detection 

and correction and his current interests are in 

Evolutionary Computation, Biomimicry, GPGPU and their 

collective application to optimizing renewable energy 

solutions. Since 2005 he has used evolutionary 

computation to evolve wavelet like transforms that 

improve image compression for photo, fingerprint, 

satellite, CT scans, Ultrasound and Mars Rover images. 

h Session(s): S0133 - Improving Mars Rover Image 

Compression Via GPUs And Genetic Algorithms 


Ronald Babich 

Research Scientist (NVIDIA) 

Ron Babich is a Research Scientist at NVIDIA, where he 

works at the intersection of algorithms and architecture, 

with a particular focus on high-performance computing. He 

was previously a postdoctoral fellow in Boston University’s 

Center for Computational Science and received his PhD in 

Physics from Boston University in 2009. 

h Session(s): S0368 - Unraveling the Mysteries of 

Quarks with Hundreds of GPUs 

(Thursday, 15:00, Room: K) 

Philip A. Beasley-Harling 

(Bank of America Merrill Lynch) 


h Session(s): S0656 kdb+ and GPUs for Market Data 

Analytics and Trading (Wednesday, 17:30, Room: L) 

Dan Bailey 

R&D (Double Negative) 

Dan Bailey is working in Research and Development at 

Double Negative, where he is driving the adoption of the 

GPU and increased parallelism in general. His primary 

focus is the proprietary fluid solver, where a strong 

educational background in Computer Science has 

complemented an interest in fluid simulation. His 

research concentrates on languages and parallel 

compiler technology, but with a strong leaning towards 

its use in production. 

h Session(s): S0300 - Jet: A Domain-Specific 

Approach to Parallelism for Film Fluid Simulation 

(Tuesday, 10:00, Room: A2) 

Tim Bajarin 

President (Creative Strategies) 

Tim Bajarin is recognized as one of the leading industry 

consultants, analysts and futurists, covering the field of 

personal computers and consumer technology. Mr. 

Bajarin has been with Creative Strategies since 1981 and 

has served as a consultant to most of the leading 

hardware and software vendors in the industry including 

IBM, Apple, Xerox, Hewlett Packard/Compaq, Dell, AT&T, 

Microsoft, Polaroid, Lotus, Epson, Toshiba and 

numerous others. His articles and/or analysis have 

appeared in USA Today, Wall Street Journal, The New 

York Times, Time and Newsweek magazines, 

BusinessWeek and most of the leading business and 

trade publications. He has appeared as a business 

analyst commenting on the computer industry on all of 

the major television networks and was a frequent guest 

on PBS’ The Computer Chronicles. Mr. Bajarin has been 

a columnist for US computer industry publications such 

as PC Week and Computer Reseller News and wrote for 

ABCNEWS.COM for two years and Mobile Computing for 

10 years. His columns currently appear in Asia 

Computer Weekly, Personal Computer World (UK), and 

Microscope (UK) as well as Mobile Enterprise Magazine. 

His various columns and analyses are syndicated in over 

30 countries. 

h Session(s): S2003 – Emerging Companies Summit 

Fireside Chat with Jen-Hsun Huang (CEO, 

President and Co-Founder, NVIDIA) and Tim 

Bajarin (President, Creative Strategies) 


Zack Baker 

(Los Alamos National Laboratory) 


h Session(s): S0702 - Los Alamos AHPC Symposium, 

The Architecture of Acceleration in HPC 

(Wednesday, 15:30, Room: J1) 

Robert Balgley 

CEO (Mersive) 

Over the past 20 years Balgley has worked as CEO of 

several category-defining companies funded by some of 

the most successful venture capital firms in the world. 

Prior to Mersive, he was CEO of SkyeTek, the worldwide 

market share leader in embedded RFID readers and 

technology. Prior to that, Balgley was CEO of Jabber, the 

pioneer and leader in enterprise instant messaging, 

which was later acquired by Cisco Systems. Before 

Jabber, he was CEO of Mobile Logic, an early market 

leader of mobile data networking software which was 

acquired in 2000. Earlier in his career, Balgley held 

executive positions in sales and marketing at GE, 3Com, 

Hughes Aircraft and Case Communications. 

h Session(s): S2005 – Emerging Companies Summit: 

CEO on Stage Featuring RealView Imaging, 

Elemental Technologies, and Mersive 


Bill Barth 

Director of High Performance Computing (Texas Advanced 

Computing Center, University of Texas at Austin) 

Bill Barth is the Director of High Performance 

Computing at the Texas Advanced Computing Center 

where he oversees the use of TACC’s large-scale HPC 

resources by a diverse international community of 

scientists and researchers. Dr. Barth received his PhD 

from the Aerospace Engineering Department of The 

University of Texas in 2004 where he worked on finite 

element methods for incompressible flow and transport 

problems. His current interests include network topology 

aware job scheduling and MPI communication, 

physics-based, flow visualization, software tools for

large-scale clusters, and the design and deployment of 

leadership-class supercomputers. 

h Session(s): Los Alamos AHPC Symposium, 

Stampede System Architecture and Early 

Accelerator Programming Experiences 


Francesco Basile 

Software Engineer (MBI srl) 

Basile obtained his joint PhD in Mathematical Physics at 

University of Pisa / Brunel University London in 2008. 

Since 2008 he devolved is strong mathematical 

background to analysis of digital radio signal processing. 

h Session(s): S0065 – Satellite HUB Communication 

System GPU Based (Thursday, 16:30, Room: M) 

Bela Bauer 

Postdoc (Microsoft Research) 


h Session(s): S0039 – Data-Driven GPGPU Ideology 

Extension (Thursday, 10:00, Marriott Ballroom 3) 

Janusz Bedkowski 

Researcher 

Janusz has been a researcher in area of mobile robotics 

- navigation, 3D modeling and simulation since 2006. He 

is working in cooperation with following institutions: 

Warsaw University of Technology, faculty of Mechatronics 

(education), Industrial Research Institute for 

Automation and Robotics (researcher, mobile robot 

design and programming), Institute of Mathematical 

Machines (researcher, simulation and modeling using 

parallel computing). 

h Session(s): S0081 - Parallel Computing In Mobile 

Robotics for RISE (Thursday, 09:30, Room: A3) 

Nathan Bell 

Senior Research Scientist (NVIDIA) 

Nathan Bell joined NVIDIA Research in August 2008. His 

current research interests include sparse linear algebra 

and programming models for parallel computing. 

Nathan contributes to several open source projects 

including Thrust, a high-level parallel template library, 

Cusp, a library for sparse linear algebra and graph 

algorithms, and PyAMG, a library of algebraic multigrid 

methods in Python. Nathan received a bachelor’s degree 

in Computer Science from Georgia Tech and a Ph.D in 

Computer Science from the University of Illinois at 

Urbana-Champaign (UIUC). 

h Session(s): S0602 - An Introduction to the 

Thrust Parallel Algorithms Library 


Tomer Ben-David 

Co-Founder and Vice President, R&D (Rocketick) 

Tomer has co-founded Rocketick at 2008 and since then 

he is serving the company as the VP of R&D. Tomer 

brings 15 years of experience in management and 

engineering of software and hardware products. He 

previously worked at Intel Corporation, Siliquent 

(acquired by Broadcom) and Mellanox. Tomer holds a B. 

Sc. (Cum Laude) in Computer Engineering from the 

Technion – the Israeli Institute of Technology, and 

Executive MBA from Recanati School of Business, 

Tel-Aviv University. 

h Session(s): S0520 - Using GPUs to Speedup Chip 

Verification (Tuesday, 10:00, Room: J3) 

h S2004 – Emerging Companies Summit: CEO on 

Stage Featuring Raytrix, Rocketick, and Ubitus 


Thomas Benson 

Research Engineer II (Georgia Tech Research Institute) 

Thomas Benson is a Research Engineer with Georgia 

Tech Research Institute, where his research focus and 

interests include high-performance computing, 

high-performance embedded computing, radar signal 

processing and medical imaging, heterogeneous 

computing, and programming models related to such 

systems. He holds a Ph.D. in Computer Science from the 

University of Tennessee, Knoxville, and has nearly five 

years of post-graduate industrial research experience 

with GE Global Research in the field of medical imaging, 

specifically image reconstruction and related algorithms 

for X-ray computed tomography (CT). His experience 

includes developing large-scale real-time processing 

implementations for several of his fields of research. 

h Session(s): S0316 - Using GPUs to Accelerate 

Synthetic Aperture Sonar Imaging via 

Backpropagation (Tuesday, 15:30, Room: J3) 

Mike Bernhardt 

(The Exascale Report) 

Mike Bernhardt is a well-respected strategic marketing, 

communications, media relations and electronic 

publishing consultant with 25 years of experience 

serving the HPC community. Bernhardt founded The 

Exascale Report in 2010 to serve as the voice of the 

emerging exascale community. Today, the subscriptionbased 

Exascale Report is a widely read publication from 

which articles and extracts have been presented to 

numerous governmental bodies to help drive funding 

and political commitment discussions on a global scale. 

As an independent consultant, Bernhardt has worked 

with dozens of companies throughout the global HPC 

ecosystem on branding, marketing, strategic 

communications and public speaking programs. 

Bernhardt is a former Intel marketing executive and 

currently serves as a consultant or Board-level advisor 

to a number of privately held organizations. 

h Session(s): S0531 - Exascaling Your Apps 


James Beyer 

Software Engineer (Cray Inc) 

James Beyer received his Ph.D. from University of 

Minnesota. He has been a member of the Cray 

Programming Environment Optimization team for more 

than 12 years. He has represented Cray on the OpenMP 

language committee and ARB since Cray rejoined the 

organization. He led the effort to redesign the Cray 

OpenMP implementation to improve optimizer 

integration. He authored the original OpenMP for 

Accelerators, OpenMP4ACC, proposal and co-chairs the 

OpenMP language subcommittee on Accelerators. 

James was the primary Cray representative during the 

design of the OpenACC specification. He is currently 

actively involved in the Cray implementations of OpenMP, 

OpenACC and OpenMP4ACC. 

h Session(s): S0089 - Accelerator Directives, 

OpenACC and OpenMP4ACC 


Johanna Beyer 

Postdoctoral Fellow (King Abdullah University of Science 

and Technology) 

Johanna Beyer is a postdoctoral fellow at the Geometric 

Modeling and Scientific Visualization Center at King 

Abdullah University of Science and Technology (KAUST), 

Saudi Arabia. She holds an M.Sc. in medical software 

engineering (2004, University of Applied Sciences 

Hagenberg, Austria) and a Ph.D. in computer science 

(2009, University of Technology Vienna, Austria). Her 

research focuses on GPU-based volume rendering 

techniques for medical and neuroscience applications, 


PANELISTS 

105

SPEAKERS AND 

PANELISTS 

with emphasis on visualization of large and multi-modal 

data. She regularly publishes at IEEE TVCG/IEEE 

Visualization. 

h Session(s): S0202 - Terascale Volume Visualization 

in Neuroscience (Wednesday, 16:30, Room: A8) 

Tim Bi 

Graduate Research Analyst (Johns Hopkins University / 

George Mason University) 

Tim Bi is a Bioinformatics Ph.D. candidate at George 

Mason University currently working as a GRA for Dr. 

Saleet Jafri and contributing to the efforts of improving 

the GPU program for Calcium Induced Calcium Release. 

He is also working for Dr. Diane Becker at the Johns 

Hopkins School of Medicine contributing to the GWAS 

studies being conducted at the GeneSTAR lab. 

h Session(s): S0272 - GPU GWAS - CUDA Based 

Genome Wide Association Studies 

(Wednesday, 10:30, Room: B) 

James Bigler 

Sr. Software Engineer (NVIDIA) 

James Bigler is currently working for NVIDIA as a Sr. 

Software Engineer developing OptiX, a GPU accelerated 

ray tracing framework. His work with ray tracing dates 

back to 2000 at the University of Utah where he worked 

under Dr. Steven Parker researching and developing 

parallel ray tracing applications for rendering and 

scientific visualization. Since coming to NVIDIA in 2008, 

James has strived to bring more ray tracing 

awesomeness to everyone through OptiX. James holds a 

B.S. and M.S. in Computer Science from the University of 

Utah. 

h Session(s): S0366 - OptiX Out-of-Core and CPU 

Rendering (Tuesday, 15:30, Room: J1) 

Sam Blackman 

CEO and Co-Founder (Elemental Technologies) 

Sam Blackman co-founded Elemental Technologies in 

2006 and has grown the company into a leading supplier 

of video solutions for multiscreen content delivery. Prior 

to co-founding Elemental, Sam designed integrated 

circuit products for Pixelworks. He has also held 

engineering positions at Silicon Graphics and Intel 

Corporation. Sam holds an M.B.A from University of 

Oregon, an M.S. in electrical engineering from University 

of California at Berkeley and a B.S in electrical 

engineering from Brown University. 





Aaron Blasius 

Sr. Product Manager (VMware) 


h Session(s): S0359 - VMware and NVIDIA: Delivering 

3D Workstations from the Cloud 


François Bodin 

Chief Technology Officer (CTO) (CAPS enterprise) 

As chief scientist, François Bodin plans, advises and 

advocates the research and development projects which 

led to the creation of innovative software tools. François 

carries on with its research activities at the Irisa lab, 

which focus in code optimization and compiler 

technologies for high performance computers and 

embedded systems. François is member of HIPEAC, the 

European Network of Excellence on High-Performance 

Embedded Architecture and Compilation. François has 

degrees in computer science from the University of 

Rennes I. François Bodin is also Chairman of IRISA 

Rennes, a research unit in the forefront of information 

and communication science and technology. 

h Session(s): S0630 Part 1of 2: Programming 

Heterogeneous Many-cores Using Directives 

(Presented by CAPS) (Monday, 13:00, Room: A8) 

h S0631 Part 2 of 2: Programming Heterogeneous 

Many-cores Using Directives (Presented by CAPS) 

(Monday, 14:30, Room: A8) 

h S0635 - How to Bake Portable Many-Core 

Programs (Wednesday, 15:00, Room: M) 

Robert Boehme 

Team Lead & CEO (Part-Time Scientists) 

Robert Boehme is Team Lead and CEO of Part-Time 

Scientists. The Part-Time Scientists Team consists of 

100 international engineers and scientists working in 

their free time on the first private mission to the moon. 

Over the past two years they managed to get the full 

technical development kick-started with a lot of 

prototypes and technology taken from the industry back 

into space. With five prototype lines, 50 business 

partnerships, several cooperations and many hours 

testing, the team is amongst the leading competitors for 

the 30 million dollar Google Lunar X-PRIZE competition. 

h Session(s): S3002 – Day 3 Keynote: Not Your 

Grandfather’s Moon Landing 

(Thursday, 11:00, Keynote Hall) 

Taisuke Boku 

Deputy Director of Center for Computational Sciences at 

University of Tsukuba (University of Tsukuba) 


h Session(s): S0618 – Best Practices of a 800TFlop 

Hybrid Supercomputer Implementation 

(Tuesday, 09:30, Room: M) 

Nikola Bozinovic 

CTO (MotionDSP) 

Nikola Bozinovic is Chief Technology Officer at 

MotionDSP where he leads all technical efforts and 

oversees product development. As the company’s key 

technologist, he leverages his expertise in signal 

processing, image and video analysis, and video 

compression to provide people and organizations around 

the world with groundbreaking video technology. Prior to 

establishing MotionDSP’s engineering department, 

Nikola was as a senior software engineer at Veodia, a 

video streaming and distribution company, and a 

research scientist at Microsoft. Nikola holds M.S. and 

Ph.D. degrees from Boston University, where he was a 

Dean’s Fellow. 

h Session(s): S0527 - GPUs and the Next-Generation 

Aerial Surveillance (Tuesday, 09:00, Room: J2) 

Wil Braithwaite 

Senior Applied Engineer (NVIDIA) 

Wil Braithwaite has worked for 15 years in VisualFX at 

studios in London and Los Angeles, including 

FrameStore, MPC, and the Jim Henson Company. 

Positions ranged from Technical direction, Compositing, 

CG Supervision, and Mocap supervision. He has 

pioneered the use of graphics hardware in the VFX 

workflow, which led to his role at NVIDIA as a Senior 

Applied-Engineer for VFX, where he specializes in 

consulting, training and assisting development for studio 

projects utilizing NVIDIA technologies. 

h Session(s): S0364 - Interacting with Huge 

Particle Simulations in Maya with the GPU 

(Tuesday, 14:00, Room: J1)

Thomas Brandes 

Senior Scientist (Fraunhofer Scientific Computing 

Institute (FhG-SCAI)) 

Thomas Brandes received his PhD in Applied 

Mathematics in 1988 from the University in Marburg.He 

joined Fraunhofer’s Scientific Computing Institute 

(FhG-SCAI) in 1989. He is working as a senior scientist 

on the design, parallelization and optimization of 

scientific applications for all kinds of parallel 

architectures. His research interests are centered 

around parallelization tools, cache optimization, GPU 

programming and object-oriented design of parallel 

software. 


Efficient AMG on Hybrid GPU Clusters 


Vincent Brisebois 

Visual Computing Product Manager (Fusion-io) 

As Visual Computing Product Manager at Fusion-io, 

Vincent Brisebois works closely with entertainment 

production studios on implementing solutions that 

facilitate new levels of creativity, productivity and 

worldwide collaboration. Vincent has designed 

technology solutions for 2D and 3D production in the 

visual effects, video game and design industries for over 

15 years. 

h Session(s): S0619 - Hate to Wait? Flash Memory 

for Full-Throttle GPU Acceleration 

(Thursday, 09:00, Room: L) 

John Brown 

Principal Engineer (Hewlett-Packard) 

John is a Principal Engineer in Hewlett-Packard’s 

Workstation Graphics Research and Development, 

engineering graphics and workstation solutions since 

1984. He has contributed to a wide variety of HP 

products and projects for 24 years, ranging from HP’s 

SRX graphics processor, to HP’s SV6 Scalable 

Visualization solution, to HP’s latest family of high-end 

workstation platforms. 

h Session(s): S0633 – Learn about new Hewlett- 

Packard GPU Systems, Solutions, and Applications! 

(Wednesday, 10:00, Room: M) 

Kevin J. Brown 

Research Assistant (Stanford University) 


h Session(s): S0365 – Delite: A Framework for 

Implementing Heterogeneous Parallel DSLs 


Andreas Buhr 

Department Manager - Performance Optimization 

(CST AG) 

Andreas Buhr works on performance optimization at 

CST AG since 2009. He holds a bachelor’s degree in 

physics and a master’s degree in applied physics from 

the Technical University Darmstadt. He is working with 

CUDA since its version 0.9. 

h Session(s): S0069 – GPU Computing Advances 

in 3D Electromagnetic Simulation 


Martin Burtscher 

Associate Professor (Texas State University) 

Martin Burtscher is Associate Professor in the 

Department of Computer Science at Texas State 

University. He received the combined BS/MS degree in 

computer science from the Swiss Federal Institute of 

Technology (ETH) Zurich in 1996 and the Ph.D. degree in 

computer science from the University of Colorado at 

Boulder in 2000. Martin’s research interests include 

efficient parallelization of programs for GPUs as well as 

automatic performance assessment and optimization of 

HPC applications. He is a senior member of the IEEE, its 

Computer Society, and the ACM. Martin has co-authored 

over 60 peer-reviewed publications, including a GPU 

Computing Gems chapter. 

h Session(s): S0111 - An Efficient CUDA 

Implementation of a Tree-Based N-Body Algorithm 

(Thursday, 15:30, Room: M) 

Michael Bussmann 

Junior Group Leader Computational Radiation Physics 

(Helmholtz-Zentrum Dresden-Rossendorf) 

Michael Bussmann is a member of the Laser Particle 

Acceleration Group at the Helmholtz-Zentrum Dresden- 

Rossendorf (HZDR). He leads the Junior Group on 

Computational Radiation Physics, looking for ways to 

create and optimize new sources of radiation using 

high-intensity lasers. His goal is to create low-cost, 

compact, laser-driven sources of ion, electron and X-ray 

beams that can be used to understand the properties of 

matter on the atomic scale. Besides his interest in 

fundamental physics Michael helps to make laser-driven 

ion beams available to cancer patients for ion beam 

treatment of tumors. With GPUs he has been able to 

simulate the generation of laser-driven particle beams 

in a new, much faster way. Since then, Michael is used to 

think of computation speed in frames per second. 

h Session(s): S0067 - PIConGPU - Bringing largescale 

Laser Plasma Simulations to GPU 

Supercomputing (Tuesday, 15:00, Room: A8) 

h S0708- Los Alamos AHPC Symposium, 

Accelerated HPC Symposium: Applications - 

Methods and Programming Models, Part 1 

(Thursday, 9:00, Room: J3) 

Javier Cabezas 

PhD Student (Barcelona Supercomputing Center) 

Javier Cabezas received a bachelor’s degree in Computer 

Science and a master’s degree in Computer Architecture 

from Universitat Politècnica de Catalunya (UPC). Since 

2008, he is a PhD student in the Computer Architecture 

Department at UPC. He also works in the Barcelona 

Supercomputing Center as a resident student since 2009. 

He has contributed to projects done in collaboration with 

companies like Hewlett-Packard, NXP and Repsol. His 

research is focused on operating system and run-time 

support for heterogeneous massively-parallel computing 

systems and massively-parallel accelerators. 

h Session(s): S0333 - GMAC-2: Easy and Efficient 

Programming for CUDA-Based Systems 

(Thursday, 09:00, Room: B) 

Tugkan Calapoglu 

Lead Graphics Software Developer (VIRES 

Simulationstechnologie GmbH) 

Tugkan Calapoglu is the lead graphics software 

developer at Vires GmbH, Germany, with more than 10 

years of experience in visual simulation industry. He is 

working on design and development of 3D rendering 

software for real-time hardware-in-the-loop and 

human-in-the-loop simulation applications. 

h Session(s): S0319 – Advanced Driver 

Assistance System Testing using OptiX 

(Tuesday, 14:00, Room: N) 

D. Andrew Carr 

Director of Bioinformatics (Accelerated Technology 

Laboratories, Inc.) 

D. AndrewCarr, Ph.D. is the Director of Bioinformatics for 

Accelerated Technology Laboratories where he oversees 

the design and development of new high through put 


PANELISTS 

107

SPEAKERS AND 

PANELISTS 

computational and database tools for use in human 

genomic scale analysis projects. Andrewreceived his 

received his Ph.D. in Computational Science 

Bioinformatics from George Mason University in 2006. 

After spending a year as a research assistant professor in 

Computational Materials Science Center and 

Nanotechnology at GMU, he took a postdoctoral position at 

University of North Carolina Charlotte, where he worked 

developing tools algorithms, database and visualization 

tools for genomic microarray and sequence analysis. 

h Session(s): S0037 - SeqNFind: Application Of 

CUDA GPU Technologies To Sequence Alignment 

Techniques (Tuesday, 17:00, Room: K) 

Patrice Castonguay 

Emerging Applications Intern (NVIDIA) 

Patrice Castonguay is completing his Ph.D. in the 

Aeronautics and Astronautics department at Stanford 

University working under the supervision of Professor 

Antony Jameson at the Aerospace Computing Lab. His 

research focuses on unstructured high-order methods 

for fluid flow simulations and on the use of GPUs for 

algorithm developments in high performance 

computing. Recently, he worked in the Emerging 

Applications group at NVIDIA on the development of 

algebraic multigrid methods. 

h Session(s): S0332 - Efficient Graph Matching 

and Coloring on the GPU 


Bryan Catanzaro 


Bryan recently received his PhD from the University of 

California at Berkeley, where he researched compilation 

techniques for embedded data parallel languages. He 

then joined NVIDIA Research, where he focuses on 

developing the Copperhead runtime and compiler. 

h Session(s): S0525 - Copperhead: Data Parallel 

Python (Wednesday, 16:30, Room: A3) 

Ulises Cervantes-Pimentel 

Senior Kernel Developer (Wolfram Research) 

Ulises Cervantes-Pimentel is Wolfram’s research lead 

kernel developer in visualization, computational teometry 

and GPU development since 2001. Ulises is a graduate 

from the University of Illinois at Urbana-Champaign 

h Session(s): S0430 – Developing Next-Generation 

CUDA Acceleration in Wolfram’s Mathematica with 

Parallel Nsight (Tuesday, 09:30, Room: B) 

h S0106 - GPU Based Numerical Methods in 

Mathematica (Thursday, 14:30, Room: L) 

Dominic Chandar 

Postdoctoral Research Associate (University of Wyoming) 

Dominic is a Postdoc at the University of Wyoming, and 

works on GPU acceleration for CFD codes. He has a PhD 

in Mechanical and Aerospace Engineering from Nanyang 

Technological University, Singapore, and a Masters in 

Aerospace Engineering from Indian Institute of Science, 

India. He has also held the position of a Scientist in the 

Defense Research and Development Organization, India. 

h Session(s): S0264 - CU++: An Object-Oriented 

Framework for Computational Fluid Dynamics 

(CFD) Applications (Thursday, 09:30, Room: A8) 

Jacqueline H. Chen 

Combustion Research Facility,National Laboratories 

Jacqueline H. Chen is a Distinguished Member of 

Technical Staff at the Combustion Research Facility at 

Sandia National Laboratories. She has contributed 

broadly to research in petascale direct numerical 

simulations (DNS) of turbulent combustion focusing on 

fundamental turbulence-chemistry interactions. These 

benchmark simulations provide fundamental insight into 

combustion processes and are used by the combustion 

modeling community to develop and validate turbulent 

combustion models for engineering CFD simulations. In 

collaboration with computer scientists and applied 

mathematicians she is the Director of the Center for 

Exascale Simulation of Combustion in Turbulence 

(ExaCT) co-designed exascale DNS algorithms together 

with exascale computer architectures including in-situ 

data mining and visualization. 

h Session(s): S0655 Direct Numerical Simulation of 

Turbulence-Chemistry Interactions: Fundamental 

Insights Towards Predictive Models 


Jeff Chien 

Principle Scientist (Adobe Systems) 


h Session(s): S0395 – GPU Enablement in Adobe 

Photoshop (Tuesday, 09:00, Room: A2) 

Suren Chilingaryan 

Researcher (Karlsruhe Institute of Technology) 

Suren Chilingaryan is a data processing expert at 

Institute for Data Processing and Electronics at 

Karlsruhe Institute of Technology. He graduated in 

mathematics from Moscow State University and was 

awarded a Ph.D. degree in Computer Science from 

Armenian National Academy of Sciences. He works on 

data acquisition and slow control systems for the long 

running scientific experiments. The current research 

focus is a high performance data processing. 

h Session(s): S0259 - A High Performance 

Platform for Real-Time X-Ray Imaging 


Samuel Cho 

Assistant Professor (Wake Forest University) 

Sam graduated from the University of Maryland, Baltimore 

County with B.S. degrees in Biochemistry and Computer 

Science. He went on to receive a Ph.D. in Physical 

Chemistry at the University of California, San Diego. Since 

then, he performed post-doctoral research at the 

University of Maryland, College Park, where he was 

awarded the NIH (NRSA) Post-doctoral Fellowship. He has 

published his interdisciplinary computational biophysics 

research in protein and RNA dynamics, folding and 

assembly in over 15 papers in peer-reviewed journals, 

including four as first author in the high impact factor 

journal, Proceedings of the National Academy of Sciences. 

h Session(s): S0139 - GPU-Based Molecular 

Dynamics Simulations of Protein and RNA 

Assembly (Wednesday, 17:00, Room: N) 

Jike Chong 

Co-Director of CUDA Research Center 

(Carnegie Mellon University) 

Jike Chong is an adjunct professor at Carnegie Mellon 

Silicon Valley and directs the CUDA Teaching Center and 

the CUDA Research Center there. For the past 10 years, 

he has been working on multicore, manycore and 

parallel computing technologies at Carnegie Mellon 

University, Intel Research Labs, and Sun Microsystems 

and University of California, Berkeley. His research 

interests include speech recognition and analytics, 

quantitative financial analytics, and design patterns for 

parallel programming. Jike earned his Ph.D. from UC 

Berkeley, M.S. and B.S. for Carnegie Mellon University. 

h Session(s): S0223 - Rapid Training of Acoustic 

Models Using GPUs (Tuesday, 15:00, Room: N)

Constantin Chuyeshov 

Algorithm Engineer (Cadence Design Systems) 

Constantin Chuyeshov is an Algorithm Engineer with 

Computational Lithography Solutions Group at Cadence 

Design Systems. He is focusing on computational 

lithography, image processing and high-performance 

computing. Constantin was born in 1979 in Kharkov, 

Ukraine. He got his BS degree in Mathematical Physics 

and Applied Mathematics from Karazin Kharkov National 

University (Ukraine) and MSc degree in Computational 

Mathematics from Stanford University. 

h Session(s): S0329 - Using GPUs to Speedup 

Computational Lithography 


Gilles Civario 

Senior Software Architect (ICHEC) 

Gilles Civario is GPU software architect in ICHEC, PI of 

ICHEC’s NVIDIA CUDA Research Center, and a NVIDIA 

Certified CUDA Programmer. Gilles is involved directly or 

indirectly in all of ICHEC’s GPU-related projects. His 

involvement ranges from software or hardware 

architectural advices, to code development and tuning, 

debugging and implementation. Gilles also regularly 

presents talks to explain GPU computing and its benefits, 

and runs NVIDIA certified CUDA training courses. His 

unique expertise in both hardware and software allows 

him to design and propose tailored solutions to address 

each users’ particular needs. Gilles is particularly involved 

in ICHEC’s technology transfer activities. 

h Session(s): S0034 - Real-Time Risk Simulation: 

The GPU Revolution In Profit Margin Analysis 

(Tuesday, 15:00, Room: L) 

Geoff Clark 

CEO (Acceleware Ltd.) 

Before joining Acceleware, Geoff was CFO of SQFive a 

private oil and gas technology company, and of TSX listed 

Guest-Tek Interactive Entertainment Ltd. While with 

Guest-Tek, Geoff was instrumental in completing two 

major acquisitions, a share buyback, and several private 

placements of debt and equity. Geoff was a co-founder of 

Revolve Magnetic Bearings Inc. a supplier of magnetic 

levitation systems. Geoff secured several rounds of 

financing for Revolve and was instrumental in Revolve’s 

eventual sale to Sweden’s SKF. Geoff holds an MBA 

degree from the University of Western Ontario, and a 

BSc in Electrical Engineering from the University of 

Calgary. 

h Session(s): S0433 - Accelerated FDTD Technique 

for Marine Controlled Source Electromagnetic 

Imaging (Wednesday, 15:30, Room: A7) 

Michael Clark 

Compute DevTech Engineer (NVIDIA) 

Dr. Clark’s background is in high energy physics, having 

completed his doctoral research in Monte Carlo 

algorithms for lattice qcd in 2005, graduating from the 

University of Edinburgh. He subsequently moved to 

Boston University, developing adaptive multi-grid 

algorithms and symplectic integrators. There, he initiated 

research into harnessing GPUs for lattice QCD 

computation. Dr. Clark spent 2009-2011 at Harvard 

University, where he continued to work on algorithms for 

GPUs and many-core processors, with focus on signal 

processing and multigrid. Dr. Clark moved to NVIDIA in 

2011, where his present work lies at the interface between 

applications, algorithms and parallel computation. 

h Session(s): S0347 - Accelerating Radio Astronomy 

Cross-Correlation beyond 1 Tflops using Fermi 


Don Clegg 

VP (Supermicro) 


h Session(s): S0636 - Supermicro: Worldwide leader 

in GP/GPU Servers and Workstation Platforms 


Esteban Clua 

Professor (Computer Science Department of 

Universidade Federal Fluminense, Rio de Janeiro, Brazil) 

Esteban is associated professor at Universidade Federal 

Fluminense, Rio de Janeiro, and director of UFF 

Medialab. He is one of the founders of SBGames - 

Brazilian Symposium of Digital Entertainment and Video 

Games, is director of Academia of IGDA-Rio, president of 

the Brazilian Computing Society Game. In 2007 received 

an award for contributing to the growth of the video 

game industry in Brazil and in 2009 received the prize of 

Young Scientist of the State of Rio de Janeiro. Esteban is 

coordinator of the first Latin America CUDA NVIDIA 

Research Center, at UFF Medialab. 

h Session(s): S0074 – Techniques for Designing 

GPGPU Games (Thursday, 17:00, Room: L) 

Jonathan Cohen 

Emerging Applications (NVIDIA) 

Jonathan Cohen leads the Emerging Applications group 

as part of NVIDIA’s Content and Technology organization. 

Emerging Applications seeks to develop enabling 

technologies that will allow end-users to access the 

power of GPU computing in a wide variety of application 

areas. Previously, he spent three years as a senior 

research scientist with NVIDIA Research developing 

scientific computing and real-time physical simulation 

applications on NVIDIA’s massively parallel GPUs. Cohen 

was awarded an Academy Award (Technical Achievement 

Award) in 2007 from the Academy of Motion Pictures 

Arts and Sciences for his work on fluid simulation and 

volumetric modeling for visual effects. He received an 

undergraduate degree from Brown in Mathematics and 

Computer Science. 

h Session(s): S0332 – Efficient Graph Matching 

and Coloring on the GPU 


Chris A. Cocosco 

Scientist (University Medical Center Freiburg, Dept. of 

Radiology, Medical Physics.) 

Chris A. Cocosco has spent over 15 years in research & 

development at the intersection of medical imaging, 

electrical engineering, computer science, and high 

performance computing, in both academic/clinical and 

industrial/commercial environments. 

h Session(s): S0348 - GPUs Open New Avenues in 

Medical MRI (Wednesday, 10:30, Room: A8) 

Andrew Corrigan 

Research Mathematician (Naval Research Laboratory) 

AndrewCorrigan has been a scientist at the Laboratory 

for Computational Physics and Fluid Dynamics at the US 

Naval Research Laboratory since 2010, where he is 

developing the Jet Engine Noise Reduction (JENRE) 

code. His research interests are in supersonic jet noise 

reduction and algorithms for high performance CFD 

solvers. He received his Ph.D. in 2009 from George 

Mason University, where he also worked as a 

postdoctoral researcher in the GMU CFD Center, porting 

the unstructured grid CFD code FEFLO to run on GPUs. 

h Session(s): S0031 - Unstructured Grid Numbering 

Schemes for GPU Coalescing Requirements 



PANELISTS 

109

SPEAKERS AND 

PANELISTS 

Iain Couzin 

Professor, Department of Ecology and Evolutionary 

Biology (Princeton University) 

Iain Couzin joined the Princeton faculty in late 2007. 

Prior to joining the faculty there, he was a Royal Society 

University Research Fellow in the Department of 

Zoology, University of Oxford, and a Junior Research 

Fellow in the Sciences at Balliol College, Oxford. His 

work aims to reveal the fundamental principles that 

underlie evolved collective behavior, and consequently 

his research includes the study of a wide range of 

biological systems, from brain tumors to insect swarms, 

fish schools and human crowds. Couzin is a member of 

the Faculty of 1000 Biology and in recognition of his 

research he was a recipient of the Searle Scholar Award 

in 2008, the Mohammed Dahleh Award in 2009 and 

Popular Science Magazines “Brilliant 10” award in 2010. 

Couzin holds a PhD in Biology from the University of 

Bath, UK. 

h Session(s): S3001: Day 2 Keynote: From Democratic 

Consensus to Cannibalistic Hordes: GPU Computing 

Reveals the Principles of Collective Behavior 

(Wednesday, 11:00, Keynote Hall) 

Cyril Crassin 

Postdoctoral Research Scientist (NVIDIA) 

Cyril Crassin joined NVIDIA Research in 2011 as a 

postdoctoral research scientist. Cyril obtained his Ph.D. 

degree from Grenoble University at INRIA in France in 

2011. His research interests include realistic rendering, 

voxel-based representations, global illumination, 

real-time ray-tracing and out-of-core data management. 

During his Ph.D., he developed the GigaVoxels approach 

that proposed the use of pre-filtered voxel representations 

for real-time rendering of large detailled scenes, complex 

objects, as well as global illumination effects. 

h Session(s): S0610 - Octree-Based Sparse 

Voxelization For Real-Time Global Illumination 

(Tuesday, 14:30, Room: B) 

Luis Crivelli 

Director of Solver Development (Dassault Systemes, 

SIMULIA) 


h Session(s): S0431 - Evolving Use of GPU for 

Dassault Systems Simulation Products 

(Wednesday, 09:00, Room: K) 

Jon Currey 

(Microsoft Research Silicon Valley) 

Jon Currey joined Microsoft Research in 2007, initially 

working on the Dryad and DryadLINQ cluster computing 

projects. His current research focus is systems support for 

GPU-accelerated computation. Jon previously worked for 

Apple, Oracle, Nortel and some startups. He holds a BA 

and MA in philosophy from the University of Cambridge. 

h Session(s): S0320 – PTask: OS Support for GPU 

Dataflow Programming (Thursday, 14:00, Room: B) 

Kenneth Czechowski 

Student (Georgia Tech) 

Kenneth Czechowski is a PhD student in the School of 

Computational Science and Engineering at the Georgia 

Institute of Technology. His research interests include 

algorithm-architecture codesign, performance modeling 

for GPU/manycore architectures, and parallel and 

distributed algorithms. Czechowski holds a masters in 

computer science from the Georgia Institute of Technology. 

h Session(s): S0362 - Maximizing Performance on 

Multi-GPU Systems (Thursday, 09:00, Hall 1) 

Johann Dahm 

(University of Michigan) 


h Session(s): S0031 – Unstructured Grid Numbering 

Schemes for GPU Coalescing Requirements 


Abdul Dakkak 

Wolfram Research) 


h Session(s): S0100 – Mathematica as a Practical 

Platform for GPU-Accelerated Finance 

(Wednesday, 17:00, Room: L) 

h S0106 – GPU Based Numerical Methods in 

Mathematica (Thursday, 14:30, Room: L) 

Eric Darve 

Professor (Stanford) 

Prof. Darve received his PhD in Applied Mathematics from 

Pierre et Marie Curie University, Paris, France (1999), 

while working in the Jacques-Louis Lions Numerical 

Analysis Laboratory under the supervision of Prof. Olivier 

Pironneau. He was a postdoctoral fellow at Stanford in the 

Center for Turbulence Research, under the supervision of 

Prof. Parviz Moin and Dr. AndrewPohorille (NASA Ames 

Research Center). He became an assistant professor of 

Mechanical Engineering at Stanford University in 2001 

and was promoted to Associate Professor in 2010. He is a 

member of the Institute for Computational and 

Mathematical Engineering, a CUDA Center of Excellence. 

This work is in collaboration with Dr. Toru Takahashi 

(Nagoya University) and Dr. Cris Cecka (Harvard). 

h Session(s): S0334 - The Fast Multipole Method 

on CPU and GPU Processors 

(Thursday, 15:00, Marriott Ballroom 3) 

Guy De Beer 

CEO (Playcast Media System) 

Guy founded Playcast Media System. During his 16 years 

in the digital media communications industry, he led the 

successful development and commercialization of 

dozens of digital media communications products and 

services. Prior to founding Playcast, Guy managed 

Harmonic’s (NASDAQ: HLIT) Broadcast and VoD edge 

product lines. Before joining Harmonic, he held several 

product marketing and business development 

management positions with the MRV group (NASDAQ: 

MRVC). Guy holds a BA in Media from the University of 

Bar-Ilan in Israel and an MA in Philosophy of Digital 

Culture from the University of Tel Aviv. 

h Session: – S2006- Emerging Companies Summit: 

CEO on Stage Featuring Raytrix and Playcast, 

Featuring Raytrix, Playcast and Universal Robotics 


Jose de Corral 

Principal Consulting Engineer (Waters Corporation) 

Jose is currently Principal Consulting Engineer at Waters 

Corporation. Jose de Corral received his B.S. in Electrical 

Engineering from Universidad Politénica de Madrid, and 

his M.S. in Software Engineering from Harvard University. 

Jose has a long career at Waters, where he started in 

1983. He has been involved in many R&D design projects, 

specializing in analog electronic design, feedback control 

systems, and embedded software development. Jose’s 

preferences evolved toward the design of complex 

algorithms for data processing and instrument control. 

Since 2007, his main focus has been in Computer 

Graphics and GPU Computing. 

h Session(s): S0327 - Large and Sparse– Mass 

Spectrometry Data Processing in the GPU 

(Wednesday, 14:00, Room: B)

Mario Dean 

Schlumberger 

Mario Dean’s current role is remote application delivery 

product champion at Schlumberger Information 

Solutions. 

h S0434 Schlumberger LiveQuest: Application 

Delivery and Collaboration Solution 


Julien Demouth 

Developer Technology Engineer (NVIDIA) 

Julien Demouth is a Developer Technology Engineer at 

NVIDIA where he works mainly on CUDA for high 

performance computing. Julien obtained his Ph.D. 

degree in Computational Geometry from Nancy 

University at INRIA in France. 

h Session(s): S0602 – An Introduction to the 

Thrust Parallel Algorithms Library 


h S0285 - Optimization of a Sparse Matrix-Matrix 

Multiplication on the GPU 


Yangdong Deng 

Associate Professor (Tsinghua University) 

Yangdong Deng received his Ph.D. degree in Electrical 

and Computer Engineering from Carnegie Mellon 

University, Pittsburgh, PA, in 2006. He received his MS 

and BE degrees in Electronic Department from Tsinghua 

University, Beijing, in 1998 and 1995, respectively. He has 

been an associate professor of Institute of 

Microelectronics, Tsinghua University, since 2008. He 

also leads the systems modeling team of the Tsinghua- 

Intel Center of Advanced Mobile Computing Technology. 

His research interests include VLSI verification, parallel 

microarchitecture, and parallel algorithms. He is the 

author or co-author of three books and over 30 papers. 

h Session(s): S0050 - High Performance Logic 

Simulation with GPUs (Tuesday, 16:00, Room: J3) 

Kristof Denolf 

Research Engineer (Barco) 

Kristof Denolf received the M.Eng. degree in electronics 

from the KHBO(Belgium) in 1998, the M.Sc. degree in 

electronic system design from LMU (U.K.) in 2000 and a 

PhD from the Technische Universiteit Eindhoven in 2007. 

He joined IMEC, in August 1998, as research engineer 

focusing on optimized, low power video implementations. 

During 2008, he spent six months as a visiting 

researcher at Xilinx research labs to work with highlevel 

synthesis tools. In 2010, he was as SW architect at 

Philips. Recently he joined Barco’s technology center, 

working on cost efficient design of advanced video 

processing systems. 

h Session(s): S0252 - Building Real-Time 

Professional Visualization Solutions with OpenCL 


Luiz DeRose 

Director of Programming Environment (Cray Inc.) 

Dr. Luiz DeRose is a Senior Principal Engineer and the 

Programming Environments Director at Cray Inc, where 

he is responsible for the programming environment 

strategy for all Cray systems. Dr. DeRose has a Ph.D. in 

Computer Science from the University of Illinois at 

Urbana-Champaign. With more than 20 years of high 

performance computing experience and a deep knowledge 

of its programming environments, he has published more 

than 50 peer-review articles in scientific journals, 

conferences, and book chapters, primarily on the topics of 

compilers and tools for high performance computing. 

h Session(s): S0407 - A High Level Programming 

Environment for Accelerated Computing 


Ronny Dewaele 

Director Technology Center (Barco) 


h Session(s): S0252 – Building Real-Time 



Tanmay Dharmadhikari 

Senior Software Development Engineer (Beckman-Coulter) 


h Session(s): S0638 – Lenovo ThinkStation 

Accelerates Medical Research with Beckman 

Coulter (Presented by Lenovo) 


Michael Dickens 

Graduate Student (University of Notre Dame) 

Michael L. Dickens is a Ph.D. candidate in Electrical 

Engineering at the University of Notre Dame. He 

received a B.S. from MIT in 1991, and a M.S. degree from 

the University of Notre Dame in 2001. He has more than 

10 years of industry experience, having worked at the 

Oak Ridge National Labs (Oak Ridge, TN), Bolt Beranek 

and Newman (“BBN”, Cambridge, MA), and most 

recently the MITRE Corporation (Bedford, MA). His 

current research interests span all aspects of 

programming for software-defined radios -- from 

system boot codes to kernels, signal-processing 

algorithm implementations to user interfaces. 

h Session(s): S0134 - On the Integration of 

OpenCL into a Software Defined Radio 


Michael Dixon 

Research Engineer (Willow Garage, Inc) 


h Session(s): S0088 – Point Cloud Library (PCL) on 

CUDA (Tuesday, 14:00, Room: C) 

Sebastien Domine 

Sr. Director, Software Engineering, Developer 

Tools (NVIDIA) 

Sébastien is the Sr. Director of Developer Technology 

Tools at NVIDIA. He runs various software engineering 

teams and oversees the development of software 

products dedicated to ease the developer’s life and to 

foster the creation of more applications that can take 

advantage of the GPU. Prior to NVIDIA, he worked on PC 

games at GameFX/THQ and 3D digital content creation 

tools at Katrix and Nichimen Graphics. He holds a 

Diplôme d’Ingénieur in Computer Science from EPITA, 

Paris, France. 

h Session(s): S0430 - Developing Next-Generation 

CUDA Acceleration in Wolfram’s Mathematica with 

Parallel Nsight (Tuesday, 09:30, Room: B) 

Mathieu Dubois 

(Bull) 

Mathieu joined Bull in 2009 as a GPU and hardware 

accelerator expert. After an engineering degree in 

electronics and a PhD in theoretical physics and 

nano-sciences, he started porting electronic transport 

applications to Graphical Processing Units in 2007, as 

part of a postdoctoral project for the simulation of new 

materials for nano-electronics. Now a member of the 

BULL’s Applications & Performance Team based in 

Grenoble, France, his main GPU activities are 

benchmarking, CUDA and OpenCL training, Proofs Of 


PANELISTS 

111

SPEAKERS AND 

PANELISTS 

Concept and new technology evaluations. In 2011, he 

was heavily involved in the deployment of the three 

largest GPU clusters in Europe, at CEA, GENCI and the 

Barcelona Supercomputing Centre. 

h Session(s): S0643 Hybrid Architectures for 

Advanced Seismic Imaging: Recent Experiences at 

Bull (Presented by Bull) (Tuesday, 17:00, Room: M) 

Eric Dunn 

Electromagnetic Research Scientist (SAIC) 

Dr. Dunn has been a research scientist at SAIC since 

2005 responsible for planning and executing a diverse 

range of solutions to problems that employ 

computational electromagnetics. His current 

responsibilities involve serving as a principle investigator 

to research high frequency asymptotic methods and 

hybrid techniques. His research interests involve 

studying hardware and software acceleration for 

high-performance scientific computing. He has been 

involved with product development and training for many 

SAIC software tools as well as outreach to Universities 

for collaboration and research mentoring. BSEE/ 

UMCP/1999, MS/UIUC/2000, PhD/UIUC/2005. 

h Session(s): S0046 - Application of the GPU to a 

Two-Part Computational Electromagnetic 

Algorithm (Tuesday, 14:30, Room: J3) 

Daniel Egloff 

Managing Partner (QuantAlea GmbH) 

Dr. Daniel Egloff studied mathematics, theoretical 

physics, and computer science at the University of 

Zurich and the ETH Zurich. He has been working for the 

last 17 years in the financial industry, mainly in risk 

management, credit risk, and derivative pricing. Since 

2007 he is actively working with GPUs to accelerate 

quantitative financial calculations. In 2010 he founded 

QuantAlea, a niche consulting firm providing specialized 

project services in the area of derivative modeling, 

statistical arbitrage strategies and risk management 

paired with first class software engineering. 

h Session(s): S0405 - New Generation GPU 

Accelerated Financial Quant Libraries 


Anders Eklund 

PhD Student (Linköping University) 

Anders Eklund is a Ph.D. student at Linköping University, 

Sweden, with a M.Sc. in applied physics and electrical 

engineering. He is focused on medical image analysis, 

especially functional magnetic resonance imaging 

(fMRI). His current work involves using GPUs for 

non-parametric fMRI analysis (e.g. random permutation 

tests), real-time fMRI analysis (e.g. brain computer 

interfaces), interactive functional connectivity analysis 

and general medical image processing in 4D (e.g. 

denoising of large computed tomography (CT) datasets, 

512 x 512 x 450 x 20). 

h Session(s): S0017 - 4D Medical Image Processing 

with CUDA (Wednesday, 09:00, Room: A8) 

Rob Enderle 

Principal Analyst (Enderle Group) 

Rob is President and Principal Analyst of the Enderle 

Group, a forward looking emerging technology advisory 

firm. With over 25 years experience with emerging 

technologies he has provided regional and global 

companies with guidance on how to be successful in this 

changing world. Before founding the Enderle Group Rob 

was the Senior Research Fellow for Forrester Research 

and the Giga Information Group. While there he worked 

for and with companies like Microsoft, TI, HP, IBM, Dell, 

Toshiba, Gateway, Sony, USAA, Texas Instruments, AMD, 

Intel, Credit Suisse First Boston, GM, Ford, ROLM, and 

Siemens. Prior to that he worked for IBM and held 

positions in Internal Audit, Competitive Analysis, 

Marketing, Finance, and Security. Currently Rob writes 

on Emerging Personal Technology, Security, and Linux 

for a wide variety of publications including 

TechNewsWorld, CIO, Forbes, TGdaily, TMCNET, 

Datamation, and IT Business Edge and international 

news organizations like CNBC, CNN, Bloomberg, and 

NPR. Rob also does a semi weekly radio spot for Wall 

Street Journal radio on consumer technology. Rob sits 

on the advisory councils for a variety of technology 

companies. 

h Session(s): Emerging Companies Summit 

(Wednesday all day, Marriott Ballroom 4) 

Eric Enderton 


Eric Enderton is a research scientist at NVIDIA, focusing 

on transparency, shadows, and film rendering. He was a 

principal engineer on NVIDIA Gelato, the first GPUaccelerated 

film rendering software. Previously, Eric 

developed rendering and animation software at 

Lucasfilm’s Industrial Light & Magic and at other major 

film studios. His film credits include “Terminator 2”, 

“Jurassic Park”, and “Star Wars Episode I”. Eric has a 

masters degree in computer science from the University 

of California at Berkeley. 

h Session(s): S0409 - Stochastic Rasterization 


Kenneth Esler 

Computational Physicist (Stone Ridge Technology) 

Dr. Esler is a computational physicist at Stone Ridge 

Technology in Bel Air, Maryland. He received his 

bachelor’s degree in physics from MIT in 1999. He 

completed his Ph.D. in computational condensed matter 

physics at the University of Illinois at Urbana-Champaign 

in 2006, developing methods for quantum-level 

simulation of matter at finite temperature. He accepted 

postdoctoral appointments at the Carnegie Institution of 

Washington and the National Center for Supercomputing 

Applications. His professional interests include 

computational methods development, algorithm 

optimization, and heterogeneous computing platforms. 

h Session(s): S0140 - Accelerating Reservoir 

Simulation and Algebraic Multigrid with GPUs 


Sorin Faibish 

(EMC Corporation) 

Sorin Faibish designed and built innovative shared High 

Performance storage solutions including architecture 

design of NFS clusters, architect the performance 

strategy of Celerra file system. Sorin is a technology 

consultant and evangelist for pNFS as well as member 

of IETF and contributor to the pNFS protocol and 

promoted pNFS in research forums. Sorin’s wider 

expertise include: Clustered File systems, Storage 

systems, High Performance Computing, Robotic 

architectures, Complex systems design and Artificial 

Intelligence. Sorin holds a Master degree from Technion, 

Israel in EE, and is a member of IEEE, ACM, USENIX, 

IETF and SNIA and has 50 papers and 36 patents. 


New GPU Appliance for Co-processing 

(Wednesday, 15:00, Room: J) 

Wes Faler 

Head of Software Development (Part-Time Scientists) 

Wesley Faler is a Head of Software Development at 

Part-Time Scientists. He is also a software engineer with 

25 years of broad experience. Unusual skills include 

GPU-based simulations, genetic programming, FPGAs,

high voltage electronics, ion engines, and sending a 

rover to the moon with the Part-Time Scientists for the 

Google Lunar X Prize. 

h Session(s): S3002 – Day 3 Keynote: Not Your 

Grandfather’s Moon Landing 

(Thursday, 11:00, Keynote Hall) 

Robert Farber 

Chief Scientist (BlackDog Endeavors, LLC) 

Rob is recognized for his work in High Performance 

Computing (HPC), machine learning, complex dynamical 

systems and high energy physics. Lately, he has been 

focused on advancing the state-of-the art through his 

publications and computational research including his 

book CUDA Application Design and Development, online 

venues Doctor Dobb’s Journal and The Code Project, 

peer-review journals, conferences, and magazines such 

as Scientific Computing. Rob has co-founded two 

companies that achieved liquidity events, as a theoretical 

division scientist at Los Alamos, on-staff at SFI, Berkeley 

and PNNL. Currently, he is working with and teaching at 

research and educational organizations around the world. 

h Session(s): S0038 - Designing Killer CUDA 

Applications for X86, multiGPU, and CPU+GPU 

(Thursday, 16:00, Marriott Ballroom 3), 

h S0646 Massively Parallel Code Development on 

Stelletto CDA (Presented by Creative Consultants) 


Reza Farivar 

PhD Student (University of Illinois at Urbana-Champaign) 

Reza Farivar received his B.S. degree in electrical 

engineering in 2003, and his M.S. degree in computer 

engineering in 2005. He is currently finishing his PhD in 

Electrical and Computer Engineering at the University of 

Illinois at Urbana-Champaign. His major research 

interests include parallel cloud computing programming 

models, heterogeneous computing algorithms 

(specifically with GPUs) and combining GPUs and cloud 

computing paradigms. He has also worked on reliability 

and security as well as ubiquitous computing. 

h Session(s): S0152 - Accurate Sequence Alignment 

using Distributed Filtering on GPU Clusters 

(Tuesday, 15:30, Room: K) 

Massimiliano Fatica 

Manager (NVIDIA) 

Massimiliano Fatica is a manager of the Tesla 

Performance Group at NVIDIA where he works in the 

area of GPU computing (high-performance computing 

and clusters). He holds a laurea in Aeronautical 

Engineering and a Phd in Theoretical and Applied 

Mechanics from the University of Rome “La Sapienza”. 

Prior to joining NVIDIA, he was a research staff member 

at Stanford University where he worked at the Center for 

Turbulence Research and Center for Integrated 

Turbulent Simulations on applications for the Stanford 

Streaming Supercomputer. 

h Session(s): S0522 – Introduction to CUDA Fortran 


Wu Feng 

Professor (Virginia Tech) 

Wu Feng holds dual appointments in Computer Science 

and Electrical & Computer Engineering at Virginia Tech 

(VT) and an adjunct professorship in Cancer Biology and 

Translational Science Institute at Wake Forest University. 

He is an internationally recognized expert in highperformance 

computing (HPC), as evidenced by his 

presence on HPCwire’s People to Watch List in 2011. His 

lab works at the synergistic intersection of HPC and the 

domain sciences. He is an ACM Distinguished Scientist 

and an IEEE Senior Member. 

h Session(s): S0156 - Towards Computing the Cure 

for Cancer (Tuesday, 17:00, Hall 1) 

Alex Fit-Florea 

Senior Engineer (NVIDIA) 

Alex Fit-Florea currently works for NVIDIA as the CUDA 

software manager in charge with core mathematical 

functionality, random number generators, and fft 

algorithms. His main professional and research 

interests revolve around computer arithmetic and 

numerical methods. He served as a member of the 

IEEE754-2008 Standard for Floating Point Arithmetic 

Review Committee. Alex holds B.S and M.S. degrees 

from UB-B, and a PhD from SMU. 

h Session(s): S0085 - Floating Point and IEEE 754 

Compliance for NVIDIA GPUs: Precision & 

Performance (Wednesday, 14:30, Room: A3) 

Christopher Fluke 

Senior Lecturer (Swinburne University of Technology - 

Centre for Astrophysics and Supercomputing) 

Dr. Christopher Fluke is a Senior Lecturer at the Centre 

for Astrophysics and Supercomputing, Swinburne 

University of Technology. His main research interests are 

in gravitational lensing, astronomy visualization, and 

advanced computation, with an emphasis on the adoption 

of GPUs to accelerate the rate of astronomical discovery. 

His GPU work has included advancements in gravitational 

microlensing computations (teraflop/s rates achieved on 

the desktop), real-time terascale visualization and data 

analysis on GPU-clusters (for next generation radio 

telescopes), and strategies for adoption of GPUs by 

astronomers. He is the Principle Investigator of the 

NVIDIA CUDA Research Centre at Swinburne University. 

h Session(s): S0707- Los Alamos AHPC Symposium, 

Accelerated HPC Symposium: Scalability: 

Hardware and Software (Thursday, 9:00, Room: J2) 

h S0022 - Scalable Frameworks and Algorithms 

for Terascale Radio Astronomy Images 


Steve Forde 

Senior Product Manager (Adobe) 

Steve Forde joined Adobe in 2011 as senior product 

manager for After Effects, the industry-leading software 

for creating sophisticated motion graphics and cinematic 

visual effects. In this role, Forde oversees extending 

After Effects into new markets and workflows. Forde is 

an experienced executive and co-founder of multiple 

businesses within media and emerging technology. He 

joined Adobe from Gridiron Software where he was 

co-founder/CEO and CTO. Gridiron develops 

complementary technologies for After Effects, and 

software for managing overall workflow in the creative 

enterprise. Forde grew the company from venture 

funding to a global operation and from a perpetual 

license revenue base to a SaaS model. Forde was 

co-founder/CEO of Creative Shack Inc. and oversaw an 

acquisition by Mitel Networks. Forde sits on the board of 

Black Cherry Digital Media. 

h Session(s): S0632 Learn how Adobe After Effects 

CS6 takes advantage of NVIDIA Optix technology 

for 3D Ray Tracing (Presented by Adobe) 


Dustin Franklin 

GPGPU Applications Engineer (GE Intelligent Platforms) 

Dustin is a GPU expert in the defense & aerospace 

industry. Originally a 3D rendering architect for games 

and simulations, he changed focus in 2005 to GPGPU. 

Dustin has years of experience in deploying highperformance 

CUDA applications onto rugged platforms 

like tanks, humvees, and UAVs. Currently, he works for 


PANELISTS 

113

SPEAKERS AND 

PANELISTS 

GE as a GPGPU Applications Engineer and lives near 

Washington DC. 

h Session(s): S0253 - Sensor Processing with 

Rugged Kepler GPUs (Wednesday, 09:00, Room: M) 

Tom Furlong 

Managing Director (Granite Ventures LLC) 

Tom joined Granite Ventures in 2000, after a successful 

career in Silicon Valley that included stints as a vice 

president at Zhone Technologies, a communications 

equipment provider, and as a partner with a leading 

valley law firm, where he spent 13 years counseling 

technology companies, venture capitalists and 

investment banks. Tom currently serves on the Boards 

of Directors for Aspen Avionics, GoingOn Networks, 

Indicee, Mixamo and Skytide. Prior investments include 

Biz360 (acquired by Attensity), Digital Fountain (acquired 

by QualComm), Five Across (acquired by Cisco), Kinecta 

(acquired by Stellent), and TuVox (acquired by West 

Interactive). 



Ravikumar G.V.V. 

(Infosys Ltd, Bangalore) 


h Session(s): S0214 – GPU Based Stacking Sequence 

Optimization For Composite Skins Using GA 


Klaus Gaedke 

Lab Manager (Technicolor) 

Klaus Gaedke studied Electrical and Electronic 

Engineering at the University of Hannover, Germany, and 

received his Dipl.-Ing. and PhD degree from this 

institution. In 1996 he started to work for Technicolor 

Research and Innovation. Currently, he is responsible for 

Technicolor’s Image Processing Lab. His research 

interest include parallel programming, parallel real-time 

processing architectures and real-time implementation 

of image processing algorithms. 

h Session(s): S0073 - Cost-effective GPU 

Acceleration of a Video Restoration and Archiving 

Workflow (Wednesday, 15:30, Room: A1) 

Daniel Gaudlitz 

Research Associate (Technische Universität München) 

As a research associate at Technische Universität 

München, Daniel Gaudlitz works on complex multiphase 

flows and their numerical modelling. Also efficient 

methods for HPC in academia and industry is a major 

research focus. Daniel Gaudlitz also leads R&D activities at 

the engineering company FluiDyna GmbH. After gratuating 

with a master’s degree from TU Dresden in 2003, he joined 

TU München and received a PhD in 2008 for his research 

on numerical simulations of multiphase flows. 

h Session(s): S0296 - A GPU-Enabled SPH Method 

for Micro and Nanofluidic Simulations 


Wei Ge 

Professor (Institute of Process Engineering, Chinese 

Academy of Sciences) 

Prof. Ge got his PhD degree at Harbin Institute of 

Technology in 1998 and has been professor of chemical 

engineering at Institute of Process Engineering, Chinese 

Academy of Sciences since 2006. He is mainly engaged 

in multi-scale simulation of particle-fluid two-phase 

systems. He proposed the so-called “pseudo-particle” 

model which enables simulation of macro-scale flow 

phenomena from microscopic physics through largescale 

parallel computation. As project leader, he has 

been working on the multi-scale software and hardware 

systems to bridge the simulation of molecular details to 

reactor performance. 

h Session(s): S0268 - Virtual Process Engineering 

- Realtime Simulation of Multiphase Systems 


h S0057 - GPU-Accelerated Molecular Dynamics 

Simulation of Solid Covalent Crystals 


Isaac Gelado 

Senior Researcher (Barcelona Supercomputing Center) 

Isaac Gelado is a Senior Researcher at the Barcelona 

Supercomputing Center and a Visiting Scholar at the 

Coordinated Science Laboratory at the University of 

Illinois. At BSC, Isaac is working in the Mont-Blanc 

project and the NVIDIA CUDA Center of Excellence. Isaac 

holds a Master’s degree on Telecommunications 

Engineering from the Universidad de Valladolid, and a 

PhD degree from The Department of Computer 

Architecture in the Universitat Politecnica de Catalunya, 

where he also held a teaching position in the Computer 

Architecture Department. 

h Session(s): S0333 – GMAC-2: Easy and Efficient 

Programming for CUDA-Based Systems 


Shaul Geldman 

Co-Founder and VP of R&D (RealView Imaging Ltd.) 

Mr. Gelman is an experienced R&D executive with over 

twelve years of hands-on experience in cutting edge 

projects in the field of multidisciplinary display 

technologies. Mr. Gelman co-founded RealView Imaging 

in 2008 and has been leading all the company’s R&D 

activities since inception. Prior to that, Shaul worked for 

Elbit Systems (NASDAQ: ESLT), one of Israel’s largest 

defense companies, leading the development of 

high-end helmet-mounted display systems for aviation/ 

pilot applications. Mr. Gelman earned his Executive MBA 

from the Haifa University, and a B.Sc. in Industrial 

Engineering & Management from the Technion, Israel 

Institute of Technology. 





Geoff Gerfin 

Sr. System Software Engineer and Technical Manager 

(NVIDIA) 

Geoff Gerfin is currently a Sr. System Software Engineer 

and Technical Manager in the CUDA Tools Group at 

NVIDIA, where he develops and manages tools for 

next-generation GPU architectures. Geoff has worked in 

the HPC community since receiving his degree in 

Computer Engineering from the University of Delaware 

in 2005. 

h Session(s): S0027A - All-In-One Debugging 

Experience with CUDA-GDB and CUDA-MEMCHECK 


h S0027B - All-In-One Debugging Experience with 

CUDA-GDB and CUDA-MEMCHECK 


Denis Gerrer 

Denis Gerrer has 20 years of experience in HPC 

previously working for SGI and Altair Engineering. As 

CAPS VP and General Manager Americas, he is now in 

charge of relations with CAPS Enterprise partners. 

h Session(s): S0646 Massively Parallel Code 

Development on Stelletto CDA (Presented by 

Creative Consultants) (Tuesday, 17:00, Room: A8)

Flip Gianos 

General Partner (Interwest Partners) 

Philip “Flip” Gianos has been part of InterWest’s IT team 

since 1982. With a background in engineering, he has 

invested in multiple areas of information technology, 

including semiconductors, computing and networking 

equipment, and infrastructure and applications software. 

He is chairman of the board of Xilinx (XLNX), a publicly 

held company, and is also a board member of several 

privately held companies, including: Bivio Networks, 

Brand.net, Convey Computer, and SpectraLinear. Gianos 

also serves on the advisory board of Storm Ventures II, 

and is a past president of the Western Association of 

Venture Capitalists. 



Oliver Gicquel 

Professor (Laboratoire E.M2.C, Ecole Centrale Paris) 


h Session(s): S0129 – A Monte Carlo Thermal 

Radiation Solver in GPU/CPU Hybrid Architecture 


Ben Goertzel 

CEO (Novamente LLC) 


h Session(s): S0104 - GPU Implementation of Deep 

Learning for Intelligent Computer Vision 


James Goodman 

President/CEO (HySpeed Computing LLC) 

Dr. Goodman is founder and President/CEO of HySpeed 

Computing, a technology company specializing in 

developing advanced algorithms and analytic tools for 

the geospatial community. His expertise includes remote 

sensing, image analysis, mathematical modeling, and 

high performance computing. Dr. Goodman maintains 

academic affiliations with the University of Puerto Rico 

at Mayaguez and the University of Miami, where 

research is focused on remote sensing of coastal 

ecosystems. He has been awarded grants from NASA, 

NSF and NOAA, and collaborated with investigators from 

around the world. He is also active in the scientific 

community, publishing research and leading sessions at 

international conferences. 

h Session(s): S0290 - Algorithm Acceleration for 

Geospatial Analysis (Thursday, 09:30, Marriott 

Ballroom 3) 

David Goodwin 

Software Engineer (NVIDIA) 

David is technical lead for the CUDA Visual Profiler 

at NVIDIA. 

h Session(s): S0419A - Optimizing Application 

Performance with CUDA Profiling Tools 

(Tuesday, 09:00, Room: C) 

h S0420 - NSight IDE for Linux and Mac 


h S0419B - Optimizing Application Performance with 

CUDA Profiling Tools (Wednesday, 14:00, Room: A5) 

Chris Gottbrath 

Principal Product Manager (Rogue Wave Software) 

Chris Gottbrath is Principal Product Manager for 

TotalView, MemoryScape, ReplayEngine and 

ThreadSpotter at Rogue Wave Software. He’s worked 

with the TotalView debugger for more than a decade in a 

range of technical and marketing roles. Prior to that he 

wrote his fair share of bugs in linux-based numerical 

simulations of galaxy dynamics and large scale structure 

as a graduate student in Tucson, AZ. He has a Masters 

of Science in Astronomy and Astrophysics from the 

University of Arizona. 

h Session(s): S0340 - Debug Multi-GPU Applications 

on CUDA-Accelerated Clusters with TotalView 


Jérôme Graindorge 

Project Manager (ALYOTECH) 

Graindorge has been working for six years for ALYOTECH 

(a software services company) first as a software 

engineer, and most recently as a project manager 

specially dedicated to HPC and particularly GPU-based 

scientific applications. 

h Session(s): S0053 - Real Time GPU-Based Marine 

Scenes Simulation (Thursday, 10:00, Room: N) 

Alan Gray 

HPC Architect (The University of Edinburgh) 

Dr. Alan Gray was awarded a Ph.D. at The University of 

Glasgow in Theoretical Particle Physics in 2003, winning 

the 2004 Ogden Prize for the best UK thesis in particle 

physics phenomenology. He furthered this work under a 

fellowship at The Ohio State University, and since joining 

EPCC in 2005 he has been involved with a wide range of 

HPC-related projects: lately his research has focused on 

the role GPUs will play in future generations of 

supercomputers, including participation in the OpenMP 

language committee exploring adoption of accelerators. 

He has authored a large number of refereed and 

highly-cited publications. 

h Session(s): S0286 - Scaling Applications to a 

Thousand GPUs and Beyond 


Simon Green 

Senior Software Engineer (NVIDIA) 

Simon Green is a senior member of the Developer 

Technology group at NVIDIA, specializing in real-time 

compute, rendering and physical simulation. He started 

graphics programming on the Sinclair ZX-81, which had 

1 kB of RAM and a screen resolution of 64 by 48 pixels, 

and has been trying to improve the quality of real-time 

graphics ever since. 

h Session(s): S0102 - Flame On: Real-Time Fire 

Simulation for Video Games 


Ray Grout 

(National Renewable Energy Laboratory) 

Dr. Grout’s research interests as part of the 

Computational Science Center at the National 

Renewable Energy Laboratory include algorithmic 

advances to facilitate integrating partial differential 

equations (PDEs) numerically on future architectures 

and development of future computation fluid dynamics 

(CFD) capabilities with particular emphasis on reacting 

flows. Dr. Grout has expertise in development of 

turbulent combustion submodels and has a wealth of 

experience developing several combustion codes at 

different institutions. His recent work has focused on the 

development of DNS (direct numerical simulation) 

databases for jets in cross flow from peta-scale, 

high-fidelity simulations in collaboration with the gas 

turbine industry. A key outcome of this work has been 

insight into the importance of low-velocity recirculation 

zones and stratified combustion in the stabilization of 

flames above a jet in cross flow. Earlier work involved 

using DNS to probe fundamental understanding of 

stratified combustion, to investigate appropriate flame 

markers (progress variables, tracers), and to propose 

new models for the combined effects of flame 

propagation and mixing. Dr. Grout also has experience 

deploying models for gaseous auto-ignition using 

commercial CFD codes. 


PANELISTS 

115

SPEAKERS AND 

PANELISTS 

h Session(s): S0625 S3D Direct Numerical 

Simulation - Preparations for the 10-100PF Era 


Vinod Grover 

Senior Manager (NVIDIA) 

Vinod Grover manages the compiler team at NVIDIA and 

responsible for compilation of CUDA and OpenCL to PTX 

ISA. Vinod has been with NVIDIA for 4 years and at 

Microsoft and Sun Microsystems before that. He 

holds a Master’s degree in computer science from 

Syracuse University. 

h Session(s): S0235 - Compiling CUDA and Other 

Languages for GPUs (Wednesday, 10:00, Room: A5) 

Guy Gueritz 

(Bull) 

Guy Gueritz joined Bull in 2008 to develop Bull’s HPC 

business in the upstream oil and gas industry, with 

particular focus on GPU-accelerated hybrid systems for 

advanced seismic imaging applications such as Reverse 

Time Migration. He has over twenty years’ experience in 

HPC and visualization applied to the geosciences, with 

previous roles in Hewlett-Packard, Linux Networx and 

SGI. His worldwide responsibilities include working with 

oil companies, seismic contractors, independent 

software vendors and technology partners to deploy 

advanced imaging capabilities on scalable HPC systems. 

He regularly participates in oil industry seminars and 

conferences and is a member of SEG and EAGE. 

h Session(s): S0643 Hybrid Architectures for 

Advanced Seismic Imaging: Recent Experiences at 

Bull (Presented by Bull) (Tuesday, 17:00, Room: M) 

Thomas Guignon 

Research Engineer (IFPEN) 


h Session(s): S0108 - An Innovative Massively 

Parallelized Molecular Dynamic Software 


Kshitij Gupta 

Graduate Student Researcher (UC Davis) 

Kshitij Gupta is a Ph.D. candidate in the Department of 

Electrical & Computer Engineering at UC Davis. He is 

interested in a variety of application domains like audio, 

image, and video. His primary interests are in exploring 

novel ways of transforming today’s high-performance 

algorithms onto emerging low-end, low-power, hybrid 

(CPU/GPU/DSP/ASIP) processors targeted towards 

mobile and automotive platforms. In his spare time, he 

likes procrastinating about novel user-interfaces, and 

hopes to work more actively on it some day. Kshitij 

received his Masters in EE from University of Pittsburgh 

(PA, USA), and his Bachelors in ECE from Osmania 

University (Hyderabad, India). 

h Session(s): S0157 - A Study of Persistent Threads 

Style Programming Model for GPU Computing 


Pankaj Gupta 

Bioinformatics Application Developer (St Jude Children’s 

Research Hospital) 

Pankaj is working as a Bioinformatics Application 

Developer at St. Jude Children’s Research Hospital in 

Memphis, TN. He received his bachelor’s degree in 

Computer Science from Rutgers University and his 

master’s degree in Computational Bioscience from 

Arizona State University. He likes working with opensource 

technologies whenever possible. 

h Session(s): S0083 - Swift: A GPU-based Smith- 

Waterman Sequence Alignment Program 


Rohit Gupta 

PhD Student (Delft University of Technology) 

Rohit completed his masters at the Delft University of 

Technology in computer engineering. During his 

masters’ thesis he worked on implementing a 

preliminary version of a preconditioned conjugate 

gradient solver on the GPU. He continued at the Delft 

Institute of Applied Mathematics as a phd student after 

graduating. His primary focus is to find new 

preconditioning methods that are suited to the GPU and 

the same time are at par with established parallelizable 

preconditioning techniques like Block Incomplete 

Cholesky in terms of achievable precision and 

mathematical stability. 

h S0063 - Robust Preconditioned Conjugate Gradient 

for the GPU and Parallel Implementations 

(Thursday, 16:00, Room: N) 

Sebastien Gurrieri 

Quantitative Analyst (Mizuho International) 

With a background of research in Theoretical Physics 

(String Theory), Gurrieri switched to finance 4 years ago. 

He is now working in the London branch of a Japanese 

investment bank and specializes in Risk Management of 

Fixed Income and Equity products. Until now he has 

been mostly interested in calibration and Monte-Carlo 

simulation issues, although he has also done some work 

on Finite Difference methods. 

h Session(s): S0206 - Monte-Carlo Pricing 

Under a Hybrid Local Volatility Model 


Tobias Gysi 

(Supercomputing Systems AG) 

Tobias Gysi graduated 2005 in computer science from 

ETH Zurich, Switzerland. He joined the R&D service 

provider Supercomputing Systems AG (SCS), working on 

advanced topics such as cryptography, image 

processing, speech recognition, and Monte-Carlo 

pricing. Tobias’ work has a strong focus on performance 

optimizations - developing more efficient 

implementation strategies and algorithms, and 

employing accelerators such as GPUs or FPGAs. 

Currently Tobias is dealing with a community code 

project where software maintainability and 

(performance) portability are key issues. 

h Session(s): S0256 – A Stencil Library for the New 

Dynamic Core of COSMO (Thursday, 09:00, Room: N) 

Alexander Haberstroh 

Software Developer (Jedox AG) 

Alexander Haberstroh studied computer science with a 

focus on image processing at the University of Freiburg, 

Germany, where he obtained his Master’s degree in 

2010. Between 2008 and 2010, he was also working at 

the Fraunhofer Institute for Solar Energy Systems. 

During his studies he worked on his first CUDA project, 

developing algorithms for comparing depth maps which 

are used in mobile robot mapping. Since 2011, he has 

been working at Jedox, concentrating on GPU 

algorithms for multidimensional databases in the area 

of Business Intelligence. 

h Session(s): S0219 – Efficient Top-Down Planning in 

Business Intelligence (Tuesday, 17:00, Room: C) 

Markus Hadwiger 

Assistant Professor (KAUST) 

Markus Hadwiger is an assistant professor of computer 

science at King Abdullah University of Science and 

Technology (KAUST) in Saudi Arabia. His research 

interests are petascale visual computing and scientific 

visualization, volume rendering, and GPU algorithms in 

general. He is currently teaching classes on scientific

visualization, and GPU and GPGPU programming. He 

obtained a PhD in computer science from the Vienna 

University of Technology. He has taught a series of 

courses on various aspects of visualization and volume 

rendering at ACM SIGGRAPH, IEEE Visualization, and 

Eurographics, and is a coauthor of the book Real-Time 

Volume Graphics (A.K. Peters, 2006). 

h Session(s): S0202 – Terascale Volume Visualization 

in Neuroscience (Wednesday, 16:30, Room: A8) 

Yoshiaki Hanada 

CEO (Prometech Software, Inc.) 

Yoshiaki Hanada is CEO of Prometech Software and 

works on promoting a particle simulation technology 

from Japan to the world. In his former job, he worked at 

Accenture Japan as a management consultant. In 2006 

he recieved a master’s degree from the Department of 

Advanced Energy, Graduate School of Frontier Sciences, 

The University of Tokyo. 

h Session(s): S0066 - Particleworks: Particle-based 

CAE Software Fully Ported on Multi-GPU 


Jerry Harris 

Senior Computer Scientist II (Adobe Systems) 

For 25+ years, Jerry has focused on deploying engaging 

commercial imaging applications. First as part of a 

startup that delivered the first commercial color paint 

program to the macintosh, later at Apple, and for the 

past 15 years at Adobe working on Photoshop. Has been 

an engineer on the Photoshop team starting on version 

5.0. Responsible for Layer Effects, Painting, Warping, 

and GPU acceleration. His current focus in on GPU 

enablement, and the delivery of joy of use via immersive 

fluid workflows. 

h Session(s): S0395 - GPU Enablement in Adobe 

Photoshop (Tuesday, 09:00, Room: A2) 

Mark Harris 

Chief Technologist, GPU Computing (NVIDIA) 

Mark Harris is Chief Technologist for GPU Computing at 

NVIDIA, where he works as a developer advocate and 

helps drive NVIDIA’s GPU computing software strategy. 

His research interests include parallel computing, 

general-purpose computation on GPUs, physically based 

simulation, and real-time rendering. Mark founded www. 

GPGPU.org while he was earning his PhD in computer 

science from the University of North Carolina at Chapel 

Hill. Mark brews his own beer and cures his own bacon 

in Brisbane, Australia, where he lives with his wife and 

daughter. 

h Session(s): S0517A - Programming GPUs with 

OpenACC (Part 1 of 3) (Monday, 10:30, Room: B) 

h S0517B - Programming GPUs with OpenACC (Part 

2 of 3) (Monday, 13:00, Room: B) 

h S0517C - Programming GPUs with OpenACC (Part 3 

of 3) (Monday, 14:30, Room: B) 

h S0641 - CUDA 5 and Beyond (Tuesday, 16:00, Hall 1) 

h S0653 - C++ and CUDA Birds-of-a-Feather 


Mike Heck 

Technology Advisor (VSG) 


h Session(s): S0444 - Explore New Techniques in 

Volume Rendering/Segmentation with Open 

Inventor (Tuesday, 15:30, Room: A7) 

Francisco J. Hernandez-Lopez 

(PhD Student, CIMAT A.C.) 

Francisco received a bachelor’s degree in computer 

systems engineering from the San Luis Potosi Institute 

of Technology, Mexico in 2005. He received the MSc 

degree in Computer Science from the Center for 

Research in Mathematics (CIMAT) in 2009. Since then, he 

is doctoral student at the CIMAT where he has been 

granted a CONACYT scholarship. His main interests are 

in the area of computer vision and in particular the 

development of efficient, parallel, algorithms for video 

processing and analysis. 

h Session(s): S0128 - V:Screen: A Real-Time 

Augmented Video Method 


David Helgason 

CEO (Unity Technologies) 

David Helgason, an entrepreneur, visionary and 

ex-programmer, has served as the CEO of Unity 

Technologies since co-founding it in 2003. The vision is 

to democratize game development and develop 

technology for the next generation of the industry. David 

founded and participated in startups in fields such as 

news and community integration, music distribution and 

consulting. He serves on the boards of several games 

and technology startups. 


CEO on Stage Featuring Unity Technologies, 

MirriAd and BioDigital 


Jeff Herbst 

Vice President of Business Development (NVIDIA) 

Jeff is the Vice President of Business Development at 

NVIDIA Corporation, the world leader in visual 

computing technologies (and inventor of the GPU). In 

this role, which he has held since 2001, Jeff leads 

NVIDIA’s worldwide business development efforts, 

including overall ecosystem development, mergers and 

acquisitions strategy, investments, partnerships and 

other strategic business relationships and transactions. 

Prior to NVIDIA, Jeff was the worldwide head of 

corporate and business development at AltaVista, and 

also served as general manager for a start-up focused 

on content delivery infrastructure for wireless networks. 

Earlier in his career, Jeff was a partner with the law firm 

of Wilson Sonsini where he specialized in corporate 

finance, joint ventures, mergers and acquisitions and 

other strategic business and intellectual propertyrelated 

transactions. Jeff holds a B.S degree in 

Computer Science from Brown University (where he 

studied computer graphics), and a law degree from 

Stanford Law School. 



Berk Hess 

PhD Student (KTH Royal Institute of Technology) 


h Session(s): S0363 – Efficient Molecular Dynamics 

on Heterogeneous GPU Architectures in GROMACS 


Christopher Horvath 

Global Technology Technical Director (Pixar) 


h Session(s): S0102 – Flame On: Real-Time Fire 

Simulation for Video Games 


Julien Houssay 

Software Engineer (ALYOTECH) 

Julien is a software engineer at ALYOTECH, specialized 

in GPU computing in scientific applications. He is 

currently working on a marine scene simulator mixing 

electro-optics and radar, using GPU for both general 


PANELISTS 

117

SPEAKERS AND 

PANELISTS 

purpose computing (CUDA and/or OpenCL) and 

rendering (OpenGL). 

h Session(s): S0053 – Real Time GPU-Based Marine 

Scenes Simulation (Thursday, 10:00, Room: N) 

Agatha Hu 

Developer technology Engineer (NVIDIA) 

Agatha Hu is Developer Technology Engineer at NVIDIA 

Corporation. She received a master’s degree in Biomedical 

Engineering from Shanghai Jiaotong University. Her work 

includes developing data parallel algorithms on GPU for 

bioinformatics as well as image processing. 

h Session(s): S0084 CUMACH - A Fast GPU-based 

Genotype Imputation Tool 


Jen-Hsun Huang 

Co-Founder, President and CEO (NVIDIA) 

Jen-Hsun Huang co-founded NVIDIA in 1993 and has 

served since its inception as president, chief executive 

officer and a member of the board of directors. Under 

his leadership, NVIDIA invented the graphics processing 

unit (GPU) in 1999. Since then, it has consistently set 

new standards in visual computing with breathtaking, 

interactive graphics available on devices ranging from 

tablets and portable media players to notebooks and 

workstations. NVIDIA’s expertise in programmable GPUs 

has led to breakthroughs in parallel processing which 

make supercomputing inexpensive and widely 

accessible. The company holds more than 1,100 U.S. 

patents, including ones covering designs and insights 

fundamental to modern computing. 

h Session(s): S3000: Opening Keynote 

(Tuesday, 10:30, Keynote Hall) 

h S2003: Emerging Companies Summit Fireside Chat 


John Humphrey 

Engineering Director (EM Photonics) 

John received his MSEE from the University of Delaware 

in 2004 and has been working in the field of accelerated 

computing for 10 years. The past six years have focused 

primarily on GPU applications, in areas ranging from 

computational electromagnetics to computational fluid 

dynamics and linear algebra libraries. 

h Session(s): S0304 - Large Scale Computational 

Fluid Dynamics Simulations on Hybrid 

Supercomputers (Wednesday, 10:30, Room: K) 

h S0307 - New Advances in GPU Linear Algebra 






Maxwell Hutchinson 

PhD Student (University of Chicago) 

Maxwell is currently a physics PhD student at the 

University of Chicago, funded by a Department of Energy 

Computational Science Graduate Fellowship. He has 

been working with GPGPUs since 2008, applying them to 

problems in electronic structure, Ising models, error 

correction in radio systems, and post-processing for 

particle detectors. 

h Session(s): S0378 - VASP Accelerated with GPUs 


Saeed Iqbal 

Senior Systems Engineer (Dell) 

Saeed Iqbal is a Senior Systems Engineer in the Global 

Solutions Engineering Group at Dell. Currently, he is the 

lead engineer on integration and performance analysis 

of GPUs in the Dell HPC solutions. He is also the lead 

engineer of the HPC advisor online tool at Dell.com/hpc. 

This tool is used by HPC customers to configure GPU 

enabled HPC clusters and associated high performance 

parallel storage clusters. 

h Session(s): S0309 – Dynamically Allocating GPGPU 

to Host Nodes (Servers) (Thursday, 10:30, Room: K) 

Olexan Isayev 

Research Scientist (Case Western Reserve University) 

Olexan Isayev was born in Ukraine and earned his Ph.D. 

in Theoretical Chemistry under the supervision of Jerzy 

Leszczynski at Jackson State University. He is currently 

joint Postdoctoral Fellow at Case Western Research 

University and US Army Engineering Research and 

Development Center (ERDC). Dr. Isayev’s research 

interests focused on structure and dynamics at bio-nano 

interfaces, fist principles and hybrid QM/MM simulations 

and high performance computing. 

h Session(s): S0315 - Probing Bio-Nano Interface 

Structure from Microsecond Molecular Dynamics 

on GPUs (Thursday, 10:00, Marriott Ballroom 4) 

Michel Izygon 

CTO (Tietronix Software, Inc.) 

Dr. Izygon has been involved in Solar Energy Projects 

since 1982, when he became the Principal Investigator 

on a French-Israeli research project to build and assess 

the performance of different solar energy concentrating 

systems. Since 1999, Dr. Izygon has been the co-founder 

and CTO of Tietronix Software, a company specializing 

in custom software development for customers such 

as NASA. 

h Session(s): S0321 – GPU-Based Monte Carlo Ray 

Tracing Simulation for Solar Power Plants 


Kevin Jackson 

Founder / CEO (Viewpartners) 

Kevin Jackson is founder and CEO of Viewpartners. He 

has 20+ years of visual media experience as one of the 

first in L.A.’s special effects market. He has worked with 

the biggest names in the film and advertising industry – 

Sony, Disney, BBDO, JWT, and others. 

h Session(s): S0425 - File Sharing Plus Real Time 

Media and Document Collaboration 


Jan Jacob 

Postdoctoral Researcher (University of Hamburg) 

Dr. Jan Jacob is a postdoctoral researcher at the 

Institute of Applied Physics of the University of Hamburg, 

Germany. He studied physics in Hamburg and graduated 

in 2007 with his diploma thesis “Preparation and 

Characterization of Spin Filters based on InAs Quantum- 

Point Contacts”. Two years later in 2009 he received his 

Ph.D. from the University of Hamburg for his thesis 

“All-electrical InAs Spin Filters”. Since then he expanded 

his research from low-temperature magnetotransport 

measurements also to numerical high-performance 

computing simulations of spin and charge transport in 

mesoscopic systems to model spintronic devices. 

h Session(s): S0379 - GPU-based High-Performance 

Simulations for Spintronics 


M. Saleet Jafri 

Professor and Chair (George Mason University) 

M. Saleet Jafri is a Professor in the School of Systems 

Biology at George Mason University. His current research 

uses detailed multi-scale models consisting of the 

subcellular, cellular, and tissue components to

understand the mechanisms that give rise to complex 

diseases in the heart such as cardiac arrhythmia, 

ischemic heart disease, and heart failure. GPU computing 

plays a central role in these studies. He received his PhD 

from Mount Sinai School of Medicine/CUNY in the 

Biomathematical Sciences, MS in Mathematics from the 

Courant Institute of Mathematical Sciences at NYU and is 

BS in mathematics from Duke University. 

h Session(s): S0072 - GPU-Enabled Spatiotemporal 

Model of Stochastic Cardiac Calcium Dynamics and 

Arrhythmias (Wednesday, 09:00, Room: B) 

Michal Januszewski 

PhD Student and Software Engineer (University of Silesia 

in Katowice; Google Switzerland) 

Michał Januszewski is a Software Engineer at Google 

Switzerland and a PhD student at the University of 

Silesia in Katowice under the supervision of Prof. Marcin 

Kostur. His current research is centered around applying 

mesoscale hydrodynamics simulation methods to 

biologically relevant flows. Michał is also the leader of 

the Sailfish project, an open source effort to build a 

highly scalable lattice Boltzmann fluid dynamics solver 

for GPUs. 

h Session(s): S0258 - Sailfish: Lattice Boltzmann 

Fluid Simulations with GPUs and Python 


WeiLe Jia 

Postgraduate Student (Supercomputing Center of CNIC, 


Weile Jia is a post-graduate student from 

Supercomputing Center of Chinese Academy of Sciences. 

h Session(s): S0392 - Large-Scale First Principle 

Pseudopotential DFT Calculations on GPU Clusters 


Stephen Jones 

CUDA Developer (NVIDIA) 

Stephen Jones is a member of CUDA’s parallel 

algorithms group. Having first worked on the CUFFT 

library, he moved on to architect the parallel system 

software framework which enables system I/O from GPU 

kernels, and wrote the first parallel system calls. He has 

made a particular study of thread execution on the GPU, 

and now works on future GPU architectures and 

development of the CUDA programming model. 

h Session(s): S0313 – Understanding and using 

Atomic Memory Operations 

(Tuesday, 14:00, Marriott Ballroom 3) 

h S0642 - Inside Kepler (Wednesday, 14:00, Hall 1) 

h S0338 - New Features In the CUDA Programming 

Model (Thursday, 10:00, Hall 1) 

h S0707- Los Alamos AHPC Symposium, Accelerated 

HPC Symposium: Scalability: Hardware and 

Software (Thursday, 09:00, Room: J2) 

Mark E S Joselli 

Researcher (UFF) 

Mark is a Industrial Engineer and Electrical Electronic 

emphasis by the Federal Center for Technological 

Education Celso Suckow da Fonseca (CEFET-RJ – 2005) 

and MSc in Computer Science from Federal Fluminense 

University (2007). He has experience in Computer 

Science with emphasis in Computer Methods and 

Techniques. Acting on the following topics: Games, 

Simulation, GPGPU. 

h Session(s): S0074 - Techniques for Designing 

GPGPU Games (Thursday, 17:00, Room: L) 

Guido Juckeland 

System Engineer (HPC), Leader Hardware Accelerator 

Group (TU Dresden - ZIH) 

Guido is a computer engineer at Technische Universität 

Dresden where he is responsible for the design, setup and 

operation of the HPC resources for the state of Saxony. He 

is also working on a Ph.D. thesis titled “Trace Based 

Performance Analysis for Hardware Accelerators”. 

h Session(s): S0067 – PIConGPU - Bringing largescale 

Laser Plasma Simulations to GPU 

Supercomputing (Tuesday, 15:00, Room: A8) 

h S0257 - Trace Based Performance Analysis For 

GPU Accelerated Multi-Hybrid Applications 


Patrick Kano 

Co-Owner (Acunum Algorithms and Simulations, LLC) 

Patrick Kano’s background lies in algorithm and 

simulation development and physics based modeling. In 

addition to being a co-owner of Acunum, he is a 

consultant with PsiNapse Technology in the San 

Francisco Bay Area. He studied at the University of 

Arizona (2001-2005) and was a student at the Arizona 

Center for Mathematical Sciences. He received a 

Diplom-Physik from the Dresden University of 

Technology in 2000 and a BS in physics from the 

University of Nevada, Reno in 1998. From 1998 to 2000, 

he was a research assistant at the Max Planck Institute 

for the Physics of Complex Systems. 

h Session(s): S0415 - An Accelerated Weeks Method 

for Numerical Laplace Transform Inversion 


Steve Karmesin 

Senior Developer (Numerix) 

Dr. Steve Karmesin is a senior developer at Numerix LLC 

working with many aspects of the CrossAsset derivatives 

pricing and analytics software, from software 

architecture to numerical modeling to GPU 

development. His GPU work rests on his background in 

supercomputing at the Los Alamos Advanced Computing 

Laboratory where he worked on numerous massively 

parallel projects including leading the POOMA (Parallel 

Object Oriented Methods and Algorithms) team for 

applying advanced C++ techniques to large scale 

scientific codes. 

h Session(s): S0383 - Speedup Derivatives and 

Structured Products Pricing, Reduce TCO Using 

GPUs (Wednesday, 09:00, Room: L) 

Eric Kelmelis 

CEO (EM Photonics) 


h Session(s): S0304 – Large Scale Computational 

Fluid Dynamics Simulations on Hybrid 


Christopher Kennelly 

Research Scientist (D. E. Shaw Research) 

Chris Kennelly received his B.S. in computer science 

from Caltech. During his time as an Amgen Scholar at 

Caltech, he developed algorithms for simulating DNA 

self-assembly. Since then, Chris has been employed at 

D.E. Shaw Research developing algorithms and software 

for Desmond. 

h Session(s): S0078 - Panoptes: A Binary 

Instrumentation Framework for CUDA 



PANELISTS 

119

SPEAKERS AND 

PANELISTS 

Osman Kent 

Co-Founder & CEO (Numecent) 

Osman Kent is a serial technology and media 

entrepreneur. He is best known as the co-founder & CEO 

of 3Dlabs – at one time a $1B company on NASDAQ and 

one of the fathers of the GPU and the OpenGL on the PC. 

He has a First Class double-major in Computer Science 

and Electronics from University of Birmingham (UK), is a 

fellow of the Royal Society (RSA) and was recently given 

the Freedom of London for lifetime contributions to the 

IT industry. He is the inventor of numerous patents in 

computing and graphics. In his spare time, Osman 

incubates musicians through his record label 

Songphonic, recites live poetry while improvising on the 

piano and produces music for films. 

h Session(s): S2003 – Emerging Companies 

Summit: CEO on Stage Featuring GAIKAI, 

Immersive Media and Numecent 


Mahesh Khadtare 

PhD Student - Scientist ESP (I2IT, Pune University) 


h Session(s): S0103 - Accelerating Protein 

Sequences and Classification using GPU-HMMER 

Search (Wednesday, 15:30, Room: B) 

h S0107 - Acceleration of Long-Wave Rapid 

Radioactive Transfer Model on GPGPU 


Brucek Khailany 

Senior Research Scientist (NVIDIA) 

Brucek Khailany joined NVIDIA in December 2009 as a 

member of the Computer Architecture Research Group. 

Previously, Dr. Khailany was a Co-Founder and Principal 

Architect at Stream Processors, Inc. (SPI) where he led 

research and development activities related to highlyparallel 

programmable processor architectures. He 

received his Ph.D. and Masters in Electrical Engineering 

from Stanford University and received B.S.E. degrees in 

Electrical Engineering and Computer Engineering from 

the University of Michigan. 

h Session(s): S0605 - cudaDMA: Emulating DMA 

engines on GPUs for Performance and 

Programmability (Wednesday, 17:00, Room: C) 

Ali Khajeh-Saeed 

PhD Candidate (University of Massachusetts, Amherst) 

Ali Khajeh-Saeed obtained his Ph.D. in Mechanical 

Engineering and in Computer Science at the University 

of Massachusetts Amherst in November 2011. Ali was 

awarded a bachelor and master degrees in Aerospace 

Engineering from Sharif University of Technology, Iran in 

2008. His main research interests are Computational 

Fluid Dynamics (CFD), parallel computation and 

General-Purpose computation on Graphics Processing 

Units (GPGPU). He is currently working as a software 

engineer in CD-Adapco. 

h Session(s): S0217 - Efficient Implementation of 

CFD Algorithms on GPU Accelerated 


Oleh Khoma 

Head of HPC Unit (ELEKS) 

With background in Applied Mathematics and more than 

12 years of experience in Software Engineering, Oleh is a 

Head of HPC Unit at ELEKS and is leading companys 

efforts in the reign of High Performance Computing. 

During the last couple of years Oleh and his team has 

successfully completed several complex bespoke HPC 

solutions utilizing the power of NVIDIA GPGPU cards. 

Passionate about engineering, his largest affection is his 

team. When you have the right people, no problem is 

challenging enough. 

h Session(s): S6047 - Effective HPC Architecture - 

Design, Develop, Implement (Presented by ELEKS) 


Mark Kilgard 

Principal Software Engineer (NVIDIA) 

Mark J. Kilgard is a Principal System Software Engineer 

and an NVIDIA Distinguished Inventor based in Austin, 

Texas. Mark works on OpenGL, programmable shading 

languages, and GPU-rendering algorithms. Mark wrote 

numerous important OpenGL extension specifications 

and implemented the popular OpenGL Utility Toolkit 

(GLUT) for developing portable OpenGL examples and 

demos. Mark co-authored the book The Cg Tutorial: the 

definitive guide to programmable real-time graphics. 

Mark’s Karaoke rendition of Dolly Parton’s “9 to 5” can’t 

be beat. 

h Session(s): S0023 - NVIDIA OpenGL for 2012 


h S0024 - GPU-Accelerated Path Rendering 


Jihan Kim 

Postdoctoral Researcher (Berkeley Lab) 

Jihan Kim began his new postdoctoral researcher position 

at NERSC on August, 2009, after earning his doctorate 

degree in electrical engineering at the University of Illinois 

Urbana-Champaign. For his dissertation, Kim wrote a 

quantum Monte Carlo code in C used to conduct 

simulations of quantum dots. He also worked on the 

device simulator Charon, during a summer internship at 

the Sandia National Laboratory. Currently, he is 

collaborating with Prof. Berend Smit from UC Berkeley on 

carbon capture and separation project. 

h Session(s): S0122 - Computational Screening 

of Novel Carbon Capture Materials 


Grzegorz Kokosiński 

Software Engineer (IBM Poland) 

Grzegorz Kokosiński is MSc of Computier Science from 

Warsaw University of Technology in Poland with thesis 

about Ray Tracing implementation on CUDA in 2010. 

Since February 2011, he is a Software Engineer at IBM 

Netezza R&D Department in Warsaw, Poland. He has 

been involved in HPC appliance project as a CUDA team 

member, where he contributed in many proof of 

concepts, including advanced analitycs, bioinformatics 

and geo spatial algorithms implementation on CUDA. 

h Session(s): S0376 - Dynamic Programming on 

CUDA: Finding the Most Similar DNA Sequence 


David Korf 

Senior Marketing Manager (Hewlett-Packard) 

Mr. Korf has 16 years of engineering experience with the 

last 25 years in various senior marketing, product 

management and partner management positions. 

Accelerators, partner relationships and competitive 

analysis are currently some of his focus areas. 

h Session(s): S0633 - Learn about new Hewlett- 

Packard GPU Systems, Solutions, and Applications! 


Alexandr Kosenkov 

Software Engineer (University of Geneva) 

Highly qualified software engineer in the field of HPC 

and distributed applications under Linux with over five 

years of experience. Possesses strong understanding

of hardware architecture designs down to the physics 

level and high-level technologies/programming 

languages. Open-minded leader, delivering useable, 

well-designed products. 

h Session(s): S0039 - Data-Driven GPGPU Ideology 

Extension (Thursday, 10:00, Marriott Ballroom 3) 

Jiri Kraus 

(Fraunhofer Institute for Algorithms and Scientific 

Computing (FhG-SCAI)) 



Efficient AMG on Hybrid GPU Clusters 

(Wednesday, 17:00, Room: J) 

Adarsh Krishnamurthy 

Post-Doctoral Researcher (UC San Diego) 

Adarsh Krishnamurthy is a post-doctoral researcher in 

the department of bioengineering at UC San Diego. His 

research interests include computer-aided design (CAD), 

geometric modeling, parallel GPU algorithms, 

biomechanics, and heart modeling. He received his Ph.D. 

in mechanical engineering from UC Berkeley specializing 

on parallel GPU algorithms for CAD. He received his 

bachelors and masters in mechanical engineering from 

Indian Institute of Technology, Madras, India. 

h Session(s): S0410 - Computing Hausdorff 

Distances between Freeforms on the GPU 


Christoph Kubisch 


Prior joining NVIDIA as Developer Technology Engineer 

(Professional Solutions), Christoph was a Ph.D. student 

on hardware accelerated visualization techniques for 

medical datasets at the Otto-von-Guericke University of 

Magdeburg. During his studies he has co-authored 

luxinia, a scriptable 3d game engine for games and 

research projects. Furthermore, he has worked for the 

games industry as technical artist doing game art, 

shader and 3dsmax plugin development. 

h Session(s): S0105 - Hardware Acceleration 

for Vessel Visualization Tasks 


Wesley Kuo 

CEO (Ubitus) 

Wesley Kuo founded Ubitus Inc. in 2007. Ubitus is 

specialized in providing cutting-edge cloud computing 

technology in multimedia application and has won 

recognition from leading carriers and hand-held device 

manufacturers around the world including NTT, NTT 

Docomo and Samsung Electronics. Wesley is a 

successful entrepreneur who founded i@Solution Inc. in 

2000 which was later merged with Aplix Corporation in 

2004 where he was a board member and held several 

managerial positions in the field of international sales, 

marketing and OEM business. Wesley owns a Bachelor 

degree in Computer Science and Information 

Engineering from National Taiwan University and has 

dedicated his career in cloud computing, distributed 

computing and embedded solutions. 


CEO on Stage Featuring eyesight Mobile, 

Numira Biosciences, and Ubitus 

(Wednesday, 11:00, Marriott Ballroom 4)) 

Jean Luc Lacome 

CEO (IMPETUS Afea SAS) 

Jean Luc LACOME has a background in Applied 

Mathematics and has been working for the past 10 years 

on the development of Smoothed Particle 

Hydrodynamics. Jean Luc has interests in fluid-structure 

interaction and defense applications. Jean-Luc is CEO of 

IMPETUS Afea France. 

h Session(s): S0143 - Fluid-Structure-Interaction 

Using SPH and GPGPU Technology 


Gianluca Lamanna 

Researcher (CERN) 

Gianluca is physicist working at CERN, the European 

Laboratory for Particle physics. In particular, at the 

moment, he’s involved in building the trigger system and 

the data acquisition system for an experiment searching 

for very rare processes. He obtained his PhD in physics 

in 2006 in the Pisa University with a thesis in data 

analysis about the search for possible violation of the 

particle physics Standard Model. After the PhD he spent 

few years in getting skills in electronics design and 

FPGA programming, very useful in our field to build 

detectors and acquisition system. 

h Session(s): S0013 - GPUs for Fast Triggering in 

NA62 Experiment (Tuesday, 10:00, Room: J2) 

Bjoern Landmann 

Development Engineer (FluiDyna GmbH) 

Landmann is a development engineer at FluiDyna GmbH, 

Munich, Germany since 2011. His research interests 

include: computational multiphysics; high-performance 

computing; and turbulence and aeroacoustics. 

h Session(s): S0293 - Culises – A Library for 

Accelerated CFD on Hybrid GPU-CPU Systems 


Ian Lane 

Assistant Research Professor (Carnegie Mellon 

University) 


h Session(s): S0223 – Rapid Training of Acoustic 

Models Using GPUs (Tuesday, 15:00, Room: N) 

Gerhard Lang 

Chief Engineering Officer (VizRT) 


h Session(s): S0356 - Optimizing Texture Transfers 


Tobias Lauer 

Senior Researcher (Jedox AG) 

Tobias Lauer got his PhD in computer science from the 

University of Freiburg (Germany) in 2007. From 2008- 

2011, he did research on parallel algorithms for OLAP 

applications in a project sponsored by the German 

Research Foundation (DFG). He is now a Senior 

Researcher at Jedox AG, a software company specialized 

in Business Intelligence. 

h Session(s): S0219 - Efficient Top-Down Planning in 

Business Intelligence (Tuesday, 17:00, Room: C) 

Jeff Layton 

Enterprise Technologist for HPC (Dell) 

Dr. Jeffrey Layton is the Enterprise Technologist for HPC 

within Dell. Dr. Layton’s Ph.D. is from Purdue in 

Aeronautical and Astronautical Engineering. In his 25+ 

years of experience with Supercomputing technologies, 

Dr. Layton has served in roles as a Professor, Engineer 

and Scientist at Boeing, Lockheed Martin, NASA, and 

Clarkson University, and has led technical efforts for High 

Performance Computing companies such as Linux 

Networx, Panasas, and Dell. In these roles he has been a 

cluster builder, a cluster user and code writer, a cluster 

administrator, as well as a systems engineer, manager, 

and benchmark engineer for HPC vendors. He is also an 


PANELISTS 

121

SPEAKERS AND 

PANELISTS 

active contributor to multiple open source projects and 

actively contributes to technical publications both for 

magazines, books, and for websites. 

h Session(s): S0637 Analyzing performance 

and power of applications with GPUs on 

Dell 12G platforms (Presented by Dell) 


Simon Layton 

PhD Candidate (Boston University) 

Simon Layton obtained his Masters in Mechanical 

Engineering from Boston University in 2011, and a 

Bachelor’s in mathematics and computer science from 

the University of Bristol in 2008. He is a PhD candidate 

under the supervision of Professor Barba at Boston 

University. During his postgraduate studies, he has 

worked on GPU-based projects, including the Fast Gauss 

transform and a CUDA based implementation of the 

immersed boundary method in fluid dynamics. Currently 

he is working on a GPU accelerated classical algebraic 

multigrid, work begun while interning at NVIDIA in 

Jonathan Cohen’s emerging applications group during 

the Summer of 2011. 

h Session(s): S0305 - Classical Algebraic Multigrid 

for CFD with CUDA (Thursday, 10:00, Room: A8) 

Scott Le Grande 

Principal Engineer (Amazon Web Services) 

Scott Le Grand is currently a principal engineer at 

Amazon Web Services. He developed the first molecular 

modeling system for home computers, Genesis, in 1987, 

Folderol, the distributed computing project targeted at 

the protein folding problem in 2000, and BattleSphere, a 

networkable 3D space shooter for the Atari Jaguar the 

same year. Surprisingly, all three of these efforts shared 

a common codebase. More recently, he ported the 

Folding@Home codebase to CUDA, achieving a 5x 

speedup over previous efforts, and which currently 

accounts for ~2.6 petaFLOPs of the project’s 

computational firepower. He is best known for his work 

porting the AMBER molecular dynamics package to 

CUDA, attaining record-breaking performance in the 

process. In a previous life, Scott picked up a B.S. in 

biology from Siena College and a Ph.D. in biochemistry 

from the Pennsylvania State University. In the current 

life, he is developing life science services on Amazon’s 

Elastic Compute Cloud (EC2). 

h Session(s): S0644 Molecule Dynamics, GPUs, and 

EC2 (Presented by Amazon Web Services) 


Chris Leader 

Research Assistant (Stanford Exploration Project) 

Chris Leader is currently working towards a PhD in 

Geophysics with the Stanford Exploration Project, under 

the supervision of Biondo Biondi and Jon Claerbout. He 

received an MSc in Geophsyics from Imperial College 

London whilst working on imaging 3D land seismic data 

and a BA in Physics from The University of Oxford whilst 

working on astrophysics and atmospheric phenomena. 

His interests include imaging blended seismic data, 

geophysical algorithm acceleration using advanced 

computing architectures and using micro-seismic data 

for imaging purposes. 

h Session(s): S0125 - Memory Efficient Reverse Time 

Migration in 3D (Wednesday, 10:00, Room: A7) 

Brent Leback 

Engineering Manager (Portland Group) 

Brent Leback is an Engineering Manager for PGI. He has 

worked in various positions over the last 26 years in HPC 

customer support, math library development, 

applications engineering and consulting at QTC, Axian, 

PGI and STMicroelectronics. 

h Session(s): S0622 - The Portand Group OpenACC 


David Lecomber 

CTO (Allinea Software) 

Dr. David Lecomber is a founder of Allinea and leads the 

research, development and support teams behind its 

software products. David’s history in High Performance 

Computing began with the Oxford BSP group in 1993, 

working on alternatives for parallel programming to the 

emerging complex MPI standard. He obtained a DPhil in 

Parallel Computing, on the simulation of sharedmemoryand 

formal semantics for distributed-memory 

clusters, continuing to research parallel libraries and 

languages afterwards. After two years developing 

software for online services on clusters, he returned to 

HPC at Allinea, building the development tools needed 

for parallel and multithreaded software. 

h Session(s): S0099 - Debugging GPU Applications 

For Correctness and Performance 


HyoukJoong Lee 

PhD Student (Stanford University) 

HyoukJoong Lee is a PhD candidate in electrical 

engineering at Stanford University. His research 

interests include parallel computer architecture and 

general-purpose GPU computing with their 

programming models. He has an MS in electrical 

engineering from Stanford University. 

h Session(s): S0365 - Delite: A Framework for 

Implementing Heterogeneous Parallel DSLs 


David Lehavi 

Senior Research Scientist (HP) 

David Lehavi is a senior research scientist with HP Labs 

Israel. He got his Ph.D. in algebraic geometry from the 

Hebrew university of Jerusalem on 2002. He has done 

research in algebraic geometry, bioinformatics, 

computerized proofs, communication networks, and 

semantics. He is currently interested in machine 

learning, and in various models for execution of general 

purpose algorithms on GPUs. 

h Session(s): S0043 - 30x Faster Regular 

Expressions on a GPU (Tuesday, 17:30, Room: C) 

Eric Lequiniou 

Director, High Performance Computing (Altair) 

Lequiniou is director of High Performance Computing at 

Altair. Expert in software optimization and parallelization 

on clusters and multi-core architectures, he developed 

the Hybrid MPP parallel version of RADIOSS finite 

element software. After a MsC degree in computer 

science, Eric started his career in 1994 at CNRS, in 

Laboratoire Informatique du Parallélisme. He joined 

Mecalog company in 1994 and worked for the French 

company until it was acquired by Altair in 2006. He also 

holds an Executive MBA from HEC French business 

school obtained in 2007. 

h Session(s): S0225 - Speedup Altair RADIOSS 

Solvers Using NVIDIA GPU 


Wei Li 

Research Scientist (Siemens Corporation) 

Wei Li is a research scientist at Siemens Corporation, 

Corporate Research & Technology, with the responsibility 

focused on GPU-related innovations for Siemens’ 

products. He is the creator of the volume renderer that

is widely deployed in Syngo.via, the medical imaging 

platform of Siemens Healthcare. Wei Li received a PhD 

in computer science from Stony Brook University. His 

research interests include visualization, medical 

imaging, GPU acceleration for graphics and nongraphics 

applications. He has published 20+ papers in 

prestigious journals and conferences, and has produced 

10+ approved and pending patents. 

h Session(s): S0342 - Volumetric Processing and 

Visualization on Heterogeneous Architecture 


Cheng Liao 

Development Manager (MSCsoftware) 

Cheng Liao received a PhD degree from Georgia Tech, and 

is a development manager with MSCsoftware. His 

professional interests include high performance matrix 

computing, I/O, and other FEA related technologies. Prior 

to MSC, Cheng spent many years with SGI and Convex. 

h Session(s): S0064 - MD.Nastran Sparse Direct 

Solvers for Tesla GPUs 


Jerome Limido 

Research & Development (IMPETUS Afea SAS) 

Jérôme LIMIDO has experience from research and 

advanced engineering within aerospace applications. 

The main work of Jérôme has focused on processes 

involving large deformations, both experimentally and 

numerically. Jérôme has special interests in advanced 

numerical methods and fatigue of materials. Jérôme is 

R&D responsible at IMPETUS Afea France and teaches 

Advanced Computational Mechanics and Numerical 

Methods at ISAE. 

h Session(s): S0143 – Fluid-Structure-Interaction 

Using SPH and GPGPU Technology 


Cheng-Hung Lin 

Associate Professor (National Taiwan Normal University) 

Cheng-Hung Lin received the Ph.D. degree in computer 

science from the National Tsing Hua University in 2008. He 

is currently an associate professor with National Taiwan 

Normal University. His current research interests include 

multicore programming and parallel algorithm design. 

h Session(s): S0054 - PFAC Library: GPU-Based 

String Matching Algorithm 

(Thursday, 14:00, Room: C) 

Heshan Lin 

Research Scientist (Virginia Tech) 

Heshan Lin is a Research Scientist in the Department of 

Computer Science at Virginia Tech. His current research 

focuses on the intersection of High Performance 

Computing and Bioinformatics. Specifically, his research 

aims at massively accelerating biological discoveries 

with emergent computational techniques including 

graphics processing units (GPU) and cloud computing. 

He is the author of the latest version of mpiBLAST, a 

popular parallel sequence-search software that has 

received thousands of downloads worldwide. He received 

a Ph.D. degree in Computer Science from North Carolina 

State University in 2009. 

h Session(s): S0156 - Towards Computing the Cure 

for Cancer (Tuesday, 17:00, Hall 1) 

James Lin 

Technical Director, High Performance Computing Center 

(Shanghai Jiao Tong University) 

James Lin is technical director for High Performance 

Computing Center in Shanghai Jiao Tong University and 

co-funder of HMPP Competence Center for AP & Japan. 

His major research area is parallel programming, 

especially for applying CUDA in CFD. He was awarded 

NVidia Academic Partnership Program in Year 2010 and 

is in reviewer committee for CUDA Campus Contest. 

h Session(s): S0251 - RANS CFD Solver on Fermi 


Yuan Lin 

Senior Engineer (NVIDIA) 

Yuan Lin is a senior engineer and manages the compute 

compiler code generation team at NVIDIA. His team’s 

responsibilities include PTX code generation, tools and 

platform support. Yuan has been at NVIDIA for 3 years. 

He was at Sun Microsystems and Motorola before that. 

He holds a doctorate in computer science from 

University of Illinois at Urbana-Champaign. 

h Session(s): S0235 – Compiling CUDA and Other 

Languages for GPUs (Wednesday, 10:00, Room: A5) 

Olay Lindtjorn 

(Schlumberger) 




Hui Liu 

(University of Calgary) 

Hui Liu is working for the reservoir simulation group at 

the University of Calgary. He is leading the development of 

GPU-based parallel iterative solvers. He has successfully 

designed/implemented a sparse BLAS library, four Krylov 

subspace solvers, two algebraic multigrid solvers, parallel 

triangular solvers and several preconditioners. He 

received his PhD degree in Computational Mathematics 

and Parallel Computing from the Chinese Academy of 

Sciences in 2010, and his BSc. degree in Computational 

Mathematics from the University of Science and 

Technology of China (USTC) in 2005. 


Accelerating Iterative Linear Solvers on GPUs 


h S0708 - Los Alamos AHPC Symposium, 




Li-Ta Lo 




PISTON: Portability and Performance for Data- 



Alex Loddoch 

Sr. Research Scientist (Chevron) 

Alex Loddoch is a Senior Research Scientist in Chevron’s 

Technical Computing group. His work includes the 

evaluation of emerging High Performance Computing 

technologies and their application to algorithms in 

Seismic Imaging and Processing and Reservoir 

Simulation. Before joining Chevron he was a Research 

Assistant at the University of Muenster, Germany where 

he worked on topics such as Computational Fluid 

Dynamics, Visualization and Data Compression. Alex 

received a M.Sc. in Physics and a Ph.D. in Geophysics 

from University of Muenster, studying the internal 

dynamics of terrestrial planets. 

h Session(s): S0628 - Panel Session: Learn from 

Experts in the Oil & Gas Industry 



PANELISTS 

123

SPEAKERS AND 

PANELISTS 

Rainald Lohner 

Professor (George Mason University) 


h Session(s): S0218 - ASI Parallel Fortran: A 

General-Purpose Fortran to GPU Translator 


K. Patrick Lorton 

Principal Developer (Schrodinger) 

Patrick Lorton is a Principal Developer and the Technical 

Lead for the Core Hopping and Combiglide products at 

Schrödinger. He received bachelors degrees in Computer 

Science, Mathematics and Chemistry from Indiana 

University, where he published in the fields of Parallel 

Computing and Computational Chemistry. He has 

worked with Schrödinger since graduation. 

h Session(s): S0121 – Software Architecture to 

Facilitate CUDA Development 


Edward Lowe 

Research Assistant Professor (Vanderbilt University) 

Dr. Lowe is a research assistant professor at Vanderbilt 

University developing novel computational methods for 

drug discovery. His interests include GPU acceleration, 

algorithmic techniques in massively parallel 

programming, machine learning, computational 

chemistry, and enzyme mechanisms. He currently leads 

a cheminformatics core in the laboratory of Professor 

Jens Meiler as a member of the Vanderbilt Center for 

Structural Biology and Institute in Chemical Biology. 

h Session(s): S0346 - GPGPU Accelerated Protein 

Similarity Measures Identifying Biological 

Relevant Structure (Wednesday, 17:30, Room: N) 

h S0354 - Bcl::ChemInfo Suite Enables Machine 

Learning-Based Drug Discovery Using GPUs 


Hatem Ltaief 

Computational Scientist (KAUST 

Supercomputing Laboratory) 

Dr. Hatem Ltaief received the MSc degree from ISITIL, a 

school of engineering at the University of Claude 

Bernard Lyon I, France, the MSc in applied mathematics 

at the University of Houston and the PhD degree in 

computer science from the University of Houston. He 

was a Research Scientist II in the Innovative Computing 

Laboratory in the Department of Electrical Engineering 

and Computer Science at the University of Tennessee, 

Knoxville. He is currently a Computational Scientist at 

KAUST Supercomputing Laboratory, Saudi Arabia. 

h Session(s): S0042 - Solving Challenging Numerical 

Linear Algebra Algorithms using Multiple GPU 

Accelerators (Wednesday, 15:00, Room: A3) 

Peter Lu 

Post-Doctoral Research Fellow (Harvard University) 

Peter J. Lu received his AB summa cum laude in physics 

(2000) from Princeton University, and AM (2002) and PhD 

(2008) in physics from Harvard University. He is presently 

a post-doctoral research fellow in the Department of 

Physics and SEAS at Harvard University; his main focus 

is on the physics of attractive colloids and the integration 

of high-performance imaging and analysis techniques. 

He conducts experiments aboard the International Space 

Station, examining phase separation of colloid mixtures 

in the absence of gravity. He has published his 

discoveries of modern quasicrystal geometry in medieval 

Islamic architectural tilings; the first precision 

compound machines, from ancient China; the first use of 

diamond, in prehistoric China; and the first 

quasicrystalline mineral found in nature. 

h Session(s): S0521 - Desktop Supercomputing 

in the Soft-Matter Physics Laboratory 


David Luebke 

Senior Director of Graphics Research (NVIDIA) 

David Luebke helped found NVIDIA Research in 2006 

after eight years on the faculty of the University of 

Virginia. Luebke received his Ph.D. under Fred Brooks at 

the University of North Carolina in 1998. His principal 

research interests are GPU computing and real-time 

computer graphics. Luebke’s honors include the NVIDIA 

Distinguished Inventor award, the NSF CAREER and DOE 

Early Career PI awards, and the ACM Symposium on 

Interactive 3D Graphics “Test of Time Award”. Dr. Luebke 

has co-authored a book, a SIGGRAPH Electronic 

Theater piece, a major museum exhibit visited by over 

110,000 people, and dozens of papers, articles, chapters, 

and patents. 

h Session(s): S0609 - Computational Graphics: An 

Overview of Graphics Research at NVIDIA 


h S0016 - NVIDIA Grad Fellowship Fast Forward 


Justin Luitjens 

Devtech Engineer (NVIDIA) 

Justin Luitjens is a Devtech Engineer at NVIDIA and 

works with applications engineers to optimize and port 

their applications to CUDA. He joined NVIDIA after 

receiving his Ph.D. in Scientific Computing from the 

University of Utah in 2011. 

h Session(s): S0624 - Introduction to CUDA C 


h S0302 - Accelerating miniFE: 

A Finite Element Mini-application 


Dimitar Lukarski 

Research Associate (Karlsruhe Institute of 

Technology (KIT)) 

Dimitar Lukarski holds a bachelor’s degree from Technical 

University of Sofia, Bulgaria and a master’s degree from 

Technical University of Karlsruhe, Germany. Currently, he 

is working at the Engineering Mathematics and Computing 

Lab (EMCL) at Karlsruhe Institute of Technology (KIT) on 

interdisciplinary topics in the area of parallel numerical 

methods and emerging hardware such as GPUs and 

multi-core CPUs. His focus is on robust and fine-grained 

parallel preconditioners with implementations on 

stream-based platforms such as CUDA. 

h Session(s): S0289 - Fine-Grained Parallel 

Preconditioners for Fast GPU-based Solvers 






h S0291 - LAtoolbox: A Multi-platform Sparse 

Linear Algebra Toolbox 


Chris Lupo 

Assistant Professor (California Polytechnic 


Chris Lupo is an Assistant Professor of Computer 

Science and Computer Engineering at California 

Polytechnic State University in San Luis Obispo. His 

teaching and research interests include parallel 

computing, computer architecture, embedded system 

design and code generation. Chris earned his PhD in 

Computer Engineering from UC Davis in 2008.

h Session(s): S0311 - Teaching Applied Parallel 

Computing with GPUs (Wednesday, 17:30, Room: C) 

Steve Lyness 

VP of HPC Solutions Engineering (Appro) 

In November of 2007, Steve Lyness joined Appro as Vice 

President of HPC Solutions Engineering. Steve is 

responsible for the success of Appro’s closed-loop 

solution management, up-front consulting and 

pre-integration of Appro’s HPC solutions across a wide 

range of HPC applications. Steve also acts as a key 

member of the management team for project 

management, planning and coordinating of worldwide 

pre-sales and post-sales customer solution programs. 

Before joining Appro, Steve was Director of Sales 

Engineering for NetEffects, a provider of 10 GigE adapter 

technologies for HPC and Enterprise customers. Steve 

graduated from Drexel University with a Bachelor’s 

degree in Electrical Engineering with an emphasis on 

radar and signal processing technologies. 

h Session(s): S0618 - Best Practices of a 800TFlop 

Hybrid Supercomputer Implementation 


Henrik Høj Madsen 

Solution Architect (LEGO) 

Henrik’s background is based on a Master degree in 

Computer sciences and Engineering from Technical 

University of Denmark where he designed and 

implemented a realtime raytracing architecture on FPGA 

hardware. Henrik was CEO and Lead game developer in 

DogOnFire Interactive, a small game development 

company dedicated to producing core MMO technologies 

for indie market developers. He is currently positioned 

as Solution Architect at LEGO where he is the architect 

of the 3D rendering backend technologies for “LEGO 

Universe”, LEGO’s Massive Online Multiplayer Game for 

LEGO fans worldwide. 

h Session(s): S0261 - Scalable GPU Computing 

Service Architecture (Tuesday, 16:00, Room: A5) 

Alireza Mahani 

Quantitative Modeler (Sentrana) 

Dr. Alireza S Mahani works as a computational scientist 

at Sentrana Inc., a quantitative marketing company in 

Washington, DC. His recent work has been focused on 

building high-performance software (using CUDA/ 

OpenMP/MPI) for Monte Carlo Markov Chain (MCMC) 

sampling of high-dimensional conditional posterior 

distributions arising in Gibbs sampling of Hierarchical 

Bayesian models. Prior to joining Sentrana, Dr. Mahani 

worked as a management consultant at McKinsey & Co. 

He holds a Ph.D. in Physics from Washington University 

in St. Louis, where his research on statistical modeling 

of neuronal motion processing in the avian brain 

resulted in six articles in peer-reviewed journals. 

h Session(s): S0035 - GPU Parallelization of Gibbs 

Sampling: Abstractions, Results, and Lessons 

Learned (Wednesday, 15:00, Marriott Ballroom 3) 

Filipe Maia 

Fellow (Lawrence Berkeley National Laboratory) 

Filipe Maia graduated in biochemistry from Oporto 

University, Portugal, in 2004 and completed his PhD in 

Physics at Uppsala University, Sweden. He is currently a 

Petascale Postdoctoral Fellow at NERSC, Lawrence 

Berkeley National Laboratory. His main research 

interests, besides GPU computing, are diffraction 

imaging, image reconstruction and compressive sensing. 

h Session(s): S0131 - Multi-GPU Real-Time 

Ptychographic X-ray Image Reconstruction 


Jason Mak 

Graduate Student (UC Davis) 

Jason is a computer science Ph.D student at U.C. Davis. 

He received my B.S. in computer science from California 

Polytechnic State University. His research interests 

include GPU computing, parallel algorithms and 

architectures, and scientific computing. 

h Session(s): S0361 – Lossless Data Compression on 

GPUs (Wednesday, 17:00, Room: B) 

Allen Malony 

Professor (University of Oregon) 

Allen D. Malony is a Professor in the Department of 

Computer and Information Science at the University of 

Oregon where he directs the TAU parallel performance 

system project. His research interests are in parallel 

computing, performance tools, and computational 

science. Malony was awarded the NSF National Young 

Investigator award, was a Fulbright Research Scholar to 

The Netherlands and Austria, and received the 

prestigious Alexander von Humboldt Research Award for 

Senior U.S. Scientists by the Alexander von Humboldt 

Foundation. He also received a Professor Partnership 

award from NVIDIA Corporation. Malony is CEO of 

ParaTools, Inc., founded in 2005. 

h Session(s): S0298 - Performance Tools for 

GPU-Powered Scalable Heterogeneous Systems 


Jonathan Marbach 

Director, Software Architecture and Engineering 

(TerraSpark Geosciences) 

Jonathan Marbach is Director of Software Architecture 

and Engineering at TerraSpark Geosciences, makers of 

the 3D Seismic Interpretation package Insight Earth. He 

received his PhD from the University of Colorado and 

specializes in 3d graphics, virtual reality, and 

visualization. He presented at GTC 2010 on GPU 

accelerated stereographic rendering. 

h Session(s): S0336 - GPU Acceleration for Seismic 

Interpretation Algorithms (Tuesday, 16:00, Room: A7) 

Nikolay Markovskiy 

HPC DevTech Engineer (NVIDIA) 

Nikolay Markovskiy is a developer technology engineer 

at NVIDIA and specializes in high performance 

computing using CUDA. He has a background in 

computational condensed matter physics and made his 

PhD in multi-level Monte Carlo algorithms at University 

of Southern California. 

h Session(s): S0247 – 3D ADI Method for Fluid 

Simulation on Multiple GPUs 


Samuel Maroy 

Software Engineer (Barco) 

Samuel Maroy received the M.Sc. degree in computer 

science from the Universiteit Gent in 2008. He joined 

Barco, in August 2008, as software engineer working on 

the development of a networked visualization system. 

Since 2011, Samuel focuses on the use of GPU’s to 

power the video streaming and video processing in 

Barco’s next generation visualization platform. Outside 

of work, Samuel is interested in graphics rendering and 

hopes someday to build his own game. Furthermore, he 

enjoys cycling, soccer, racing and spending time with 

friends. 

h Session(s): S0252 - Building Real-Time 




PANELISTS 

125

SPEAKERS AND 

PANELISTS 

Naoya Maruyama 

Assistant Professor (Tokyo Institute of Technology) 

Naoya Maruyama received his Ph.D. degree in Computer 

Science from Tokyo Institute of Technology in 2008, and 

is an Assistant Professor at Global Scientific Information 

and Computing Center, Tokyo Institute of Technology. He 

has been working on research topics related to 

large-scale high performance computing, including fault 

tolerance, low power computing, and programming 

models for heterogeneous systems. 

h Session(s): S0367 - Physis: An Implicitly Parallel 

Framework for Stencil Computations 


Issei Masaie 

Chief Engineer (Prometech Software, Inc.) 

Issei Masaie is a chief engineer of Prometech Software 

and works on developing physics simulation and 

acceleration technology on gpu / cell / multicore 

hardware for particle-based CAE software. In 2005 he 

recieved a master’s degree from the Department of 

Quantum Engineering and System Science, at the 

Graduate School of Engineering, The University of Tokyo. 

h Session(s): S0066 – Particleworks: Particle-based 

CAE Software Fully Ported on Multi-GPU 


Chris Mason 

Product Manager (Acceleware) 

Chris is the Product Manager for Acceleware’s GPU 

accelerated electromagnetic product line. He is 

responsible for the successful development and launch of 

Acceleware products used by companies world-wide. 

Chris has seven years of experience in developing 

commercial applications for the GPU and has delivered 

over 20 CUDA courses to students in a diverse range of 

industries. His previous experience also includes 

parallelization of algorithms on digital signal processors 

(DSPs) for cellular phones and base stations. Chris has a 

Masters in Electrical Engineering from Stanford University. 

h Session(s): S0614 - Part 1: Introduction to GPU 

Programming (Monday, 09:00, Room: C) 

h S0615 - Part 2: Introduction to the GPU 

Architecture and Memory Model 

(Monday, 10:30, Room: C) 

h S0616 - Part 3: Debugging GPU Programs 

(Monday, 13:00, Room: C) 

h S0617 - Part 4: Introduction to Optimizations and 

Profiling (Monday, 14:30, Room: C) 

Enrico Mastrostefano 

PhD Student (Sapienza Università di Roma) 

Enrico is a PhD student at Sapienza University of Rome. 

h Session(s): S0241 - Large Graphs on Multi-GPUs 


Satoshi Matsuoka 

Titech 




David McAllister 

OptiX Manager (NVIDIA, OptiX group) 

Bio unavailable at press time. 

h Session(s): S0366 - OptiX Out-of-Core and CPU 

Rendering (Tuesday, 15:30, Room: J1) 

Chris McClanahan 

Software Engineer (AccelerEyes) 

Chris McClanahan is a software engineer at 

AccelerEyes. He has a Master’s Degree in Computer 

Science from the Georgia Institute of Technology, with a 

focus on computer vision and computational 

photography. 

h Session(s): S0287 - Jacket for Multidimensional 

Scaling in Genomics (Tuesday, 17:30, Room: K) 

h S0325 - ArrayFire Graphics: A Tutorial 


Iain McCready 

CEO (Cortexica) 

Iain has over 25 years experience within the world’s 

Telecommunications and IT Industries. Until recently he 

was the CEO of NeoMedia Inc., a public US based 

software business that is the world leader in state-of-the 

art barcode creation, capture, delivery and reading 

technology. Prior to that Iain was CEO of Mobiqa Limited, 

an Edinburgh based business where he led the company 

form a start up to the world leaders in mobile ticketing, 

mobile boarding pass and couponing solutions based on 

the creation, optimisation, delivery and redemption of 

barcodes to mobile phones. He was also Chairman of 

Scolocate Limited a co-location and managed services 

business specialising in IT architecture, design and 

planning, project management and implementation 

services. Prior to that he was Chief Operating Officer of 

KSCL, Scotland’s largest software house and a leading 

supplier of customer care and billing applications to the 

world’s mobile phone operators. 

h Sessions: S2000 – Emerging Companies Summit 

Opening with Jeff Herbst (VP of Business 

Development, NVIDIA), Followed by CEO on 

Stage Featuring, Rocketick and Cortexica 


Myles M. McGovern 

President/CEO (Immersive Media) 

Myles McGovern has served as the President and CEO of 

Immersive Media since 2004. Under Myles’ direction IMC 

has pioneered and become the world leading provider of 

3600 interactive video experience ever since. Prior to 

joining IMC Myles was the Founder, President and CEO 

of Centrinity/MC2 where he spearheaded the company’s 

rapid growth in 55 countries and was twice nominated 

for Canadian Entrepreneur of the Year. After his post 

secondary education at Simon Fraser University Myles 

gained valuable technology experience during his 10 

years at Xerox culminating in product management for 

their digital product integration strategy. 

h Session(s): SS2004 – Emerging Companies 


Immersive Media, and Numecent 


Morgan McGuire 

Visiting Professor (NVIDIA and WIlliams College) 

Morgan McGuire is a visiting professor in the NVIDIA 

Research Graphics Group, where he works on real-time 

special effects and future GPUs, and an assistant 

professor of Computer Science at Williams College 

where he teaches computer graphics and game design. 

He is also the editor in chief of the Journal of Graphics 

Tools. Dr. McGuire contributed to many commercial 

products including the E-Ink display for the Amazon 

Kindle, the PeakStream high-performance computing 

infrastructure acquired by Google, the Titan Quest role 

playing game, and the Marvel Ultimate Alliance 2 video 

game for Xbox 360. 

h Session(s): S0409 – Stochastic Rasterization 

(Tuesday, 15:30, Room: B)

Simon McIntosh-Smith 

(The University of Bristol) 

Simon McIntosh-Smith has spent most of his life 

designing and programming multi-core and many-core 

systems. He began his career as a microprocessor 

architect at Inmos and STMicroelectronics, before 

co-designing the world’s first fully programmable GPU 

at Pixelfusion in 2000. In 2002 he co-founded ClearSpeed 

where, as Director of Architecture and Applications, he 

led the development of the first modern many-core HPC 

accelerators. In 2003 he designed the first accelerated 

BLAS/LAPACK and FFT libraries, leading to the first 

modern accelerated Top500 system, TSUBAME 1.0 at 

Tokyo Tech in 2006. He now leads the Microelectronics 

Research Group at the University of Bristol, UK. 


Adaptive Heterogeneous Computing with 

OpenCL: A Molecular Docking Case Study 






Sara McMains 

Professor (UC Berkeley) 

Dr. McMains is an Associate Professor of Mechanical 

Engineering at Berkeley. Her research interests include 

geometric solid modeling, CAD/CAM, GPU algorithms, 

geometric Design for Manufacturing feedback, computer 

aided process planning, layered manufacturing, 

computer graphics, visualization, and virtual prototyping. 

Applications of her research include haptic design 

environments, accessibility analysis for manufacturing, 

design for cleanability, layered manufacturing, and 

machining. She received her AB from Harvard and her 

MS and PhD from Berkeley, all in Computer Science. She 

is the recipient of Best Paper Awards from Usenix, ASME 

and the ACM Solid and Physical Modeling Symposium, 

and the NSF CAREER Award. 

h Session(s): S0410 - Computing Hausdorff 

Distances between Freeforms on the GPU 


h S0411 - Artifact-Free Cloud-Based CAD Rendering 


Gaetano Mendola 

Principal Engineer (MBI srl) 

Principal Software Engineer for MBI srl. MBI develops 

exclusive critical mission solutions. He graduated in 

computer engineer at University of Pisa. His interest are 

related to low latency systems. Since 2008 exploiting the 

Software Designed Radio approach is leading the 

building of real demodulators completely in software 

offloading to GPU what normally other do with FPGA. 

h Session(s): S0065 - Satellite HUB Communication 

System GPU Based (Thursday, 16:30, Room: M) 

Duane Merrill 


Duane Merrill joined NVIDIA Research after completing 

his Ph.D. in Computer Science at the University of 

Virginia. His research interests include algorithmic 

primitives, design idioms, and programming models with 

a particular focus on dynamic, irregular, and cooperative 

parallelism. He contributes to the B40C and Thrust open 

source libraries of GPU computing primitives. 

h Session(s): S0600 - Scalable GPU Graph Traversal 


Peter Messner 

Compute Devtech Engineer (NVIDIA) 

Peter Messmer has been developing and optimizing 

parallel scientific software for over 15 years. After 

completing his PhD in solar plasma-physics at ETH Zurich 

in 2001, Peter joined Tech-X Corp in Boulder, CO, where he 

was leading a group of scientists solving space-related 

simulation and data analysis problems. As part of a NASA 

project, he became an early adopter of GPU computing 

and the lead developer of GPULib, a library for accelerating 

data analysis tasks with GPUs. Since joining NVIDIA in 

2011, he has been working with clients to optimize their 

massively parallel GPU applications. 

h Session(s): S0629 - CUDA Accelerated Compute 

Libraries (Monday, 13:00, Room: A5) 

h S0256 - A Stencil Library for the New Dynamic 

Core of COSMO (Thursday, 09:00, Room: N) 

Renato Miceli 

Computational Scientist (ICHEC) 

Renato Miceli is a Computational Scientist and GPU 

Developer at the Irish Centre for High-End Computing. 

He has a BSc in Computer Science (hons) from 

Universidade Federal de Campina Grande, Brazil, where 

he focused on Software Engineering and Distributed 

Systems, especially Grid and Cloud Computing for HPC. 

At ICHEC, Renato works primarily at analyzing, 

developing, optimizing and porting of applications to 

many-core architectures; his past projects involved 

cryptography, financial simulation, geophysical analysis 

and molecular dynamics. Renato also works on the 

European FP7 projects PRACE, in enabling scientific 

computing on GPUs; and AutoTune, for automatic tuning 

of GPU codes. 

h Session(s): S0034 – Real-Time Risk Simulation: 

The GPU Revolution In Profit Margin Analysis 


Paulius Micikevicius 


Paulius Micikevicius is a Developer Technology Engineer 

at NVIDIA with a focus on parallel computation and 

performance analysis. He has been involved in the 

analysis and optimization of both industrial and scientific 

codes over several generations of GPUs starting with 

G80, the first CUDA-capable architecture. Prior to joining 

NVIDIA, Paulius was an assistant professor of Computer 

Science at Armstrong Atlantic State University as well as 

a research associate at the Media Convergence 

Laboratory at UCF. Paulius holds a PhD in Computer 

Science from the University of Central Florida and a B.S. 

in Computer Science from Midwestern State University. 

h Session(s): S0515 - Multi-GPU Programming 

(Tuesday, 14:00, Room: Hall 1) 

h S0628 - Panel Session: Learn from Experts in the 

Oil & Gas Industry (Tuesday, 16:30, Room: A7) 

h S0514 - GPU Performance Analysis and 

Optimization (Wednesday, 15:30, Hall 1) 

Phillip Miller 

Director, Workstation Software Product 

Management (NVIDIA) 

Bio unavailable at press time. 

h Session(s): S0603 - GPU Ray Tracing 


h S0604 - NVIDIA Advanced Rendering Solutions 


Aamir Mohammad 

Associate Director (Aon Benfield Securities) 

Aamir leads the development of High Productivity 

Computing solutions for Variable Annuity derivatives 


PANELISTS 

127

SPEAKERS AND 

PANELISTS 

models, at Aon Benfield Securities. Prior to joining Aon, 

Aamir worked in US Variable Annuity Hedging at a global 

insurance company, and began his career in quantitative 

finance at a hedge fund in Toronto. Aamir has over five 

years of experience in computational finance, trading 

and software development. Aamir holds an Honors B.Sc. 

in Applied Mathematics & Statistics from the University 

of Toronto. 

h Session(s): S0418 - High Productivity 

Computational Finance on GPUs 


Jamal Mohd-Yusof 


Jamal Mohd-Yusof is member of the Collaborative 

Programming team in Applied Computer Science group 

at LANL. He was part of team which worked on Open 

Science programming for Roadrunner, where he was 

responsible for refactoring and porting of the CFDNS-RR 

fluid dynamics code, including development of a novel 

low-communication tridiagonal solver. He has been 

working with advanced architectures for several years, 

and teaches OpenCL courses at LANL. He is currently 

developing and profiling physics algorithms for a variety 

of advanced architectures. Prior to coming to LANL he 

worked at the Center for Turbulence Research at 

Stanford University. He received his MS and PhD from 

Cornell University in fluid mechanics, where he 

developed novel computational techniques for multiphase 

flow simulation. 

h Session(s): S0708 - Accelerated HPC Symposium: 

Applications - Methods and Programming Models, 

Part 1 (Thursday, 09:00, Room: J3) 

Alexander Monakov 

Researcher (ISP RAS) 

Alexander Monakov is a PhD candidate at Moscow State 

University and a researcher at Institute for System 

Programming, specializing in program optimization and 

compiler technology. He has provided improvements to 

the GCC compiler, including contributions to Graphite- 

OpenCL, an automatic translation pass that generates 

OpenCL code from parallel loops. 

h Session(s): S0115 - Specialized Sparse Matrix 

Formats and SpMV Kernel Tuning for GPUs 


Brooks Moses 

Ph.D., Sourcerer (Mentor Graphics Corporation) 

Dr. Moses leads the High Performance Computing 

Solutions team in Mentor Graphics’ Embedded Software 

Division. He also participates directly in the development 

of the Sourcery VSIPL++ library and other highperformance 

library products. Dr. Moses worked 

extensively on the Cell/B.E and NVIDIA CUDA ports of 

Sourcery VSIPL++. Dr. Moses holds a Ph.D. in 

Mechanical Engineering from Stanford University where 

he conducted advanced research into algorithms for 

computational fluid dynamics simulation. 

h Session(s): S0620 - VSIPL++: A High-Level 

Programming Model for Productivity and 

Performance (Tuesday, 15:00, Room: M) 

Daniel Moth 

Principal Program Manager (Microsoft) 

As a Principal Program Manager in the Developer 

Division, Daniel Moth is responsible for parallel runtimes 

and tools that ship with Visual Studio. He has been with 

Microsoft for over five years; before that he worked in 

the UK as a consultant for Avanade, and before that as a 

developer for a Honeywell company for seven years. In 

his free time you can find him on FICS playing chess or 

near a beach SCUBA diving with his wife. 

h Session(s): S0242 - Harnessing GPU Compute 

with C++ AMP (Part 1 of 2) 


h S0244 - Harnessing GPU Compute with C++ AMP 

(Part 2 of 2) (Thursday, 10:00, Room: C) 

Supratik Moulik 

Cardiovascular Imaging Fellow (University of Pennsylvania) 


h Session(s): S0303 - GPU Acceleration for 

Threshold Based Region Growth Algorithms 


Sathya Narayana K. 

Principal Consultan (Infosys Ltd.) 

Sathya Narayana K. is a Principal Consultant with 

Advanced Engineering Group (AEG) of Infosys. He has 

more than twenty years of experience in the areas of 

high performance scientific computing (HPC), Computer 

Graphics (CG), Mathematical Modeling & Simulation and 

Engineering Software Development. His research 

interests include Mathematical Modeling, Simulation, 

Optimization and Operations Research in Aerospace, 

Gaming, Oil and Gas industry. He has Master of Science 

degree in structural engineering (1993) and information 

technology. He has published 5 papers in national and 

international conferences. 

h Session(s): S0214 - GPU Based Stacking Sequence 

Optimization For Composite Skins Using GA 


Ramesh Narayanaswamy 

Principal Engineer (Synopsys Inc.) 

Ramesh works on Optimizing Compilers and Special 

Purpose Supercomputers for Hardware Description 

Language execution. Notable architectures from past 

projects include a 96 core Heterogeneous Computer with 

MIPS Core + ASIC Coprocessor, a 1024 core HDL 

Processor, and a Multicore CPU + Array of FPGAs. These 

architectures provide orders of magnitude performance 

improvement. Ramesh has been granted seven patents. 

h Session(s): S0317 - Compiling a Parallel 

Domain Specific Language to GPUs 


Rajib Nath 

Student (University of California San Diego) 


h Session(s): S0248 - Excitements, Challenges, and 

Rewards In Optimizing GPGPU Kernels 


Vincent Natoli 

Founder & CEO (Stone Ridge Technology) 

Dr. Vincent Natoli is the founder and CEO of Stone Ridge 

Technology. Stone Ridge is an NVIDIA partner that 

develops, optimizes and ports complex scientific and 

engineering codes to GPU and multi-core platforms. The 

company focusses on work in the energy industry and has 

experience with seismic, reservoir simulation and other 

industry applications. Dr. Natoli has a BS and MS from MIT, 

a PhD in Physics from the University of Illinois Urbana- 

Champaign and an MS in technology management from 

the University of Pennsylvania and Wharton School. He 

worked for 10 years with ExxonMobil Corporate research 

before starting Stone Ridge Technology. 

h Session(s): S0140 – Accelerating Reservoir 

Simulation and Algebraic Multigrid with GPUs 

(Wednesday, 14:00, Room: A7)

Maxim Naumov 


Maxim Naumov’s expertise is in the area of parallel 

numerical linear algebra. In particular, he has worked 

on parallel iterative linear systems and eigenvalue 

solvers. He received his Ph.D. in Computer Science (with 

specialization in Computational Science and 

Engineering) in 2009 and his B.Sc. in Computer Science 

and Mathematics in 2003, all from Purdue University 

– West Lafayette. He currently works in NVIDIA CUDA 

Platform team developing parallel numerical algorithms 

for Graphics Processing Units (GPUs). He has previously 

worked in the Intel Corporation Microprocessor 

Technology Lab and Computational Software Lab, and 

received a 2008-09 Intel Foundation Ph.D. Fellowship. 

h Session(s): S0149 - On the Parallel Solution of 

Sparse Triangular Linear Systems 


Dan Negrut 

Associate Professor (University of Wisconsin-Madison) 

Dan Negrut received his Mechanical Engineering Ph.D. 

in 1998 from the University of Iowa after which he spent 

six years in the CAE industry. In 2004 he served as 

Adjunct Assistant Professor in the Department of 

Mathematics at the University of Michigan. He spent 

2005 as a Visiting Scientist at Argonne National 

Laboratory in the Mathematics and Computer Science 

Division. At the end of 2005 Dan joined the Mechanical 

Engineering faculty at the University of Wisconsin- 

Madison. His interests are in Computational Science and 

he leads the Simulation-Based Engineering Lab (http:// 

sbel.wisc.edu) and Wisconsin Applied Computing Center. 

h Session(s): S0518 - GPU Computing: From Sand to 

Tank Dynamics (Wednesday, 17:00, Room: K) 

Chee Ng 

Research Assistant Professor of Pediatrics (Children 

Hospital of Philadelphia/University of Pennsylvania) 

Dr. Chee M Ng PharmD PhD FCP, is a Research 

Assistant Professor of Pediatrics, at the University of 

Pennsylvania and an investigator of the Laboratory for 

Applied Pharmacokinetic/Pharmacodynamic in the 

Division of Clinical Pharmacology and Therapeutics at 

the Children’s Hospital of Philadelphia (CHOP). He is 

also an investigator of Kinetic Modeling and Simulation 

(KMAS) core of the University of Pennsylvania. He 

received his B.S. from the State University of New York at 

Buffalo, Doctor of Pharmacy with High Honor from the 

University of Illinois, PhD in pharmaceutics from the 

University of North Carolina at Chapel Hill. 

h Session(s): S0262 - GPU-Accelerated Model-Based 

Drug Development (Wednesday, 10:00, Room: B) 

Trung Dac Nguyen 

(University of Michigan) 


h Session(s): S0058 – Advancing GPU Molecular 

Dynamics: Rigid Bodies in HOOMD-blue 


Dave Nichols 






Marc Nienhaus 

(NVIDIA ARC) 


h Session(s): S0507 – Interactive and Scalable 

Subsurface Data Visualization Framework 


Claus Nilsson 

Programmer (Tietronix Software, Inc.) 


h Session(s): S0321 - GPU-Based Monte Carlo Ray 

Tracing Simulation for Solar Power Plants 


Lars Nyland 

Senior Architect (NVIDIA) 

Lars Nyland has been a Senior Architect in the Compute- 

Architecture Group at NVIDIA for over 6 years. Among his 

concerns is memory performance for GPU computing, 

and one of the more interesting sub-problems has been 

the implementation and performance evaluation of atomic 

memory operations on tesla, fermi and kepler GPUs. 

Prior to joining NVIDIA, Lars was a professor of Computer 

Science at the University of North Carolina and the 

Colorado School of Mines. Lars earned his Ph.D. studying 

parallel programming at Duke University in 1991. 

h Session(s): S0313 - Understanding and using 

Atomic Memory Operations 


h Session(s): S0642 – Inside Kepler 

(Wednesday, 14:00, Hall 1) 

Akira Nukada 

Researcher (Tokyo Institute of Technology) 

Akira Nukada is a researcher at Global Scientific 

Information and Computing center, Tokyo Institute of 

Technology, Japan. His research interest includes high 

performance computing, especially on fast Fourier 

transform and GPU computing. He has developed the 

FFTSS library and NukadaFFT library, which are for 

superscalar processor systems and for NVIDIA CUDA 

GPUs, respectively. Both of them have a kind of 

auto-tuning mechanism and the performance is often 

competitive with vendor’s libraries. 

h Session(s): S0209 - Performance of 3-D FFT 

Using Multiple GPUs with CUDA 4 


Anton Obukhov 

Engineering Consultant (Ubiquiti Networks) 

Anton Obukhov’s specialization lies in the field of 

computer vision, multimedia processing, and systems 

design. Prior to joining Ubiquiti Networks, he was an 

engineer at NVIDIA in the Developer Technology group 

for four years. He graduated from Moscow State 

University with a master’s degree in Computer Science 

from the Computational Mathematics and Cybernetics 

department in Russia. Before joining NVIDIA, he 

conducted research and development in the Graphics 

and Multimedia Lab at Moscow State University while 

also working at YUVsoft Corporation. 

h Session(s): S0062 - Histograms of Oriented 

Gradients with CUDA: Performance Analysis and 

Optimization Tips (Tuesday, 16:00, Room: A1) 

David Oehmke 

(Cray Inc.) 


h Session(s): S0089 – Accelerator Directives, OpenACC 

and OpenMP4ACC (Tuesday, 16:00, Room: A3) 


PANELISTS 

129

SPEAKERS AND 

PANELISTS 

Taro Okamoto 

Assistant Professor (Department of Earth and Planetary 

Sciences, Tokyo Institute of Technology) 

Taro Okamoto’s major research fields include: 

geophysics, in particular seismology: simulating and 

analyzing seismic waves to study the structure of the 

Earth and other planets, and to study the earthquake 

source physics. 

h Session(s): S0352 - GPU-Accelerated Parallel 

Computing for Simulation of Seismic Wave 

Propagation (Wednesday, 10:30, Room: A7) 

Michal Okoniewski 

Director of Marketing (Acceleware Ltd.) 


h Session(s): S0433 – Accelerated FDTD Technique 

for Marine Controlled Source Electromagnetic 

Imaging (Wednesday, 15:30, Room: A7) 

Aaron Oliker 

Partner/Director of 3D Technology (BioDigital) 

Aaron is a partner and Director of 3D Technology at 

BioDigital. Aaron is an expert in the field of 3D computer 

based medical simulation and his work has created a 

new paradigm in medical education. Aaron is also a 

Research Assistant Professor of Educational Informatics 

New York University School of Medicine. He has taught 

3D programming and medical visualization at the 

undergraduate and graduate level at NYU and SVA for 

past 12 years. Prior to BioDigital, Aaron founded 

CyberFiber, Inc. and was the Director of Animation and 

Programming at the New York University School of 

Medicine Virtual Surgery Research Laboratory. 


Summit: CEO on Stage Featuring Unity 

Technologies, MirriAd, and BioDigital 


Brent Oster 


Brent Oster is an applied engineer at NVIDIA, with 17 

years experience in computer graphics and simulation, 

having worked with Bioware, LucasFilm, Electronic Arts, 

and holding a degree in Aerospace Engineering and 

graduate studies in scientific computing. 

h Session(s): S0403 - NURBS Tessellation with CUDA 


Eugene Ostroukhov 

Tools Developer (NVIDIA) 

Eugene Ostroukhov is currently a part of the NVIDIA 

CUDA developer tools team, developing NVIDIA Nsight 

for Linux and Mac platforms. He believes in visual tools 

as an important way to combat ever-increasing software 

complexity and spent almost a decade working on visual 

tools and popular integrated developing environments 

for Java, web and mobile application developers. He 

holds B.S. and M.S. from KNEU. 

h Session(s): S0420 – NSight IDE for Linux and Mac 


Andrew Page 

Senior Product Manager (NVIDIA) 

Andrew Page is the Senior Product Manager for 

multi-display and broadcast video products in NVIDIA’s 

Quadro product line. Over his 15 years in hardware and 

software industries he has held engineering and 

marketing roles in professional photo imaging, color 

management and high performance 3D graphics toolkits. 

h S0530 - Multi-Display Roundtable 




Szilárd Páll 

PhD Student (KTH Royal Institute of Technology) 

Szilard is a PhD student at KTH Royal Instute of 

Technology, working on parallel algorithms for Molecular 

Dynamics; developer of the GROMACS MD package. 

h Session(s): S0363 - Efficient Molecular Dynamics 

on Heterogeneous GPU Architectures in GROMACS 


Jeremie Papon 

PhD Student (University of Gottingen) 


h Session(s): S0075 - Oculus Real-Time Modular 

Cognitive Vision System (Tuesday, 15:00, Room: A1) 

Valerio Pascucci 

(University of Utah) 

Dr. Valerio Pascucci is the Director of the Center for 

Extreme Data Management, Analysis and Visualization 

(CEDMAV.COM) of the University of Utah establishes in 

collaboration of the Pacific Northwest National 

Laboratory (PNNL). Valerio is a Professor of the Scholl of 

computing, Associate Director of the Scientific 

Computing and Imaging (SCI) Institute, and a Laboratory 

Fellow at the PNNL. Before joining SCI, Dr. Pascucci 

served as a Group Leader and a Project Leader at the 

Lawrence Livermore National Laboratory, Center for 

Applied Scientific Computing and as Adjunct Professor 

at the Computer Science Department of University of 

California Davis. 

h Session(s): S0623 Visualizing Heterogeneous 

Performance Tested on MPI+CUDA Gigapixel 

Panorama Stitching (Wednesday, 17:00, Room: A8) 

Ritesh Patel 

Student (University of California Davis) 

Ritesh is a graduate student pursuing my M.S. degree in 

Electrical and Computer Engineering at the University of 

California, Davis. His interests are in the area of GPGPU 

applications. 

h Session(s): S0361 - Lossless Data Compression on 

GPUs (Wednesday, 17:00, Room: B) 

Sandeep Patel 

Assitant Professor (University of Delaware) 

Sandeep Patel is and Assistant Professor in the 

Department of Chemistry and Biochemistry at the 

University of Delaware. He earned his Ph.D. in Chemical 

Engineering from the Massachusetts Institute of 

Technology (MIT). His research interests include the 

broad areas to which simulation techniques of 

biophysical systems and development of advanced 

molecular modeling technologies are applied. 

h Session(s): S0207 – GPU Enabled Macromolecular 

Simulation: Challenges and Opportunities 


Anjul Patney 

PhD Candidate (University of California, Davis) 

Anjul is a fifth year PhD student in the Department of 

Electrical and Computer Engineering at University of 

California, Davis. He works under the guidance of Prof. 

John Owens in the area of graphics and computer 

architecture. In his research, he is interested in pursuing 

hardware and software challenges in the design of 

programmable rendering architectures. 

h Session(s): S0138 – GPU Task-Parallelism: 

Primitives and Applications 

(Thursday, 15:30, Marriott Ballroom 3)

Bharath Pattabiraman 

PhD Student (Northwestern University) 


h Session(s): S0087 - GPU Acceleration of 

Dense Stellar Clusters Simulation 


Sushrut Pavanaskar 

PhD Candidate (UC Berkeley) 

Sushrut Pavanaskar is a PhD candidate in Mechanical 

Engineering at UC Berkeley. His research interests 

include CAD/CAM, geometric modeling, GPU algorithms, 

computer graphics, and manufacturing. Applications of 

his research include solid model rendering, toolpath 

planning, and methods to improve efficiency in 

manufacturing. He received his BE in Mechanical 

Engineering from Pune University and his M. Tech. from 

IIT Bombay in Manufacturing. Currently at Berkeley, he 

works in computer aided design and manufacturing 

laboratory advised by Prof. Sara McMains. He recently 

won Audi Production Award 2011 for his concept on 

applying advanced geometric algorithms in automobile 

manufacturing for resource efficiency. 

h Session(s): S0411 – Artifact-Free Cloud-Based CAD 

Rendering (Thursday, 16:30, Room: L) 

Jon Peddie 

President (Jon Peddie Research) 

Jon Peddie is one of the pioneers of the graphics 

industry, starting his career in computer graphics in 

1962. After the successful launch of several graphics 

manufacturing companies, Peddie began JPA in 1984 to 

provide comprehensive data, information and 

management expertise to the computer graphics 

industry. Peddie lectures at numerous conferences on 

topics pertaining to graphics technology and the 

emerging trends in digital media technology. Recently 

named one of the most influential analysts, he is 

frequently quoted in trade and business publications, 

and contributes articles to numerous publications 

including as well as appearing on CNN and TechTV. 



Bert Peers 

Senior Graphics Programmer (CCP Games) 

Bert Peers is a senior graphics programmer with 

Iceland based CCP Games, the company behind the 

single shard space MMO Eve Online. After working in the 

games industry as a freelancer for over a decade, as 

well as a few years in the field of medical imaging and 

rapid prototyping, he joined CCP to focus on high fidelity 

avatar customization, rendering, and all things 

characters. 

h Session(s): S0021 - OptiX for DirectX Programmers 

- EVE Online’s GPU-Raytraced Portraits 


Blair Perot 

Professor (University of Massachusetts, Amherst) 

Prof. Perot is the Director of the Theoretical and 

Computational Fluid Dynamics Laboratory at the 

University of Massachusetts, Amherst. He obtained his 

Ph.D. and M.S. degrees in Mechanical Engineering and 

in Computer Science from Stanford University and a 

B.S.E in Engineering Physics with highest honors from 

Princeton University in 1987. Research in the Theoretical 

and Computational Fluid Dynamics Laboratory focuses 

on high performance computing, the computer 

simulation of fluid flow, and the study of fluid turbulence. 

The Laboratory is funded, in part, by the Office of Naval 

Research, the Air Force Office of Scientific Research, the 

DOE and the NSF. 

h Session(s): S0217 – Efficient Implementation of 

CFD Algorithms on GPU Accelerated 


David Perry 

CEO and Co-Founder (GAIKAI) 

David Perry was the founder & president Shiny 

Entertainment, Inc. for over 12 years (bought by Atari), 

he’s one of the best known video game industry 

veterans. Over 29 years, Perry has developed or 

programmed over 100 games across 29 video game 

platforms. All told, Perry’s games (including #1 hits like 

The Terminator, Teenage Mutant Ninja Turtles, Disney’s 

Aladdin & Warner’s Matrix projects) have totaled over a 

billion dollars in retail sales. Perry sits on the advisory 

board of the Game Developers Conference, Indiecade, 

VGEXPO, and has spoken at TED, E3, Hollywood and 

Games Summit, CGDC, MIT, USC, UCI, UCLA, QUB, 

Montreal Game Summit, Digital Hollywood, What Teens 

Want etc.). In his last position Perry was the co-founder 

& chief creative officer of Acclaim.com, directing 

multiple MMORPG games, Social Network Games & 

Casual Titles. All games used the ‘free-to-play’ model, 

supported by in-game advertising, subscriptions or 

micro-transactions. Now Perry is the CEO and cofounder 

of Gaikai.com, a company that’s developed a 

cutting-edge video game streaming technology that 

allows any Windows game or application to run in any 

browser with just one click. Perry also recently launched 

a book for students called David Perry on Game Design 

- GameDesignBook.org (the largest non-profit book on 

Game Design ever written). 



Immersive Media, and Numecent 


Christian Perwass 

CEO (Raytrix GmbH) 

Dr. Christian Perwass received a MSci degree in Physics 

from the University of London, UK, in 1996, and a Ph.D. 

in engineering from Cambridge University, UK, in 1999. 

He then held a post-doctoral position at the University of 

Kiel, Germany, until 2006, where he worked on image 

processing, machine learning and camera models. From 

2006 until 2009 he worked at Robert Bosch GmbH, 

Germany, where he developed image processing 

software for automated optical inspection machines. In 

2009 he co-founded Raytrix GmbH to develop and build 

lightfield cameras. 

h Session(s): S0335 - Live 3D-Video with a Lightfield 

Camera (Wednesday, 14:00, Room: A1) 

h S2006 - Emerging Companies Summit: CEO on 

Stage Featuring Raytrix, Playcast and Universal 

Robotics (Wednesday, 17:00, Marriott Ballroom 4) 

David Peters 

(CEO, Universal Robotics) 

David launched Universal Robotics in April of 2008, 

having raised private equity to capitalize operations. He 

is the Chairman of the Board. Before founding Universal, 

he was an entrepreneur in the motion picture industry, 

working as a producer for 17 years. He is a seasoned 

operations executive and fund raiser. David is a member 

of the Director’s Guild of America and the Robotics and 

Smart Device Committee of the World Economic Forum 

Network of Global Agenda Councils. He has a Bachelor 

of Fine Arts from the Cleveland Institute of Art. 

h S2006 - Emerging Companies Summit: CEO on 

Stage Featuring Raytrix, Playcast and Universal 

Robotics (Wednesday, 17:00, Marriott Ballroom 4) 


PANELISTS 

131

SPEAKERS AND 

PANELISTS 

Loukas Petridis 

Staff Scientist (Oak Ridge National Laboratory) 

Loukas Petridis obtained his PhD in theoretical physics 

from Cambridge University in 2006. He is a Postdoctoral 

fellow at Oak Ridge National Laboratory from 2007 to 

2009 where he currently is a Staff Scientist. 

h Session(s): S0659 - Computer Simulation of 

Lignocellulosic Biomass (Tuesday, 16:30, Room: A2 

James Phillips 

Senior Research Programmer (University of Illinois) 

James Phillips is a Senior Research Programmer in the 

Theoretical and Computational Biophysics Group at the 

Beckman Institute for Advanced Science and Technology 

at the University of Illinois at Urbana-Champaign. He 

has a Ph.D. in Physics from the University of Illinois. 

Since 1999, James has been the lead developer of the 

highly scalable parallel molecular dynamics program 

NAMD, for which he received a Gordon Bell Award in 

2002. His research interests include improving the 

performance and accuracy of biomolecular simulations 

through parallelization, optimization, hardware 

acceleration, better algorithms, and new methods. 

h Session(s): S0127 - Petascale Molecular Dynamics 

Simulations on GPU-Accelerated Supercomputers 






Peter Phillips 

SVP (Aon Benfield Securities) 


h Session(s): S0418 – High Productivity 

Computational Finance on GPUs 


Jakub Pietrzak 

Software Engineer (University of Warsaw) 

Jakub Pietrzak is member in the research team in the 

Department of Medical Physics, Maria Skłodowska- 

Curie Memorial Cancer Centre - Institute of Oncology. 

He is an experienced C++ developer interested in image 

processing and analyzing techniques and their 

applications in medical imaging. He also worked as 

software engineer for a video postproduction company. 

Jakub is a student of the final year of Inter-faculty 

Individual Studies In Mathematics and Natural Sciences 

at the University of Warsaw, where he studies 

simultaneously physics (specialization in nuclear 

medicine) and mathematics. 

h Session(s): S0312 - GPU Implementation for Rapid 

Iterative Image Reconstruction in Nuclear 

Medicine (Wednesday, 10:00, Room: A8) 

Nikos Pitsianis 

Assistant Professor (Aristotle University, Greece) 

Nikos Pitsianis is an assistant professor at the 

Department of Electrical and Computer Engineering, 

Aristotle University of Thessaloniki, Greece, and an 

adjunct professor with the Departments of Computer 

Science and Electrical and Computer Engineering of 

Duke University, Durham, North Carolina. His research 

interests include high-performance algorithms and 

architectures for signal and image processing. 

h Session(s): S0314 - Efficient k-Nearest 

Neighbor Search Algorithms on GPUs 


Victor Podlozhnyuk 


Victor Podlozhnyuk is a performance optimization expert 

currently working on NVIDIA FFT library. In his spare time 

he is investigating various opportunities for putting to use 

the tremendous amount of horsepower modern 

GPU-based systems have. In his previous role of a devtech 

engineer at NVIDIA he authored a number of sample 

algorithm implementations in CUDA and OpenCL for 

NVIDIA GPU Computing SDK. Victor holds a Master’s and 

a Bachelor’s degree in Electrical Engineering from 

Moscow Institute of Physics and Technology. 

h Session(s): S0273 - Fast JPEG Coding on the GPU 


Raphaël Poncet 

Research Scientist (Commissariat à l’Energie Atomique 

et aux Energies Alternatives) 

Raphael Poncet is a research scientist at CEA (the French 

Alternative Energies and Atomic Energy Commission), a 

French government-funded technological research 

institution, where he works on a high performance 

industrial multi-physics multi-material hydrodynamic code. 

h Session(s): S0091 - Sustainable Hybrid 

Parallelization of an Unstructured Hydrodynamic 

Code (Thursday, 15:00, Room: N) 

Warren Ponder 

Director, Product Management (VMware) 


h Session(s): S0359 – VMware and NVIDIA: 

Delivering 3D Workstations from the Cloud 


Duncan Poole 

Senior Manager, HPC (NVIDIA) 

Duncan Poole is the CEO of the OpenACC organization, 

and Senior Manager in HPC for NVIDIA, he where he 

works with 3rd party tools providers to deliver GPUenabled 

capabilities. Duncan’s interests include 

fostering strong academic research relationships, most 

recently in the area of computational chemistry. Duncan 

is a graduate in Electrical Engineering from the 

University of Toronto. 

h Session(s): S0517A – Programming GPUs with 


h S0517B – Programming GPUs with OpenACC (Part 


h S0517C – Programming GPUs with OpenACC (Part 


h S0621 - NVIDIA OpenACC 


Mark Popkiewicz 

CEO (MirriAd) 

Mark has extensive executive experience in high growth 

companies and has grown businesses from small to 

large and from local to global market leadership globally, 

having set up 30 operations around the world. With 

extensive executive experience in high growth companies 

such as Eicon Network, SDX Business systems, Lucent 

Technologies, Mobile Media, and BBC Ventures Group, 

and BBC Vecta – he is now CEO of MirriAd. Mark has a 

thorough understanding of contemporary technology and 

business models in digital media, on-line advertising as 

well as telecoms and mobile. 


Summit: CEO on Stage Featuring Unity 

Technologies, MirriAd, BioDigital 

(Wednesday, 10:00, Marriott Ballroom 4))

Srinivasa Prasanna 

Professor (International Institute of Information 

Technology Bangalore) 


h Session(s): S0271 – Fast Adaptive Sampling 

Technique for Multi-Dimensional Integral Estimation 

Using GPUs (Wednesday, 14:30, Marriott Ballroom 3) 

Will Ramey 

Sr. Product Manager, GPU Computing (NVIDIA) 

As NVIDIA’s Senior Product Manager for GPU 

Computing, Will helps define and promote platforms, 

libraries and developer tools for CUDA architecture 

GPUs. Prior to joining NVIDIA in 2003, he managed an 

independent game studio and developed advanced 

technology for the entertainment industry as a product 

manager and software engineer. He holds a BA in 

Computer Science from Willamette University and 

completed the Japan Studies Program at the Tokyo 

International University. Outside of work, Will learns 

something new every day, usually from his two kids. He 

enjoys hiking, camping, swimming, spending time with 

his wonderful wife, and playing The Game. 

h Session(s): S0005 - Languages, APIs and 

Development Tools for GPU Computing 


Pradeep Rao 

Technology Architect (Infosys Technologies Ltd) 

Pradeep is Technology Architect at Infosys Limited, 

Bangalore, India. He has nine years of experience in the 

IT industry. His core focus area has been building 

solutions and applied research in the field of High 

Performance Computing (HPC). He has experience in 

many HPC technologies such as CUDA, OpenCL and 

multi-core technologies such as Microsoft HPC Server. 

As part of HPC team at Infosys, his responsibilities 

include providing consulting services to our Fortune 500 

clients for their HPC needs and building solutions 

leveraging suitable HPC technology. He has also worked 

on various Microsoft platforms including .Net 

technologies and Sql Server. 

h Session(s): S0271 - Fast Adaptive Sampling 

Technique for Multi-Dimensional Integral 

Estimation Using GPUs 


Steve Rennich 

HPC Developer Technology Engineer (NVIDIA) 

Steve Rennich is a CUDA Developer Technology Engineer 

at NVIDIA where he supports the use of GPUs in by 

computational structural mechanics community. Steve 

holds a PhD in Aeronautics and Astronautics from 

Stanford University where he studied computational fluid 

mechanics and vortex system instabilities. Prior to 

joining NVIDA Steve spent 10 years developing structural 

analysis codes. 

h Session(s): S0029 - Leveraging Matrix Block 

Structure In Sparse Matrix-Vector Multiplication 


Max Rietmann 

PhD Student (Institute for Computational Science / USI 

Lugano, Switzerland) 

Max Rietmann is a PhD Student in computer science at 

the Institute for Computational Science at the USI 

Lugano in Switzerland. As a developer for the GPU 

version of seismology code SPECFEM3D, he research is 

focused on both computational and algorithmic 

challenges associated with numerical wave propagation. 

h Session(s): S0508 - Faster Finite Elements for Wave 

Propagation Codes (Thursday, 10:00, Room: A2) 

Mariano Rivera 

(Researcher-Professor, CIMAT A.C.) 


h Session(s): S0128 - V:Screen: A Real-Time 

Augmented Video Method 


Dylan Roeh 

Kernel Developer (Wolfram Research Inc) 

Dylan Roeh is a Kernel Developer for Wolfram Research 

Inc., the company that makes Mathematica and 

Wolfram|Alpha. He is one of the developers responsible 

for the recently-added CUDA and OpenCL support. 

h Session(s): S0100 - Mathematica as a Practical 

Platform for GPU-Accelerated Finance 


John Romein 

Senior Researcher (ASTRON) 

John W. Romein is a senior system researcher in 

high-performance computing at ASTRON, where he is 

responsible for the central, real-time data processing of 

LOFAR telescope data. He obtained his Ph.D. degree on 

distributed search algorithms for board-game playing at 

Vrije Universiteit, Amsterdam. As a postdoctoral 

researcher, he solved the game of Awari using a large 

computer cluster and did research on parallel 

algorithms for bioinformatics. His research interests 

include high-performance computing, parallel 

algorithms, networks, programming languages, and 

compiler construction. 

h Session(s): S0124 - Signal Processing on GPUs for 

Radio Telescopes (Thursday, 10:00, Room: M) 

Christopher Rossbach 

Researcher (Microsoft Research Silicon Valley) 

Chris Rossbach is a Researcher with Microsoft Research 

Silicon Valley. 

h Session(s): S0320 - PTask: OS Support for GPU 

Dataflow Programming (Thursday, 14:00, Room: B) 

Davide Rossetti 

Researcher (Italian National Institue for Nuclear Physics) 

Davide Rossetti has a degree in Theoretical Physics and 

is currently a staff researcher at Italian National Institute 

for Nuclear Physics (INFN). He has been member of the 

Array Processor Experiment (APE) research group for 

more than 15 years. His interests range from numerical 

simulations and HPC to processor architectures, 

compilers, computer graphics. He spent the last 10 

years working on the development of software and 

hardware for high performance interconnection 

networks on clusters. 

h Session(s): S0282 - Leveraging NVIDIA GPUDirect 

on APEnet+ 3D Torus Cluster Interconnect 


Scott Rostrup 

Software Engineer (Synopsys Inc) 

After completing a Masters Thesis at the University of 

Waterloo on developing fluid simulation algorithms for 

the Cell and GPU architectures, Scott joined Synopsys’s 

GPU computing effort. Since joining Synopsys, Scott has 

become interested in developing GPU algorithms for 

applications not typically thought suitable for 

acceleration such as sparse linear algebra, graph 

algorithms, and circuit simulation. 

h Session(s): S0349 - Tree Accumulation on the GPU 



PANELISTS 

133

SPEAKERS AND 

PANELISTS 

Erwin Roth 

Researcher (Technische Universitaet Muenchen) 

Erwin graduated from Technische Universität München 

in 2008 with a Master of Science (Dipl.-Ing.) degree in 

Mechanical Engineering with a solid background in 

computer vision and model based tracking. He is 

currently working as PhD candidate for the Ingolstadt 

Institute of the Technische Universität München, a 

scientific research center founded by the AUDI AG and 

the Technische Universität München in the field of 

sensor data simulation for the computer-based testing 

of Advanced Driver Assistance Systems. 

h Session(s): S0319 - Advanced Driver 

Assistance System Testing using OptiX 

(Tuesday, 14:00, Room: N) 

Gregory Ruetsch 


Greg Ruetsch is an applications engineer in GPU 

Computing at NVIDIA. Prior to this he held positions at 

Clearspeed Technologies and at Sun Microsystems. He 

received his Bachelor’s degree in mechanical and 

aerospace engineering from Rutgers University and a 

Ph.D. in applied mathematics from Brown University, 

after which he was a postdoctoral fellow in the 

Aerospace Engineering Department at the University of 

Southern California and in the Center for Turbulence 

Research at Stanford University. 

h Session(s): S0522 - Introduction to CUDA Fortran 


Karl Rupp 

Project Assistant (TU Wien) 

Karl Rupp received the BSc degree in electrical 

engineering from the Technische Universität Wien in 

2006, the MSc in computational mathematics from 

Brunel University in 2007, and the degree of 

Diplomingenieur in microelectronics and in technical 

mathematics from the Technische Universität Wien in 

2009. He completed his doctoral degree on deterministic 

numerical solutions of the Boltzmann transport 

equation in 2011. His scientific interests are in the field 

of semiconductor device simulation and include generic 

programming, advanced discretization schemes for 

partial differential equations and parallel computing. 

h Session(s): S0071 - The High-Level Linear 

Algebra Library ViennaCL And Its Applications 


Scott Ruppert 

ThinkStation Technical Solutions Manager (Lenovo) 

Scott Ruppert is Technical Solutions Manager for the 

worldwide ThinkStation business unit at Lenovo. 

h Session(s): S0638 - Lenovo ThinkStation 

Accelerates Medical Research with Beckman 

Coulter (Presented by Lenovo) 


Radu Rusu 

Research Scientist (Willow Garage, Inc) 

Radu B. Rusu is a Research Scientist at Willow Garage 

and a Visiting Lecturer at Stanford University. Dr. Rusu 

received his Ph.D. in Computer Science from the 

Technische Universitaet Muenchen, Germany. During the 

last few years, Dr. Rusu has been on the board of many 

workshops and scientific events held at prestigious 

conferences, such as RSS, ICRA, IROS, AAAI, etc. He has 

authored over 50 scientific publications, including 1 book 

and 1 best paper award at ICAR 2009. Dr. Rusu’s current 

research interests include realtime perception and 3D 

semantic mapping. He is currently a maintainer of the 

PCL project. 

h Session(s): S0088 - Point Cloud Library (PCL) on 

CUDA (Tuesday, 14:00, Room: C) 

Denis Sabitov 



h Session(s): S0171 - Numerical Modeling Of 3D 

Anisotropic Seismic Wave Propagation On 

MultiGPU Platforms (Wednesday, 09:00, Room: A7) 

Priyanka Sah 

Compute DevTech Engineer (NVIDIA) 

Having spent two years with the Indian Space Research 

Organization, developing and implementing parallel 

image processing algorithms for satellite imagery, 

Priyanka Sah went on to attain her masters in Computer 

Science and Engineering at IIT Delhi. Priyanka 

subsequently worked on life science and weather 

simulation codes as a CUDA consultant, before joining 

NVIDIA in their Developer Technology group. With NVIDIA 

Priyanka works in a number of HPC application 

domains, helping customers develop with the GPU and 

working at the leading edge of HPC performance. 

h Session(s): S0428 - Panini: A GPU Aware Array 

Class (Thursday, 16:00, Room: B) 

Nikolai Sakharnykh 


Nikolai Sakharnykh is a developer technology engineer 

at NVIDIA. He has been working with game developers 

and HPC CUDA customers providing support for 

graphics technology and GPU compute. Currently he is 

working on CFD and linear algebra related projects for 

current and future GPU hardware. His interests include 

computational fluid dynamics, sparse matrix solvers and 

visualization techniques. Nikolai graduated with honours 

from Moscow State University, the department of 

Computational Mathematics and Cybernetics as a 

specialist in applied mathematics and informatics. 

Currently he’s also working on his PhD at MSU. 

h Session(s): S0247 - 3D ADI Method for 

Fluid Simulation on Multiple GPUs 


Graham Sanborn 

Lead Software Developer (FunctionBay) 

Graham Sanborn is a research engineer at FunctionBay, 

Inc. He is a member of the multi-flexible-body dynamics 

(MFBD) development team, where his research and 

development focus is finite element technologies for 

nonlinear dynamics, the integration of these technologies 

with multi-body formulations for system-level analysis of 

dynamic systems, and the numerical methods 

appropriate for these systems. He has a bachelor’s 

degree in computer science and a PhD in mechanical 

engineering. He received his PhD in 2008 from the 

University of Illinois at Chicago, where he studied 

computational rigid and flexible body system dynamics. 

h Session(s): S0055 - Particle Dynamics with MBD 

and FEA using CUDA (Wednesday, 16:00, Room: K) 

Avijit Santra 

Project Manager - Knowledge Based Engineering (Tata 

Motors Limited) 

Avijit Santra received his Masters in Mechanical 

Engineering from IIT Kharagpur 2001. He then joined Tata 

Technologies Ltd in 2001 and deputed to Tata Motors Ltd 

Engineering Research Center. Having 10 years of 

experience in Knowledge Based Engineering Kernel and 

Application development, he is also involved in various 

initiatives in Tata Motors Digital Vehicle Development 

Program which includes PLM, 3D for All etc.

h Session(s): S0040 - Introducing CUDA in KBE 

Applications for Digital Vehicle Development 

Programs (Tuesday, 09:30, Room: J2) 

Greg Scantlen 

Greg Scantlen is CEO of CreativeC.com, a supplier of 

high-performance computing machines and expertise to 

scientists and researchers at academic institutions and 

US national laboratories, such as Los Alamos National 

Laboratory and Sandia National Laboratory. 

h Session(s): S0646 - Massively Parallel Code 

Development on Stelletto CDA (Presented by 

Creative Consultants) (Tuesday, 17:00, Room: A8) 

Bertil Schmidt 

(Nanyang Technological University) 


h Session(s): S0008 - Algorithms and Tools for 

Bioinformatics on GPUs (Tuesday, 16:00, Room: K) 

Michael Schøler 

Senior Consultant (LEGO) 

Michael has a Masters Degree in Computer Science 

from Aalborg University within the fields of Computer 

Vision and Artificial Intelligence systems. As a Senior 

Consultant and CEO in Hinnerup Net A/S, Michael has 

participated in a number of projects for LEGO. One of 

these projects is LEGO 3DServices which is a service 

oriented distributed HPC framework (GPU/CPU) that this 

session will focus on. Michael has worked on numerous 

other projects, ranging from simple websites to cutting 

edge technology development. The most recent primary 

customers for Hinnerup Net A/S are: Vestas, TrygVesta, 

The Danish Road-Directory and LEGO. 

h Session(s): S0261 – Scalable GPU Computing 

Service Architecture (Tuesday, 16:00, Room: A5) 

Steve Scott 

CTO, Tesla Business (NVIDIA) 

Dr. Steve Scott is Chief Technology Officer of the Tesla 

business unit at NVIDIA, where he is responsible for the 

evolution of NVIDIA’s GPU computing roadmap. Prior to 

joining NVIDIA in August 2011, Steve spent 19 years at 

Cray, where he was CTO since 2004. He was the Chief 

Architect of multiple systems at Cray, architected the 

routers for the Cray XT, XE and Cascade systems, and 

led the Cray Cascade project funded by the DARPA High 

Productivity Computing Systems program. Steve holds 

twenty-eight US patents, and has served on numerous 

advisory boards and program committees. He was the 

recipient of the 2005 ACM Maurice Wilkes Award and the 

2005 IEEE Seymour Cray Computer Engineering Award. 

He received his PhD in computer architecture in 1992 

from the University of Wisconsin at Madison, where he 

was a Wisconsin Alumni Research Foundation and Hertz 

Foundation Fellow. 



Frank Sculli 

Co-Founder/Informatics Director (BioDigital) 

Frank cofounded BioDigital on the premise that 

advancements in 3D and information technology will 

revolutionize the understanding of health and this vision 

continues to drive innovation. With extensive experience 

in health informatics, Frank has consulted to numerous 

prestigious medical institutions. Most notably, Frank led 

the development of the Caisis cancer data management 

project which is used globally by leading cancer 

hospitals. Prior to cofounding BioDigital, Frank worked 

at Honeywell, and later as a consultant to major 

organizations such as the Bank of New York, Pfizer and 

the Pennsylvania Treasury Department. Frank received 

his MS in Engineering from Columbia University. 


CEO on Stage Featuring Unity Technologies, 

Numecent, and BioDigital 


Ani Anciaux Sedrakian 

(IFP Energie Nouvelles) 

biography unavailable at press time. 

h Session(s): S0108 – An Innovative Massively 

Parallelized Molecular Dynamic Software 


Mark Seligman 

Senior Scientist (Insilicos LLC) 

Mark was a compiler developer for supercomputer 

vendors for many years. In recent years, he became 

more interested in the interplay of algorithms with 

hardware and now prefers to work directly with other 

researchers. His original training was in pure math, but 

nowadays he tends to focus on bioinformatics, 

computational statistics and optimization. 

h Session(s): S0337 - High-Throughput Epistasis 

Screening Using GPUs (Tuesday, 09:00, Room: K) 

Matthew Sellitto 

(Northeastern University) 


h Session(s): S0290 – Algorithm Acceleration 

for Geospatial Analysis 


Partha Sen 

CEO (Fuzzy Logix) 

Partha Sen is the Co-founder and CEO of Fuzzy Logix. 

He has a passion for solving complex business problems 

using quantitative methods, data mining and pattern 

recognition. Since 1995, Partha has pursued this passion 

and has developed numerous high-performance 

quantitative algorithms. Today, these algorithms and 

models are the basis for the products being brought to 

market by Fuzzy Logix. Before founding Fuzzy Logix, 

Partha worked at Bank of America where he held senior 

management positions in the commercial and 

investment bank and in the portfolio strategies group. 

Previously Partha held managerial positions at Ernst 

and Young and Tata. He has an Engineering degree from 

the Indian Institute of Technology and an MBA from 

Wake Forest University. 

h Session(s): S0427 - Intra-Day Risk-Management 

with Parallelized Algorithms on GPUs 


Neil Sequeira 

Managing Director (General Catalyst Partners) 

As a Managing Director of General Catalyst Partners, 

Neil invests in both new and existing technology 

businesses. His areas of special interest include: 

Internet and new media; software; consumer services; 

and network infrastructure. He is based in our Palo Alto 

office. Before joining General Catalyst Partners, Neil 

held positions at Time Warner where he was most 

recently Managing Director, Technology for Time Warner 

Investments. Formerly AOL Time Warner Ventures, the 

early stage private investment vehicle for the world’s 

largest media company. During his four years at Time 

Warner, Neil worked closely with various operating 

groups including AOL, HBO, Time Inc., Time Warner 

Cable, Turner and Warner Brothers to identify 

investment opportunities. Neil sourced, led and was a 

board director or observer for several of the companies 

within the Time Warner Investments portfolio including: 


PANELISTS 

135

SPEAKERS AND 

PANELISTS 

Arroyo Video Solutions (CSCO), BigBand Networks 

(BBND), Entropic (ENTR), Goldpocket Interactive (ERIC), 

N2Broadband (ERIC) and Waterfront Media. 



Fyodor Serzhenko 

SEO (Fastvideo) 

Fyodor Serzhenko is CEO of Fastvideo company. His 

research interests include high speed cameras and 

software for high speed imaging, high performance 

computing. He was graduated from Moscow Institute of 

Physics and Technology in 1989 and got PhD in physics 

of semiconductors in 1993. 

h Session(s): S0273 - Fast JPEG Coding on the GPU 


Christopher Sewell 




PISTON: Portability and Performance for Data- 



h S0707 - Los Alamos AHPC Symposium, Accelerated 



Peter Shenkin 

Vice President (Schrodinger) 

Peter S. Shenkin, Vice President, joined Schrödinger in 

1999. Previously, he was the lead developer of the 

MacroModel molecular-modeling package at Columbia 

University. He received his Ph.D. in Chemistry from 

Princeton University in 1979. After working for Owens- 

Corning Fiberglass Corporation for four years, he taught 

and carried out research at Columbia University and 

Barnard College prior to joining the MacroModel group 

in 1992. He has published in the areas of biosequence 

diversity analysis, protein structure determination, 

implicit solvation models for molecular mechanics and 

fast methods for determining solvent-accessible surface 

areas for atoms in molecules. 

h Session(s): S0121 - Software Architecture to 

Facilitate CUDA Development 


Gideon Shmuel 

CEO (eyeSight Mobile Technologies, Ltd.) 

Gideon joined eyesight with 20 Years of experience in 

the Telecoms and Enterprise Software markets. 

Gideon has been involved in growing technology 

organizations and running and establishing the business 

activities and operations of several companies across 

international markets. Most recently Gideon performed 

the role of VP Sales at cVidya Networks. Prior to that 

Gideon had a number of executive roles in a number of 

countries in Olista, Top Image Systems, LCR Telecom 

and Esprit Telecom. 

h Session(s): S2002 - Emerging Companies Summit: 

CEO on Stage Featuring eyeSight Mobile, 



Mark Silberstein 

Post-doctoral Researcher (UT Austin) 

Mark Silberstein is a Post-doctoral fellow at the 

University of Texas at Austin, with Prof. Emmett Witchel. 

He earned his PhD from the Technion, Israel Institute of 

Technology. His current research focuses on improving 

the integration of GPUs with the Operating Systems, as 

well as optimized execution of hybrid applications 

involving both GPUs and CPUs. He can be reached at 

marks@cs.utexas.edu. 

h Session(s): S0360 - Set GPUs Free: Integrating a 

File System with CUDA Programs 

(Thursday, 09:30, Hall 1) 

Chris Slaughter 

President (University of Texas Perception, Lynx Labs) 

Chris Slaughter is the President of Lynx Laboratories 

and a member of the Perception Laboratory at the 

University of Texas at Austin. Along with a team of 

engineers and researchers, he investigates theoretical 

problems in Computer Vision with an emphasis on high 

performance. His current research direction is focused 

on compressive motion analysis, real-time data 

clustering, and statistical localization on large maps. As 

the President of Lynx Labs, he also oversees the 

development of high performance algorithms for 

tracking, dense reconstruction, and SLAM as well as the 

commercialization of these technologies 

h Session(s): S0607 - High Performance 3D 

Perception (Tuesday, 09:00, Room: A1) 

Peter-Pike Sloan 

Principal Research Scientist (NVIDIA) 

Peter-Pike Sloan recently moved to NVIDIA Research. 

Prior to that he was part of a research group for Disney 

Interactive Studios and also spent nearly 10 years at 

Microsoft, where he worked in the graphics research 

group, DirectX and on the many-core incubation team. 

He is interested in all areas of computer graphics, 

particularly interactive rendering techniques. 

h Session(s): S0611 - Edge-Aware Shaders 

for Real-Time Computer Graphics 


Berend Smit 

(UC Berkeley/Berkeley Lab) 


h Session(s): S0122 – Computational Screening 

of Novel Carbon Capture Materials 


Roman Sokolov 

Director of System Architecture (D4D Technologies) 

Roman Sokolov received his Ph.D. in Physics from UCSD in 

2005. He has been working at D4D technologies since 2007 

as a software engineer. His main interests include applied 

mathematics, numerical methods and image processing. 

h Session(s): S0079 – Warped Parallel Nearest 

Neighbor Searches using KD-Trees 


Prakalp Somawanshi 

(CRL India) 


h Session(s): S0107 – Acceleration of Long-Wave 

Rapid Radioactive Transfer Model on GPGPU 


Paulo Souza 

HPC Consultant / Software Engineer (Petrobras) 

Paulo Souza has spent 9+ years working with E&P 

production geophysics software, seismic imaging on 

HPC clusters, RTM, One Way Wave Equation, Kirchhoff, 

multiple architecture optimization (GPGPU, x86, Power, 

Cell) and cluster deployment. He has been working with 

GPGPU since 2006 porting seismic imaging applications 

to CUDA with gains up to 10X in performance/price and 

performance/watt over a traditional multi-million dollar 

x86 cluster.




Dale Southard 

Senior Solution Architect (NVIDIA) 

Dale Southard is a senior solution architect with NVIDIA. 

In the past he was a HW architect in the LLNL systems 

group designing the vis/post-processing solutions and 

on-call for capability systems. 

h Session(s): S0119 - Best Practices for Architecting 

and Managing High-Performance GPU Clusters 


Marco Sozzi 

Associate Professor (Physics Department of Pisa) 

Marco Sozzi is associate professor of physics at the 

University of Pisa, working in particle physics and 

focusing on discrete symmetry violations in Nature. His 

areas of interest include high-performance triggering 

and event selection, and he coordinates the Trigger and 

Data Acquisition project for the NA62 experiment in 

preparation at CERN, for which a pilot project using 

GPUs is foreseen. 

h Session(s): S0013 – GPUs for Fast Triggering in 

NA62 Experiment (Tuesday, 10:00, Room: J2) 

Kyle Spagnoli 

Research Engineer (EM Photonics) 

Kyle has been working in GPU accelerated algorithms and 

applications since the pre-CUDA era. At the University of 

Delaware, he received his Master’s degree in electrical 

engineering with a focus in parallel computing 

architectures. Since then, as a research engineer at EM 

Photonics, he was worked on a number of GPGPU projects 

including: accelerated physical optics simulations, 

computational fluid dynamics, biomedical processing, 

advanced image processing, and computational linear 

algebra. Currently, he is researching new algorithms and 

techniques for large scale sparse linear algebra solvers. 

h Session(s): S0307 – New Advances in GPU Linear 

Algebra (Wednesday, 14:00, Room: A3) 

Paolo Spallaccini 

System Engineer (Ericsson) 

Paolo Spallaccini is working at Ericsson R&D Italy, 

Microwave Department, as a system engineer. His 

research interests lie in diverse digital signal processing 

areas, with focus on source and channel coding, as well 

as in software engineering and in algorithm 

engineering, with focus on parallel computing. His 

working experiences ranged from joining/leading 

technical development groups for signal processing 

systems to pioneering long-time perspective innovative 

and strategic projects for telecommunication networks 

backbone and mobile backhaul systems. He received a 

master degree in Electronic Engineering from University 

of Perugia in 1999. He is an IEEE Member. 

h Session(s): S0255 - Telecom Systems Simulations 

Acceleration via CPU/GPU Co-Processing: Turbo 

Codes Case Study 


Pierre Spatz 

Head of Quantitative Research (Murex SAS) 

Pierre has joined Murex in 1989 and has a master 

degree in computer science and applied mathematics 

from ENSIMAG. After various leading positions in the 

Murex software development team Pierre has launched 

the Murex Analytics initiative in 2002. 

h Session(s): S0250 - From GPU Computing 

Toward Full HPC In Finance with GPUs 


Filippo Spiga 

Computational Scientist (Irish Centre for High-End 

Computing) 

Filippo joined ICHEC in January 2011 as a 

Computational Scientist after six months at the IBM T.J. 

Watson Research Center as Research Engineer. His 

main interests include general GP-GPU programming, 

numerical algorithms for GP-GPU, development of 

mixed multi-core CPU and GPU code and scientific 

application porting. Inside ICHEC Filippo is directly 

involved in the GP-GPU porting of the PWSCF package 

(QUANTUM ESPRESSO suite), enabling the package for 

efficient and high-scalable serial and parallel 

calculations on large GPU clusters. 

h Session(s): S0220 - Enabling faster material 

science modeling using the accelerated Quantum 

ESPRESSO (Thursday, 16:30, Marriott Ballroom 4) 

Savitha Srinivasan 

Partner (IBM Venture Capital Group) 

Savitha Srinivasan is a Partner in IBM’s Venture Capital 

Group in Corporate Strategy where she develops 

strategic relationships with venture capitalists and their 

portfolio companies to leverage external innovation for 

mutual strategic advantage. She has over 20 years of 

experience at IBM in leadership roles addressing the 

strategic priorities of IBM’s Services businesses and 

leads the development of IBM’s Services venture 

ecosystem, with each of the Global Technology Services 

business units – Strategic Outsourcing, Integrated 

Technology Services, Managed Business Process 

Services and Industry Analytics with early identification 

of companies, fostering pilots, partnerships and M&A 

insights. She is currently engaged in driving IBM 

Watson’s content partnership strategy. 



Timo Stich 


Timo Stich is a Developer Technology Engineer for 

NVIDIA Corporation. His focus is on image processing 

and general purpose compute applications of GPUs. 

Prior to joining NVIDIA he was research staff at the 

Graphics, Optics and Vision Group at the Max-Planck- 

Institute for Computer Science, Saarbruecken and the 

Computer Graphics Lab at Brunswick University. He 

received a diploma degree in Computer Science from 

Mannheim University, Germany and a Ph.D. degree from 

the Brunswick University, Germany. 

h Session(s): S0052 - Fast High Quality Image and 

Video Background Removal with CUDA 


Chris Stiefeling 

(Oliver Wyman Financial Services) 

Chris has more than 15 years of experience in designing 

and implementing software solutions for the Financial 

Sector. He has an in-depth knowledge of spreadsheet, 

database and automation technologies and has 

developed expertise in many different programming 

languages and technologies. He has developed a 

significant amount of experience in the areas of 

economic scenario generation as well as pricing and 

valuation of derivatives and insurance products using 

Monte Carlo simulation techniques. Chris has expertise 

in implementing HPC solutions including large scale 

cloud computing implementations, programming on 

general purpose GPU cards and distributed computing 

frameworks such as Windows HPC. 

h Session(s): S0435 - Leveraging GPGPU Technology 

for Valuation of Complex Insurance Products 



PANELISTS 

137

SPEAKERS AND 

PANELISTS 

John Stone 

Senior Research Programmer (University of Illinois at 


John Stone is a Senior Research Programmer in the 

Theoretical and Computational Biophysics Group, and 

Associate Director of the NVIDIA CUDA Center of 

Excellence at the University of Illinois. Stone is the lead 

developer of VMD, a high performance molecular 

visualization tool used by researchers all over the world. 

His research interests include molecular visualization, 

GPU computing, parallel processing, ray tracing, haptics, 

and virtual environments. Mr. Stone was awarded as an 

NVIDIA CUDA Fellow in 2010. Stone provides consulting 

services for projects involving computer graphics and 

GPU computing. 

h Session(s): S0142 - VMD: High Performance 

Molecular Visualization and Analysis on GPUs 






Jeff Stuart 

PhD Student (UC Davis) 


h Session(s): S0157 – A Study of Persistent Threads 

Style Programming Model for GPU Computing 


Xiaobai Sun 

Professor (Duke University) 

Xiaobai Sun is a professor of computer science at Duke 

University. Her research interests and efforts focus on 

numerical algorithm design and analysis, especially, in 

bridging and blending mathematical models and 

computer architectures for scientific simulation and 

signal processing. 

h Session(s): S0314 – Efficient k-Nearest 

Neighbor Search Algorithms on GPUs 


Rajeev Surati 

President (Scalable Display Technologies) 


h Session(s): S0355 - Seamless Scalable Displays- 

using NVDIA Warp + Intensity API 


Krishnan Suresh 

Associate Professor (University of Wisconsin) 

Krishnan Suresh is currently an Associate Professor in 

the Department of Mechanical Engineering Department, 

University of Wisconsin, Madison. He graduated in 1998 

from Cornell with a Ph.D. in Mechanical Engineering. He 

later served as an Engineering Manager at Kulicke and 

Soffa Industries, Philadelphia from 1998 through 2002. 

His research interests are in representational and 

computational challenges underlying computational and 

bio-mechanics. 

h Session(s): S0070 - GPU-Friendly 

Preconditioners for Thin Structure Analysis 


William Tang 

Director of Fusion Simulation Program at the Princeton 

Plasma Physics Laboratory (Princeton) 

William Tang is the Director of the Fusion Simulation 

Program at the Princeton Plasma Physics Laboratory 

(PPPL) and Lecturer with Rank & Title of Professor in 

the Department of Astrophysical Sciences at Princeton 

University. He is a Fellow of the American Physical 

Society and received the 2005 Chinese Institute of 

Engineers-USA (CIE-USA) Distinguished Achievement 

Award “for his outstanding leadership in fusion research 

and contributions to fundamentals of plasma science.” 

He is internationally recognized for his theoretical 

contributions as well as associated HPC applications 

dealing with electromagnetic kinetic plasma behavior in 

complex geometries. He has over 200 publications – with 

more than 140 peer-reviewed papers and an “h-index” 

or “impact factor” of 42 on the Web of Science, including 

over 5400 total citations. He is currently the U.S. PI for 

the G8 Exascale Project in Fusion Energy -- an 

international HPC collaboration involving the US, UK, 

France, Germany, Japan, and Russia. 

h Session(s): S0654 Fusion Energy Sciences & 

Computing at the Extreme Scale 


Sarah Tariq 


Sarah is a senior engineer in NVIDIA’s Developer 

Technology team focusing on High Performance GPU 

Computing in the Life Sciences domain. As part of her job 

she works collaboratively with external developers to 

research and develop GPU computing algorithms and 

ensure the best performance of GPU computing 

applications on current and next-generation architectures. 

h Session(s): S0351 - Strong Scaling for Molecular 

Dynamics Applications (Tuesday, 14:30, Room: A1) 

Michela Taufer 

Assistant Professor (University of Delaware) 

Michela Taufer is an Assistant Professor in Computer 

and Information Sciences at the University of Delaware. 

She earned her MS in Computer Engineering from the 

University of Padova and her Ph.D. in Computer Science 

from ETH. She was a post-doc at UC San Diego and The 

Scripps Research Institute. Michela has a long history of 

interdisciplinary work with computational biophysics 

groups. Her research interests include software 

applications and their advance programmability in 

heterogeneous computing (i.e., multi-core platforms and 

GPUs); cloud computing and volunteer computing; and 

performance analysis, modeling and optimization of 

multi-scale applications. 

h Session(s): S0207 - GPU Enabled Macromolecular 

Simulation: Challenges and Opportunities 


Tetsuo Tawara 

Software Engineer (Koozyt) 

Tetsuo Tawara is currently a software engineer at Koozyt 

where he works on augmented reality and data mining 

projects. He received a Masters degree in Mechanical 

Engineering from Aoyama Gakuin University. 

h Session(s): S0231 - Levenberg-Marquardt using 

Block Sparse Matrices on CUDA 


Andrei Tchouprakov 

Director of System Architecture (D4D Technologies) 

Andrei Tchouprakov is a Director of System Architecture 

at D4D Technologies where he is currently working on 

developing a 3D dental scanner. His background is in 3D 

data acquisition, point cloud processing, surface 

generation, image processing and parallel computing. 

He received his MS degree in Mathematics in 1998 from 

Irkutsk State University, Russia. 

h Session(s): S0079 - Warped Parallel Nearest 

Neighbor Searches using KD-Trees 

(Thursday, 10:30, Room: A2)

Tom-Michael Thamm 

Director, Software Product Management (NVIDIA ARC) 

Tom-Michael Thamm is the Director for Software 

Product Management at NVIDIA ARC and is responsible 

for all products, such as iray, mental ray and the 

geo-spatial library. He is managing direct customer 

support as well. Thamm is working for mental images 

and NVIDIA ARC for over 20 years. He has led several 

key projects such as integration of mental ray into many 

of the major CAD systems. He has studied Mathematics 

and has developed various 3D file formats, such as 

extended OBJ, and free-form surface algorithms. 

h Session(s): S0507 - Interactive and Scalable 

Subsurface Data Visualization Framework 


Derek Thorslund 

Director of Product Management (Citrix Systems, Inc.) 

Derek Thorslund Drives Citrix’s product strategy for HDX 

(high definition experience) multimedia virtualization 

technologies and leads the company’s HDX Product 

Management group across XenDesktop, XenApp, 

VDI-in-a-Box, Citrix Receiver and CloudGateway. Upon 

joining Citrix in 2003, he played a key role in introducing 

the Citrix Access Suite, forerunner to XenDesktop 

Platinum Edition. Thorslund has had an extensive career 

in the high-tech industry as Director of Product 

Management at Avotus and Manager of New Business 

Applications at Bell-Northern Research. 

h Session(s): S0413 - Delivering 3D Professional 

Graphics from the Cloud with Citrix XenDesktop 


Alexey Titov 

Engineering Research Associate (Stanford) 

Dr. Alexey Titov is an Engineering Research Associate in 

the Martinez Group at Stanford University. His research 

efforts are focused on exploring, implementing and 

optimizing computational chemistry algorithms for novel 

architectures. He is one of developers of TeraChem, 

quantum chemistry software created from scratch for 

GPUs. Alexey Titov’s research interests also include 

parallel algorithms, various applications of symbolic 

algebra systems in optimization of performance-critical 

computational routines for novel architectures. 

h Session(s): S0429 - Quantum Chemistry: Automated 

Code Generation and Optimization for GPU Kernels 


Stanimire Tomov 

Research Director (University of Tennessee, Knoxville) 


h Session(s): S0248 – Excitements, Challenges, 

and Rewards In Optimizing GPGPU Kernels 


h S0042 – Solving Challenging Numerical Linear 

Algebra Algorithms using Multiple GPU 

Accelerators (Wednesday, 15:00, Room: A3) 

Doug Traill 

Senior Solutions Architect (NVIDIA) 

Doug Traill is a Senior Solutions Architect at NVIDIA for 

scalable visualization solutions. He has over 15 years 

experience in designing and building some of the worlds 

most complex visualization systems. 

h Session(s): S0341 - See the Big Picture Scalable 

Visualization Solutions for System Integrators 


Justin Tripp 

Technical Staff Member (Los Alamos National Laboratory) 

Dr. Justin L. Tripp is a Technical Staff Member on the 

Advanced Architectures team at Los Alamos National 

Laboratory. Dr. Tripp works on tools and methodologies 

for creating high-performance computing systems, 

which have been applied to systems from 

supercomputers to satellites and airborne video 

surveillance. Dr. Tripp received an R&D100 Award for his 

work on the Trident C-to-FPGA Compiler. Dr. Tripp 

received his PhD in Electrical Engineering from Brigham 

Young University in 2004 and has nineteen publications 

relating to FPGAs and high-performance computing, and 

more than 15 years of experience with FPGAs, highperformance 

computing, advanced architectures, and 

system-level design and analysis tools. 


The Architecture of Acceleration in HPC 


h S0707- Los Alamos AHPC Symposium, Accelerated 



Alejandro Troccoli 

Mobile Imaging Researcher (NVIDIA) 

Alejandro has been with NVIDIA since 2006 and joined 

NVIDIA Research in March 2011 to work in mobile 

computer vision and applications. As a 3D Systems 

Software Engineer he lead the development of NVIDIA’s 

Optimus technology, contributed to NVIDIA’s hybrid 

technology and did development work for the Direct3D 

Driver. Alejandro received a Licenciatura en Ciencias de 

la Computacion from the Universidad de Buenos Aires, 

Argentina, in 2001. He did his graduate work at Columbia 

University in the City of New York, where he received a 

Ph.D. in 2006. 

h Session(s): S0526 - Tools for Mobile Computational 

Photography (Tuesday, 16:00, Room: N) 

Jeroen Tromp 

Director, Princeton Institute for Computational 

Science (Princeton) 

Seismologist Jeroen Tromp, Blair Professor of Geology, 

Professor of Applied & Computational Mathematics, and 

Director of the Princeton Institute for Computational 

Science joined the Princeton faculty in 2008. Tromp’s 

main research interests are in theoretical & 

computational seismology, including simulations of 

acoustic (an)elastic, and poroelastic seismic wave 

propagation on local, regional and global scales. The 

current focus of his research involves imaging Earth’s 

interior based on spectral-element and adjoint methods. 

He received the Macelwane Medal of the American 

Geophysical Union in 1999 and a Gordon Bell Award in 

2003. He is a corresponding member of the Royal 

Netherlands Academy of Sciences. 

h Session(s): S0608 - Toward Global Seismic Imaging 

based on Spectral-Element and Adjoint Methods 


Thomas True 


Tom is a Senior Applied Engineer in NVIDIA’s 

Professional Solutions Group where he focuses on the 

use of GPUs in broadcast, video and film applications 

ranging from pre-visualization to post production and 

live to air. Prior to joining NVIDIA, Tom was an 

Applications Engineer at SGI. Thomas has a M.S. degree 

in Computer Science from the Graphics Lab at Brown 

University and a B.S. Degree from the Rochester 

Institute of Technology. 



h S0328 - Best Practices in GPU-Based Video 

Processing (Tuesday, 14:00, Room: J2) 

h S0049 - Using the GPU Direct for Video API 



PANELISTS 

139

SPEAKERS AND 

PANELISTS 

Hoang-Tron Minh Tuan 

PhD Student (George Mason University) 

Tuan is currently the PhD student at George Mason 

University, School of System Biology. His research has 

been focusing on calcium dynamics, cardiac cell 

modeling, and high performance computing. Currently, 

he’s working on developing a computational model for 

cardiac cell at a microscale level using GPU technology 

to study the underlying mechanisms of calciumentrained 

arrhythmias. 

h Session(s): S0072 – GPU-Enabled Spatiotemporal 

Model of Stochastic Cardiac Calcium Dynamics and 

Arrhythmias (Wednesday, 09:00, Room: B) 

Antonino Tumeo 

Research Scientist (Pacific Northwest National 

Laboratory) 

Dr. Antonino Tumeo received the M.S degree in 

Informatic Engineering, in 2005, and the Ph.D. degree in 

Computer Engineering, in 2009, from Politecnico di 

Milano in Italy. Since February 2011, he has been a 

research scientist in the PNNL’s High Peformance 

Computing group. He Joined PNNL in 2009 as a post 

doctoral research associate. Previously, he was a post 

doctoral researcher at Politecnico di Milano. His 

research interests are modeling and simulation of high 

performance architectures, hardware-software 

codesign, FPGA prototyping and GPGPU computing. 

h Session(s): S0343 - A Quantum Chemistry 

Domain-Specific Language For Heterogeneous 

Clusters (Tuesday, 10:00, Room: L) 

Stanley Tzeng 

Graduate Student (University of California, Davis) 

Stanley Tzeng is a graduate student at the University of 

California, Davis. His main research is into task-parallel 

systems on the GPU and he is interested in its applications. 

h Session(s): S0138 - GPU Task-Parallelism: 

Primitives and Applications 






Ivan Ufimtsev 

Postdoc (Stanford) 


h Session(s): S0429 – Quantum Chemistry: Automated 

Code Generation and Optimization for GPU Kernels 


Stefan Umbreit 

Postdoctoral Associate (Northwestern University) 


h Session(s): S0087 – GPU Acceleration of 

Dense Stellar Clusters Simulation 


Vamsi Krishna Veligatla 

GPU Programmer (University Of Groningen) 

Vamsi Krishna Veligatla received his Masters in 

Computer Science (IIIT Hyderabad 2006) and BTech in 

Computer Science (IIIT Hyderabad 2004). His 

professional experience includes, Software Developer at 

NVIDIA (Pune, India), then later worked as a Software 

Developer at AMD (Hyderabad, India), and most recently 

has been working as GPU Programmer at Kapteyn 

Astronomical Institute, University Of Groningen 

(Groningen, The Netherlands). 

h Session(s): S0187 - GPUs for Radio Imaging 


Shalini Venkataraman 

Senior Applied Engineer (NVIDIA) 

Shalini Venkataraman is a Senior Applied Engineer 

at NVIDIA. 



h Session(s): S0356 - Optimized Texture Transfers 


h S0353 - Programming Multi-GPU’s for Scalable 

Rendering (Wednesday, 09:00, Room: A1) 

h S0322 - Warping & Blending for Multi-Display 

Systems (Wednesday, 10:00, Room: A1) 



Shivaram Venkataraman 

PhD Student (UC Berkeley) 

Shivaram Venkataraman is a PhD student at the 

University of California, Berkeley and is a part of the 

AMP Lab. He completed his M.S at the University of 

Illinois in 2011 and his B.E from the Birla Institute of 

Technology and Science, Pilani, India. His research 

interests are in design of storage systems and analytics 

platforms for big-data applications. 

h Session(s): S0152 – Accurate Sequence Alignment 

using Distributed Filtering on GPU Clusters 


Vyas Venkataraman 


Vyas Venkataraman is a software engineer in the CUDA 

developer tools group at NVIDIA. He is primarily 

responsible for CUDA-MEMCHECK, and contributes to 

the CUDA Driver and backend code shared by clients of 

the debug API. He joined NVIDIA in 2010 from Boston 

University where he was doing research on abstractions 

for high level modeling of synthesizable communicating 

systems. Vyas received his Doctor of Philosophy from the 

College of Engineering at Boston University. 

h Session(s): S0027A – All-In-One Debugging 

Experience with CUDA-GDB and CUDA-MEMCHECK 


h S0027B – All-In-One Debugging Experience 

with CUDA-GDB and CUDA-MEMCHECK 


Jeff Vetter 

(Oak Ridge National Laboratory) 




Oreste Villa 

Research Scientist (Pacific Northwest National Laboratory) 


h Session(s): S0343 – A Quantum Chemistry 

Domain-Specific Language For Heterogeneous 

Clusters (Tuesday, 10:00, Room: L) 

Will Wade 

Manager, Quadro Advanced Technologies (NVIDIA) 

Will Wade manages the Quadro Advanced Technologies 

Team at NVIDIA, responsible for some of the highest 

demanding visual computing solutions on the planet. 

This team creates technologies for virtual reality caves, 

3D stereo-scopic professional visualization, real-time 

broadcast graphics, and remote and virtualized 

interactive graphics. Will has been a leader in the field 

for over 15 years, with work at both NVIDIA and HP. 

h Session(s): S0254 - Graphics in the Cloud - 

How NVIDIA is Enabling Cloud Visualization 

(Tuesday, 14:00, Room: A5)

Kelly Walker 

Senior Software Developer (Hue) 


h Session(s): S0436 - Integrated GPU Acceleration 

With Real Time Visualization Of Terabyte Data 


Ross Walker 

Assistant Professor (University of California San Diego) 

Ross Walker is an Assistant Research Professor at the 

San Diego Supercomputer Center, an Adjunct Assistant 

Professor in the Department of Chemistry and 

Biochemistry at the University of California, San Diego 

and an NVIDIA Fellow. He runs the Walker Molecular 

Dynamics Lab where he leads a team developing 

advanced techniques for Molecular Dynamics Simulations 

supporting work improving drug and biocatalyst design. 

His work includes improved Quantum Mechanical/ 

Molecular Mechanical models, development of force 

fields for simulation of lipid membranes, simulations of 

cellulase enzymes for improved cellulosic bioethanol 

production and the development of GPU accelerated 

versions of the AMBER Molecular Dynamics engine. 

h Session(s): S0010 - Towards Routine Microsecond 

Molecular Dynamics Simulations on Commodity 

Hardware (Wednesday, 09:00, Room: N) 

Jason Walsh 

(University of Pennsylvania 3D Lab) 


h Session(s): S0303 – GPU Acceleration for 

Threshold Based Region Growth Algorithms 


BingQiang Wang 

Head of High Performance Computing (BGI) 

BingQiang Wang completed his doctorate in 

computational chemistry at East China University of 

Science and Technology (ECUST) in 2006. From March 

2005, he was a research scientist at Shanghai 

Supercomputer center, dedicated to high performance 

computing enabling in computational chemistry and life 

science research. In March 2010 he joined BGI as group 

head of high performance computing, to develop 

solutions for challenging life science problems. 

h Session(s): S0519 - GPU Accelerated 

Bioinformatics Research at BGI 


h S0109 - SOAP3: GPU-based Compressed Indexing 

and Ultra-fast Parallel Alignment of Short Reads 


Gaofeng Wang 

Postdoc Fellow (Laboratoire E.M2.C, Ecole Centrale Paris) 

Dr. Gaofeng WANG is postdoc fellow in Laboratory EM2C, 

CNRS UPR288, Ecole Centrale Paris. His research 

interests are in area of turbulent combustion modeling 

and high fidelity CFD. 

h Session(s): S0129 - A Monte Carlo Thermal 

Radiation Solver in GPU/CPU Hybrid Architecture 


Long Wang 

Associate Professor (Supercomputing Center of CNIC, 



h Session(s): S0392 – Large-Scale First Principle 

Pseudopotential DFT Calculations on GPU Clusters 


Peng Wang 

Devtech Engineer (NVIDIA) 

Peng Wang is currently the manager of HPC developer 

technology in NVIDIA China, where he works with HPC 

developers in porting and optimizing HPC codes on GPU. 

Previously he works in NVIDIA US as a HPC developer 

technology engineer, where he mainly worked on CAE 

solvers on GPU and molecular dynamics. He got a Ph.D. 

on computational physics from Stanford, where he 

worked on developing massively parallel adaptive mesh 

fluid simulations code and applying to astrophysical 

turbulence simulations. He also got a MS in Physics and 

BS in Scientific Computing from Nankai University. 

h Session(s): S0245 - Porting Legacy Plasma Codes 

to GPU (Tuesday, 16:00, Room: A8) 

David Weinstein 

CTO (Numira Biosciences) 

Dr. David Weinstein is the Chief Technology Officer and 

Senior Director of Salt Lake Operations for Numira 

Biosciences. As a PhD student at the University of Utah 

in the early 90’s, David was a founding member of the 

Scientific Computing and Imaging (SCI) Institute. In 2004, 

he co-founded Visual Influence (VI), a SCI startup 

focused on custom visualization and analysis software 

for the medical imaging industry. In 2007, VI was 

acquired by Numira Biosciences, where David and his 

team now develop high-throughput processing, and 

Cloud-based interactive visual analysis tools for 

preclinical imaging. David has co-authored over 40 

peer-reviewed scientific publications. 


CEO on Stage Featuring eyeSight Mobile, 



Jack Wells, Ph.D. 

Director of Science, Oak Ridge Leadership Computing 

Facility (Oak Ridge National Laboratory) 

Jack Wells is the director of science for the National 

Center for Computational Sciences (NCCS) at Oak Ridge 

National Laboratory (ORNL). He is responsible for 

devising a strategy to ensure cost-effective, state-of-theart 

scientific computing at the NCCS, which houses the 

Department of Energy’s Oak Ridge Leadership 

Computing Facility (OLCF). In ORNL’s Computing and 

Computational Sciences Directorate, Wells has worked 

as group leader of both the Computational Materials 

Sciences group in the Computer Science and 

Mathematics Division and the Nanomaterials Theory 

Institute in the Center for Nanophase Materials 

Sciences. During a sabbatical, he served as a legislative 

fellow for Senator Lamar Alexander, providing 

information about high-performance computing, energy 

technology, and science, technology, engineering, and 

mathematics education issues. Wells began his ORNL 

career in 1990 for resident research on his Ph.D. in 

Physics from Vanderbilt University. Following a 

three-year postdoctoral fellowship at Harvard University, 

he returned to ORNL as a staff scientist in 1997 as a 

Wigner postdoctoral fellow. Jack is an accomplished 

practitioner of computational physics and has been 

supported by the Department of Energy’s Office of Basic 

Energy Sciences. Jack has authored or co-authored over 

70 scientific papers and edited one book, spanning 

nanoscience, materials science and engineering, 

nuclear and atomic physics computational science, and 

applied mathematics. 

h Session(s): S0606 - GPU-accelerated Science on 

Titan: Tapping into the World’s Preeminent GPU 

Supercomputer to Achieve Better Science 


h S0657 - Applying for INCITE Program, Conclusions, 

Q&A (Tuesday, 17:30, Room A2) 


PANELISTS 

141

SPEAKERS AND 

PANELISTS 

Elmar Westphal 

Software Developer (Forschungszentrum Juelich) 

Elmar Westphal has been working at Forschungszentrum 

Juelich for 15 years in the group that is now PGI/JCNS-TA 

Scientific IT-Systems. His main tasks include planning the 

institute’s compute clusters and writing/porting scientific 

software for multi-core and GPU environments. His latest 

projects include the CUDA-port of the micromagnetic 

simulation software TetraMag and the creation of a 

framework of accelerator routines for GPU-assisted 

molecular dynamics simulations. 

h Session(s): S0036 - Multiparticle Collision 

Dynamics on GPUs (Tuesday, 15:00, Room: C) 

Jan-Philipp Weiss 

Junior Professor (Karlsruhe Institute of Technology) 

Jan-Philipp Weiss is a junior professor at the Karlsruhe 

Institute of Technology (KIT), Germany. He is heading the 

Computing Lab Hardware-Aware Numerics at the 

Engineering Mathematics and Computing Labs (EMCL). 

From 2008 to 2012 he was heading a Shared Research 

Group on multicore and coprocessor technologies at KIT in 

joint collaboration with the company Hewlett-Packard. 

Research of his group addresses parallel numerical 

methods and programming techniques for emerging 

multi- and manycore technologies in numerical simulation 

and scientific computing. He received a Ph.D. from 

University Karlsruhe (TH) in applied mathematics in 2006. 

h Session(s): S0289 – Fine-Grained Parallel 

Preconditioners for Fast GPU-based Solvers 


h S0291 – LAtoolbox: A Multi-platform Sparse 

Linear Algebra Toolbox 


Ian Williams 

Director of Applied Engineering (NVIDIA) 

Ian Williams is currently Director of Applied Engineering 

within NVIDIA’s Professional Solutions Group. Within the 

Applied Engineering team he has been closely involved 

in the design and development of many of NVIDIA’s 

Industry focused professional solutions and key 

technologies. In addition the Applied Engineering team 

helps customers and partners integrate these 

technologies into their solutions . Prior to NVIDIA he 

worked for 8 years at Silicon Graphics in various 

technical roles within Application Engineering and the 

Desktop Product Group. Prior to Silicon Graphics, he 

worked at Rolls Royce Commercial Aerospace 

developing applications to numerically simulate 

manufacturing processes. He holds a Bachelor of 

Science degree in Engineering Science and Technology 

from Loughborough University (UK) as well as a Masters 

of Business Administration from Pepperdine University 

(CA, USA). He is a Chartered Mechanical Engineer with 

the Institute of Mechanical Engineers (UK) and 

throughout his career has been awarded several 

patents. For the past 10 years he has been Chairman 

SPEC/GPC committee which is part of the Standard 

Performance Evaluation Corporation and responsible for 

developing the industry wide SPECViewperf benchmark. 







Robert Wipfel 

Fellow (Fusion-io) 

Robert Wipfel is a Fellow at Fusion-io. Prior to that, at 

Novell, Robert was an architect or engineering lead for 

various Data Center products that integrated clustering, 

virtualization, and shared storage. Robert also helped 

Unisys and Intel jointly enter the commercial parallel 

processing market. Robert is co-author of Novell’s Guide 

to Storage Area Networks and Novell Cluster Services 

and frequently speaks at Novell’s Brainshare and other 

technology conferences. Robert earned a BSc (Hons) in 

Computer Systems Engineering from the University of 

Kent at Canterbury, U.K. He holds ten patents on parallel 

processing, clustering, server and storage virtualization. 

h Session(s): S0619 – Hate to Wait? Flash Memory 

for Full-Throttle GPU Acceleration 


Emmet Witchel 

(University of Texas, Austin) 


h Session(s): S0360 – Set GPUs Free: Integrating 

a File System with CUDA Programs 

(Thursday, 09:30, Hall 1) 

Nils Woetzel 

PhD Candidate (Vanderbilt University) 

Nils Woetzel, a native German, was exposed to the Basic 

programming language in the second grade. In his 

senior year of high school, he wrote a Delphi program 

“TitraCom”, that aided in chemical analysis experiments 

and participated with it in the German “Jugend forscht” 

high school science competition in 2001. After studying 

Chemistry at the University of Leipzig, Germany he 

started his PhD in computational structural biology at 

the Vanderbilt University in Nashville in 2005, where he 

could combine his computational and chemical skills to 

develop a novel protein structure prediction algorithm. 

h Session(s): S0346 – GPGPU Accelerated Protein 

Similarity Measures Identifying Biological 

Relevant Structure (Wednesday, 17:30, Room: N) 

h S0354 – Bcl::ChemInfo Suite Enables Machine 

Learning-Based Drug Discovery Using GPUs 


Tim Wood 

Quantitative Analyst (ING Bank nv) 

Tim Wood is a Quantitative Analyst and Developer at ING 

Bank in the Netherlands. Tim joined ING after studying 

Computational Science and Computational Finance at 

the University of Amsterdam. Since Joining ING in 2009 

Tim has played a key role in the development and 

deployment of computationally demanding risk analytics 

systems leveraging massively parallel architectures 

within the bank. 

h Session(s): S0369 - Running Risk On GPUs 


Cliff Woolley 

CUDA Developer Technology Engineer (NVIDIA) 

Cliff Woolley is a CUDA Developer Technology Engineer 

with NVIDIA Corporation. He received his Master’s degree 

in Computer Science from the University of Virginia in 

2003. He was among the earliest academic researchers to 

investigate the use of graphics processors for general 

purpose computation, having applied these early GPGPU 

ideas both to non-traditional graphics rendering 

techniques as well as to non-graphical algorithms such 

as a multigrid solver for PDEs. 

h Session(s): S0517A - Programming GPUs with 


h S0517B - Programming GPUs with OpenACC (Part 


h S0517C - Programming GPUs with OpenACC (Part 


h S0377 - C++ Data Marshalling Best Practices 

(Wednesday, 16:30, Room: L)

Rio Yokota 

Research Scientist (King Abdullah University of Science 

and Technology) 

Rio Yokota obtained his PhD in Mechanical Engineering 

from Keio University, Japan, in 2009, and was a 

postdoctoral researcher at the Department of 

Mathematics at University of Bristol from 2009-2010, 

and also at Mechanical Engineering Department at 

Boston University from 2010-2011. During his PhD, he 

worked on the implementation of fast multipole methods 

on special purpose machines such as MDGRAPE-3, and 

then on GPUs after CUDA was released. During his 

post-doc he has continued to work on fast multipole 

methods, and was part of the team that won the Gordon 

Bell prize for price/performance in 2009 using 760 GPUs 

h Session(s): S0308 - Recent Trends in 

Hierarchical N-body Methods on GPUs 


Eric Young 

Manager of Developer Technology Profesional and 

Consumer Applications (NVIDIA) 

Eric Young is a developer technology engineering 

working at NVIDIA supporting developer with 

professional graphics and computer vision. 



h S0404 - Computer Vision Libraries with GPUs 


Ronald Young 

President (Multipath Corporation) 

Dr. Young received his PhD in Engineering and 

Numerical Analysis from UC Berkeley in 1972. His 

career has focused on designing matrix algebra 

algorithms which exploit all hardware features for 

achieving the highest performance possible. In 1989 Dr. 

Young founded Multipath Corporation which develops the 

Fast Matrix Solver (FMS) software. FMS is an out-of-core 

matrix algebra package used to solve extremely large 

problems in production applications. 

h Session(s): S0032 - Teraflop GPU Acceleration Of 

Large Matrix Algebra (Thursday, 14:30, Room: C) 

Alaa Yousif 

Software Solution Architect (Dell) 

Alaa Yousif is Principle Engineer at Dell and has spent 

the last 12 years in the area of Dell Remote Management 

Products. Currently responsible for integrating Hadoop 

(Big Data) with HPC cluster. Alaa was also a lead 

engineer in custom solutions engineering leading 12 

engineers in Austin and Bangalore design centers. 

h Session(s): S0309 - Dynamically Allocating GPGPU 

to Host Nodes (servers) (Thursday, 10:30, Room: K) 

Song Yu 

(Chemical & Petroleum Department, University of Calgary) 

Song Yu is a petroleum engineering M.Sc. student who 

joined the Department of Chemical and Petroleum 

engineering at the University of Calgary in January 2010. 

He holds a B.Sc. degree in software engineering(ISS) 

from Wuhan University(WHU) in China and M.Sc. degree 

in computer software and theory from State Key 

Laboratory of Software Engineering(SKLSE) of Wuhan 

University(WHU) in China. Research Topic: Parallel 

Reservoir Simulation using GPU Computing Developing 

parallel sparse linear solver package on GPU parallel 

Computing Environment and integrating them into 

reservoir simulation to enhance the performance for 

large-scale simulation problems. 

h Session(s): S0190 - Large-Scale Reservoir 

Simulation on GPU (Wednesday, 14:30, Room: A7) 

Fabrizio Zanella 

Systems Manager (CST of America) 

Fabrizio Zanella has been at CST of America, a 

worldwide provider of full wave electromagnetic 

software, for 6 years. His current role consists of IT 

management for North America, and customer support 

for topics including hardware, licensing and high 

performance computing solutions. Prior to joining CST 

Fabrizio had 15 years of experience performing Signal 

Integrity characterization of high speed digital systems. 

He has worked at various companies including EMC 

Corporation and Teradyne. 

h Session(s): S0069 - GPU Computing Advances 

in 3D Electromagnetic Simulation 


Krzysztof Zarzycki 

Senior Software Developer (IBM Poland) 

Krzysztof Zarzycki is a Senior Software Developer in IBM 

Poland, Netezza R&D Department where he plays a role 

of technical lead of CUDA Development team. His 

research covers using GPUs to accelerate various 

methods - from AI, data mining & analytics, through 

data warehouse operations, finally to solving 

bioinformatics problems. He was educated on Warsaw 

University in Poland where he got a Master degree of 

Computer Science. 

h Session(s): S0376 – Dynamic Programming on 

CUDA: Finding the Most Similar DNA Sequence 


Peter Zaspel 

Research Assistant (University of Bonn) 

Peter Zaspel is research assistant at the Institute for 

Numerical Simulation of the University of Bonn, 

Germany. He studied Computer Science and is now 

working on his PhD. His research topics are 

computational fluid dynamics, general-purpose 

computations on graphics hardware and visualization. 

h Session(s): S0044 - A Massively Parallel Two- 

Phase Solver for Incompressible Fluids on 

Multi-GPU Clusters (Thursday, 14:00, Room: N) 

Kang Zhang 

Research Scientist (GE Global Research) 

Kang Zhang is currently a research scientist at GE 

Global Research Center, New York. He obtained the Ph. 

D. and M. S. E. degrees in Electrical and Computer 

Engineering from Johns Hopkins University, in 2011 and 

2009 respectively, and the B. S. degree in physics from 

Nankai University, China, in 2007. His research interests 

include GPGPU applications, high data throughput 

imaging platform, real-time imaging system, and optical 

sensing & imaging. From 2009 to 2010, Kang worked as 

an ORISE Research Fellow for the U. S. Food and Drug 

Administration (FDA), where he developed optical 

metrology methods for medical device evaluation. 

h Session(s): S0141 - GPU-Accelerated Optical 

Coherence Tomography Imaging 


Kaiyong Zhao 

PhD Student (Hong Kong Baptist University) 

Kaiyong received his B.Eng. degree in the Aircraft Design 

and Technology from Beijing Institute of Technology (BIT), 

Beijing, P. R. China, in 2005. After that he worked in CCUR 

two years, then got his master’s degree at HKBU. Now, he 

is currently an PhD student in the Department of 

Computer Science, Hong Kong Baptist University. 

h Session(s): S0281 - Accelerate a Fully Functional 

Photo Editing Software with GPU 



PANELISTS 

143

SPEAKERS AND 

PANELISTS 

Hongwei Zhou 

Senior Software Development Engineer (Altair) 

Hongwei Zhou is a senior software developer. He has 

various experiences in sparse direct solver, Lanczos and 

automatic multilevel-substructuring Eigen value solver 

in Altair Engineering. He received B.S. degree in 2003 

and M.S. degree in 2006 from Department of Mechanics, 

Peking University, China. 

h Session(s): S0225 – Speedup Altair RADIOSS 

Solvers Using NVIDIA GPU 


Jun Zhu 

Professor (Zhejiang University) 

Jun Zhu is currently the Director and a Professor, within 

the Institute of Bioinformatics at Zhejiang University. 

Previously, he was Vice President at Zhejiang University 

(2005-2009). Before that, Zhu was the Dean, for the 

College of Agricultural and Biotechnology at Zhejiang 

University (1999-2005). His education experience 

includes a Ph.D. in Statistics and Genetics, NC State, 

USA (1989). 

h Session(s): S0516 - The Advantage of GPU 

Computation for Analyzing Complex Traits 


Gernot Ziegler 

Compute Developer Technology (NVIDIA) 

Gernot Ziegler (MSc/civ.ing.) is an Austrian engineer with 

an MSc degree in Computer Science and Engineering 

from Linköping University, Sweden. He pursued his PhD 

studies at the Max-Planck-Institute for Informatics in 

Saarbrücken, Germany, where he specialized in GPU 

algorithms for computer vision and data-parallel 

algorithms for spatial data structures. As a member of 

NVIDIA’s DevTech-Compute team, Gernot now consults 

in high performance computing on graphics hardware. 

h Session(s): S0096 - Summed Area Ripmaps 


Robert Zigon 

Sr Staff Development Engineer (Beckman Coulter) 

Bob Zigon is a Sr. Staff Research Engineer and has 

worked at Beckman Coulter for 10 years. He has 

degrees in Computer Science and Mathematics from 

Purdue University. He was the architect of Kaluza, an 

NVIDIA Tesla powered analysis application for flow 

cytometry. He’s now working in particle characterization 

and analytical ultracentrifugation. His interests include 

high performance computing, numerical analysis and 

information retrieval theory. 

h Session(s): S0221 - 1024 Bit Parallel Rational 

Arithmetic Operators for the GPU 


Enrico Zschau 

Lead Software Architect (SeeReal Technologies GmbH) 

Enrico Zschau received the diploma in computer science 

from Technical University Dresden, Germany, in 2004. 

Since 2000 he has been working as assistant with the 

3D-group at Technical University Dresden. In 2002 he 

joined Dresden 3D GmbH, a spin-off from the TU 

Dresden 3D-group, which became SeeReal Technologies 

shortly after. Mr. Zschau’s activities focus on research 

and development of software solutions in the fields of 

image-processing and GPGPU-based algorithms for 

holography. He holds the position of Lead Software 

Architect and is responsible for a variety of softwaresolutions 

especially eye-tracking on PC and DSPs and 

real-time holography on GPUs and FPGAs. 

h Session(s): S0324 - Content Generation and 

Real-Time Hologram Computation for Holographic 

3D-Displays (Thursday, 10:00, Room: A1)

PLATINUM SPONSORS 

ASUS 

BULL 

CAPS 

Cooley LLP 

Dell 

ASUS comes from the last four letters of Pegasus, the winged horse in 

Greek mythology that represents the inspiration of art and learning. ASUS 

embodies the strength, creative spirit and purity symbolized by this regal 

and agile mythical creature, soaring to new heights of quality and innovation 

with each product it introduces to the market. 

Bull, the premier European-based global IT supplier, has made Extreme 

Computing one of its key strategic priorities. In a few years only, Bull has 

won over 150 customers in 15 countries across 3 continents. Bull has a 

proven track record of building Extreme Computing systems for prestigious 

academic and industry customers, most notably in France, Germany, UK, 

Spain, Netherlands and Brazil. Bull’s Extreme Computing solutions are 

based on bullx, a range of innovative systems designed for uncompromised 

performance, which has gained worldwide recognition. For more information 

visit: http://www.bull.com/extreme-computing 

CAPS is a major supplier of solutions dedicated to application migration and 

deployment on manycore processors. CAPS global solution for manycore 

leads the developer to performance by providing top-of-the-range 

technology (HMPP hybrid compiler and wizard), code porting methodology 

and ecosystem (third software tools, expertise, training…). It’s directivebased 

& multi-target HMPP compiler enables developers to safely move to 

hybrid CPU / GPU model and quickly get performance by leveraging the 

computing power of stream processors without the pain associated to GPU 

programming. HMPP is offered within CAPS DevDeck package: an 

ALL-IN-ONE multi-level suite for manycore application definition, porting 

and optimization with tools (HMPP compiler, development tools such as 

HMPP Wizard, debugging & profiling software and scientific libraries), 

methodology and resources (tutorials, use cases…). 

Cooley LLP is a global law firm for the converging worlds of high technology, 

high finance and high-stakes litigation. We are counselors, strategists and 

advocates for the foremost private and public companies and investors in all 

major technology fields. Our Emerging Companies practice has a long 

tradition of representing emerging and high-growth companies worldwide. 

The GPU space is an exciting growth area in the technology arena, and 

Cooley has been at the forefront, advising both established and start-up 

companies on the issues facing businesses in this industry. Our attorneys’ 

extensive experience in intellectual property protection and business 

counseling along with the Firm’s deep roots in the technology sector give us 

a unique perspective on the issues facing our clients. Cooley’s team consists 

of experienced counselors and litigators that are equally skilled at 

representing and advising clients on the protection and commercialization of 

their intellectual property in a wide range of areas, including copyright, 

trademark, patent, technology licensing, privacy, electronic security and 

electronic commerce. We are dedicated to offering comprehensive and 

creative legal support, utilizing the full resources of the Firm. 

For more than 26 years, Dell has played a critical role in transforming 

computing, enabling more affordable and more pervasive access to technology 

around the world. The company’s technology solutions improve customers’ 

productivity, enhances their lives and meets their distinct needs. 

Headquartered in Round Rock, Texas, Dell serves customers ranging from the 

world’s largest and most demanding businesses and public-sector 

organizations, to small and medium businesses, and consumers worldwide. 

Recognized for its ability to provide customers personalized, built-to-order 

technology through direct, online and retail channels, nearly 80 percent of 

Dell’s $53 billion in revenue last year was driven by enterprise products, 

services and solutions it delivers to businesses and organizations. Dell’s nearly 

100,000 team members worldwide are deeply committed to corporate 

CONFERENCE GUIDE SPONSORS AND 

EXHIBITORS 

145

SPONSORS AND 

EXHIBITORS 

PLATINUM SPONSORS, continued 

HP 

IBM 

Lenovo 

Los Alamos National Laboratory 

Microsoft Corporation 

responsibility. The company ranks among Working Mother Magazine’s 100 Best 

Companies and first among Newsweek’s Greenest Companies in America. 

At Dell, we promote an environment that thrives on innovation. To deliver 

effective solutions that meet customer challenges, Dell employs an open, 

standards-based approach to technology innovation. Each year, Dell honors 

the outstanding inventors among its employees. 

HP creates new possibilities for technology to have a meaningful impact on 

people, businesses, governments and society. The world’s largest technology 

company, HP brings together a portfolio that spans printing, personal 

computing, software, services and IT infrastructure to solve customer problems. 

More information about HP (NYSE: HPQ) is available at http://www.hp.com. 

IBM is involved in more than 150 smart grid engagements around the world, 

in both mature and emerging markets. IBM is the founding member of the 

Global Intelligent Utility Network Coalition, a unique collaboration of utilities 

from around the globe who are working to accelerate the use of smart grid 

technologies and move the industry forward through its most challenging 

transformation. More about IBM’s vision to bring a new level of intelligence 

to how the world works—how every person, business, organization, 

government, natural system, and man-made system interacts, can be found 

here: http://www.ibm.com/smarterplanet. 

Lenovo is one of the world’s largest makers of personal computers and 

makes the world’s most innovative PCs, including the renowned ThinkPad ® 

notebook as well as products carrying the ThinkCentre ® , ThinkStation ® , 

ThinkServer ® , IdeaCentre ® , and IdeaPad ® sub-brands. 

Today, Lenovo is a global corporation with significant operations on six 

continents and operating in more than 60 countries and selling products in 

160. Everyone at Lenovo takes great pride in our ability to attract top talent 

from diverse backgrounds and from around the world. We view our 

differences and diversity as a source of strength in building a collaborative 

culture that helps us achieve our goals. We have no world headquarters and, 

instead, have put in place a distributed management structure that places 

operational hubs in centers of excellence around the world integrating this 

talented, diverse group into a cohesive Next Generation company. 

Los Alamos National Laboratory, a multidisciplinary research institution 

engaged in strategic science on behalf of national security, is operated by 

Los Alamos National Security, LLC, a team composed of Bechtel National, 

the University of California, The Babcock & Wilcox Company, and URS for the 

Department of Energy’s National Nuclear Security Administration. 

Los Alamos enhances national security by ensuring the safety and reliability 

of the U.S. nuclear stockpile, developing technologies to reduce threats from 

weapons of mass destruction, and solving problems related to energy, 

environment, infrastructure, health, and global security concerns. 

Microsoft Visual Studio® development system is an integrated environment 

that helps simplify the entire development process from design to 

deployment. Customers can unleash their creativity with powerful 

prototyping, modeling, and design tools that brings a vision to life. Work 

within a personalized environment, and target a growing number of 

platforms. With integrated testing and debugging tools that enable delivery 

of high-quality solutions, developers and testers can work more efficiently.

PNY 

Supermicro 

SYNNEX Corporation 

TSMC 

Established in 1985, PNY Technologies ® , Inc. is the authorized NVIDIA ® 

Quadro ® channel partner for North America, Latin America and Europe. PNY 

provides unsurpassed service and commitment to its professional graphics 

customers offering: 3 year warranty, pre and post sales support, dedicated 

Quadro Field Application engineers and direct tech support hot lines. PNY 

recently introduced a new line of high performance Solid State Drives Prevail 

Series SSD designed specifically for the professional and enterprise 

markets. The company also offers a full line of commercial and consumer 

graphics cards, computer memory upgrade modules, flash memory cards, 

USB flash drives, and HDMI cables. Headquartered in Parsippany, NJ, PNY 

maintains facilities in North America, Europe, Asia, and Latin America. For 

more information, please visit http://www.pny.com. 

Supermicro, the leader in server technology innovation and green computing, 

provides customers around the world with application-optimized server, 

workstation, blade, storage and GPU systems. Based on its advanced Server 

Building Block Solutions, Supermicro offers the most optimized selection for IT, 

datacenter and HPC deployments. The company’s system architecture 

innovations include Twin server, double-sided storage and SuperBlade ® product 

families. Offering the most comprehensive product lines in the industry, 

Supermicro delivers energy-efficient solutions with unmatched performance 

and value. Founded in 1993, Supermicro is headquartered in Silicon Valley with 

worldwide operations and manufacturing centers in Europe and Asia. For more 

information, visit www.supermicro.com. 

SYNNEX Corporation, a Fortune 300 corporation, is a leading business 

process services company, partnering with resellers and original equipment 

manufacturers in multiple regions around the world. The Company provides 

services in IT distribution, supply chain management, contract assembly and 

global business services. Founded in 1980, SYNNEX employs more than 

10,000 associates worldwide and operates in the United States, Canada, 

China, Japan, Mexico, the Philippines and the United Kingdom. Our valueadded 

service model streamlines business processes to help customers 

across the globe lower their costs and create greater efficiencies. We 

provide a variety of professional and marketing services, including: demand 

generation; education and training; pre- and post-sale technical support; 

end-user enablement; server assessment; design and integration; recycling 

and trade-in; contract design and assembly; and IT resource planning. 

TSMC is the world’s largest dedicated semiconductor foundry, providing the 

industry’s leading process technology and the foundry segment’s largest 

portfolio of process-proven libraries, IPs, design tools and reference flows. 

The Company’s managed capacity in 2011 totaled 13.22 million (8-inch 

equivalent) wafers, including capacity from three advanced 12-inch 

GIGAFAB facilities, four eight-inch fabs, one six-inch fab, as well as 

TSMC’s wholly owned subsidiaries, WaferTech and TSMC China, and its joint 

venture fab, SSMC. TSMC is the first foundry to provide 28nm production 

capabilities. Its corporate headquarters are in Hsinchu, Taiwan. For more 

information about TSMC please visit http://www.tsmc.com. 


EXHIBITORS 

147

SPONSORS AND 

EXHIBITORS 

GOLD SPONSORS 

Amazon Web Services 

Fusion-io 

NextIO 

SGI 

SILVER SPONSORS 

Acceleware Corporation 

Adobe 

Appro International, Inc. 

Built upon the same world-class technology that powers Amazon.com, 

Amazon Web Services (AWS) provides businesses with a secure, reliable, 

easy-to-scale, low-cost computing platform “in the cloud.” Companies of all 

sizes, from all around the globe use AWS to build applications, store data, 

manage business processes, and more. Learn more: http://aws.amazon.com 

The Fusion-io storage memory platform significantly improves processing 

capabilities within a data center by moving active data closer to the CPU 

where it is processed. Called shared data decentralization, this reduces 

latency while increasing data center efficiency. Fusion’s software and 

hardware solutions leverage non-volatile memory for enterprise-grade 

performance, reliability and manageability. 

NextIO was founded based upon the vision of creating shared server I/O 

resource pools. Today, NextIO simplifies complex server I/O and enables 

any-to-any connectivity among a wide variety of data center resources. With 

the NextIO architecture server I/O is consolidated at the top of the rack, may 

be shared and dynamically allocated across servers within the rack. NextIO 

currently offers a complete portfolio of I/O consolidation and I/ O 

virtualization products that are easily managed, highly flexible, and provide 

customers with greater operational efficiencies that reduce CapEx and OpEx 

costs, and deliver the utmost in data center flexibility and business agility, 

which drives productivity and economic efficiencies. 

SGI is the trusted leader in technical computing. The company develops, 

markets and sells a broad line of mid-range and high-end scale-out and 

scale-up servers plus data storage solutions and differentiating software. 

SGI solutions are used by the scientific, technical and business communities 

to solve challenging, data-intensive compute and data management 

problems requiring large amounts of computing power and fast, efficient 

data movement both within the computing system and to and from largescale 

data storage installations. 

Acceleware delivers industry leading CUDA training and HPC consulting 

services to organisations looking to unlock the parallel processing potential of 

the GPU. Acceleware’s software solutions include GPU accelerated Seismic 

Migration libraries for the Oil & Gas industry and Electromagnetic solvers for 

CAE markets. At Acceleware the goal is always the same – Go Faster 

Whether it’s a smartphone or tablet app, a game, a video, a digital magazine, 

a website, or an online experience, chances are that it was touched by Adobe 

technology. Our tools and services enable our customers to create 

groundbreaking digital content, deploy it across media and devices, and then 

continually measure and optimize it based on user data. By providing 

complete solutions that combine digital media creation with data-driven 

marketing, we help businesses improve their communications, strengthen 

their brands, and ultimately achieve greater business success. 

Appro is a leading developer of innovative supercomputing solutions and is 

positioned to support High Performance Computing markets. Appro 

accelerates technical applications and business results through outstanding 

price/performance, power efficiency and fast time-to-market solutions 

based on the latest open standards technologies. Appro enables scientists 

and engineers to use data-intensive, capacity, capability and hybrid 

computing for scientific research, data modeling, engineering simulations, 

and seismic visualization. To learn more, visit www.appro.com

Deloitte 

ELEKS 

GE Intelligent Platforms 

Morgan Stanley 

SK Hynix 

SVB 

In the United States, Deloitte LLP and its subsidiaries have 45,000 

professionals with a single focus: serving our clients and helping them solve 

their toughest problems. We work in four key business areas — audit, 

financial advisory, tax and consulting — but our real strength comes from 

combining the talents of those groups to address clients’ needs. Fortune and 

BusinessWeek consistently rank our organization among the best places to 

work, which is good news for our talent and our clients alike. When the best 

people tackle the most compelling challenges, everyone wins. 

Multi-year expertise in building complex science-intensive solutions 

including HPC has determined our value proposition of delivering 

sophisticated custom computing systems for power, finance, automation, 

entertainment and other industries. ELEKS’ engineering culture, combined 

with aspiration for technological excellence and solid project management 

skills, ensures superior business value we deliver to our highly valued 

customers. For more information about ELEKS’ software development, 

localization and testing services go to www.eleks.com. 

GE Intelligent Platforms is a leading manufacturer of rugged COTS computer 

boards and systems for military programs. As a partner to NVIDIA for 

Embedded Applications, GE brings GPGPU technology into a wide range of 

defense related programs and can now be used in ground tanks, fighter 

aircraft, military helicopters, and UAV’s for Radar, ISR, DSP, Sensor 

Processing, Imaging and many other military applications. 

Morgan Stanley is a leading global financial services firm providing a wide 

range of investment banking, securities, investment management and 

wealth management services. The Firm’s employees serve clients worldwide 

including corporations, governments, institutions and individuals from more 

than 1,300 offices in 43 countries. For further information about Morgan 

Stanley, please visit www.morganstanley.com. 

SK Hynix designs, manufactures and markets a wide variety of DRAM and 

NAND Flash memories and CMOS Image Sensors. 

SK Hynix is the new corporate name of Hynix Semiconductor Inc. following 

the merger with SK Telecom on February 14, 2012. In synergy with SK 

Telecom, SK Hynix expects to enhance its competitiveness in 

semiconductors, and expand into new global markets. 

Silicon Valley Bank is the premier commercial bank for companies in the 

technology, life science, cleantech, venture capital, private equity and 

premium wine industries. SVB provides a comprehensive suite of financing 

solutions, treasury management, corporate investment and international 

banking services to its clients worldwide. Through its focus on specialized 

markets and extensive knowledge of the people and business issues driving 

them, Silicon Valley Bank provides a level of service and partnership that 

measurably impacts its clients’ success. Founded in 1983 and headquartered 

in Santa Clara, Calif., the company serves clients around the world through 

26 U.S. offices and international operations in China, India, Israel and the 

United Kingdom. Silicon Valley Bank is a member of global financial services 

firm SVB Financial Group (Nasdaq: SIVB), with SVB Analytics, SVB Capital 

and SVB Private Bank. More information on the company can be found at 

www.svb.com. 


EXHIBITORS 

149

SPONSORS AND 

EXHIBITORS 

PLATINUM MEDIA PARTNERS 

Dow Jones & Company 

Dr. Dobb’s 

HPCwire 

insideHPC 

mergermarket 

GOLD MEDIA PARTNERS 

HPC in the Cloud 

Dow Jones Private Equity & Venture Capital is a division of Dow Jones & Co., 

a News Corporation company. Dow Jones Private Equity & Venture Capital 

offers integrated content solutions for deal-sourcing, due diligence and 

fundraising needs of today’s venture capital and private equity investors, 

corporate investors, advisors, and portfolio companies. Core products 

include the deal database VentureSource and the fundraising database LP 

Source, as well as the highly-respected publications Private Equity Analyst, 

VentureWire, Daily Bankruptcy Review and LBO Wire.. 

Dr. Dobb’s is the most respected development-focused brand helping 

application and software development professionals make the right 

decisions for their businesses. Dr. Dobb’s provides deep content that 

challenges developers to think of new and dynamic ways to create businessfocused 

applications while balancing “what can be developed” with practical, 

real-world analysis. http://drdobbs.com 

HPCwire is the leading publication for news and information on high 

performance and data-intensive computing for business and technology 

professionals. HPCwire is the #1 resource selected by academic, government, 

industrial and vendor communities who are interested in computationallyintensive 

computing, including systems, software, applications, middleware, 

networking and storage. Subscribe at: www.hpcwire.com. 

insideHPC is the web’s premier high performance computing (HPC) short 

format news site. insideHPC distills news and events, and presents them in 

bite-sized nuggets of helpfulness as a resource for supercomputing 

professionals. insideHPC, along with its sister publication, inside-BigData, 

pumps out more than 1.2 million monthly page views to a growing 

community of readers that now exceeds 61,000 unique monthly visitors. 

mergermarket, part of The Mergermarket Group, is an unparalleled, 

independent M&A intelligence tool used by the world’s foremost financial 

institutions to originate deals. It provides proprietary intelligence on 

potential deal flow, potential mandates and valuations via the world’s largest 

group of M&A journalists and analysts who have direct access to the most 

senior decision-makers and corporates. 

HPC in the Cloud is dedicated to covering data-intensive cloud computing 

in science, industry and the data center. The publication provides 

technology decision-makers and stakeholders in the high performance 

computing industry on developments happening in the point where high 

performance and cloud computing intersect. Subscribe now at: 

http://www.hpcinthecloud.com/xs/register.

EXHIBITING COMPANIES 

3dmx 

AccelerEyes LLC 

ACE Computers 

Advantest 

Allinea Software 

AMAX 

Aspen Systems 

BioDigital 

BOXX Technologies, Inc. 

® 

Since 2003, 3dmx has been creating extraordinary 3D animation, 

stereoscopic 3D, visual effects, visualizations, live action, stop motion 

and video games for the medical, technology and entertainment 

industries. When in need to present a groundbreaking invention, to 

provide user tutorials for specialized machinery and processes, 

training material, architectural walkthroughs or preparing an 

appealing set of art for marketing campaigns, 3dmx is able to do it 

for you, on time and within budget. 

AccelerEyes develops and markets fast, simple GPU software 

libraries. Today, AccelerEyes delivers products which are used to 

accelerate C, C++, Fortran, Python, and MATLAB ® codes on CUDA 

and OpenCL GPUs. 

Founded in 1983, Ace Computers is a respected systems integrator 

focused on custom requirements and regularly works with major 

Universities, Federal Labs, and Corporate clients. We hold WSCA 

and GSA Prime contracts in addition to multiple GWACs. Ace is 

ISO9001:2008 Certified and is well associated with NVIDIA, Intel 

and AMD. 

A world-class technology company, Advantest is the leading 

producer of automatic test equipment (ATE) for the semiconductor 

industry and a premier manufacturer of measuring instruments. Its 

leading-edge products are integrated into the most advanced 

semiconductor production lines in the world. Founded in Tokyo in 

1954, Advantest now operates in 21 countries worldwide. 

www.advantest.co.jp 

We’re recognized as the leading vendor of tools for parallel software 

development and High Performance Computing (HPC). One of the 

fastest growing companies in HPC, we were recently honored as a Red 

Herring Top 100 company. We have offices in the US and the UK, as 

well as network of resellers and partners in most parts of the world. 

AMAX, pioneer of the Personal Supercomputer, is a leading 

technology provider with over 30 years of solidified partnerships 

with technology innovators such as NVIDIA. AMAX excels at 

delivering unique and customized HPC cluster, server and storage 

solutions that continually push the limits of innovation with 

maximum performance and exceptional efficiency. 

Aspen Systems, founded in 1982, is an established, privately-held, 

two time Inc. 500 corporation that designs, manufactures, and 

services computing products including high-performance compute 

clusters, systems software, storage/file systems, and visualization. 

Aspen Systems places its highest priority on first class technical 

support and the creation of fully customized products that always 

incorporate the latest technologies. This allows our customers to 

enjoy the highest performing solutions at very competitive prices. 

BioDigital is the leading developer of state of the art biomedical 

visualization. BioDigital recently launched The BioDigital Human 

- a 3D visualization platform with a revolutionary approach for 

communicating health and medical information with interactive 

tools for exploring human anatomy, physiology and conditions. 

BOXX is the leading innovator of high-performance workstations 

and rendering systems for product design, engineering, visual 

effects, animation, architectural visualization, and more. For over 

15 years, we’ve combined record-setting performance, speed, and 

reliability with unparalleled industry knowledge to become the 

trusted choice for creative professionals worldwide. 


EXHIBITORS 

151

SPONSORS AND 

EXHIBITORS 

Bright Computing 

Cirrascale 

Colfax International 

Concurrent 

Creative Consultants 

Cyberpower 

Digital Storm 

reative 

onsultants 

COMPUTE FASTER! 

Bright Computing, a leader in integrated cluster management 

software, provides seamless management of NVIDIA GPU and 

hybrid clusters. Bright is a single solution for provisioning, 

scheduling, monitoring and managing clusters. Every Brightmanaged 

cluster is also cloud-ready, enabling users to extend their 

system into AWS EC2 for access to additional CPUs and NVIDIA 

GPUs, with a few mouse clicks. All of this capability is accessed via 

its intuitive GUI or using Bright’s powerful cluster management 

shell. Bright Computing is headquartered in San Jose, CA 

http://www.brightcomputing.com 

Cirrascale Corporation is a premier provider of advanced GP/GPU 

blade-based workstation and server solutions for conventional and 

containerized data centers that are scalable, reliable and offer best 

price/performance value in the industry. Cirrascale leverages its 

patented Vertical Cooling Technology to provide the industry’s most 

energy-efficient standards-based platforms with the lowest possible 

total cost of ownership in the densest form factor. To learn more 

about Cirrascale and its unique GP/GPU solutions, please visit 

http://www.cirrascale.com or call (888) 942-3800. 

Buy it from a trusted expert. Colfax provides the most comprehensive 

range of innovative, cutting-edge and highly customized GPU 

solutions. With outstanding price/performance and technical 

support, Colfax is a leading choice of scientists and engineers for 

GPU-accelerated data modeling, simulation and real-time 

visualization solutions. Visit www.colfax-intl.com for more details. 

Concurrent Computer Corporation (NASDAQ:CCUR) is a worldwide 

leader in real-time Linux ® computing technology including real-time 

operating systems; advanced debugging and analysis tools; 

simulation tools; and fully-integrated multiprocessing/GPU computer 

platforms. Concurrent focuses on hardware-in-the-loop and 

man-in-the-loop simulation, data acquisition and industrial systems. 

For more information, please visit www.real-time.ccur.com. 

Creative Consultants demonstrates a Multi-Projector Semi- 

Immersive Virtual Reality (VR) environment with GPU enabled 

warping and blending. Our parallel code development appliance 

Stelletto computes hundreds of thousands of threads, in real time, 

driving the VR display; thus creating an interactive HPC 

demonstration with live scaling of calculations for 250,000 particles. 

CyberPower, Inc. is one of the nation-wide leading computer system 

manufacturers. As published in the Los Angeles Business Journal 

in 2003, we were the fastest growing private company in Los 

Angeles. With vision, commitment, and steadfast determination, we 

manufacture and distribute various customized high-end gaming 

machines, notebook systems and high performance workstations 

to meet the unique needs for gamers, businesses, government 

agencies, educational institutions and other end-users. 

Founded in 2002, Digital Storm has rapidly emerged as the 

predominant name in system integration. With expertise in 

workstation computers, Digital Storm’s mission is to deliver its 

customers bleeding edge technology with direct support. As a 

validation of Digital Storm’s success, its systems have received the 

industry’s most prestigious awards.

EM Photonics 

Exxact Corporation 

eyesight Mobile technologies 

Ltd. 

Fuzzy Logix 

GraphStream Incorporated 

Green Revolution Cooling 

Immersive Media 

JMR Electronics, Inc. 

MathWorks 

�� 

�� 

Innovators in Storage 

Technologies 

EM Photonics’ core competency lies in its strength with using GPUs, 

FPGAs, and other parallel computing platforms to accelerate extremely 

complex computational applications. We have developed products in the 

areas of image processing, linear algebra, and scientific computing and 

worked with clients in fields from finance to defense to life sciences. 

Founded in 1992, Exxact Corporation is both a value-added 

distributor of professional workstation graphics cards and a 

manufacturer of solutions for visualization and compute-intensive 

applications. In addition, Exxact offers software and services to 

develop, port, maintain, and deploy applications for GPU computing. 

eyeSight’s Touch Free technology provides an enhanced user 

experience, allowing to easily and intuitively control a variety of devices 

using simple hand gestures. eyeSight’s Natural User Interface 

solution utilizes the device’s standard 2D camera, along with advanced 

real-time image processing and machine vision algorithms, to track 

the user’s hand gestures and convert them into actions. 

Fuzzy Logix is the leading provider of in-database analytics software 

and GPU-based analytics solutions. Our GPU Appliance, TANAY, 

makes accessing the power of GPU technology easy and includes a 

library of over 300 analytic functions that can be invoked from DLLs 

or Shared Objects. Additional Information: http://www.fuzzl.com 

GraphStream is a supplier of advanced scalable systems for data 

networking, processing, and storage. These systems are customconfigured 

to meet specific application requirements with superior 

simplicity, reliability, scalability, and efficiency. Since 2003, 

GraphStream has worked together with PNY and NVIDIA to deliver 

some of the world’s most powerful GPU-accelerated systems. 

Green Revolution Cooling (GRC) provides the highest performance, 

lowest cost-per-Watt cooling system available today for data centers. 

The CarnotJet system submerges fanless OEM servers into a 

managed dielectric fluid environment, reducing cooling energy by 

95% while providing powerful and continuous heat removal for even 

the highest density servers. 

Immersive Media is the pioneer and leading world provider of 3600, 

full motion, interactive video. Our immersive 3600 video content is 

delivered via internet to PC, Ipad or mobile device. Immersive Media 

provides the enabling technologies for interaction videos to record, 

process, live stream and deliver images from ours or other wide 

field cameras, with a patent portfolio covering key discoveries and 

capabilities of interactive and immersive video. 

JMR ELECTRONICS INC. is a 30-year established ISO 9001 certified 

design, development and manufacturing resource for high 

performance computing and storage systems based in Chatsworth, 

CA. JMR’s award-winning BlueStor and SilverStor systems are 

widely used in broadcast, digital intermediate, geophysical survey, 

post-production and scientific applications. 

Over one million people around the world use MATLAB for technical 

computing. They rely on MATLAB to help them develop cancer 

therapies, search for new sources of energy, make our cars safer 

and more fuel efficient, and explore outer space. By combining a 

powerful numeric engine and technical programming environment 

with interactive exploration and visualization tools, MATLAB has 

become the language of technical computing. For more 

information, visit www.mathworks.com 


EXHIBITORS 

153

SPONSORS AND 

EXHIBITORS 

MBA Sciences 

Mellanox Technologies 

Mentor Graphics Corp. 

Mersive 

Microway Inc. 

migenius 

Morgan Kaufmann 

MulticoreWare Inc. 

Deliver on the promise of Data and Graph Analytics. MBA Sciences 

enables engineers and scientists to rapidly prototype, analyze and 

deploy robust parallel solutions across heterogeneous computing 

resources spanning servers, cores and GPUs from either data 

centers or public clouds. 

Mellanox Technologies (NASDAQ: MLNX, TASE: MLNX) is a leading 

supplier of end-to-end InfiniBand and Ethernet connectivity 

solutions and services for servers and storage. Mellanox products 

optimize data center performance and deliver industry-leading 

bandwidth, scalability, power conservation and cost-effectiveness 

while converging multiple legacy network technologies into one 

future-proof architecture. www.mellanox.com 

The Mentor Graphics ® Embedded Software Division comprises the 

Mentor ® Embedded family of products and services, including 

embedded software intellectual property (IP), tools, and professional 

consultant services to help embedded developers and silicon 

partners optimize their products for design and cost efficiency. The 

Mentor Embedded team continues to lead the industry with 

involvement in the open source community, with Inflexion ® 2D and 3D 

UI development, Sourcery open source tools, and Nucleus ® RTOS 

solutions. More information on Mentor Embedded products and 

services can be found at www.mentor.com/embedded 

Since it was founded in 2006, Mersive has revolutionized high 

performance display setup and maintenance enabling a new class of 

displays. Mersive’s Sol software automatically aligns multiple 

commodity projectors into one seamless image of extraordinary 

quality and resolution without the expense of specialized hardware 

and services. For more information, visit www.mersive.com 

Since 1982, Microway has earned an international reputation for 

building screaming fast HPC clusters, servers, and 

WhisperStations. Since 2007, these have included Tesla GPUs. 

Utilizing multi-core CPUs, high-efficiency power, robust designs 

and excellent cooling, Microway’s GPU clusters deliver more 

TFLOPs with fewer watts. Our unique Tesla systems offer full PCI-E 

Gen3 support and optional FDR InfiniBand. 

The migenius mission is to bring software and web services to the 

market that enable ‘live 3D for all’ for better and much faster 

decision making in design and marketing. Leveraging the power of 

the cloud, GPU and NVIDIA iray, migenius provides platforms and 

applications to make this a reality. 

Morgan Kaufmann delivers the knowledge of experts to the 

computing community. Through superior print and digital content, 

our authors aim to educate our readers and inspire innovation. 

MulticoreWare, Inc. develops tools and software solutions for 

homogeneous and heterogeneous architectures for profiling, 

optimization and portability. With significant expertise in GPU and 

multicore CPU programming models and in OpenCL, the company 

has delivered tools and software solutions in architectures such as 

OpenMP and CUDA to high-performance applications including 

video and image processing.

NeST/SFO Technologies 

Numecent 

Numira Biosciences 

Patriot Technologies 

PEER 1 Hosting 

Penguin Computing 

PGI 

SFO Technologies, a NeST Group company, offers end-to-end 

engineering solutions to OEMs in Healthcare, Industrial, 

Communications and Transportation verticals. Services include 

hardware and software design, embedded product engineering, 

application development, prototyping, testing and manufacturing. 

An early adopter of GPGPU, and a CUDA Design Partner of NVIDIA, 

NeST specializes in GPU computing and 3D Graphics solutions, 

leveraging a highly skilled team and a streamlined process to 

deliver industry leading speedup and optimization. 

Numecent (www.numecent.com) is a start-up which came out of 

stealth with a bang in March 2012 and is the inventor of 

‘cloudpaging’. This patented technology enables friction-free digital 

delivery of native software and other non-linear assets through 

virtualization. One of the benefits of cloudpaging is that it can 

reduce the network footprint of digital downloads between 20x and 

100x and execute them natively, at full speed, without actually 

requiring installation. Once cloudpaged, applications can even run 

off-line and always under license control. 

Numira Biosciences is a leading provider of specialty contract 

research services for preclinical drug and device development. 

Numira’s customers include the top biopharmaceutical companies 

and academic research institutions. Through its next-generation 

study portal, Numira provides its customers with interactive tools 

for accessing, exploring, and communicating about their preclinical 

study data. 

Patriot’s Manufacturing and Logistics Services enables software 

developers, application users and solution providers to optimize their 

software applications on a reliable, branded and customized hardware 

platform. By choosing Patriot, customers can leverage an appliancebased 

model with minimal investment and realize the benefits of 

faster time-to-market, increased profitability and business growth. 

Two obsessions – Ping & People – have made us one of the world’s 

leading hosting providers. Our proprietary 10Gbps FastFiber Network 

and 18 datacenters connect our customers to the world. And our 

FirstCall Promise supports over 10,000 businesses 24x7x365. The first 

large-scale GPU Cloud is just one of our hosting innovations. 

For well over a decade Penguin Computing has been delivering 

integrated, Linux based solutions for the enterprise and HPC space. 

With Linux expertise that is unmatched in the industry Penguin 

Computing offers an end-to-end portfolio of products that range 

from Linux servers and workstations to integrated, turn-key HPC 

clusters and cluster management software. 

The Portland Group ® is a premier supplier of software compilers 

and development tools for parallel computing. PGI ® offers high 

performance scalar and parallel Fortran, C and C++ compilers and 

tools for systems based on 64-bit x86 processors from Intel and 

AMD, and NVIDIA CUDA-enabled GPUs running under Linux, 

MacOS and Windows operating systems. 


EXHIBITORS 

155

SPONSORS AND 

EXHIBITORS 

Polywell 

PQ Labs, Inc 

Prefixa 

Ramtron International 

Corporation 

Raytrix GmbH 

Reservoir Labs 

RTT 

raytrix 

3D light field camera 

Scalable Display Technology 

Polywell, established in 1987, is a manufacturer of high quality 

computer products. Its lineup ranges from industrial embedded 

PCs and storage solutions to high-performance workstations and 

high-end servers. Polywell has been serving the needs of various 

commercial and government entities with systems for CAD/CAM, 

animation, content creation, and for data centers. Polywell also 

offers OEM/ODM services for various vertical markets, such as 

Digital Signage, Kiosk, POS, Surveillance, IPTV, entertainment, 

gaming, medical equipment, network appliance and IP Phone. 

Established in Silicon Valley, PQ Labs, Inc. is a leading provider of 

Multi-Touch solution in the world, providing revolutionary hardware 

and software to eliminate the need of keyboard and mouse for 

future computers. PQ Labs’ Multi-Touch G³ enables people to 

interact with computers directly using just fingers and gestures. 

The company’s key technology improvement is enabling a next 

generation of natural user interface to be widely adopted in the 

computer industry. 

Prefixa develops 3D solutions for 3D data capture, model and 

render, accelerated with Nvidia Technology. Our core technology is a 

3D Photorealistic Render Engine natively implemented in NVIDIA- 

CUDA, and scalable to multiple GPU - Multiple CPU nodes. We are 

looking for key partners to scale our solution to the cloud, and build 

business around our platform. 

Ramtron International Corporation, headquartered in Colorado 

Springs, Colorado, is a fabless semiconductor company that 

designs, develops and markets specialized semiconductor memory 

and integrated semiconductor solutions used in a wide range of 

product applications and markets worldwide. For more information, 

visit www.ramtron.com. 

Raytrix develops and markets single-lens 3D video cameras based 

on their patented high resolution light field technology, offering 

solutions for Particle Image Velocimetry (PIV), optical inspection, 

face capturing, microscopy – as well as IP for consumer products 

(mobile phones). 

Privately owned and in business since 1990, Reservoir Labs 

specializes in advanced compiler, network and reasoning 

technologies with an emphasis on mapping innovative algorithms 

to high performance and embedded architectures. We deliver 

cutting-edge technology products, customized solutions and 

advanced R&D services to our commercial and government clients. 

RTT stands for creative and fascinating 3D visualization solutions, 

which bring products to life in realtime and portray them in a 

natural and realistic environment. Our RTT Virtual Prototyping and 

RTT Virtual Marketing products and services combine software, 

support and customized strategic solutions, allowing us to turn 

dreams into reality. 

Scalable Display Technologies is a global leader providing software 

tools to construct and manage ultra high-resolution displays. 

Scalable’s software is used by the military and Global 1000 

accounts to enhance productivity through higher resolution and 

increased visual realism of displays. Scalable’s products are 

spawning a new class of displays called “multi-megapixel displays”.

SECO 

Seneca 

Splashtop Inc 

Terascala, Inc. 

Themis Computer 

TunaCode 

TYAN 

Seco, International company leader in the electronic embedded 

solutions, over its 30 years has shown the capability to adapt its 

know-how to meet the new challenging customer needs guiding the 

customer to its most innovative solutions. The collaborations with 

important scientific Universities and partnerships with the worldwide 

leading companies have contributed to transform Seco in an 

International reality that have owned the market based on the new 

challenges of the ordinary days. 

Seneca is a premier U.S.-based custom system manufacturer and 

value-added technology distributor with over 30 years of experience. 

As a designer and manufacturer of High Performance Computing 

Clusters, Seneca supports academic, lab, government, and defense 

researchers across the nation. Our HPC practice includes solutions for 

compute clusters, NVIDIA GPGPU platforms, technical computing 

workstations, storage systems, and management software. 

Splashtop aspires to touch people’s lives by delivering the best-inclass 

remote desktop experience - bridging tablets, phones, 

computers and TVs. Splashtop technology empowers consumer 

and business users with high-performance, secure, interactive 

access to their favorite applications, media content and files 

anytime, anywhere. Splashtop is headquartered in San Jose with 

offices in Beijing, Hangzhou, Shanghai, Taipei and Tokyo. For more 

information, visit http://www.splashtop.com. 

Terascala’s high throughput storage solutions make big data fast. 

With Terascala, organizations transition from storing and sifting their 

data to leveraging that data to drive applications. Combining a 

parallel file system, extensive analysis and optimization, appliances 

enable rapid analysis of big data sets using large server installations. 

Themis combines industry leadership, high-performance 

computing, and advanced thermal and mechanical design 

techniques to deliver reliable, rugged standards-based and custom 

embedded computing solutions. From small form factor computers 

to large blade servers, Themis is committed to building products 

that achieve a superior balance the between standard commercial 

technology and ruggedness to keep mission-critical applications 

available in the most demanding environments. Our diverse product 

portfolio includes: board-level computers, rack mounted servers, 

bladed server systems, mission and payload systems, small form 

factors, and storage appliances. 

TunaCode delivers accelerated computing solutions making 

innovative use of multi-core and manycore processors. We develop 

and market CUVILib which offers GPU-accelerated Vision and 

Imaging functionality with plug-and-play ease of use resulting in 

instant speedups of 10X. With over 1000 active users and 

commercial deployments in Medical Imaging, Industrial/Defense 

Imaging and Entertainment domains, CUVILib offers cost-effective 

way to achieve real-time performance in Imaging applications. 

PNY and TYAN have established a new EMEA partnership to offer a 

wide range of NVIDIA GPU-based computing platforms designed for 

High Performance Computing (HPC) and massive parallel computing 

environments. As companion processor to the CPU in a server, 

NVIDIA TESLA GPUs accelerate HPC applications by up to 10x. 


EXHIBITORS 

157

SPONSORS AND 

EXHIBITORS 

Ubitus Inc. 

USEFULPROGRESS 

WILD Systems (HPC Project) 

Wolfram Research, Inc. 

Wurth Electronics Midcom 

Zoobe 

WILD SYSTEMS 

Ubitus Inc., the technology leader in deploying Cloud-enabled rich 

media services, offers innovative cloud computing solutions for 

device manufacturers, wired/wireless communication service 

providers, telecommunication operators and digital content 

developers. Founded in 2007 and headquartered in Taipei, Taiwan, 

the company now has 150 employees and 4 offices in Tokyo, Beijing, 

Guangzhou and Seoul. 

The development in computer graphics allows huge progress in the 

knowledge of Life and Matter. In Medical science, CT scanners 

allow to investigate the whole body with transparency. A very 

important step in data analysis consist to convert signals (X, MR, 

US) in digital data that could be treated by computers 

UsefulProgress develops new software strategies based on 

computer graphics for highperformance visualisation. 

Wild Systems is a recognized expert in software performance 

optimization. At Wild Systems, we combine know how and tools for 

automatic code parallelization. This allows the user to run its 

optimized application on hybrid architecture appliances. Connected 

to the network, these appliances, fully dedicated to a given 

optimized application software, boosts its execution performance. 

Research is the company where “computation meets knowledge.” 

A powerhouse in technical innovation, the company is the developer of 

Mathematica, the ultimate computation platform, and Wolfram|Alpha, 

the computational knowledge engine. Wolfram also sponsors the 

world’s largest free network of technical information websites, 

including MathWorld and the Wolfram Demonstrations Project. 

Würth Elektronik is one of the world’s leading manufacturers of 

passive and electromechanical components. Our product range 

contains EMC ferrites, filter chokes, common mode chokes, circuit 

protection EMI shielding material, power inductors, power 

transformers, LAN and telecom transformers, RF inductors, 

LTCC components, connectors, switches, assembly technique and 

power elements. 

Zoobe is a messaging service that allows you to voice an animated 

character. From your voice or text message and your chosen 

character we generate a personal animation clip within seconds 

which you can send to your friends or post on your wall.

GTC WORLDWIDE 

EVENTS 

SAVE THE DATES 

GTC JAPAN 2012 

July 26 

Tokyo Midtown Hall 

www.gputechconf.jp 

GTC U.S. 2013 

March 19–22 

San Jose McEnery Convention Center 

www.gputechconf.com

STAY EDUCATED! 

GTC is comprised of year-round international 

conferences, workshops and online events. It is an 

essential resource for the scientists, engineers, 

researchers, and developers who rely on GPUs to tackle 

enormous computational challenges. GTC On-Demand 

gives you archival access to the world-class education 

delivered at GTC, as well as the latest research and 

insights presented by NVIDIA staff at other important 

industry events. Explore and learn from the best and 

brightest minds working in High Performance 

Computing today. Visit www.gputechconf.com 

Blog - http://blogs.nvidia.com/category/supercomputing/ 

Facebook - https://www.facebook.com/gputechnologyconference 

Twitter - http://twitter.com/#!/gpucomputing 

LinkedIn - http://www.linkedin.com/groups?about=&gid=2159196 

Flickr - http://www.flickr.com/photos/nvidia/collections/ 

YouTube - http://youtube.com/user/nvidiatesla 

Meetup - http://hpc.meetup.com/ 

STAY CONNECTED! 

GTC attendees are talented. No doubt you’ve had firsthand 

experience of this here at GTC 2012. Attendees 

work in major industry verticals such as Finance, 

Government, Life Sciences, Energy, Computer Software 

Development, Manufacturing, as well as Academia. GTC 

provides invaluable opportunities for peer-to-peer 

learning and connection within and across industries all 

year long. Build on the relationships you made this week. 

Stay connected!

VISUALIZE A GREEN EVENT 

Place compostables and recyclables in proper bins 

Use public transportation during the show 

In hotel, decline new sheets and towels 

Also, unplug phone and laptop chargers 

Offset your travel at www.cool-it.us 

Take only collateral/giveaways you will use 

What We’re Doing 

> 100% of convention center’s greenhouse gas is offset 

> Extensive composting and recycling 

> Producers and vendors agree to green guidelines 

> Minimizing printed materials 

> Using recycled and biodegradable paper/non-toxic inks 

> Monitoring lighting and A/C usage 

> Local-based food options when available 

> Non-toxic cleaning materials 

�� 

�� 

�� 

�� 

�� 

�� 

��

FIRST FLOOR 

TO ST. CLAIRE HOTEL 

BALLROOMS 

(ACROSS THE STREET) 

GOLD SPONSORS 

SILVER SPONSORS 

SECOND FLOOR 

SALES 

OFFICE 

SPEAKER D 

READY ROOM 

SPEAKER & 

SPONSOR 

LOUNGE 

K L M N 

E 

PRESS 

LOUNGE 

STAIRS DOWN 

TO ROOMS K, L, M, N 

MICROSOFT 

LOUNGE 

STORES 

THINK TANK 

SILICON VALLEY 

BOARD ROOM 

NVIDIA MEETING ROOM 

CHECK-IN 

GUADALUPE 

MARRIOTT 

SAN CARLOS BALL- 

ROOM 3 

MARRIOTT 

WILLOW GLEN 

C 

B 

3 

2 

1 

A3 

A2 

A1 

POSTERS 

BALL- 

ROOM 4 

LAB 

A5 

A7 

A8 

MAIN ENTRANCE 

KEYNOTE HALL EXHIBIT HALL 

HALL 1 HALL 2 

STAIRS 

DOWN TO LAB 

ELEVEVATOR TO BLOSSOM HILL, 

ALMADEN AND 

3RD FLOOR MEETING ROOMS 

PARKING 

REGISTRATION 

PLATINUM MEDIA SPONSORS GOLD MEDIA SPONSORS 

HILTON 

STAFF & 

SHOW MANAGEMENT 

HILTON 

J3 

J2 

J1 

F2 H 

F1 G 

ALMADEN 

CONCOURSE 

ELEVATOR TO 

2ND FLOOR 

VIP MEETING ROOM 

�� 

�� 

�� 

�� 

�� 

© 2012 NVIDIA CORPORATION. ALL RIGHTS RESERVED. �� 

��

GTC 2012 Program Guide - GPU Technology Conference

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?