High Performance Python on Apache Spark

1
©
Cloudera,
Inc.
All
rights
reserved.

High
Performance
Python
on

Apache
Spark

Wes
McKinney
@wesmckinn

Spark
Summit
West
-‐-‐
June
7,
2016

2
©
Cloudera,
Inc.
All
rights
reserved.

Me

•  Data
Science
Tools
at
Cloudera

•  Serial
creator
of
structured
data
tools
/
user
interfaces

•  Wrote
bestseller
Python
for
Data
Analysis
2012

• 
Working
on
expanded
and
revised
2nd
edi-on,
coming
2017

•  Open
source
projects

•  Python
{pandas,
Ibis,
statsmodels}

•  Apache
{Arrow,
Parquet,
Kudu
(incubaUng)}

•  Focused
on
C++,
Python,
and
Hybrid
projects

3
©
Cloudera,
Inc.
All
rights
reserved.

Agenda

•  Why
care
about
Python?

•  What
does
“high
performance
Python”
even
mean?

•  A
modern
approach
to
Python
data
so]ware

•  Spark
and
Python:
performance
analysis
and
development
direcUons

4
©
Cloudera,
Inc.
All
rights
reserved.

•  Accessible,
“swiss
army
knife”
programming
language

•  Highly
producUve
for
so]ware
engineering
and
data
science
alike

•  Has
excelled
as
the
agile
“orchestraUon”
or
“glue”
layer
for
applicaUon

business
logic

•  Easy
to
interface
with
C
/
C++
/
Fortran
code.
Well-‐designed
Python
C
API

Why
care
about
(C)Python?

5
©
Cloudera,
Inc.
All
rights
reserved.

Deﬁning
“High
Performance
Python”

•  The
end-‐user
workﬂow
involves
primarily
Python
programming;
programs
can

be
invoked
with
“python
app_entry_point.py
...”

•  The
so]ware
uses
system
resources
within
an
acceptable
factor
of
an

equivalent
program
developed
completely
in
Java
or
C++

•  Preferably
1-‐5x
slower,
not
20-‐50x

•  The
so]ware
is
suitable
for
interacUve
/
exploratory
compuUng
on
modestly

large
data
sets
(=
gigabytes)
on
a
single
node

6
©
Cloudera,
Inc.
All
rights
reserved.

Building
fast
Python
so]ware

means
embracing
certain

limitaUons

7
©
Cloudera,
Inc.
All
rights
reserved.

Having
a
healthy
relaUonship
with
the
interpreter

•  The
Python
interpreter
itself
is
“slow”,
as
compared
with
hand-‐coded
C
or
Java

•  Each
line
of
Python
code
may
feature
mulUple
internal
C
API
calls,

temporary
data
structures,
etc.

•  Python
built-‐in
data
structures
(numbers,
strings,
tuples,
lists,
dicts,
etc.)
have

signiﬁcant
memory
and
performance
use
overhead

•  Threads
performing
concurrent
CPU
or
IO
work
must
take
care
not
to
block

other
threads

8
©
Cloudera,
Inc.
All
rights
reserved.

Mantras
for
great
success

•  Key
quesUon
1:
Am
I
making
the
Python
interpreter
do
a
lot
of
work?

•  Key
quesUon
2:
Am
I
blocking
other
interpreted
code
from
execu-ng?

•  Key
quesUon
3:
Am
I
handling
data
(memory)
in
a
“good”
way?

9
©
Cloudera,
Inc.
All
rights
reserved.

Toy
example:
interpreted
vs.
compiled
code

10
©
Cloudera,
Inc.
All
rights
reserved.

Toy
example:
interpreted
vs.
compiled
code

Cython: 78x faster than
interpreted

11
©
Cloudera,
Inc.
All
rights
reserved.

Toy
example:
interpreted
vs.
compiled
code

NumPy
Creating a full 80MB temporary array +
PyArray_Sum is only 35% slower than a
fully inlined Cython ( C ) function
Interesting: ndarray.sum by itself is almost
2x faster than the hand-coded Cython
function...

12
©
Cloudera,
Inc.
All
rights
reserved.

Submarines
and
Icebergs:
metaphors
for
fast
Python
so]ware

13
©
Cloudera,
Inc.
All
rights
reserved.

SubopUmal
control
ﬂow

Elsewhere Data
Python
code
Python data
structures
Pure Python
computation
Python data
structures
Pure Python
computation
Python data
structures
Data
Deserialization Serialization
Time for a coffee
break...

14
©
Cloudera,
Inc.
All
rights
reserved.

Beler
control
ﬂow

Extension
code
(C / C++)
Native
data
Python
code
C Func
Native
data
Native
data
C Func
Python
app logic
Python
app logic
Users only see this!
Zoom zoom!
(if the extension code is good)

15
©
Cloudera,
Inc.
All
rights
reserved.

But
it’s
much
easier
to
write
100%
Python!

•  Building
hybrid
C/C++
and
Python
systems
adds
a
lot
of
complexity
to
the

engineering
process

•  (but
it’s
o]en
worth
it)

•  See:
Cython,
SWIG,
Boost.Python,
Pybind11,
and
other
“hybrid”
so]ware

creaUon
tools

•  BONUS:
Python
programs
can
orchestrate
mulU-‐threaded
/
concurrent
systems

wrilen
in
C/C++
(no
Python
C
API
needed)

•  The
GIL
only
comes
in
when
you
need
to
“bubble
up”
data
or
control
ﬂow

(e.g.
Python
callbacks)
into
the
Python
interpreter

16
©
Cloudera,
Inc.
All
rights
reserved.

A
story
of
reading
a
CSV
ﬁle

f
=
get_stream(...)

df
=
pandas.read_csv(f,
**csv_options)

while more_data():
buffer = f.read()
parse_bytes(buffer)
df = type_infer_columns()
internally, pseudocode
Concerns
Uses PyString_FromStringAndSize, must
hold GIL for this
Synchronous or asynchronous with IO?
Type infer in parallel?
Data structures used?

17
©
Cloudera,
Inc.
All
rights
reserved.

It’s
All
About
the
Benjamins
(Data
Structures)

•  The
hard
currency
of
data
so]ware
is:
in-‐memory
data
structures

•  How
costly
are
they
to
send
and
receive?

•  How
costly
to
manipulate
and
munge
in-‐memory?

•  How
diﬃcult
is
it
to
add
new
proprietary
computaUon
logic?

•  In
Python:
NumPy
established
a
gold
standard
for
interoperable
array
data

•  pandas
is
built
on
NumPy,
and
made
it
easy
to
“plug
in”
to
the
ecosystem

•  (but
there
are
plenty
of
warts
sUll)

18
©
Cloudera,
Inc.
All
rights
reserved.

What’s
this
have
to
do
with
Spark?

•  Some
known
performance
issues
in
PySpark

•  IO
throughput

•  Python
to
Spark

•  Spark
to
Python
(or
Python
extension
code)

•  Running
interpreted
Python
code
on
RDDs
/
Spark
DataFrames

•  Lambda
mappers
/
reducers
(rdd.map(...))

•  Spark
SQL
UDFs
(registerFuncUon(...))

19
©
Cloudera,
Inc.
All
rights
reserved.

Spark
IO
throughput
to/from
Python

1.15 MB/s in
9.82 MB/s out
Spark 1.6.1 running on
localhost
76 MB pandas.DataFrame

20
©
Cloudera,
Inc.
All
rights
reserved.

Spark
IO
throughput
to/from
Python

Unofficial improved
toPandas
25.6 MB/s out

21
©
Cloudera,
Inc.
All
rights
reserved.

Compared
with
HiveServer2
Thri]
RPC
fetch

Impala 2.5 + Parquet
file on localhost
ibis + impyla
41.46 MB/s read
hs2client (C++ / Python)
90.8 MB/s
Task benchmarked: Thrift TFetchResultsReq + deserialization + conversion to
pandas.DataFrame

22
©
Cloudera,
Inc.
All
rights
reserved.

Back
of
envelope
comp
w/
ﬁle
formats

Feather: 1105 MB/s write
CSV (pandas): 6.2 MB/s write
Feather: 2414 MB/s read
CSV (pandas): 51.9 MB/s read
disclaimer: warm NVMe / OS file cache

23
©
Cloudera,
Inc.
All
rights
reserved.

Aside:
CSVs
can
be
fast

See: https://github.com/wiseio/paratext

24
©
Cloudera,
Inc.
All
rights
reserved.

How
Python
lambdas
work
in
PySpark

Spark
RDD
Python task
worker pool
Python worker
Python worker
Python worker
Python worker
Python worker
Data stream +
pickled PyFunction
See: spark/api/python/PythonRDD.scala
python/pyspark/worker.py
The inner loop of RDD.map
map(f,
iterator)

25
©
Cloudera,
Inc.
All
rights
reserved.

How
Python
lambdas
perform

NumPy array-oriented operations are
about 100x faster… but that’s not the
whole story
Disclaimer: this isn’t a remotely “fair” comparison, but it helps illustrate the
real pitfalls associated with introducing serialization and RPC/IPC into a
computational process

27
©
Cloudera,
Inc.
All
rights
reserved.

Asides
/
counterpoints

•  Spark<-‐>Python
IO
may
not
be
important
-‐-‐
can
leave
all
of
the
data
remote

•  Spark
DataFrame
operaUons
have
reduced
the
need
for
many
types
of
Lambda

funcUons

•  Can
use
binary
ﬁle
formats
as
an
alternate
IO
interface

•  Parquet
(Python
support
soon
via
apache/parquet-‐cpp)

•  Avro
(see
cavro,
fastavro,
pyavroc)

•  ORC
(needs
a
Python
champion)

•  ...

29
©
Cloudera,
Inc.
All
rights
reserved.

Apache
Arrow
in
a
Slide

•  New
Top-‐level
Apache
So]ware
FoundaUon
project

• 
hlp://arrow.apache.org

•  Focused
on
Columnar
In-‐Memory
AnalyUcs

1.  10-‐100x
speedup
on
many
workloads

2.  Common
data
layer
enables
companies
to
choose
best
of

breed
systems

3.  Designed
to
work
with
any
programming
language

4.  Support
for
both
relaUonal
and
complex
data
as-‐is

•  Oriented
at
collaboraUon
amongst
other
OSS
projects

Calcite
Cassandra
Deeplearning4j
Drill
Hadoop
HBase
Ibis
Impala
Kudu
Pandas
Parquet
Phoenix
Spark
Storm
R

30
©
Cloudera,
Inc.
All
rights
reserved.

High
Performance
Sharing
&
Interchange

Today With Arrow
•  Each system has its own internal
memory format
•  70-80% CPU wasted on serialization
and deserialization
•  Similar functionality implemented in
multiple projects
•  All systems utilize the same memory
format
•  No overhead for cross-system
communication
•  Projects can share functionality (eg,
Parquet-to-Arrow reader)

31
©
Cloudera,
Inc.
All
rights
reserved.

Arrow
and
PySpark

•  Build
a
C
API
level
data
protocol
to
move
data
between
Spark
and
Python

•  Either

•  (Fast)
Convert
Arrow
to/from
pandas.DataFrame

•  (Faster)
Perform
naUve
analyUcs
on
Arrow
data
in-‐memory

•  Use
Arrow

•  For
eﬃciently
handling
nested
Spark
SQL
data
in-‐memory

•  IO:
pandas/NumPy
data
push/pull

•  Lambda/UDF
evaluaUon

32
©
Cloudera,
Inc.
All
rights
reserved.

• Problem:
fast,
language-‐
agnosUc
binary
data
frame

ﬁle
format

• Creators:
Wes
McKinney

(Python)
and
Hadley

Wickham
(R)

• Read
speeds
close
to
disk
IO

performance

Arrow
in
acUon:
Feather
File
Format
for
Python
and
R

34
©
Cloudera,
Inc.
All
rights
reserved.

Summary

•  It’s
essenUal
to
improve
Spark’s
low-‐level
data
interoperability
with
the
Python

data
ecosystem

•  I’m
personally
excited
to
work
with
the
Spark
+
Arrow
+
PyData
+
other

communiUes
to
help
make
this
a
reality

High Performance Python on Apache Spark

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to High Performance Python on Apache Spark

Similar to High Performance Python on Apache Spark (20)

More from Wes McKinney

More from Wes McKinney (20)

Recently uploaded

Recently uploaded (20)

High Performance Python on Apache Spark