More Related Content
Similar to High Performance Python on Apache Spark (20)
More from Wes McKinney (20)
High Performance Python on Apache Spark
- 1. 1
©
Cloudera,
Inc.
All
rights
reserved.
High
Performance
Python
on
Apache
Spark
Wes
McKinney
@wesmckinn
Spark
Summit
West
-‐-‐
June
7,
2016
- 2. 2
©
Cloudera,
Inc.
All
rights
reserved.
Me
• Data
Science
Tools
at
Cloudera
• Serial
creator
of
structured
data
tools
/
user
interfaces
• Wrote
bestseller
Python
for
Data
Analysis
2012
•
Working
on
expanded
and
revised
2nd
edi-on,
coming
2017
• Open
source
projects
• Python
{pandas,
Ibis,
statsmodels}
• Apache
{Arrow,
Parquet,
Kudu
(incubaUng)}
• Focused
on
C++,
Python,
and
Hybrid
projects
- 3. 3
©
Cloudera,
Inc.
All
rights
reserved.
Agenda
• Why
care
about
Python?
• What
does
“high
performance
Python”
even
mean?
• A
modern
approach
to
Python
data
so]ware
• Spark
and
Python:
performance
analysis
and
development
direcUons
- 4. 4
©
Cloudera,
Inc.
All
rights
reserved.
• Accessible,
“swiss
army
knife”
programming
language
• Highly
producUve
for
so]ware
engineering
and
data
science
alike
• Has
excelled
as
the
agile
“orchestraUon”
or
“glue”
layer
for
applicaUon
business
logic
• Easy
to
interface
with
C
/
C++
/
Fortran
code.
Well-‐designed
Python
C
API
Why
care
about
(C)Python?
- 5. 5
©
Cloudera,
Inc.
All
rights
reserved.
Defining
“High
Performance
Python”
• The
end-‐user
workflow
involves
primarily
Python
programming;
programs
can
be
invoked
with
“python
app_entry_point.py
...”
• The
so]ware
uses
system
resources
within
an
acceptable
factor
of
an
equivalent
program
developed
completely
in
Java
or
C++
• Preferably
1-‐5x
slower,
not
20-‐50x
• The
so]ware
is
suitable
for
interacUve
/
exploratory
compuUng
on
modestly
large
data
sets
(=
gigabytes)
on
a
single
node
- 6. 6
©
Cloudera,
Inc.
All
rights
reserved.
Building
fast
Python
so]ware
means
embracing
certain
limitaUons
- 7. 7
©
Cloudera,
Inc.
All
rights
reserved.
Having
a
healthy
relaUonship
with
the
interpreter
• The
Python
interpreter
itself
is
“slow”,
as
compared
with
hand-‐coded
C
or
Java
• Each
line
of
Python
code
may
feature
mulUple
internal
C
API
calls,
temporary
data
structures,
etc.
• Python
built-‐in
data
structures
(numbers,
strings,
tuples,
lists,
dicts,
etc.)
have
significant
memory
and
performance
use
overhead
• Threads
performing
concurrent
CPU
or
IO
work
must
take
care
not
to
block
other
threads
- 8. 8
©
Cloudera,
Inc.
All
rights
reserved.
Mantras
for
great
success
• Key
quesUon
1:
Am
I
making
the
Python
interpreter
do
a
lot
of
work?
• Key
quesUon
2:
Am
I
blocking
other
interpreted
code
from
execu-ng?
• Key
quesUon
3:
Am
I
handling
data
(memory)
in
a
“good”
way?
- 9. 9
©
Cloudera,
Inc.
All
rights
reserved.
Toy
example:
interpreted
vs.
compiled
code
- 10. 10
©
Cloudera,
Inc.
All
rights
reserved.
Toy
example:
interpreted
vs.
compiled
code
Cython: 78x faster than
interpreted
- 11. 11
©
Cloudera,
Inc.
All
rights
reserved.
Toy
example:
interpreted
vs.
compiled
code
NumPy
Creating a full 80MB temporary array +
PyArray_Sum is only 35% slower than a
fully inlined Cython ( C ) function
Interesting: ndarray.sum by itself is almost
2x faster than the hand-coded Cython
function...
- 12. 12
©
Cloudera,
Inc.
All
rights
reserved.
Submarines
and
Icebergs:
metaphors
for
fast
Python
so]ware
- 13. 13
©
Cloudera,
Inc.
All
rights
reserved.
SubopUmal
control
flow
Elsewhere Data
Python
code
Python data
structures
Pure Python
computation
Python data
structures
Pure Python
computation
Python data
structures
Data
Deserialization Serialization
Time for a coffee
break...
- 14. 14
©
Cloudera,
Inc.
All
rights
reserved.
Beler
control
flow
Extension
code
(C / C++)
Native
data
Python
code
C Func
Native
data
Native
data
C Func
Python
app logic
Python
app logic
Users only see this!
Zoom zoom!
(if the extension code is good)
- 15. 15
©
Cloudera,
Inc.
All
rights
reserved.
But
it’s
much
easier
to
write
100%
Python!
• Building
hybrid
C/C++
and
Python
systems
adds
a
lot
of
complexity
to
the
engineering
process
• (but
it’s
o]en
worth
it)
• See:
Cython,
SWIG,
Boost.Python,
Pybind11,
and
other
“hybrid”
so]ware
creaUon
tools
• BONUS:
Python
programs
can
orchestrate
mulU-‐threaded
/
concurrent
systems
wrilen
in
C/C++
(no
Python
C
API
needed)
• The
GIL
only
comes
in
when
you
need
to
“bubble
up”
data
or
control
flow
(e.g.
Python
callbacks)
into
the
Python
interpreter
- 16. 16
©
Cloudera,
Inc.
All
rights
reserved.
A
story
of
reading
a
CSV
file
f
=
get_stream(...)
df
=
pandas.read_csv(f,
**csv_options)
while more_data():
buffer = f.read()
parse_bytes(buffer)
df = type_infer_columns()
internally, pseudocode
Concerns
Uses PyString_FromStringAndSize, must
hold GIL for this
Synchronous or asynchronous with IO?
Type infer in parallel?
Data structures used?
- 17. 17
©
Cloudera,
Inc.
All
rights
reserved.
It’s
All
About
the
Benjamins
(Data
Structures)
• The
hard
currency
of
data
so]ware
is:
in-‐memory
data
structures
• How
costly
are
they
to
send
and
receive?
• How
costly
to
manipulate
and
munge
in-‐memory?
• How
difficult
is
it
to
add
new
proprietary
computaUon
logic?
• In
Python:
NumPy
established
a
gold
standard
for
interoperable
array
data
• pandas
is
built
on
NumPy,
and
made
it
easy
to
“plug
in”
to
the
ecosystem
• (but
there
are
plenty
of
warts
sUll)
- 18. 18
©
Cloudera,
Inc.
All
rights
reserved.
What’s
this
have
to
do
with
Spark?
• Some
known
performance
issues
in
PySpark
• IO
throughput
• Python
to
Spark
• Spark
to
Python
(or
Python
extension
code)
• Running
interpreted
Python
code
on
RDDs
/
Spark
DataFrames
• Lambda
mappers
/
reducers
(rdd.map(...))
• Spark
SQL
UDFs
(registerFuncUon(...))
- 19. 19
©
Cloudera,
Inc.
All
rights
reserved.
Spark
IO
throughput
to/from
Python
1.15 MB/s in
9.82 MB/s out
Spark 1.6.1 running on
localhost
76 MB pandas.DataFrame
- 20. 20
©
Cloudera,
Inc.
All
rights
reserved.
Spark
IO
throughput
to/from
Python
Unofficial improved
toPandas
25.6 MB/s out
- 21. 21
©
Cloudera,
Inc.
All
rights
reserved.
Compared
with
HiveServer2
Thri]
RPC
fetch
Impala 2.5 + Parquet
file on localhost
ibis + impyla
41.46 MB/s read
hs2client (C++ / Python)
90.8 MB/s
Task benchmarked: Thrift TFetchResultsReq + deserialization + conversion to
pandas.DataFrame
- 22. 22
©
Cloudera,
Inc.
All
rights
reserved.
Back
of
envelope
comp
w/
file
formats
Feather: 1105 MB/s write
CSV (pandas): 6.2 MB/s write
Feather: 2414 MB/s read
CSV (pandas): 51.9 MB/s read
disclaimer: warm NVMe / OS file cache
- 23. 23
©
Cloudera,
Inc.
All
rights
reserved.
Aside:
CSVs
can
be
fast
See: https://github.com/wiseio/paratext
- 24. 24
©
Cloudera,
Inc.
All
rights
reserved.
How
Python
lambdas
work
in
PySpark
Spark
RDD
Python task
worker pool
Python worker
Python worker
Python worker
Python worker
Python worker
Data stream +
pickled PyFunction
See: spark/api/python/PythonRDD.scala
python/pyspark/worker.py
The inner loop of RDD.map
map(f,
iterator)
- 25. 25
©
Cloudera,
Inc.
All
rights
reserved.
How
Python
lambdas
perform
NumPy array-oriented operations are
about 100x faster… but that’s not the
whole story
Disclaimer: this isn’t a remotely “fair” comparison, but it helps illustrate the
real pitfalls associated with introducing serialization and RPC/IPC into a
computational process
- 26. 26
©
Cloudera,
Inc.
All
rights
reserved.
How
Python
lambdas
perform
8 cores
1 core
Lessons learned: Python data analytics should
not be based around scalar object iteration
- 27. 27
©
Cloudera,
Inc.
All
rights
reserved.
Asides
/
counterpoints
• Spark<-‐>Python
IO
may
not
be
important
-‐-‐
can
leave
all
of
the
data
remote
• Spark
DataFrame
operaUons
have
reduced
the
need
for
many
types
of
Lambda
funcUons
• Can
use
binary
file
formats
as
an
alternate
IO
interface
• Parquet
(Python
support
soon
via
apache/parquet-‐cpp)
• Avro
(see
cavro,
fastavro,
pyavroc)
• ORC
(needs
a
Python
champion)
• ...
- 28. 28
©
Cloudera,
Inc.
All
rights
reserved.
Apache
Arrow
http://arrow.apache.org
Some slides from Strata-HW talk w/
Jacques Nadeau
- 29. 29
©
Cloudera,
Inc.
All
rights
reserved.
Apache
Arrow
in
a
Slide
• New
Top-‐level
Apache
So]ware
FoundaUon
project
•
hlp://arrow.apache.org
• Focused
on
Columnar
In-‐Memory
AnalyUcs
1. 10-‐100x
speedup
on
many
workloads
2. Common
data
layer
enables
companies
to
choose
best
of
breed
systems
3. Designed
to
work
with
any
programming
language
4. Support
for
both
relaUonal
and
complex
data
as-‐is
• Oriented
at
collaboraUon
amongst
other
OSS
projects
Calcite
Cassandra
Deeplearning4j
Drill
Hadoop
HBase
Ibis
Impala
Kudu
Pandas
Parquet
Phoenix
Spark
Storm
R
- 30. 30
©
Cloudera,
Inc.
All
rights
reserved.
High
Performance
Sharing
&
Interchange
Today With Arrow
• Each system has its own internal
memory format
• 70-80% CPU wasted on serialization
and deserialization
• Similar functionality implemented in
multiple projects
• All systems utilize the same memory
format
• No overhead for cross-system
communication
• Projects can share functionality (eg,
Parquet-to-Arrow reader)
- 31. 31
©
Cloudera,
Inc.
All
rights
reserved.
Arrow
and
PySpark
• Build
a
C
API
level
data
protocol
to
move
data
between
Spark
and
Python
• Either
• (Fast)
Convert
Arrow
to/from
pandas.DataFrame
• (Faster)
Perform
naUve
analyUcs
on
Arrow
data
in-‐memory
• Use
Arrow
• For
efficiently
handling
nested
Spark
SQL
data
in-‐memory
• IO:
pandas/NumPy
data
push/pull
• Lambda/UDF
evaluaUon
- 32. 32
©
Cloudera,
Inc.
All
rights
reserved.
• Problem:
fast,
language-‐
agnosUc
binary
data
frame
file
format
• Creators:
Wes
McKinney
(Python)
and
Hadley
Wickham
(R)
• Read
speeds
close
to
disk
IO
performance
Arrow
in
acUon:
Feather
File
Format
for
Python
and
R
- 33. 33
©
Cloudera,
Inc.
All
rights
reserved.
More
on
Feather
array 0
array 1
array 2
...
array n - 1
METADATA
Feather File
libfeather
C++ library
Rcpp
Cython
R data.frame
pandas DataFrame
- 34. 34
©
Cloudera,
Inc.
All
rights
reserved.
Summary
• It’s
essenUal
to
improve
Spark’s
low-‐level
data
interoperability
with
the
Python
data
ecosystem
• I’m
personally
excited
to
work
with
the
Spark
+
Arrow
+
PyData
+
other
communiUes
to
help
make
this
a
reality
- 35. 35
©
Cloudera,
Inc.
All
rights
reserved.
Thank
you
Wes
McKinney
@wesmckinn
Views
are
my
own