DataFrame
demo¶import os
import findspark
# set spark_home to point to spark on your system
spark_home = os.path.join(os.path.expanduser('~'), 'src/spark')
findspark.init(spark_home=spark_home)
import pyspark
Initialize the SparkSession
which allows us to use the RDD and DataFrame APIs
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.master("local[2]") \
.appName("DataFrame demo") \
.getOrCreate()
sc = spark.sparkContext
file_path = os.path.join(spark_home, 'examples/src/main/resources/people.txt')
people_rdd = (sc.textFile('file://{0}'.format(file_path)))
.map(lambda line: line.split(',')))
people_rdd.first()
['Michael', ' 29']
Now we can use this data to create Row
objects and convert the RDD
into a DataFrame
:
from pyspark.sql import Row
row_rdd = people_rdd.map(lambda x: Row(name=x[0], age=int(x[1])))
row_rdd.first()
df = spark.createDataFrame(row_rdd)
When the DataFrame
is constructed, the data type for each column is inferred:
df.printSchema()
root |-- age: long (nullable = true) |-- name: string (nullable = true)
df.first()
Row(age=29, name='Michael')
There are some convenient methods for pretty-printing the columns:
df.show()
+---+-------+ |age| name| +---+-------+ | 29|Michael| | 30| Andy| | 19| Justin| +---+-------+
Let's compare RDD
methods and DataFrame
-- we want to get all the people older than 20:
# using the usual RDD methods
people_rdd.filter(lambda x: int(x[1])>20).collect()
[['Michael', ' 29'], ['Andy', ' 30']]
# using the DataFrame
df.filter(df.age > 20).take(20)
[Row(age=29, name='Michael'), Row(age=30, name='Andy')]
No need to write map
s if you can express the operation with the built-in functions. You refer to columns via the DataFrame
object:
# this is a column that you can use in arithmetic expressions
df.age
Column<b'age'>
df.select(df.age, (df.age*2).alias('times two')).show()
+---+---------+ |age|times two| +---+---------+ | 29| 58| | 30| 60| | 19| 38| +---+---------+
# equivalent RDD method
people_rdd.map(lambda x: int(x[1])*2).collect()
[58, 60, 38]