Please enable JavaScript.
Coggle requires JavaScript to display documents.
Introduction to Pyspark (Using DataFrames:
Spark's core data…
Introduction to Pyspark
What is Spark?
- Spark is a platform for cluster computing.
- Spread data and computations over clusters with multiple nodes
- Both data processing and computation are performed in parallel over the nodes in the cluster
Consider to use Spark?
- Is my data too big to work with on a single machine?
- Can my calculations be easily parallelized?
Using Spark in Python
- The first step in using Spark is connecting to a cluster
- Trung 1 cum co 1 Master: (manages splitting up the data and the computations) so con laid la slave.
- Create 1 connection don gian la tao 1 instance cua SparkContext class.
- An object holding all these attributes can be created with the SparkConf() constructor
Using DataFrames:
- Spark's core data structure is the Resilient Distributed Dataset (RDD) ---> tac dung: split data across nhieu node trong cluster.
- Tuy nhien khó làm việc trực tiếp với RDD
===> do đó, sẽ sử dungj Spark DataFrame build on top of RDD.
- DataFrames are also more optimized for complicated operations than RDDs.
- để làm việc vs DataFrame:
- cần tạo SparkSession object từ SparkContext.
** có thể hiểu SparkContext là connection tới cluster và SparkSession là interface của kết nối đó.
- Creating a SparkSession
Creating multiple SparkSessions and SparkContexts can cause issues, so it's best practice to use the SparkSession.builder.getOrCreate() method. This returns an existing SparkSession if there's one in the environment, or creates a new one if necessary!
- Viewing tables
<SparkSession has an attribute called catalog which lists all the data inside the cluster.>
- SparkSession --> attribute (catalog) --> few methods for extract info
- Hay dung nhất là: .listTables(). => trả về tên của tât cả table trong cluester dưới dạng list.
-
- Put pandas DataFrame to Spark cluster:
- The .createDataFrame() method takes a pandas DataFrame and returns a Spark DataFrame
output o local. ko phai o catalog.
-> This means that you can use all the Spark DataFrame methods on it, but you can't access the data in other contexts.
-> access the data in this way, you have to save it as a temporary table.
-> using the .createTempView()
-> or using .createOrReplaceTempView(). (more safety)
-
-
-
-
-
-
-
-
-
Doc truc tiep file vao spark
airports = spark.read.csv(file_path, header=True)
airports.show()
Tao column:
- df = df.withColumn("newCol", df.oldCol + 1)
GROUP BY command.
- This command breaks your data into groups and applies a function from your SELECT statement to each group.
Filtering Data:
- flights.filter(flights.air_time > 120).show()
- flights.filter("air_time > 120").show()
--> same result
JOIN:
- PySpark, joins are performed using the DataFrame method .join().
- This method takes three arguments.
table1.join(table2, on=, how=)
-
-
-
-
-
-