Incremental Data loading and Querying in Pyspark without restarting Spark JOB
大家好,我想做增量数据查询。
1 2 3 4 5 6 7 8 9 | df = spark .read.csv('csvFile', header=True) #1000 Rows df.persist() #Assume it takes 5 min df.registerTempTable('data_table') #or createOrReplaceTempView result = spark.sql('select * from data_table where column1 > 10') #100 rows df_incremental = spark.read.csv('incremental.csv') #200 Rows df_combined = df.unionAll(df_incremental) df_combined.persist() #It will take morethan 5 mins, I want to avoid this, because other queries might be running at this time df_combined.registerTempTable("data_table") result = spark.sql('select * from data_table where column1 > 10') # 105 Rows. |
将 csv/mysql 表数据读入 spark 数据帧。
仅在内存中保留该数据帧(原因:我需要性能
尝试流式传输会更快,因为会话已经在运行,并且每次您在文件夹中放置内容时都会触发它:
1 2 3 4 5 6 7 8 9 10 11 12 13 | df_incremental = spark \\ .readStream \\ .option("sep",",") \\ .schema(input_schema) \\ .csv(input_path) df_incremental.where("column1 > 10") \\ .writeStream \\ .queryName("data_table") \\ .format("memory") \\ .start() spark.sql("SELECT * FROM data_table).show() |