Welcome back to our ongoing series where we evaluate the performance of various large language models (LLMs) in generating PySpark code for specific data transformation tasks. Following our analysis of GPT-4 and Llama-3-70B, this post delves into how the Claude 3 Sonnet model handles the same challenge.
Introduction to the Series
This series provides a platform for comparing AI-generated PySpark code across different models, offering valuable insights into their coding efficiency, readability, and innovation. For this post, we employed Claude 3 Sonnet, utilizing the free services offered, to generate code for processing ERCOT’s “2-Day Ancillary Services Reports.”
Preselected Data and Use Case
We continue with the same dataset as in our previous posts:
- NP3-959-EX: 48-hour Aggregate AS Offers
- NP3-960-EX: 48-hour Self-Arranged AS
- NP3-961-EX: 48-hour Cleared DAM AS
The task is to ingest and transform this data using Azure Synapse Analytics and PySpark to derive meaningful insights.
Prompt Used
We have a folder with Ercot’s “2-Day Ancillary Services Reports” This report contains all 48 Hour disclosure data related to DAM. The following individual files are included in the report:NP3-959-EX 48-hour Aggregate AS Offers; NP3-960-EX 48-hour Self-Arranged AS; NP3-961-EX 48-hour Cleared DAM AS (previously named 48 Hour Ancillary Services Reports).
We gathered daily data as a zipped file containing multiple CSVs inside the Zipped file for the month of April. I have added all the zipped files in a master folder. So now there is a Master folder with a 30 days zipped file. Each zipped file has multiple CSV with data listed above.
Can you give me step by step procedure to do the following:
1) Ingest the CSV file within zipped folders using PySpark
2) filter outNP3-960-EX 48-hour Self-Arranged AS
andNP3-961-EX 48-hour Cleared DAM AS
3) Ingest the file in the SQL database that I recently set up on my Azure.
PySpark Code Generation by Claude 3 Sonnet
Ingest the CSV files within zipped folders using PySpark
from pyspark.sql import SparkSession
from pyspark.sql.functions import input_file_name
# Create a SparkSession
spark = SparkSession.builder.appName("IngestZippedCSVs").getOrCreate()
# Define the path to the master folder containing the zipped files
master_folder_path = "/path/to/master/folder"
# Read all CSV files within the zipped folders
df = spark.read.format("csv") \
.option("header", "true") \
.load(f"{master_folder_path}/*.zip")
# Add a column to identify the original file name
df = df.withColumn("file_name", input_file_name())
Filter out NP3-960-EX 48-hour Self-Arranged AS
and NP3-961-EX 48-hour Cleared DAM AS
# Filter out the unwanted files
filtered_df = df.filter(~df.file_name.contains("NP3-960-EX") & ~df.file_name.contains("NP3-961-EX"))
Ingest the data into an SQL database on Microsoft Azure:
# Replace the placeholders with your actual credentials and database information
server_name = "<your_server_name>.database.windows.net"
database_name = "<your_database_name>"
username = "<your_username>"
password = "<your_password>"
# Define the JDBC URL
jdbc_url = f"jdbc:sqlserver://{server_name};database={database_name}"
# Write the filtered DataFrame to the SQL database
filtered_df.write \
.format("jdbc") \
.option("url", jdbc_url) \
.option("dbtable", "<your_table_name>") \
.option("user", username) \
.option("password", password) \
.mode("append") \
.save()
Please note that you’ll need to replace the placeholders (<your_server_name>
, <your_database_name>
, <your_username>
, <your_password>
, and <your_table_name>
) with your actual credentials and database information.
Additionally, make sure you have the necessary JDBC driver installed and included in your Spark cluster’s classpath. For Microsoft SQL Server, you can download the JDBC driver from the official Microsoft website: https://docs.microsoft.com/en-us/sql/connect/jdbc/download-microsoft-jdbc-driver-for-sql-server
After completing these steps, your PySpark application will ingest the relevant CSV files from the zipped folders, filter out the unwanted files, and ingest the remaining data into your SQL database on Microsoft Azure.
Implementation Results
- No issue on Ingest the CSV files within zipped folders using PySpark step.
- No issue on filtering step.
- I was able to load the file to SQL server without any issue.
Conclusion
The Claude 3 Sonnet model, accessed through free services, successfully generated PySpark code that was both efficient and easy to integrate into our data workflow. It was interesting that I was able to run all three steps without any issues.
One more thing that I forgot to mention is I did not vet the data quality in all three cases.
Next in this series
We will explore Google Gemini’s performance on this data transformation project next in this series.