AI-Based Data Transformation: Evaluating Llama-3-70B's PySpark Code Generation
Published May 10, 2024

AI-Based Data Transformation: Llama-3-70B's Take on PySpark Code
Welcome back to our comparative series where we dive into the performance of various large language models (LLMs) in generating PySpark code for data transformations. After reviewing GPT-4's capabilities in our last post, we now turn our attention to Llama-3-70B, a model developed by Meta. Instead of deploying the model on our own servers, we are using pre-deployed model on Groq LPU Inference Engine.
Introduction
For this installment, we focus on Llama-3-70B. Notably, since Llama-3 needs to be deployed on our own server and Meta's AI chat corpus training date cutoff is December 2022, we utilized Llama-3-70B running on the Groq LPU Inference Engine to ensure top performance and responsiveness.
Preselected Data and Use Case
As with our previous analysis, we continue to work with ERCOT's 2-Day Ancillary Services Reports. This dataset includes:
- NP3-959-EX: 48-hour Aggregate AS Offers
- NP3-960-EX: 48-hour Self-Arranged AS
- NP3-961-EX: 48-hour Cleared DAM AS
Our task involves processing these data files to load them on an MS SQL database on Azure.
Prompt Used
We have a folder with Ercot's "2-Day Ancillary Services Reports" This report contains all 48 Hour disclosure data related to DAM. The following individual files are included in the report: NP3-959-EX 48-hour Aggregate AS Offers; NP3-960-EX 48-hour Self-Arranged AS; NP3-961-EX 48-hour Cleared DAM AS (previously named 48 Hour Ancillary Services Reports).
We gathered daily data as a zipped file containing multiple CSVs inside the Zipped file for the month of April. I have added all the zipped files in a master folder. So now there is a Master folder with a 30 days zipped file. Each zipped file has multiple CSV with data listed above.
Can you give me step by step procedure to do the following:
- Ingest the CSV file within zipped folders using PySpark
- Filter out NP3-960-EX 48-hour Self-Arranged AS and NP3-961-EX 48-hour Cleared DAM AS
- Ingest the file in the SQL database that I recently set up on my Azure.
Llama-3-70B's PySpark Answer
Here is a step-by-step procedure to achieve the tasks you mentioned:
Step 1: Ingest the CSV files using PySpark
Install PySpark and its dependencies using pip:
pip install pyspark
Then, follow these steps:
from pyspark.sql import SparkSession
import zipfile
# Create a SparkSession
spark = SparkSession.builder.appName("Ercot Report Ingestion").getOrCreate()
# Define the master folder path
master_folder_path = "/path/to/master/folder"
# Iterate over each zipped file in the master folder
for file in os.listdir(master_folder_path):
if file.endswith(".zip"):
zip_file_path = os.path.join(master_folder_path, file)
# Extract the zipped file
with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
zip_ref.extractall(path=master_folder_path)
# Iterate over each CSV file extracted from the zipped file
for csv_file in os.listdir(master_folder_path):
if csv_file.endswith(".csv"):
csv_file_path = os.path.join(master_folder_path, csv_file)
# Read the CSV file using PySpark
df = spark.read.csv(csv_file_path, header=True, inferSchema=True)
Step 2: Filter out NP3-960-EX 48-hour Self-Arranged AS and NP3-961-EX 48-hour Cleared DAM AS
Assuming the CSV files have a column named "File_Name" or "File_Type" that contains the file names, you can filter out the unwanted files using PySpark's filter method:
filtered_df = df.filter(
(df.File_Name != 'NP3-960-EX 48-hour Self-Arranged AS') &
(df.File_Name != 'NP3-961-EX 48-hour Cleared DAM AS')
)
Step 3: Ingest the file in the SQL database on Azure
To ingest the filtered data into an Azure SQL database, you'll need to use the spark-sql package and the jdbc connector. First, add the necessary dependencies to your PySpark script:
from pyspark.sql import SparkSession
from pyspark.sql.types import *
# Create a SparkSession
spark = SparkSession.builder.appName("Ercot Report Ingestion").getOrCreate()
# Define the Azure SQL database connection properties
jdbc_url = "jdbc:sqlserver://your_server_name.database.windows.net:1433;database=your_database_name"
username = "your_username"
password = "your_password"
driver = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
# Write the filtered data to the Azure SQL database
filtered_df.write.format("jdbc") \
.option("url", jdbc_url) \
.option("query", "your_table_name") \
.option("user", username) \
.option("password", password) \
.option("driver", driver) \
.save()
Replace the placeholders with your actual Azure SQL database credentials and table name.
Note: Make sure to install the spark-sql package and the mssql-jdbc driver in your PySpark environment. You can do this using pip install spark-sql mssql-jdbc or by adding the dependencies to your spark-defaults.conf file.
Implementation Results
Issue on Step 1: NameError: name 'os' is not defined
- Fix: Added
import osto the imports
Issue on Step 1: Multiple nested loops caused very long processing time
- Impact: Ingestion took almost 10 minutes to run (vs GPT-4's 1 minute)
Issue on Step 2: AttributeError: 'DataFrame' object has no attribute 'File_Name'
- Resolution: After multiple prompting iterations (3 attempts), received corrected filtering code
- Note: Llama-3 initially provided several solutions that also raised errors
Database Ingestion: Successfully loaded data to SQL Server without issues
Conclusion on Llama-3-70B's Performance
Llama-3-70B was able to generate code faster. However, it fell short compared to GPT-4 in terms of creating file ingestion script. GPT-4 was able to convert zipped files to binary files, thereby, the processing time was super fast.
However, Llama-3-70B generated codes with multiple nested loop which is not a good practice when ingesting large file in big data projects. So the processing time took almost 10 mins to run compared to GPT-4 ingestion which happened seamlessly within 1 minute.
Moreover, GPT-4 didn't give any issues on file filtering steps. Llama-3-70B raised an Attribute error issue which I was able to solve after multiple intermediate steps. Those intermediate steps raised additional issues which I was finally able to resolve by using the code above in Implementation Results section.
Next in this Series
We will explore Claude's performance on this data transformation project next in this series.