Abstract:
In the field of education, analyzing academic performance is vital for
understanding student learning behaviors, identifying areas needing enhancement, and
developing targeted interventions to improve educational outcomes. Traditional
assessment methods typically depend on simple metrics like grades or standardized
test scores; which often fail to capture the complexities of student proficiency and
behavior. To overcome these limitations, educational researchers have increasingly
adopted advanced data mining techniques and machine learning algorithms for a more
granular and comprehensive analysis of academic performance data. This research
proposes an Enhanced Dirichlet Process Means (EDP-Means) clustering algorithm
combined with Educational Extract, Transform, Load (Edu-ETL) processes to
evaluate academic performance across various educational levels. The proposed
approaches aim to offer greater assurance and clarity in evaluating and supporting
student achievements throughout their educational journey. The integration of Edu-
ETL processes ensures data quality and consistency, preparing educational datasets
for thorough analysis. The architecture of the proposed system utilizes the EDP-
Means clustering algorithm, an improvement over the original DP-Means, for
enhanced clustering performance. While both algorithms assign data points to clusters
based on distance and threshold, EDP-Means introduces iterative optimization steps
for improved accuracy and stability. In the original DP-Means algorithm, the number
of clusters and the threshold parameter were typically fixed or set based on
heuristic choices. In EDP-Means, these parameters are dynamically adjusted based on
the data characteristics and clustering quality, leading to more accurate and reliable
clustering results. This study demonstrates that EDP-Means performs better and is
comparable to traditional K-Means and original DP-Means algorithms in clustering
educational data. To validate and prove the performance of EDP-Means, datasets
from different fields were used to further experiment EDP-Means and ensure its
effectiveness. Furthermore, the analysis of the PySpark environment underscores how
the utilization of PySpark enhances the scalability and efficiency of EDP-Means,
particularly in processing large-scale datasets.