FALL 2022 Graduate Student Master Thesis Defense

Title: Time Series Analysis of US Gasoline Prices Using Time Series Model

Tuesday, November 28, 2022 2:00 PM

Location: WH 284

Speaker: Moussa Abdoulaye

Abstract: Natural gas demand has increased significantly on a global scale, and businesses are keen to realize natural gas price forecasts. The prediction is expected to meet the needs of various producers, suppliers, traders, bankers, and end users who are involved in the exploration, production, transportation, and trading of natural gas. The goal for both the supply and the trader is to conduct business while satisfying demand. Researchers have used a variety of approaches to predict the price of natural gas. In this study, we examine how well time series analysis can forecast US gasoline prices. We discovered proof that the time series models accurately predicted the decrease of US gasoline prices one year in advance. This is crucial for proper planning for all parties involved in the exploration, production, transportation, and trading of natural gas. The ARIMA model approach for time series data provides reliable predictions of US gasoline prices for the following year, according to research results.

Thesis Advisor: Dr. Metzbahur Rahman


Title: A comparative Study of Ridge, LASSO and Principal components Regression

Wednesday, November 16, 2022 3:00 PM

Location: https://minnstate.zoom.us/j/96409490562

Speaker: Franck Olilo

Abstract: One of the statistical techniques that is often employed and has applications in all aspects of daily life is linear regression. In regression, the goal is to correlate the variation in one or more response variables with proportional change in one or more explanatory factors to explain the variation in the response variables. They are deemed to be orthogonal if there is no linear relationship between these explanatory variables. Several of the explanatory variables will fluctuate in quite comparable ways if the variables are not orthogonal. This issue, known as multicollinearity, is one that frequently arises in regression analysis. When two or more explanatory variables are highly (but not perfectly) correlated with one another, it makes challenging to interpret the strength of each variable's effect because in the presence of multicollinearity the OLS estimators are not precisely estimated. 

In the first part of this paper, we discuss the multicollinearity problem in linear regression model, present the technique to identify the problem, look for its causes and consequences. After that we explore ways to handle multicollinearity such as Ridge Regression, Lasso Regression and Principal Components regression and discuss the theory beyond them.  

In addition, we attempted a case study and applied those methods, and we compare which among the OLS, RR, LAS, and PCR should be an alternative when fitting a model with multicollinearity.  MSE, RMSE and R squared being the comparison factor, the results showed that RR, LAS and PCR have mean square error less than the OLS while RR and LASSO performs well than PCR.

Thesis Advisor: Dr. Iresha Premarathna

 

 

Summer 2022 Graduate Student Master Thesis Defense

Title: Mitigating Class Imbalance in Machine Learning for a Binary Classification Using Resampling Techniques

Thursday, July 14, 2022 12:00 PM

Location: https://minnstate.zoom.us/j/99988495768

Speaker: Sujin Kim

Abstract: Class imbalance is one of the problems that we face often when building a model for classification using machine learning (ML) algorithms. ML algorithms are likely to create a model that classifies all observations into the majority class as it focuses on the overall accuracy of the model in general and the minority class contributes less to the accuracy than the majority class. In this paper, we intend to mitigate the problem of class imbalance with sepsis clinical data using data level approach, which is resampling techniques that are intuitive and simple but universal ways to apply to any ML algorithm. The resampling technique is a method of resampling the original data having a higher class imbalance to create new data having a lower class imbalance. Nearmiss under-sampling, Tomek Link under-sampling, and SMOTE over-sampling methods are used. The sepsis clinical data is a dataset having information about survival of patients with sepsis. The dataset is divided into a train set for building a model and a test set for validation and several ML algorithms are used on the train set for this binary classification problem. Logistic regression, support vector machine, and random forest are applied. The performance of resampling techniques with each ML algorithm is evaluated by scores from a confusion matrix.

Thesis Advisor: Dr. Deepak Sanjel


Title: An Investigation of Markowitz and Robust Portfolio Optimization

Wednesday, July 13, 2022 11:00 AM

Location: WH 291

Speaker: Shangyi Bi

Abstract: The purpose of this paper is to explain the Markowitz portfolio theory and improve its sensitivity issue with parameter estimation errors using a robust method. The Markowitz portfolio theory is a mathematical framework for selecting a portfolio of assets such that the expected return is maximized for a given level of risk. The robust formulations are to systematically combat the sensitivity of the optimal portfolio to statistical and modeling errors in the estimates. We introduce a box uncertainty set for the mean and variance, which makes the overall return more stable. This method is important for investors because financial products are affected by unexpected incidents easily, a robust formulation will further extend the diversification in investing.

Thesis Advisor: Dr. Hyekyung Min


Title:Predicting types of chest pain using Logistic regression

Monday, June 8, 2022 4:30 PM

Location: https://minnstate.zoom.us/j/7386964275

Speaker: Eric Adu

Abstract: Logistic regression analysis is a statistical technique to evaluate the relationship between various predictor variables (either categorical or continuous) and an outcome which is binary (dichotomous). In this paper, we discuss binary logistic regression analysis and the binary logistic regression and the error generated as a result of using binary logistic regression instead of binary logistic regression.

Thesis Advisor: Dr. Metzbahur Rahman


Spring 2022 Graduate Student Master Thesis Defense

Solving a Combinatorial Timetabling Optimization Via Random Local Search With Simulated Annealing

Friday, April 29, 2022, 3-3:50pm

Location: WH 284A

Speaker: Jason Motzko

Abstract: An overview is given for combinatorial optimization, random local search and simulated annealing methods. These methods are then used to develop an algorithm for locating the best solution to a timetabling problem at a university. Various criteria of desirable attributes in the timetable are evaluated, with a penalty assigned for violations of the criteria. The sum of these penalties is the value for which a desired minimum is sought. Due to the discrete nature of the optimization problem, random methods are utilized to seek a global minimum, with no guarantee of convergence to a global minimum. The results of these randomized methods varies be- tween implementations of the algorithm. Effectiveness of randomization of initial conditions as a method of avoiding entrapment in local minimum neighborhood space is explored. Employing a cohort of randomly selected initial solutions, a significantly greater reduction in the penalty was achieved during implementation of the algorithm than by employment of a single initial solution.

Thesis Advisor: Dr. Nicholas Fisher


Forecasting the Closing price of Bitcoin Cryptocurrency using ARIMA, Prophet and LSTM models

Wednesday, April 20, 4-4:50pm

Location: On-line

Zoom Information: https://minnstate.zoom.us/j/97782055828

Speaker: Abimbola Kolebaje

Abstract: Due to the difficulty in assessing the exact nature of a time series, it is often considerably challenging to generate appropriate forecasts. Over the years, various forecasting models have been developed in the literature, but they have produced minimum accuracy in forecasting financial trend. In recent years, the advent of Deep Learning has revolutionized the business of forecasting financial trends, this study involves the time series forecasting of the bitcoin closing prices with improved efficiency using long short-term memory techniques (LSTM) and compares its predictability with the traditional method (ARIMA). Additionally, we will implement the forecast of bitcoin price with the Facebook Prophet model and forecast future prices. The Mean Absolute Percentage Error (MAPE) of all three models will be compared to ascertain which model has the highest accuracy in forecasting bitcoin prices. In our case, the LSTM model outperforms the ARIMA and Prophet machine learning algorithms.

Thesis Advisor: Dr. Deepak Sanjel


Prediction of Abnormal Vaginal Discharge using Machine learning techniques among women living in rural and urban area of Tangail district, Bangladesh

Friday, April 8, 3-4pm

Location: On-line

Zoom Information: https://minnstate.zoom.us/j/8314634593
Meeting ID: 831 463 4593
Passcode: 9682

Speaker: Aninda Roy

Abstract: In recent times, early detection of disease has become a crucial problem due to the rapid growth of the population worldwide. When it comes to women’s health, they have many complications that start during their reproductive life. Abnormal vaginal discharge (AVD) is a prevalent problem among women. If it is not treated appropriately, it may lead to severe complications such as pelvic inflammatory disease and cervical cancer. In Bangladesh, women suffer from abnormal vaginal discharge due to a lack of proper hygiene. More importantly, their hesitation of sharing about this problem leads to further complications in their health. This paper presents a qualitative study of women’s socio-demographic profile, personal hygienic practices, previous medical history, associated symptoms, characteristics of discharge and health-seeking pathways, and factors that influence abnormal vaginal discharge. Data was collected from Tangail district, Bangladesh using a predesigned survey questionnaire that includes questions designed to fulfill the study objective. This dataset had 280 total observations where 180 women’s (64.3%) response was positive with AVD and negative for others (35.7%) at the study time. Association of daily hygiene practices and associated symptoms with abnormal vaginal discharge (AVD) were determined using the Chi-square test, where a p-value of less than 0.05 was considered statistically significant. The prime objective of this paper is to create a model for predicting abnormal vaginal discharge using four machine learning classification algorithms which are K Nearest Neighbor (KNN), Support Vector Machine (SVM), Random Forrest (RF), and Logistic regression (LR). The performance of different classifiers is measured concerning their accuracy, sensitivity, specificity, positive predictive value, and negative predictive value. Additionally, these techniques were appraised on the area under the receiver operating characteristic curve (ROC). The results reveal that the LR model obtained the highest accuracy, sensitivity, and positive predictive value with the lowest specificity and negative predictive value of 81.4%, 91%, 82%, 64%, and 80%, respectively.

Thesis Advisor: Dr. Mezbahur Rahman


Chinese Remainder Theorem and its application on RSA (Rivest-Shamir-Adleman) cryptography

Wednesday, March 30, 4-5:30pm

Location: Wissink Hall 288 (WH 288)

Speaker: Ammishaddai Ogyiri

Abstract: The security of data has been an issue across the globe due to potential threat to the confidentiality and the integrity of data by third parties obtaining unauthorized access to protected data. Cryptography has come a long way to help maintain the security of data. The Symmetric-Key (Secret-Key Algorithm) and the Asymmetric-Key (Public-Key Algorithm) have been the two common classes of Cryptography that help make data extremely difficult to be accessed without the authorized key. In this research paper, we delve into the Asymmetric-Key Algorithm and focus on the Rivest-Shamir-Adleman (RSA) algorithm. The more secured a key must be, the longer it takes to encrypt and decrypt the data. We compare the speed of encrypting and decrypting data with the ordinary RSA algorithm and RSA-CRT (Chinese Remainder Theorem). Moduli of 1024 bits and 4096 bits have been used for this comparison. We also discuss the effectiveness of the CRT in RSA cryptography in its security and the speed of the decryption process.

Thesis Advisor: Dr. In-Jae Kim


Spring 2021 Graduate Student Master Thesis Defense

Comparison of Classification Algorithms in machine learning

Wednesday, May 5, 3-3:50pm

Location: On-line

Zoom Information: https://minnstate.zoom.us/j/918 4173 4732

Speaker: Dong Young Park

Abstract: Classification in data science is the process of predicting the class of given data points. Classes are sometimes called as targets/labels or categories. Classification predictive modeling is the task of approximating a mapping function (f) from input variables (X) to discrete output variables (y). Classification belongs to the category of supervised learning where the targets also provided with the input data. We are going to use several classification algorithms to classify two different kinds of datasets. The algorithms we used are decision tree, support vector machine, logistic regression, and neural networks. The dataset we used are MNIST handwritten digits dataset and wine quality dataset. MNIST is a graphical data, but wine quality dataset is a numerical dataset.

Thesis Advisor: Dr. Namyong Lee


Conformal Deformation of Surfaces by the Extrinsic Dirac Operator

Wednesday, April 28, 2-2:50pm

Location: On-line

Zoom Information: https://minnstate.zoom.us/j/96717822990

Speaker: Katelyn LaPorte

Abstract: The purpose of the APP is to survey the methods used by Crane and others to create conformal deformations of surfaces in 3-dimensional Euclidean space. His goal was to utilize this for applications in image processing. Here we will go into more detail of the mathematical theory behind his method including the not so familiar Quaternion-Valued Extrinsic Dirac Operator. We will also explain the integrability conditions of the conformal deformation problem, which can be reduced to an eigenvalue problem related to this Dirac operator. As it is a first order linear operator, it has high efficiency in discretization and surface curvature editing.

Thesis Advisor: Dr. Ke Zhu


Comparing Various Robust Estimation Techniques in Regression Analysis

Friday, April 23, 4-5pm

Location: On-line

Zoom Information: https://minnstate.zoom.us/j/97782055828

Speaker: Tracy Sharon Morrison

Abstract: In regression analysis, the use of the ordinary least squares (OLS) method is inadvisable when dealing with outlier or extreme observations. As a result, we require a method of robust estimation in which the estimation value is not significantly affected by outlier or extreme observations. Four methods of estimation will be compared in this paper in order to determine the best estimation: the M estimation method, the Least Trimmed Square Estimator, the S-estimation method, and the MM estimation method in robust regression. We discover that the best method is the MM-estimation method in this study. The M-estimation method is an extension of the maximum likelihood method, whereas the MM estimation method is a development of the M-estimation method, and the S- estimation method is related to the M-estimation method due to the use of the M-estimation residual scale. While robust regression methods can significantly improve estimation precision, they should not be used in place of more traditional methods.

Thesis Advisor: Dr. Mezbahur Rahman


Count regression models for Covid-19 related deaths and overall deaths

Tuesday, Apr 20, 4-5pm

Location: On-line

Zoom Information: https://minnstate.zoom.us/j/94358910903

Speaker: Manori Ampe Mohottige Dona

Abstract: With the start of the ongoing Covid-19 pandemic, the number of deaths worldwide has increased in a considerable amount. Confirmed coronavirus cases in the United States increased significantly in the third week of March in 2020 as testing was made more rapid and overtook China’s on the 26th of March 2020, making the US the world’s most affected country by the coronavirus.

This study aims to determine the relationship of overall death counts and Covid-19 related death counts of five main states in the United States to the different age groups and gender over the period of one year. The data were collected from the government data repository, data.gov.

Poisson Regression analysis and Negative Binomial Regression analysis were used for model building purposes and total death count prediction. The k fold cross-validation and leave-one-out cross-validation were used to identify the best model.

The Negative Binomial regression model was identified as the best model compared to the Poisson regression model. According to the model, the most significant factor for total deaths and covid-19 deaths is gender. Texas has the highest significant contribution to the Covid-19 model and the most significant age group is 84 years or over.

Thesis Advisor: Dr. Iresha Premarathna


Correlational Study: An Application of Factor Analysis on a Life Expectancy Data Set

Monday, April 19, 4:15-5:15pm

Speaker: Afrah Alhamad

Abstract: Many statistical techniques focus on analyzing the association between two variables. However, these techniques are not very useful when the interest centers on analyzing the mutual associations across all the variables with no distinctions made between them. Factor analysis is one of the multivariate statistical methods commonly used for this purpose. This paper applies and explains the exploratory factor analysis procedure using a data set. Additionally, the theoretical aspects of factor analysis are briefly discussed from a practical, applied perspective. Particularly, the objective of the paper is to explore the factorial structure of a life expectancy data set by means of exploratory factor analysis and to identify the factor scores.

Thesis Advisor: Dr. Mezbahur Rahman


Prediction of Heart Disease Using Bayesian Logistic Regression by Polya-Gamma Data Augmentation

Friday, April 16, 3-4pm

Location: On-line

Zoom Information: https://minnstate.zoom.us/j/939 3476 3132

Speaker: Zhenhan Fang

Abstract: Heart disease is one of the most common diseases nowadays, due to number of contributing factors, such as high blood pressure, high blood cholesterol, and smoking. About half of Americans (47%) have at least one of these three risk factors. To reduce the risk of heart disease, healthcare industries generate enormous amount of data, and have been seeking an early diagnosis of such disease for many years. Many data analytics tools have also been applied to help health care providers to identify some of the early signs of heart disease. Many tests can be performed on potential patients to take the extra precautions measures to reduce the effect of having such a disease, and reliable methods to predict early stages of heart disease. In this study, Logistic Regression and Bayesian Logistic Regression are used to establish models to predict heart disease. We apply the Polya-Gamma data augmentation to our Bayesian Logistic model. We found that Bayesian Logistic model can provide a better performance, although it is more expensive than general Logistic model.

Thesis Advisor: Dr. Han Wu


Classification of Chess Games: An exploration of classifiers for anomaly detection in chess

Friday, April 2, 4-5pm

Location: On-line

Zoom Information: https://minnstate.zoom.us/j/5074676277

Speaker: Masudul Hoque

Abstract: Chess is a strategy board game with its inception dating back to the 15th century. The Covid-19 pandemic has led to a chess boom online with 95,853,038 chess games being played on January 2021 on one online chess site (lichess.com) alone. Along with the chess boom, instances of cheating have also become more rampant. Classifications have been used for anomaly detection in fields such as network security and online games and thus it is a natural idea to develop classifiers to detect cheating. However, there are no such prior examples of this, and it is difficult to obtain data where cheating has occurred. So in this paper, we develop 4 machine learning classifiers, Linear Discriminant Analysis, Quadratic Discriminant Analysis, Multinomial Logistic Regression, and K Nearest Neighbour classifiers to predict chess game results and explore predictors that produce the best accuracy performance. We use Confusion Matrix, K Fold Cross Validation, and Leave One Out Cross Validation methods to find the accuracy metrics.

There is three phases of analysis. In phase I, we train classifiers using 1.94 million over the board game as training data and 20 thousand online games as testing data and obtain accuracy metrics. In Phase II, we select a smaller pool of 212 games, pick 8 additional predictor variables from chess engine evaluation of the moves played in those games and check whether the inclusion of the variables improve performance. Finally, in Phase III, we shall investigate for patterns in misclassified cases to define anomalous values.

From Phase I, the models are not performing at a utilizable level of accuracy (44-63%). For all classifiers, it is no better than deciding the class with a coin toss. K Nearest Neighbour with K = 7 was the best model. In Phase II, adding the new predictors improved the performance of all the classifiers significantly across all validation methods. In fact, using only significant variables as predictors produced highly accurate classifiers. Finally, from Phase III, we could not find any patterns or significant differences between the predictors for both correct classifications and misclassifications.

In conclusion, Machine learning classification is only one useful tool to spot instances that indicates anomalies. However, we cannot simply judge for anomalous games using only one method.

Thesis Advisor: Dr. Iresha Premarathna


Sequential Probability Ratio Test and Experiment

Tuesday, March 16, 4-5pm

Location: AH 102

Speaker: Brianna Klapoetke

Abstract: The Sequential Probability Ratio Test (SPRT) is a method of testing simple hypotheses where the sample size is not determined in advance. In this talk I will describe the general process of using the SPRT, overview the theory that supports it, and describe how I applied it to data I collected to determine what alpha values people used to make their decisions in a simple game I designed.

Thesis Advisor: Dr. Mezbahur Rahman


Fall 2020 Graduate Student Master Thesis Defense

Grobner Bases and Systems of Polynomial Equations

Monday, November 23, 4-5pm

Speaker: Rachel Holmes

Abstract: The goal of this paper is to explore the use and construction of Grobner bases through Buchberger’s algorithm. Specifically, applications of such bases for solving systems of polynomial equations will be discussed. Furthermore, we relate many concepts in commutative algebra to ideas in computational algebraic geometry.

Thesis Advisor: Dr. Wook Kim


Improvement in Regression Analysis through Optimal Clustering Algorithms with Machine Learning

Wednesday, November 18, 3-4pm

Speaker: Taeyoung Choi

Abstract: The primary purpose of the project is to enhance the quality of data analysis by adapting various clustering systems with machine learning and apply the advanced clustering techniques to regression model in order to improve the efficiency of the analysis. First and foremost, this research aims to expand the knowledge of data analysis through diverse clustering algorithms, including Hierarchical, K-Means, Partition Around Medoid (PAM), Clustering Large Applications (CLARA), and Clustering Large Applications based upon Randomized Search (CLARANS). The clustering algorithms assist in high-quality data analysis by constructing particular groups within the given data. The clustering techniques could be easily applicable to multiple fields, including clinical, manufacturing, or business sectors. For example, large type II diabetes patient information data sets with numerous variables could be classified with relevant personal medical histories, physical activity level, response to a certain treatment, or diet habits through the appropriate cluster analysis.

Thesis Advisor: Dr. Mezbahur Rahman


Summer 2020 Graduate Student Master Thesis Defense

Multiple Regression Analysis with Continuous and Binary Response Variable

Friday, August 14, 11am-12pm

Speaker: Eunhye Lee

Abstract: This alternate plan paper aimed to analyze student data in different Regression models to fit the best model and find the best model out of different types of regression models. The inferential statistics could provide more information beyond the descriptive statistics by answering questions in terms of data, testing hypotheses, and fitting into a proper model not only to describe the relationship in data set but also to predict a target. A statistical method, regression can be utilized in numerous fields in order to reveal the relationship between variables including finance, marketing, biology, investment, health, even psychology, etc. The main question in this paper is what variables are affecting to the final grade the most. The goal is to fit a multiple linear regression model and multiple logistic regression model properly, to detect the most relevant and effective variables in the fitted model to help understanding in respect to final mathematics grade. I will cover the linear regression model, one of the basic types of regression to describe the simultaneous associations of observed variables with a continuous dependent variable. To get the valid linear regression model, the assumptions of residual normality, linearity, independence of residual terms, zero mean of residual and homogeneity of residual variance checked to satisfy. Secondly, the logistic regression is to study the effect of binary outcomes regardless of the other regressor measurement. Logistic model is based on the logit function with the interpretation of probability than a value. The assumption for logistic regression comes with the response variable to be ordinal, the error terms to be independent, absence of multicollinearity, and linearity of independent variables and log odds with large sample size.

Thesis Advisor: Dr. Metzbahur Rahman


Soybean Price Prediction Using Time Series Foresting with Google Trend

Friday, August 7, 10-11am

Speaker: Zhuoning Li

Abstract: We use the time series methods to analyze the trend, predict price in U.S. soybean commodity market, and find the impact on the soybean price by the "trade war" between China and the U.S.. We use autoregressive integrated moving average and autoregressive conditional heteroskedasticity models to predict soybean price by using the U.S soybean daily price data, and we also use vector autoregression(VAR) and long short time memory models to predict soybean price by using the previous data and google trend data. By comparing these methods, we get the best prediction from VAR model.

Thesis Advisor: Dr. Deepak Sanjel


An Application for Bank Loan Default Prediction Analysis using Logistic Regression and Support Vector Machine

Friday, July 31, 10-11am

Speaker: Shuk Ping Wong

Abstract: Risk Management is one of the most crucial areas for banks. Banks are constantly working on effective models to estimate the likelihood of whether a customer could default to maintain a sustainable and profitable business. Although credit scoring is a common indicator for bankers, some financial datasets simply do not come with this variable. This study built a logistic regression model and a support vector machine (SVM) model to predict whether the loan borrower will default based on different categorical variables. The performance of the models is compared based on accuracy and efficiency. We found that a logistic regression model generally provides more depth in analysis of the variables and is better in terms of interpretability. Although SVM has a higher accuracy rate, the method took too much time for the computer to run and it suffers from a lack of interpretability. Logistic regression model has a better performance in general.

Thesis Advisor: Dr. Mezbahur Rahman


Number Construction

Monday, July 20, 11am-12pm

Speaker: Brian Bertness

Abstract: This paper describes how numbers are constructed via sets and equivalence relations. The necessary Zermelo-Franko set theory axioms are used to define basic sets, relations, and functions. Employing the Axiom of Infinity, the natural numbers are then constructed in terms of sets with an ordering that also conforms to the Peano axioms. Using the set of natural numbers and an equivalence relation the set of integers with an ordering are created followed, in turn, by the set of rational numbers. Lastly, Cauchy sequences are introduced and, using an equivalence relation, these are turned into the set of real numbers which are shown to have an ordering and the completeness property.

Thesis Advisor: Dr. Wook Kim


Spring 2020 Graduate Student Master Thesis Defense

The Roots of Root Finding

Wednesday, May 6, 2-3pm

Speaker: Kurt Grunzke

Abstract: One of the biggest challenges facing teachers is convincing students that their intuition about a concept is incorrect. In particular, our current social climate fuels an intuition that mathematicians are “nerds,” “geeks,” or other terms that generally refer to a boring person who lacks social skills. The goal of this paper is to demolish that stereotype by demonstrating that mathematicians are independent, argumentative, and vibrant individuals, whose energy is fueled by the social climate of their time. In order to demonstrate these characteristics, we will consider the question of solving polynomial equations, and not just one of them, but all of them. The answer to our question will span thousands of years, cross through multiple civilizations and continents, and introduce us to some lively mathematicians. Furthermore, this investigation will provide an approachable access point to concepts in higher mathematics.

Thesis Advisor: Dr. Namyong Lee


Discrete Morse Theory by Vector Fields: A Survey and New Directions

Tuesday, May 5, 4-5pm

Speaker: Matthew Nemitz

Abstract: We synthesize some of the main tools in discrete Morse theory from various sources. We do this in regards to abstract simplicial complexes with an emphasis on vector fields and use this as a building block to achieve our main result which is to investigate the relationship between simplicial maps and homotopy. We use the discrete vector field as a catalyst to build a chain homotopy between chain maps induced by simplicial maps.

Thesis Advisor: Dr. Brandon Rowekamp


Heat Kernel Voting with Geometric Invariants

Friday, May 1, 4-5pm

Speaker: Alexander Harr

Abstract: Here we provide a method for comparing geometric objects. Two objects of interest are embedded an infinite dimensional Hilbert space using their Laplacian eigenvectors and eigenfunctions into an infinite dimensional space, truncated to a finite dimensional Euclidean space, where correspondences between the objects are found and voted on. To simplify correspondence finding, we propose using several geometric invariants to reduce the necessary computations. This method improves on voting methods by identifying isometric regions in shapes of dimension greater than 3, and genus greater than 0, as well as almost retaining isometry. The voting approach evaluates local correspondences while at the same time respecting the global structure.

Thesis Advisor: Dr. Ke Zhu

A Mathematical Model for Malaria with Age-Heterogenous Biting Rate

Wednesday, April 22, 3-4pm

Speaker: Sho Kawakami

Abstract: We propose a mathematical model for malaria with age-heterogeneous biting rate. The existence of the model, the local behaviour of the disease free equillibrium are explored. Furthurmore the model is extended to an optimal control problem and the correspond- ing adjoint equations and optimality conditions are derived. Age dependent parameter values are estimated and numerical simulations are carried out for the model. The new model better accounts for difference in biting rates between different age groups, and improvements in stability to the explicit algorithm. The optimal control is also shown to depend on the age distribution of the biting rate.

Thesis Advisor: Dr. Ruijun Zhao


Apply logistic regression procedures to datasets with a binary and a nominal response variable

Wednesday, April 15, 2-3pm

Speaker: Duaa Alsubhi

Abstract: A major emphasis of this paper is on applying a binomial and a multinomial regression. In binomial regression, we used a heart disease dataset to illustrate how to build a modeling strategy by using a purposeful selection variable to determine the model with the best fit. In multinomial regression, we used an Adolescent Placement Study dataset to compare the logistic regression model with and without insignificant independent variables. In addition, we are interested in the impact of the insignificant predictor variable, which is explained in terms of an odds ratio.

Thesis Advisor: Dr. Mezbahur Rahman


Theory of Principal Components for Applications in Exploratory Crime Analysis and Clusting

Thursday, April 9, 3-4pm

Speaker: Daniel Silva

Abstract: The purpose of this paper is to develop the theory of principal components analysis succinctly from the fundamentals of matrix algebra and multivariate statistics. Principal components analysis is sometimes used as a descriptive technique to explain the variance-covariance or correlation structure of a dataset. However, most often, it is used as a dimensionality reduction technique to visualize a high dimensional dataset in a lower dimensional space. Principal components analysis accomplishes this by using the first few principal components, provided that they account for a substantial proportion of variation in the original dataset. In the same way, the first few principal components can be used as inputs into a cluster analysis in order to combat the curse of dimensionality and optimize the runtime for large datasets. The application portion of this paper will apply these methods to a US Crime 2018 dataset extracted from the Uniform Crime Reports on the FBI’s website.

Thesis Advisor: Dr. Iresha Premarathna


Spring 2022 Graduate Student Master Thesis Defense

Forecasting the Closing price of Bitcoin Cryptocurrency using ARIMA, Prophet and LSTM models

Wednesday, April 20- 4:00-4:50 PM

Location: On-line

Zoom Information: https://minnstate.zoom.us/j/97782055828

Speaker: Abimbola Kolebaje

Abstract: Due to the difficulty in assessing the exact nature of a time series, it is often considerably challenging to generate appropriate forecasts. Over the years, various forecasting models have been developed in the literature, but they have produced minimum accuracy in forecasting financial trend. In recent years, the advent of Deep Learning has revolutionized the business of forecasting financial trends, this study involves the time series forecasting of the bitcoin closing prices with improved efficiency using long short-term memory techniques (LSTM) and compares its predictability with the traditional method (ARIMA). Additionally, we will implement the forecast of bitcoin price with the Facebook Prophet model and forecast future prices. The Mean Absolute Percentage Error (MAPE) of all three models will be compared to ascertain which model has the highest accuracy in forecasting bitcoin prices. In our case, the LSTM model outperforms the ARIMA and Prophet machine learning algorithms.

Thesis Advisor: Dr. Deepak Sanjel


Prediction of Abnormal Vaginal Discharge using Machine learning techniques among women living in rural and urban area of Tangail district, Bangladesh

Friday April 8 - 3-4pm

Location: On-line

Zoom Information: https://minnstate.zoom.us/j/8314634593
Meeting ID: 831 463 4593
Passcode: 9682

Speaker: Aninda Roy

Abstract: In recent times, early detection of disease has become a crucial problem due to the rapid growth of the population worldwide. When it comes to women’s health, they have many complications that start during their reproductive life. Abnormal vaginal discharge (AVD) is a prevalent problem among women. If it is not treated appropriately, it may lead to severe complications such as pelvic inflammatory disease and cervical cancer. In Bangladesh, women suffer from abnormal vaginal discharge due to a lack of proper hygiene. More importantly, their hesitation of sharing about this problem leads to further complications in their health. This paper presents a qualitative study of women’s socio-demographic profile, personal hygienic practices, previous medical history, associated symptoms, characteristics of discharge and health-seeking pathways, and factors that influence abnormal vaginal discharge. Data was collected from Tangail district, Bangladesh using a predesigned survey questionnaire that includes questions designed to fulfill the study objective. This dataset had 280 total observations where 180 women’s (64.3%) response was positive with AVD and negative for others (35.7%) at the study time. Association of daily hygiene practices and associated symptoms with abnormal vaginal discharge (AVD) were determined using the Chi-square test, where a p-value of less than 0.05 was considered statistically significant. The prime objective of this paper is to create a model for predicting abnormal vaginal discharge using four machine learning classification algorithms which are K Nearest Neighbor (KNN), Support Vector Machine (SVM), Random Forrest (RF), and Logistic regression (LR). The performance of different classifiers is measured concerning their accuracy, sensitivity, specificity, positive predictive value, and negative predictive value. Additionally, these techniques were appraised on the area under the receiver operating characteristic curve (ROC). The results reveal that the LR model obtained the highest accuracy, sensitivity, and positive predictive value with the lowest specificity and negative predictive value of 81.4%, 91%, 82%, 64%, and 80%, respectively.

Thesis Advisor: Dr. Mezbahur Rahman


Chinese Remainder Theorem and its application on RSA (Rivest-Shamir-Adleman) cryptography

Wednesday, March 30, 4-5:30pm

Location: Wissink Hall 288 (WH 288)

Speaker: Ammishaddai Ogyiri

Abstract: The security of data has been an issue across the globe due to potential threat to the confidentiality and the integrity of data by third parties obtaining unauthorized access to protected data. Cryptography has come a long way to help maintain the security of data. The Symmetric-Key (Secret-Key Algorithm) and the Asymmetric-Key (Public-Key Algorithm) have been the two common classes of Cryptography that help make data extremely difficult to be accessed without the authorized key. In this research paper, we delve into the Asymmetric-Key Algorithm and focus on the Rivest-Shamir-Adleman (RSA) algorithm. The more secured a key must be, the longer it takes to encrypt and decrypt the data. We compare the speed of encrypting and decrypting data with the ordinary RSA algorithm and RSA-CRT (Chinese Remainder Theorem). Moduli of 1024 bits and 4096 bits have been used for this comparison. We also discuss the effectiveness of the CRT in RSA cryptography in its security and the speed of the decryption process.

Thesis Advisor: Dr. In-Jae Kim


Spring 2021 Graduate Student Master Thesis Defense

Comparison of Classification Algorithms in machine learning

Wednesday, May 5, 3-3:50pm

Location: On-line

Zoom Information: https://minnstate.zoom.us/j/918 4173 4732

Speaker: Dong Young Park

Abstract: Classification in data science is the process of predicting the class of given data points. Classes are sometimes called as targets/labels or categories. Classification predictive modeling is the task of approximating a mapping function (f) from input variables (X) to discrete output variables (y). Classification belongs to the category of supervised learning where the targets also provided with the input data. We are going to use several classification algorithms to classify two different kinds of datasets. The algorithms we used are decision tree, support vector machine, logistic regression, and neural networks. The dataset we used are MNIST handwritten digits dataset and wine quality dataset. MNIST is a graphical data, but wine quality dataset is a numerical dataset.

Thesis Advisor: Dr. Namyong Lee


Conformal Deformation of Surfaces by the Extrinsic Dirac Operator

Wednesday April 28, 2-2:50pm

Location: On-line

Zoom Information: https://minnstate.zoom.us/j/96717822990

Speaker: Katelyn LaPorte

Abstract: The purpose of the APP is to survey the methods used by Crane and others to create conformal deformations of surfaces in 3-dimensional Euclidean space. His goal was to utilize this for applications in image processing. Here we will go into more detail of the mathematical theory behind his method including the not so familiar Quaternion-Valued Extrinsic Dirac Operator. We will also explain the integrability conditions of the conformal deformation problem, which can be reduced to an eigenvalue problem related to this Dirac operator. As it is a first order linear operator, it has high efficiency in discretization and surface curvature editing.

Thesis Advisor: Dr. Ke Zhu


Comparing Various Robust Estimation Techniques in Regression Analysis

Friday, April 23, 4-5pm

Location: On-line

Zoom Information: https://minnstate.zoom.us/j/97782055828

Speaker: Tracy Sharon Morrison

Abstract: In regression analysis, the use of the ordinary least squares (OLS) method is inadvisable when dealing with outlier or extreme observations. As a result, we require a method of robust estimation in which the estimation value is not significantly affected by outlier or extreme observations. Four methods of estimation will be compared in this paper in order to determine the best estimation: the M estimation method, the Least Trimmed Square Estimator, the S-estimation method, and the MM estimation method in robust regression. We discover that the best method is the MM-estimation method in this study. The M-estimation method is an extension of the maximum likelihood method, whereas the MM estimation method is a development of the M-estimation method, and the S- estimation method is related to the M-estimation method due to the use of the M-estimation residual scale. While robust regression methods can significantly improve estimation precision, they should not be used in place of more traditional methods.

Thesis Advisor: Dr. Mezbahur Rahman


Count regression models for Covid-19 related deaths and overall deaths

Tuesday, Apr 20, 4-5pm

Location: On-line

Zoom Information: https://minnstate.zoom.us/j/94358910903

Speaker: Manori Ampe Mohottige Dona

Abstract: With the start of the ongoing Covid-19 pandemic, the number of deaths worldwide has increased in a considerable amount. Confirmed coronavirus cases in the United States increased significantly in the third week of March in 2020 as testing was made more rapid and overtook China’s on the 26th of March 2020, making the US the world’s most affected country by the coronavirus.

This study aims to determine the relationship of overall death counts and Covid-19 related death counts of five main states in the United States to the different age groups and gender over the period of one year. The data were collected from the government data repository, data.gov.

Poisson Regression analysis and Negative Binomial Regression analysis were used for model building purposes and total death count prediction. The k fold cross-validation and leave-one-out cross-validation were used to identify the best model.

The Negative Binomial regression model was identified as the best model compared to the Poisson regression model. According to the model, the most significant factor for total deaths and covid-19 deaths is gender. Texas has the highest significant contribution to the Covid-19 model and the most significant age group is 84 years or over.

Thesis Advisor: Dr. Iresha Premarathna


Correlational Study: An Application of Factor Analysis on a Life Expectancy Data Set

Monday, April 19, 4:15-5:15pm

Speaker: Afrah Alhamad

Abstract: Many statistical techniques focus on analyzing the association between two variables. However, these techniques are not very useful when the interest centers on analyzing the mutual associations across all the variables with no distinctions made between them. Factor analysis is one of the multivariate statistical methods commonly used for this purpose. This paper applies and explains the exploratory factor analysis procedure using a data set. Additionally, the theoretical aspects of factor analysis are briefly discussed from a practical, applied perspective. Particularly, the objective of the paper is to explore the factorial structure of a life expectancy data set by means of exploratory factor analysis and to identify the factor scores.

Thesis Advisor: Dr. Mezbahur Rahman


Prediction of Heart Disease Using Bayesian Logistic Regression by Polya-Gamma Data Augmentation

Friday, April 16, 3-4pm

Location: On-line

Zoom Inforamtion: https://minnstate.zoom.us/j/939 3476 3132

Speaker: Zhenhan Fang

Abstract: Heart disease is one of the most common diseases nowadays, due to number of contributing factors, such as high blood pressure, high blood cholesterol, and smoking. About half of Americans (47%) have at least one of these three risk factors. To reduce the risk of heart disease, healthcare industries generate enormous amount of data, and have been seeking an early diagnosis of such disease for many years. Many data analytics tools have also been applied to help health care providers to identify some of the early signs of heart disease. Many tests can be performed on potential patients to take the extra precautions measures to reduce the effect of having such a disease, and reliable methods to predict early stages of heart disease. In this study, Logistic Regression and Bayesian Logistic Regression are used to establish models to predict heart disease. We apply the Polya-Gamma data augmentation to our Bayesian Logistic model. We found that Bayesian Logistic model can provide a better performance, although it is more expensive than general Logistic model.

Thesis Advisor: Dr. Han Wu


Classification of Chess Games: An exploration of classifiers for anomaly detection in chess

Friday, April 2, 4-5pm

Location: On-line

Zoom Information: https://minnstate.zoom.us/j/5074676277

Speaker: Masudul Hoque

Abstract: Chess is a strategy board game with its inception dating back to the 15th century. The Covid-19 pandemic has led to a chess boom online with 95,853,038 chess games being played on January 2021 on one online chess site (lichess.com) alone. Along with the chess boom, instances of cheating have also become more rampant. Classifications have been used for anomaly detection in fields such as network security and online games and thus it is a natural idea to develop classifiers to detect cheating. However, there are no such prior examples of this, and it is difficult to obtain data where cheating has occurred. So in this paper, we develop 4 machine learning classifiers, Linear Discriminant Analysis, Quadratic Discriminant Analysis, Multinomial Logistic Regression, and K Nearest Neighbour classifiers to predict chess game results and explore predictors that produce the best accuracy performance. We use Confusion Matrix, K Fold Cross Validation, and Leave One Out Cross Validation methods to find the accuracy metrics.

There is three phases of analysis. In phase I, we train classifiers using 1.94 million over the board game as training data and 20 thousand online games as testing data and obtain accuracy metrics. In Phase II, we select a smaller pool of 212 games, pick 8 additional predictor variables from chess engine evaluation of the moves played in those games and check whether the inclusion of the variables improve performance. Finally, in Phase III, we shall investigate for patterns in misclassified cases to define anomalous values.

From Phase I, the models are not performing at a utilizable level of accuracy (44-63%). For all classifiers, it is no better than deciding the class with a coin toss. K Nearest Neighbour with K = 7 was the best model. In Phase II, adding the new predictors improved the performance of all the classifiers significantly across all validation methods. In fact, using only significant variables as predictors produced highly accurate classifiers. Finally, from Phase III, we could not find any patterns or significant differences between the predictors for both correct classifications and misclassifications.

In conclusion, Machine learning classification is only one useful tool to spot instances that indicates anomalies. However, we cannot simply judge for anomalous games using only one method.

Thesis Advisor: Dr. Iresha Premarathna


Sequential Probability Ratio Test and Experiment

Tuesday, March 16, 4-5pm

Location: AH 102

Speaker: Brianna Klapoetke

Abstract: The Sequential Probability Ratio Test (SPRT) is a method of testing simple hypotheses where the sample size is not determined in advance. In this talk I will describe the general process of using the SPRT, overview the theory that supports it, and describe how I applied it to data I collected to determine what alpha values people used to make their decisions in a simple game I designed.

Thesis Advisor: Dr. Mezbahur Rahman


Fall 2020 Graduate Student Master Thesis Defense

Grobner Bases and Systems of Polynomial Equations

Monday, November 23, 4-5pm

Speaker: Rachel Holmes

Abstract: The goal of this paper is to explore the use and construction of Grobner bases through Buchberger’s algorithm. Specifically, applications of such bases for solving systems of polynomial equations will be discussed. Furthermore, we relate many concepts in commutative algebra to ideas in computational algebraic geometry.

Thesis Advisor: Dr. Wook Kim


Improvement in Regression Analysis through Optimal Clustering Algorithms with Machine Learning

Wednesday, November 18, 3-4pm

Speaker: Taeyoung Choi

Abstract: The primary purpose of the project is to enhance the quality of data analysis by adapting various clustering systems with machine learning and apply the advanced clustering techniques to regression model in order to improve the efficiency of the analysis. First and foremost, this research aims to expand the knowledge of data analysis through diverse clustering algorithms, including Hierarchical, K-Means, Partition Around Medoid (PAM), Clustering Large Applications (CLARA), and Clustering Large Applications based upon Randomized Search (CLARANS). The clustering algorithms assist in high-quality data analysis by constructing particular groups within the given data. The clustering techniques could be easily applicable to multiple fields, including clinical, manufacturing, or business sectors. For example, large type II diabetes patient information data sets with numerous variables could be classified with relevant personal medical histories, physical activity level, response to a certain treatment, or diet habits through the appropriate cluster analysis.

Thesis Advisor: Dr. Mezbahur Rahman


Summer 2020 Graduate Student Master Thesis Defense

Multiple Regression Analysis with Continuous and Binary Response Variable

Friday, August 14, 11am-12pm

Speaker: Eunhye Lee

Abstract: This alternate plan paper aimed to analyze student data in different Regression models to fit the best model and find the best model out of different types of regression models. The inferential statistics could provide more information beyond the descriptive statistics by answering questions in terms of data, testing hypotheses, and fitting into a proper model not only to describe the relationship in data set but also to predict a target. A statistical method, regression can be utilized in numerous fields in order to reveal the relationship between variables including finance, marketing, biology, investment, health, even psychology, etc. The main question in this paper is what variables are affecting to the final grade the most. The goal is to fit a multiple linear regression model and multiple logistic regression model properly, to detect the most relevant and effective variables in the fitted model to help understanding in respect to final mathematics grade. I will cover the linear regression model, one of the basic types of regression to describe the simultaneous associations of observed variables with a continuous dependent variable. To get the valid linear regression model, the assumptions of residual normality, linearity, independence of residual terms, zero mean of residual and homogeneity of residual variance checked to satisfy. Secondly, the logistic regression is to study the effect of binary outcomes regardless of the other regressor measurement. Logistic model is based on the logit function with the interpretation of probability than a value. The assumption for logistic regression comes with the response variable to be ordinal, the error terms to be independent, absence of multicollinearity, and linearity of independent variables and log odds with large sample size.

Thesis Advisor: Dr. Metzbahur Rahman


Soybean Price Prediction Using Time Series Foresting with Google Trend

Friday, August 7, 10-11am

Speaker: Zhuoning Li

Abstract: We use the time series methods to analyze the trend, predict price in U.S. soybean commodity market, and find the impact on the soybean price by the "trade war" between China and the U.S.. We use autoregressive integrated moving average and autoregressive conditional heteroskedasticity models to predict soybean price by using the U.S soybean daily price data, and we also use vector autoregression(VAR) and long short time memory models to predict soybean price by using the previous data and google trend data. By comparing these methods, we get the best prediction from VAR model.

Thesis Advisor: Dr. Deepak Sanjel


An Application for Bank Loan Default Prediction Analysis using Logistic Regression and Support Vector Machine

Friday, July 31, 10-11am

Speaker: Shuk Ping Wong

Abstract: Risk Management is one of the most crucial areas for banks. Banks are constantly working on effective models to estimate the likelihood of whether a customer could default to maintain a sustainable and profitable business. Although credit scoring is a common indicator for bankers, some financial datasets simply do not come with this variable. This study built a logistic regression model and a support vector machine (SVM) model to predict whether the loan borrower will default based on different categorical variables. The performance of the models is compared based on accuracy and efficiency. We found that a logistic regression model generally provides more depth in analysis of the variables and is better in terms of interpretability. Although SVM has a higher accuracy rate, the method took too much time for the computer to run and it suffers from a lack of interpretability. Logistic regression model has a better performance in general.

Thesis Advisor: Dr. Metzbahur Rahman


Number Construction

Monday, July 20, 11am-12pm

Speaker: Brian Bertness

Abstract: This paper describes how numbers are constructed via sets and equivalence relations. The necessary Zermelo-Franko set theory axioms are used to define basic sets, relations, and functions. Employing the Axiom of Infinity, the natural numbers are then constructed in terms of sets with an ordering that also conforms to the Peano axioms. Using the set of natural numbers and an equivalence relation the set of integers with an ordering are created followed, in turn, by the set of rational numbers. Lastly, Cauchy sequences are introduced and, using an equivalence relation, these are turned into the set of real numbers which are shown to have an ordering and the completeness property.

Thesis Advisor: Dr. Wook Kim


Spring 2020 Graduate Student Master Thesis Defense

The Roots of Root Finding

Wednesday, May 6, 2-3pm

Speaker: Kurt Grunzke

Abstract: One of the biggest challenges facing teachers is convincing students that their intuition about a concept is incorrect. In particular, our current social climate fuels an intuition that mathematicians are “nerds,” “geeks,” or other terms that generally refer to a boring person who lacks social skills. The goal of this paper is to demolish that stereotype by demonstrating that mathematicians are independent, argumentative, and vibrant individuals, whose energy is fueled by the social climate of their time. In order to demonstrate these characteristics, we will consider the question of solving polynomial equations, and not just one of them, but all of them. The answer to our question will span thousands of years, cross through multiple civilizations and continents, and introduce us to some lively mathematicians. Furthermore, this investigation will provide an approachable access point to concepts in higher mathematics.

Thesis Advisor: Dr. Namyong Lee


Discrete Morse Theory by Vector Fields: A Survey and New Directions

Tuesday, May 5, 4-5pm

Speaker: Matthew Nemitz

Abstract: We synthesize some of the main tools in discrete Morse theory from various sources. We do this in regards to abstract simplicial complexes with an emphasis on vector fields and use this as a building block to achieve our main result which is to investigate the relationship between simplicial maps and homotopy. We use the discrete vector field as a catalyst to build a chain homotopy between chain maps induced by simplicial maps.

Thesis Advisor: Dr. Brandon Rowekamp


Heat Kernel Voting with Geometric Invariants

Friday, May 1, 4-5pm

Speaker: Alexander Harr

Abstract: Here we provide a method for comparing geometric objects. Two objects of interest are embedded an infinite dimensional Hilbert space using their Laplacian eigenvectors and eigenfunctions into an infinite dimensional space, truncated to a finite dimensional Euclidean space, where correspondences between the objects are found and voted on. To simplify correspondence finding, we propose using several geometric invariants to reduce the necessary computations. This method improves on voting methods by identifying isometric regions in shapes of dimension greater than 3, and genus greater than 0, as well as almost retaining isometry. The voting approach evaluates local correspondences while at the same time respecting the global structure.

Thesis Advisor: Dr. Ke Zhu

A Mathematical Model for Malaria with Age-Heterogenous Biting Rate

Wednesday, April 22, 3-4pm

Speaker: Sho Kawakami

Abstract: We propose a mathematical model for malaria with age-heterogeneous biting rate. The existence of the model, the local behaviour of the disease free equillibrium are explored. Furthurmore the model is extended to an optimal control problem and the correspond- ing adjoint equations and optimality conditions are derived. Age dependent parameter values are estimated and numerical simulations are carried out for the model. The new model better accounts for difference in biting rates between different age groups, and improvements in stability to the explicit algorithm. The optimal control is also shown to depend on the age distribution of the biting rate.

Thesis Advisor: Dr. Ruijun Zhao


Apply logistic regression procedures to datasets with a binary and a nominal response variable

Wednesday, April 15, 2-3pm

Speaker: Duaa Alsubhi

Abstract: A major emphasis of this paper is on applying a binomial and a multinomial regression. In binomial regression, we used a heart disease dataset to illustrate how to build a modeling strategy by using a purposeful selection variable to determine the model with the best fit. In multinomial regression, we used an Adolescent Placement Study dataset to compare the logistic regression model with and without insignificant independent variables. In addition, we are interested in the impact of the insignificant predictor variable, which is explained in terms of an odds ratio.

Thesis Advisor: Dr. Mezbahur Rahman


Theory of Principal Components for Applications in Exploratory Crime Analysis and Clusting

Thursday, April 9, 3-4pm

Speaker: Daniel Silva

Abstract: The purpose of this paper is to develop the theory of principal components analysis succinctly from the fundamentals of matrix algebra and multivariate statistics. Principal components analysis is sometimes used as a descriptive technique to explain the variance-covariance or correlation structure of a dataset. However, most often, it is used as a dimensionality reduction technique to visualize a high dimensional dataset in a lower dimensional space. Principal components analysis accomplishes this by using the first few principal components, provided that they account for a substantial proportion of variation in the original dataset. In the same way, the first few principal components can be used as inputs into a cluster analysis in order to combat the curse of dimensionality and optimize the runtime for large datasets. The application portion of this paper will apply these methods to a US Crime 2018 dataset extracted from the Uniform Crime Reports on the FBI’s website.

Thesis Advisor: Dr. Iresha Premarathna