Spring 2024 Graduate Student Master Thesis Defense

Title: Time Series Forecasting Ozone(O3) AQI in Minnesota

Monday, April 22, 2024 3:00 PM

Location: WH 286

Speaker: Tatiana Quinonez

Abstract:  

The purpose of this paper is to forecast Ozone(O3) in the state of Minnesota. Different forecasting methods such as Seasonal Autoregressive Integrated Moving Average(SARIMA), Exponential Smoothing(ETS), Seasonal Naive (SNAIVE), and PROPHET are going to be used to achieve the goal. Besides, different tests and plots are going to be used to fit the best forecast model for each of the methods. Additionally, the forecast values from each model will be compared with the actual value from the dataset to judge and select which model performs the best which can be achieved by using for comparison of the Root Mean Square Error (RMSE), Mean Absolute Error (MAE) and Mean Absolute Percentage Error(MAPE).

Thesis Advisor: Dr. Deepak Sanjel

Title: Multi-variable Lung Cancer Risk Models: Application of Tree-based Machine Learning Classification Algorithms and Prediction Analysis

Friday, April 19, 2024 3:00 PM

Location: WH 286

Speaker: Mingyu Huang

Abstract:  This project utilizes tree-based machine learning classification algorithms and resampling techniques for both predicting the risk of lung cancer diagnosis and the survival status of patients. For survival prediction, the study conducted comparisons among various models, including logistic regression, decision tree, random forest, Gradient Boosting (GBM), and XGBoost. The results revealed that the XGBoost model outperformed the other models, with a precision rate of $93.6\%$. With the help of learning curve analysis, the XGBoost model also showed the lowest bias and variance, making it more accurate for predicting lung cancer risk and survival status. The study identified crucial risk factors, including tumor status, age, tumor stage, and treatment outcomes, among others. Additionally, the study employed Markov Chain Monte Carlo (MCMC) to balance data structure, and the Random Forest model highlighted symptomatic and lifestyle factors such as fatigue, smoking, and anxiety as crucial in lung cancer diagnosis risk, as well as predicting the disease risk with precision index $AUC=0.963$ . These findings offer significant insights for healthcare professionals to develop tailored treatment plans, apply more precise and personalized care to improve patient outcomes, and reduce the mortality rate associated with lung cancer.

Thesis Advisor: Dr. Deepak Sanjel

Title: Assessing Homogeneity: A Comparative Study for Robust Statistical Analysis

Monday, April 15, 2024 3:00 PM

Location: WH 286

Speaker: Chrisantus Bongbeebina

Abstract: Homogeneity of Variance (HOV) is a fundamental assumption in statistical analysis, particularly in parametric tests such as Analysis of Variance (ANOVA) and paired t-tests. However, variance homogeneity can often be violated in real-world data. Ensuring that the variances among groups or conditions being compared are approximately equal is essential to avoid Type I and Type II errors and to maintain the validity of research findings.


While several tests for assessing homogeneity of variance exist in literature, there lacks consensus on their robustness, especially in scenarios where the assumption of normality is not met. The purpose of this simulation study is to evaluate the performance of nine tests for the homogeneity of variance assumption in one-way ANOVA models in terms of Type I error (also robustness) and statistical power. These tests are analyzed through three different shaped distributions: normal, Logistic, and exponential. The results from these studies suggest that three O'Brien tests and the Fligner-Killeen test (nonparametric) demonstrate superior Type I error control across all conditions, indicating better performance compared to other tests.

Thesis Advisor: Dr. Mezbahur Rahman

Title: Utilizing a Bayesian Approach to Forecast Time Series

Monday, April 8, 2024 3:00 PM

Location: WH 286

Speaker: Robert Shields

Abstract: Time series gives an analyst a range of different models that can be used on time dependent data. One of the most popular models is the ARIMA model. While ARIMA is an effective tool, there are some drawbacks. By utilizing a Bayesian approach, analysts can more elegantly handle uncertainty and do not need as much data to get effective forecasts. Thus, this can help overcome some of the drawbacks found in ARIMA. The model that will be compared to ARIMA is called Bayesian Structural Time series (BSTS). The objective of this project is to analyze the benefits/pitfalls of each model.

Thesis Advisor: Dr. Mezbahur Rahman

Title: Predicting Heart Disease Based on Patient Characteristics

Friday, April 5, 2024 2:00 PM

Location: WH 286

Speaker: Lizzy Eccles

Abstract: Heart disease is a major concern worldwide. Everyone must know the importance of early detection and intervention when it comes to this life-threatening disease. In this research paper, multiple unsu- pervised and supervised learning techniques are employed to predict the likelihood of heart disease based on many factors. These factors range from clinical features to demographic characteristics. A logis- tic regression algorithm, a random forest algorithm, and a decision tree are applied to a data set to construct the most applicable predic- tive model. The overall analysis resulted in a random forest model with 98.88% accuracy along with many other interesting findings.

Thesis Advisor: Dr. Mezbahur Rahman

Title: Optimizing Salary Cap Allocation in the NFL

Monday, April 1, 2024 2:00 PM

Location: WH 288A

Speaker: Lucas Bukowski

Abstract: Every season, each NFL team must decide how to allocate the funds available to them within the salary cap. The goal of this paper is to use regression techniques to model team success based on team spending, and then use those models to determine optimal spending under the cap. Inefficient spending at one position can leave teams lacking at others, causing a domino effect that can drastically harm a team’s success. This project will look at using beta regression, Poisson regression, and logistic regression to fit predictive models, and then optimize them using the Improved Stochastic Ranking Evolution Strategy (ISRES) and Constrained Optimization By Linear Approximations (COBYLA). These regression techniques and optimization methods used will be introduced and a

brief overview given. The optimal spending allocation found from modeling different forms of success in the NFL will then be compared to see which positions are worth spending on to win the most games and provide the highest probability of making the playoffs. Finally, this study will conclude

with some discussion of how to interpret the results and potential further research.

Thesis Advisor: Dr. Namyong Lee

Title: Diabetes Health Indicator Using Machine Learning Techniques

Friday, March 28, 2024 2:00 PM

Location: WH 286

Speaker: Xeng Yang

Abstract: The purpose of this paper is to address how machine learning techniques can indicate a patient’s health condition outcome based on the information provided.  Statistical machine learning techniques such as Logistics Model, Decision Trees, Random Forest, and Generalized Boost Regression Model are used to predict a patient’s health condition.  To determine which model can predict whether a patient is diagnosed with diabetes or not, the models’ accuracies will be compared to each other using confusion matrix and ROC curves.  In this paper, it will go over which model is adequate to predict whether a person could be diabetic or not and elaborate the model interpretation to prevent or improve oneself health condition.  K-fold cross validation method is performed to ensure the model is not overfit.   

Thesis Advisor: Dr. Deepak Sanjel

Title: Comparative Analysis of Multiple Comparison Procedures on Median Household Income Across US Census Bureau’s Nine Divisions

Friday, March 15, 2024 3:30 PM

Location: WH 288a

Speaker: Kuukua E. Abraham

Abstract: In the Analysis of Variance(ANOVA), the main goal is to analyze the differences between the means of more than two treatments. It only determines whether there exists any overall statistically significant differences between the means of these treatments. To obtain a detailed conclusion of group differences in means, researchers conduct tests on the differences between particular pairs of experimental and control groups. Tests conducted on subgroups of data tested previously in an analysis are called post-hoc tests. A category of post-hoc tests that provide this type of detailed conclusion for ANOVA results is what is referred to as ”multiple comparison analysis” tests.

Multiple Comparison tests are aimed at scrutinizing the differences between specific pairs of means or linear combinations of means amongst the groups. This provides information that is of most relevant to the researcher as compared to the conclusion drawn from ANOVA test. This paper systematically reviews and analyzes a diverse set of multiple comparison procedures, such as Bonferroni correction, Tukey’s Honestly Significant Difference (HSD), Scheffe’s method and Fisher’s Least Significant Difference. Each procedure is reviewed individually, highlighting on their respective keynotes and features, capabilities and limitations. Dataset was obtained from U.S. Census Bureau, from 1984 to 2022; which consists of the median household income by states. The states are categorized into nine regions based on their geographical location. With the aid of R statistical software, the various procedures will be generated in obtaining the mean differences in the median household income amongst these states . Furthermore, the question as to which criteria is considered in selecting the most appropriate test amongst the various multiple comparison tests will be resolved. Subsequently, a systematic comparison of the different procedures in terms of their critical values and their powers will be conducted.

Thesis Advisor: Dr. Mezbahur Rahman

Fall 2023 Graduate Student Master Thesis Defense

Title: An Optimal Control Problem in Wireless Charging of Electric Vehicles

Wednesday, December 6, 2023 1:30 PM

Location: WH 284A

Speaker: Kevin Bischoff

Abstract: The goal of this paper is to use the theory of Calculus of Variations to solve an optimal control problem regarding wireless power transfer in the context of electric vehicles. In particular, if a human drove an electric vehicle with wireless power transfer capability along a roadway embedded with wireless chargers, it would be reasonable to desire appreciable energy transfer while reaching the end of the route as fast as possible. These along with some additional requirements can be written in the format of an optimal control problem, for which a solution will be attempted here by the aforementioned Calculus of Variations.

Thesis Advisor: Dr. Ruijun Zhao

Title: Persistent Homology and its Application in Image Analysis

Tuesday, December 5, 2023 3:00 PM

Location: WH 286

Speaker: Jacob McCoy

Abstract: Persistent Homology is a new data analysis tool that has emerged over the last decade that
allows us to quantify topological features of data. Namely, analyzing images of handwritten numerical
digits to try and identify their topological features and categorize the images based on their homology
class. Rather than the popular simplicial complexes, cubical complexes will be used to analyze the
images.

This paper aims to provide the required background to a reader who has not been exposed to
Topological Data Analysis but has some familiarity with Topology and Abstract Algebra. An introduction
to homology, graphs, simplicial complexes, cubical complexes, Betti numbers, Bar Charts, and
persistence diagrams are provided.

Thesis Advisor: Dr. Namyong Lee

Title: Modeling Malaria: A Mathematical and Statistical Approach

Thursday, November 16, 2023 3:00 PM

Location: WH 286

Speaker: Hope Enright

Abstract: In this paper a mathematical and statistical model of malaria will be explored to determine how to estimate real world parameters, find the reproduction number for Malaria, and fore- cast the disease. In order to achieve these goals, some processes will be explored such as next generation matrix, partial rank correlation coefficient global sensitivity analysis, model fitting through minimization of least squares, and auto regressive integrated moving average model (ARIMA model). Data from the World Bank for Uganda will be utilized to construct and fit the models in this paper. Through the use of a mathematical SIS model fitted to data obtained for Uganda, parameters that are otherwise difficult to quantify are estimated (biting rate of mosquitoes, human disease induced death rate, disease transmission probabil- ity from infectious mosquitoes to susceptible humans, disease transmission probability from infectious humans to susceptible mosquitoes, and rate at which infectious humans become susceptible again) and utilized to find that the reproduction number of malaria in Uganda is less than one. A statistical ARIMA model and neural network autoregressive model are then constructed to forecast malaria for the next few years in Uganda. Ultimately, this paper will explore the relevant parameters impacting malaria in Uganda and create a model to predict the behavior of malaria.

Thesis Advisor: Dr. Ruijun Zhao

SUMMER 2023 Graduate Student Master Thesis Defense

Title: Statistical Analytics of Sentinel Lakes of Minnesota  

Tuesday, June 20, 2023 3:00 PM

Location: 

Speaker: Md Raihatul Jannat Saimon

Abstract: The Minnesota Pollution Control Agency (MPCA) is an agency in Minnesota, which monitors environmental quality, offers technical and financial assistance, and enforces environmental regulations for the State of Minnesota. As part of its continuing efforts to ensure good quality water of the lakes in Minnesota, MPCA wants to get a better understanding of the lakes by analyzing data over the years. The key quantitative observation is water transparency, with qualitative categorical observations on physical appearance and recreational suitability. Analysis includes trend analysis, especially of the categorical data, and an in-depth look at the relationships between the quantitative and categorical observations.

Thesis Advisor: Dr. Iresha Premarathna

Spring 2023 Graduate Student Master Thesis Defense

Title: Haar Measure on Lie Groups

Wednesday, April 26, 2023 3:00 PM

Location: WH 286a

Speaker: Jenna Stitt

Abstract: The last section is what we have been building up to. We will define a Haar measure. Then we will give a sketch of the proof of the existence and uniqueness of the Haar measure on locally compact groups. Every Lie group is a locally compact group, so indirectly we show that the Haar measure always exists on a Lie group. We will then show how the Haar measure leads to integration. Then we will concretely show what the specific Haar measure is on various groups including the Lie group 𝐺𝐿𝑛(R). Lastly we discuss the relationship between the left and right haar measure and show that for compact groups, these agree.

Thesis Advisor: Dr. Wook Kim

Title: The Minkowski Metric

Thursday, April 27, 2023 9:00 AM

Location: WH 288

Speaker: Katherine Fennema

Abstract: This paper provides evidence for the metric in Minkowski spacetime. Along the way we will derive the Lorentz transformation. During this process we will find an interesting outcome about simultaneity between observers in spacetime. Then we will give the mathematical background needed to derive the Christoffel symbols for the pseudo-Riemannian metric and to describe the spacetime manifold. To this end we will be explaining the differential geometry and manifold theory needed to reach this goal.

Thesis Advisor: Dr. Brandon Rowekamp

FALL 2022 Graduate Student Master Thesis Defense

Title: Time Series Analysis of US Gasoline Prices Using Time Series Model

Tuesday, November 28, 2022 2:00 PM

Location: WH 284

Speaker: Moussa Abdoulaye

Abstract: Natural gas demand has increased significantly on a global scale, and businesses are keen to realize natural gas price forecasts. The prediction is expected to meet the needs of various producers, suppliers, traders, bankers, and end users who are involved in the exploration, production, transportation, and trading of natural gas. The goal for both the supply and the trader is to conduct business while satisfying demand. Researchers have used a variety of approaches to predict the price of natural gas. In this study, we examine how well time series analysis can forecast US gasoline prices. We discovered proof that the time series models accurately predicted the decrease of US gasoline prices one year in advance. This is crucial for proper planning for all parties involved in the exploration, production, transportation, and trading of natural gas. The ARIMA model approach for time series data provides reliable predictions of US gasoline prices for the following year, according to research results.

Thesis Advisor: Dr. Metzbahur Rahman


Title: A comparative Study of Ridge, LASSO and Principal components Regression

Wednesday, November 16, 2022 3:00 PM

Location: https://minnstate.zoom.us/j/96409490562

Speaker: Franck Olilo

Abstract: One of the statistical techniques that is often employed and has applications in all aspects of daily life is linear regression. In regression, the goal is to correlate the variation in one or more response variables with proportional change in one or more explanatory factors to explain the variation in the response variables. They are deemed to be orthogonal if there is no linear relationship between these explanatory variables. Several of the explanatory variables will fluctuate in quite comparable ways if the variables are not orthogonal. This issue, known as multicollinearity, is one that frequently arises in regression analysis. When two or more explanatory variables are highly (but not perfectly) correlated with one another, it makes challenging to interpret the strength of each variable's effect because in the presence of multicollinearity the OLS estimators are not precisely estimated. 

In the first part of this paper, we discuss the multicollinearity problem in linear regression model, present the technique to identify the problem, look for its causes and consequences. After that we explore ways to handle multicollinearity such as Ridge Regression, Lasso Regression and Principal Components regression and discuss the theory beyond them.  

In addition, we attempted a case study and applied those methods, and we compare which among the OLS, RR, LAS, and PCR should be an alternative when fitting a model with multicollinearity.  MSE, RMSE and R squared being the comparison factor, the results showed that RR, LAS and PCR have mean square error less than the OLS while RR and LASSO performs well than PCR.

Thesis Advisor: Dr. Iresha Premarathna

 

 

Summer 2022 Graduate Student Master Thesis Defense

Title: Mitigating Class Imbalance in Machine Learning for a Binary Classification Using Resampling Techniques

Thursday, July 14, 2022 12:00 PM

Location: https://minnstate.zoom.us/j/99988495768

Speaker: Sujin Kim

Abstract: Class imbalance is one of the problems that we face often when building a model for classification using machine learning (ML) algorithms. ML algorithms are likely to create a model that classifies all observations into the majority class as it focuses on the overall accuracy of the model in general and the minority class contributes less to the accuracy than the majority class. In this paper, we intend to mitigate the problem of class imbalance with sepsis clinical data using data level approach, which is resampling techniques that are intuitive and simple but universal ways to apply to any ML algorithm. The resampling technique is a method of resampling the original data having a higher class imbalance to create new data having a lower class imbalance. Nearmiss under-sampling, Tomek Link under-sampling, and SMOTE over-sampling methods are used. The sepsis clinical data is a dataset having information about survival of patients with sepsis. The dataset is divided into a train set for building a model and a test set for validation and several ML algorithms are used on the train set for this binary classification problem. Logistic regression, support vector machine, and random forest are applied. The performance of resampling techniques with each ML algorithm is evaluated by scores from a confusion matrix.

Thesis Advisor: Dr. Deepak Sanjel


Title: An Investigation of Markowitz and Robust Portfolio Optimization

Wednesday, July 13, 2022 11:00 AM

Location: WH 291

Speaker: Shangyi Bi

Abstract: The purpose of this paper is to explain the Markowitz portfolio theory and improve its sensitivity issue with parameter estimation errors using a robust method. The Markowitz portfolio theory is a mathematical framework for selecting a portfolio of assets such that the expected return is maximized for a given level of risk. The robust formulations are to systematically combat the sensitivity of the optimal portfolio to statistical and modeling errors in the estimates. We introduce a box uncertainty set for the mean and variance, which makes the overall return more stable. This method is important for investors because financial products are affected by unexpected incidents easily, a robust formulation will further extend the diversification in investing.

Thesis Advisor: Dr. Hyekyung Min


Title:Predicting types of chest pain using Logistic regression

Monday, June 8, 2022 4:30 PM

Location: https://minnstate.zoom.us/j/7386964275

Speaker: Eric Adu

Abstract: Logistic regression analysis is a statistical technique to evaluate the relationship between various predictor variables (either categorical or continuous) and an outcome which is binary (dichotomous). In this paper, we discuss binary logistic regression analysis and the binary logistic regression and the error generated as a result of using binary logistic regression instead of binary logistic regression.

Thesis Advisor: Dr. Metzbahur Rahman


Spring 2022 Graduate Student Master Thesis Defense

Solving a Combinatorial Timetabling Optimization Via Random Local Search With Simulated Annealing

Friday, April 29, 2022, 3-3:50pm

Location: WH 284A

Speaker: Jason Motzko

Abstract: An overview is given for combinatorial optimization, random local search and simulated annealing methods. These methods are then used to develop an algorithm for locating the best solution to a timetabling problem at a university. Various criteria of desirable attributes in the timetable are evaluated, with a penalty assigned for violations of the criteria. The sum of these penalties is the value for which a desired minimum is sought. Due to the discrete nature of the optimization problem, random methods are utilized to seek a global minimum, with no guarantee of convergence to a global minimum. The results of these randomized methods varies be- tween implementations of the algorithm. Effectiveness of randomization of initial conditions as a method of avoiding entrapment in local minimum neighborhood space is explored. Employing a cohort of randomly selected initial solutions, a significantly greater reduction in the penalty was achieved during implementation of the algorithm than by employment of a single initial solution.

Thesis Advisor: Dr. Nicholas Fisher


Forecasting the Closing price of Bitcoin Cryptocurrency using ARIMA, Prophet and LSTM models

Wednesday, April 20, 4-4:50pm

Location: On-line

Zoom Information: https://minnstate.zoom.us/j/97782055828

Speaker: Abimbola Kolebaje

Abstract: Due to the difficulty in assessing the exact nature of a time series, it is often considerably challenging to generate appropriate forecasts. Over the years, various forecasting models have been developed in the literature, but they have produced minimum accuracy in forecasting financial trend. In recent years, the advent of Deep Learning has revolutionized the business of forecasting financial trends, this study involves the time series forecasting of the bitcoin closing prices with improved efficiency using long short-term memory techniques (LSTM) and compares its predictability with the traditional method (ARIMA). Additionally, we will implement the forecast of bitcoin price with the Facebook Prophet model and forecast future prices. The Mean Absolute Percentage Error (MAPE) of all three models will be compared to ascertain which model has the highest accuracy in forecasting bitcoin prices. In our case, the LSTM model outperforms the ARIMA and Prophet machine learning algorithms.

Thesis Advisor: Dr. Deepak Sanjel


Prediction of Abnormal Vaginal Discharge using Machine learning techniques among women living in rural and urban area of Tangail district, Bangladesh

Friday, April 8, 3-4pm

Location: On-line

Zoom Information: https://minnstate.zoom.us/j/8314634593
Meeting ID: 831 463 4593
Passcode: 9682

Speaker: Aninda Roy

Abstract: In recent times, early detection of disease has become a crucial problem due to the rapid growth of the population worldwide. When it comes to women’s health, they have many complications that start during their reproductive life. Abnormal vaginal discharge (AVD) is a prevalent problem among women. If it is not treated appropriately, it may lead to severe complications such as pelvic inflammatory disease and cervical cancer. In Bangladesh, women suffer from abnormal vaginal discharge due to a lack of proper hygiene. More importantly, their hesitation of sharing about this problem leads to further complications in their health. This paper presents a qualitative study of women’s socio-demographic profile, personal hygienic practices, previous medical history, associated symptoms, characteristics of discharge and health-seeking pathways, and factors that influence abnormal vaginal discharge. Data was collected from Tangail district, Bangladesh using a predesigned survey questionnaire that includes questions designed to fulfill the study objective. This dataset had 280 total observations where 180 women’s (64.3%) response was positive with AVD and negative for others (35.7%) at the study time. Association of daily hygiene practices and associated symptoms with abnormal vaginal discharge (AVD) were determined using the Chi-square test, where a p-value of less than 0.05 was considered statistically significant. The prime objective of this paper is to create a model for predicting abnormal vaginal discharge using four machine learning classification algorithms which are K Nearest Neighbor (KNN), Support Vector Machine (SVM), Random Forrest (RF), and Logistic regression (LR). The performance of different classifiers is measured concerning their accuracy, sensitivity, specificity, positive predictive value, and negative predictive value. Additionally, these techniques were appraised on the area under the receiver operating characteristic curve (ROC). The results reveal that the LR model obtained the highest accuracy, sensitivity, and positive predictive value with the lowest specificity and negative predictive value of 81.4%, 91%, 82%, 64%, and 80%, respectively.

Thesis Advisor: Dr. Mezbahur Rahman


Chinese Remainder Theorem and its application on RSA (Rivest-Shamir-Adleman) cryptography

Wednesday, March 30, 4-5:30pm

Location: Wissink Hall 288 (WH 288)

Speaker: Ammishaddai Ogyiri

Abstract: The security of data has been an issue across the globe due to potential threat to the confidentiality and the integrity of data by third parties obtaining unauthorized access to protected data. Cryptography has come a long way to help maintain the security of data. The Symmetric-Key (Secret-Key Algorithm) and the Asymmetric-Key (Public-Key Algorithm) have been the two common classes of Cryptography that help make data extremely difficult to be accessed without the authorized key. In this research paper, we delve into the Asymmetric-Key Algorithm and focus on the Rivest-Shamir-Adleman (RSA) algorithm. The more secured a key must be, the longer it takes to encrypt and decrypt the data. We compare the speed of encrypting and decrypting data with the ordinary RSA algorithm and RSA-CRT (Chinese Remainder Theorem). Moduli of 1024 bits and 4096 bits have been used for this comparison. We also discuss the effectiveness of the CRT in RSA cryptography in its security and the speed of the decryption process.

Thesis Advisor: Dr. In-Jae Kim


Spring 2021 Graduate Student Master Thesis Defense

Comparison of Classification Algorithms in machine learning

Wednesday, May 5, 3-3:50pm

Location: On-line

Zoom Information: https://minnstate.zoom.us/j/918 4173 4732

Speaker: Dong Young Park

Abstract: Classification in data science is the process of predicting the class of given data points. Classes are sometimes called as targets/labels or categories. Classification predictive modeling is the task of approximating a mapping function (f) from input variables (X) to discrete output variables (y). Classification belongs to the category of supervised learning where the targets also provided with the input data. We are going to use several classification algorithms to classify two different kinds of datasets. The algorithms we used are decision tree, support vector machine, logistic regression, and neural networks. The dataset we used are MNIST handwritten digits dataset and wine quality dataset. MNIST is a graphical data, but wine quality dataset is a numerical dataset.

Thesis Advisor: Dr. Namyong Lee


Conformal Deformation of Surfaces by the Extrinsic Dirac Operator

Wednesday, April 28, 2-2:50pm

Location: On-line

Zoom Information: https://minnstate.zoom.us/j/96717822990

Speaker: Katelyn LaPorte

Abstract: The purpose of the APP is to survey the methods used by Crane and others to create conformal deformations of surfaces in 3-dimensional Euclidean space. His goal was to utilize this for applications in image processing. Here we will go into more detail of the mathematical theory behind his method including the not so familiar Quaternion-Valued Extrinsic Dirac Operator. We will also explain the integrability conditions of the conformal deformation problem, which can be reduced to an eigenvalue problem related to this Dirac operator. As it is a first order linear operator, it has high efficiency in discretization and surface curvature editing.

Thesis Advisor: Dr. Ke Zhu


Comparing Various Robust Estimation Techniques in Regression Analysis

Friday, April 23, 4-5pm

Location: On-line

Zoom Information: https://minnstate.zoom.us/j/97782055828

Speaker: Tracy Sharon Morrison

Abstract: In regression analysis, the use of the ordinary least squares (OLS) method is inadvisable when dealing with outlier or extreme observations. As a result, we require a method of robust estimation in which the estimation value is not significantly affected by outlier or extreme observations. Four methods of estimation will be compared in this paper in order to determine the best estimation: the M estimation method, the Least Trimmed Square Estimator, the S-estimation method, and the MM estimation method in robust regression. We discover that the best method is the MM-estimation method in this study. The M-estimation method is an extension of the maximum likelihood method, whereas the MM estimation method is a development of the M-estimation method, and the S- estimation method is related to the M-estimation method due to the use of the M-estimation residual scale. While robust regression methods can significantly improve estimation precision, they should not be used in place of more traditional methods.

Thesis Advisor: Dr. Mezbahur Rahman


Count regression models for Covid-19 related deaths and overall deaths

Tuesday, Apr 20, 4-5pm

Location: On-line

Zoom Information: https://minnstate.zoom.us/j/94358910903

Speaker: Manori Ampe Mohottige Dona

Abstract: With the start of the ongoing Covid-19 pandemic, the number of deaths worldwide has increased in a considerable amount. Confirmed coronavirus cases in the United States increased significantly in the third week of March in 2020 as testing was made more rapid and overtook China’s on the 26th of March 2020, making the US the world’s most affected country by the coronavirus.

This study aims to determine the relationship of overall death counts and Covid-19 related death counts of five main states in the United States to the different age groups and gender over the period of one year. The data were collected from the government data repository, data.gov.

Poisson Regression analysis and Negative Binomial Regression analysis were used for model building purposes and total death count prediction. The k fold cross-validation and leave-one-out cross-validation were used to identify the best model.

The Negative Binomial regression model was identified as the best model compared to the Poisson regression model. According to the model, the most significant factor for total deaths and covid-19 deaths is gender. Texas has the highest significant contribution to the Covid-19 model and the most significant age group is 84 years or over.

Thesis Advisor: Dr. Iresha Premarathna


Correlational Study: An Application of Factor Analysis on a Life Expectancy Data Set

Monday, April 19, 4:15-5:15pm

Speaker: Afrah Alhamad

Abstract: Many statistical techniques focus on analyzing the association between two variables. However, these techniques are not very useful when the interest centers on analyzing the mutual associations across all the variables with no distinctions made between them. Factor analysis is one of the multivariate statistical methods commonly used for this purpose. This paper applies and explains the exploratory factor analysis procedure using a data set. Additionally, the theoretical aspects of factor analysis are briefly discussed from a practical, applied perspective. Particularly, the objective of the paper is to explore the factorial structure of a life expectancy data set by means of exploratory factor analysis and to identify the factor scores.

Thesis Advisor: Dr. Mezbahur Rahman


Prediction of Heart Disease Using Bayesian Logistic Regression by Polya-Gamma Data Augmentation

Friday, April 16, 3-4pm

Location: On-line

Zoom Information: https://minnstate.zoom.us/j/939 3476 3132

Speaker: Zhenhan Fang

Abstract: Heart disease is one of the most common diseases nowadays, due to number of contributing factors, such as high blood pressure, high blood cholesterol, and smoking. About half of Americans (47%) have at least one of these three risk factors. To reduce the risk of heart disease, healthcare industries generate enormous amount of data, and have been seeking an early diagnosis of such disease for many years. Many data analytics tools have also been applied to help health care providers to identify some of the early signs of heart disease. Many tests can be performed on potential patients to take the extra precautions measures to reduce the effect of having such a disease, and reliable methods to predict early stages of heart disease. In this study, Logistic Regression and Bayesian Logistic Regression are used to establish models to predict heart disease. We apply the Polya-Gamma data augmentation to our Bayesian Logistic model. We found that Bayesian Logistic model can provide a better performance, although it is more expensive than general Logistic model.

Thesis Advisor: Dr. Han Wu


Classification of Chess Games: An exploration of classifiers for anomaly detection in chess

Friday, April 2, 4-5pm

Location: On-line

Zoom Information: https://minnstate.zoom.us/j/5074676277

Speaker: Masudul Hoque

Abstract: Chess is a strategy board game with its inception dating back to the 15th century. The Covid-19 pandemic has led to a chess boom online with 95,853,038 chess games being played on January 2021 on one online chess site (lichess.com) alone. Along with the chess boom, instances of cheating have also become more rampant. Classifications have been used for anomaly detection in fields such as network security and online games and thus it is a natural idea to develop classifiers to detect cheating. However, there are no such prior examples of this, and it is difficult to obtain data where cheating has occurred. So in this paper, we develop 4 machine learning classifiers, Linear Discriminant Analysis, Quadratic Discriminant Analysis, Multinomial Logistic Regression, and K Nearest Neighbour classifiers to predict chess game results and explore predictors that produce the best accuracy performance. We use Confusion Matrix, K Fold Cross Validation, and Leave One Out Cross Validation methods to find the accuracy metrics.

There is three phases of analysis. In phase I, we train classifiers using 1.94 million over the board game as training data and 20 thousand online games as testing data and obtain accuracy metrics. In Phase II, we select a smaller pool of 212 games, pick 8 additional predictor variables from chess engine evaluation of the moves played in those games and check whether the inclusion of the variables improve performance. Finally, in Phase III, we shall investigate for patterns in misclassified cases to define anomalous values.

From Phase I, the models are not performing at a utilizable level of accuracy (44-63%). For all classifiers, it is no better than deciding the class with a coin toss. K Nearest Neighbour with K = 7 was the best model. In Phase II, adding the new predictors improved the performance of all the classifiers significantly across all validation methods. In fact, using only significant variables as predictors produced highly accurate classifiers. Finally, from Phase III, we could not find any patterns or significant differences between the predictors for both correct classifications and misclassifications.

In conclusion, Machine learning classification is only one useful tool to spot instances that indicates anomalies. However, we cannot simply judge for anomalous games using only one method.

Thesis Advisor: Dr. Iresha Premarathna


Sequential Probability Ratio Test and Experiment

Tuesday, March 16, 4-5pm

Location: AH 102

Speaker: Brianna Klapoetke

Abstract: The Sequential Probability Ratio Test (SPRT) is a method of testing simple hypotheses where the sample size is not determined in advance. In this talk I will describe the general process of using the SPRT, overview the theory that supports it, and describe how I applied it to data I collected to determine what alpha values people used to make their decisions in a simple game I designed.

Thesis Advisor: Dr. Mezbahur Rahman


Fall 2020 Graduate Student Master Thesis Defense

Grobner Bases and Systems of Polynomial Equations

Monday, November 23, 4-5pm

Speaker: Rachel Holmes

Abstract: The goal of this paper is to explore the use and construction of Grobner bases through Buchberger’s algorithm. Specifically, applications of such bases for solving systems of polynomial equations will be discussed. Furthermore, we relate many concepts in commutative algebra to ideas in computational algebraic geometry.

Thesis Advisor: Dr. Wook Kim


Improvement in Regression Analysis through Optimal Clustering Algorithms with Machine Learning

Wednesday, November 18, 3-4pm

Speaker: Taeyoung Choi

Abstract: The primary purpose of the project is to enhance the quality of data analysis by adapting various clustering systems with machine learning and apply the advanced clustering techniques to regression model in order to improve the efficiency of the analysis. First and foremost, this research aims to expand the knowledge of data analysis through diverse clustering algorithms, including Hierarchical, K-Means, Partition Around Medoid (PAM), Clustering Large Applications (CLARA), and Clustering Large Applications based upon Randomized Search (CLARANS). The clustering algorithms assist in high-quality data analysis by constructing particular groups within the given data. The clustering techniques could be easily applicable to multiple fields, including clinical, manufacturing, or business sectors. For example, large type II diabetes patient information data sets with numerous variables could be classified with relevant personal medical histories, physical activity level, response to a certain treatment, or diet habits through the appropriate cluster analysis.

Thesis Advisor: Dr. Mezbahur Rahman


Summer 2020 Graduate Student Master Thesis Defense

Multiple Regression Analysis with Continuous and Binary Response Variable

Friday, August 14, 11am-12pm

Speaker: Eunhye Lee

Abstract: This alternate plan paper aimed to analyze student data in different Regression models to fit the best model and find the best model out of different types of regression models. The inferential statistics could provide more information beyond the descriptive statistics by answering questions in terms of data, testing hypotheses, and fitting into a proper model not only to describe the relationship in data set but also to predict a target. A statistical method, regression can be utilized in numerous fields in order to reveal the relationship between variables including finance, marketing, biology, investment, health, even psychology, etc. The main question in this paper is what variables are affecting to the final grade the most. The goal is to fit a multiple linear regression model and multiple logistic regression model properly, to detect the most relevant and effective variables in the fitted model to help understanding in respect to final mathematics grade. I will cover the linear regression model, one of the basic types of regression to describe the simultaneous associations of observed variables with a continuous dependent variable. To get the valid linear regression model, the assumptions of residual normality, linearity, independence of residual terms, zero mean of residual and homogeneity of residual variance checked to satisfy. Secondly, the logistic regression is to study the effect of binary outcomes regardless of the other regressor measurement. Logistic model is based on the logit function with the interpretation of probability than a value. The assumption for logistic regression comes with the response variable to be ordinal, the error terms to be independent, absence of multicollinearity, and linearity of independent variables and log odds with large sample size.

Thesis Advisor: Dr. Metzbahur Rahman


Soybean Price Prediction Using Time Series Foresting with Google Trend

Friday, August 7, 10-11am

Speaker: Zhuoning Li

Abstract: We use the time series methods to analyze the trend, predict price in U.S. soybean commodity market, and find the impact on the soybean price by the "trade war" between China and the U.S.. We use autoregressive integrated moving average and autoregressive conditional heteroskedasticity models to predict soybean price by using the U.S soybean daily price data, and we also use vector autoregression(VAR) and long short time memory models to predict soybean price by using the previous data and google trend data. By comparing these methods, we get the best prediction from VAR model.

Thesis Advisor: Dr. Deepak Sanjel


An Application for Bank Loan Default Prediction Analysis using Logistic Regression and Support Vector Machine

Friday, July 31, 10-11am

Speaker: Shuk Ping Wong

Abstract: Risk Management is one of the most crucial areas for banks. Banks are constantly working on effective models to estimate the likelihood of whether a customer could default to maintain a sustainable and profitable business. Although credit scoring is a common indicator for bankers, some financial datasets simply do not come with this variable. This study built a logistic regression model and a support vector machine (SVM) model to predict whether the loan borrower will default based on different categorical variables. The performance of the models is compared based on accuracy and efficiency. We found that a logistic regression model generally provides more depth in analysis of the variables and is better in terms of interpretability. Although SVM has a higher accuracy rate, the method took too much time for the computer to run and it suffers from a lack of interpretability. Logistic regression model has a better performance in general.

Thesis Advisor: Dr. Mezbahur Rahman


Number Construction

Monday, July 20, 11am-12pm

Speaker: Brian Bertness

Abstract: This paper describes how numbers are constructed via sets and equivalence relations. The necessary Zermelo-Franko set theory axioms are used to define basic sets, relations, and functions. Employing the Axiom of Infinity, the natural numbers are then constructed in terms of sets with an ordering that also conforms to the Peano axioms. Using the set of natural numbers and an equivalence relation the set of integers with an ordering are created followed, in turn, by the set of rational numbers. Lastly, Cauchy sequences are introduced and, using an equivalence relation, these are turned into the set of real numbers which are shown to have an ordering and the completeness property.

Thesis Advisor: Dr. Wook Kim


Spring 2020 Graduate Student Master Thesis Defense

The Roots of Root Finding

Wednesday, May 6, 2-3pm

Speaker: Kurt Grunzke

Abstract: One of the biggest challenges facing teachers is convincing students that their intuition about a concept is incorrect. In particular, our current social climate fuels an intuition that mathematicians are “nerds,” “geeks,” or other terms that generally refer to a boring person who lacks social skills. The goal of this paper is to demolish that stereotype by demonstrating that mathematicians are independent, argumentative, and vibrant individuals, whose energy is fueled by the social climate of their time. In order to demonstrate these characteristics, we will consider the question of solving polynomial equations, and not just one of them, but all of them. The answer to our question will span thousands of years, cross through multiple civilizations and continents, and introduce us to some lively mathematicians. Furthermore, this investigation will provide an approachable access point to concepts in higher mathematics.

Thesis Advisor: Dr. Namyong Lee


Discrete Morse Theory by Vector Fields: A Survey and New Directions

Tuesday, May 5, 4-5pm

Speaker: Matthew Nemitz

Abstract: We synthesize some of the main tools in discrete Morse theory from various sources. We do this in regards to abstract simplicial complexes with an emphasis on vector fields and use this as a building block to achieve our main result which is to investigate the relationship between simplicial maps and homotopy. We use the discrete vector field as a catalyst to build a chain homotopy between chain maps induced by simplicial maps.

Thesis Advisor: Dr. Brandon Rowekamp


Heat Kernel Voting with Geometric Invariants

Friday, May 1, 4-5pm

Speaker: Alexander Harr

Abstract: Here we provide a method for comparing geometric objects. Two objects of interest are embedded an infinite dimensional Hilbert space using their Laplacian eigenvectors and eigenfunctions into an infinite dimensional space, truncated to a finite dimensional Euclidean space, where correspondences between the objects are found and voted on. To simplify correspondence finding, we propose using several geometric invariants to reduce the necessary computations. This method improves on voting methods by identifying isometric regions in shapes of dimension greater than 3, and genus greater than 0, as well as almost retaining isometry. The voting approach evaluates local correspondences while at the same time respecting the global structure.

Thesis Advisor: Dr. Ke Zhu

A Mathematical Model for Malaria with Age-Heterogenous Biting Rate

Wednesday, April 22, 3-4pm

Speaker: Sho Kawakami

Abstract: We propose a mathematical model for malaria with age-heterogeneous biting rate. The existence of the model, the local behaviour of the disease free equillibrium are explored. Furthurmore the model is extended to an optimal control problem and the correspond- ing adjoint equations and optimality conditions are derived. Age dependent parameter values are estimated and numerical simulations are carried out for the model. The new model better accounts for difference in biting rates between different age groups, and improvements in stability to the explicit algorithm. The optimal control is also shown to depend on the age distribution of the biting rate.

Thesis Advisor: Dr. Ruijun Zhao


Apply logistic regression procedures to datasets with a binary and a nominal response variable

Wednesday, April 15, 2-3pm

Speaker: Duaa Alsubhi

Abstract: A major emphasis of this paper is on applying a binomial and a multinomial regression. In binomial regression, we used a heart disease dataset to illustrate how to build a modeling strategy by using a purposeful selection variable to determine the model with the best fit. In multinomial regression, we used an Adolescent Placement Study dataset to compare the logistic regression model with and without insignificant independent variables. In addition, we are interested in the impact of the insignificant predictor variable, which is explained in terms of an odds ratio.

Thesis Advisor: Dr. Mezbahur Rahman


Theory of Principal Components for Applications in Exploratory Crime Analysis and Clusting

Thursday, April 9, 3-4pm

Speaker: Daniel Silva

Abstract: The purpose of this paper is to develop the theory of principal components analysis succinctly from the fundamentals of matrix algebra and multivariate statistics. Principal components analysis is sometimes used as a descriptive technique to explain the variance-covariance or correlation structure of a dataset. However, most often, it is used as a dimensionality reduction technique to visualize a high dimensional dataset in a lower dimensional space. Principal components analysis accomplishes this by using the first few principal components, provided that they account for a substantial proportion of variation in the original dataset. In the same way, the first few principal components can be used as inputs into a cluster analysis in order to combat the curse of dimensionality and optimize the runtime for large datasets. The application portion of this paper will apply these methods to a US Crime 2018 dataset extracted from the Uniform Crime Reports on the FBI’s website.

Thesis Advisor: Dr. Iresha Premarathna


Spring 2022 Graduate Student Master Thesis Defense

Forecasting the Closing price of Bitcoin Cryptocurrency using ARIMA, Prophet and LSTM models

Wednesday, April 20- 4:00-4:50 PM

Location: On-line

Zoom Information: https://minnstate.zoom.us/j/97782055828

Speaker: Abimbola Kolebaje

Abstract: Due to the difficulty in assessing the exact nature of a time series, it is often considerably challenging to generate appropriate forecasts. Over the years, various forecasting models have been developed in the literature, but they have produced minimum accuracy in forecasting financial trend. In recent years, the advent of Deep Learning has revolutionized the business of forecasting financial trends, this study involves the time series forecasting of the bitcoin closing prices with improved efficiency using long short-term memory techniques (LSTM) and compares its predictability with the traditional method (ARIMA). Additionally, we will implement the forecast of bitcoin price with the Facebook Prophet model and forecast future prices. The Mean Absolute Percentage Error (MAPE) of all three models will be compared to ascertain which model has the highest accuracy in forecasting bitcoin prices. In our case, the LSTM model outperforms the ARIMA and Prophet machine learning algorithms.

Thesis Advisor: Dr. Deepak Sanjel


Prediction of Abnormal Vaginal Discharge using Machine learning techniques among women living in rural and urban area of Tangail district, Bangladesh

Friday April 8 - 3-4pm

Location: On-line

Zoom Information: https://minnstate.zoom.us/j/8314634593
Meeting ID: 831 463 4593
Passcode: 9682

Speaker: Aninda Roy

Abstract: In recent times, early detection of disease has become a crucial problem due to the rapid growth of the population worldwide. When it comes to women’s health, they have many complications that start during their reproductive life. Abnormal vaginal discharge (AVD) is a prevalent problem among women. If it is not treated appropriately, it may lead to severe complications such as pelvic inflammatory disease and cervical cancer. In Bangladesh, women suffer from abnormal vaginal discharge due to a lack of proper hygiene. More importantly, their hesitation of sharing about this problem leads to further complications in their health. This paper presents a qualitative study of women’s socio-demographic profile, personal hygienic practices, previous medical history, associated symptoms, characteristics of discharge and health-seeking pathways, and factors that influence abnormal vaginal discharge. Data was collected from Tangail district, Bangladesh using a predesigned survey questionnaire that includes questions designed to fulfill the study objective. This dataset had 280 total observations where 180 women’s (64.3%) response was positive with AVD and negative for others (35.7%) at the study time. Association of daily hygiene practices and associated symptoms with abnormal vaginal discharge (AVD) were determined using the Chi-square test, where a p-value of less than 0.05 was considered statistically significant. The prime objective of this paper is to create a model for predicting abnormal vaginal discharge using four machine learning classification algorithms which are K Nearest Neighbor (KNN), Support Vector Machine (SVM), Random Forrest (RF), and Logistic regression (LR). The performance of different classifiers is measured concerning their accuracy, sensitivity, specificity, positive predictive value, and negative predictive value. Additionally, these techniques were appraised on the area under the receiver operating characteristic curve (ROC). The results reveal that the LR model obtained the highest accuracy, sensitivity, and positive predictive value with the lowest specificity and negative predictive value of 81.4%, 91%, 82%, 64%, and 80%, respectively.

Thesis Advisor: Dr. Mezbahur Rahman


Chinese Remainder Theorem and its application on RSA (Rivest-Shamir-Adleman) cryptography

Wednesday, March 30, 4-5:30pm

Location: Wissink Hall 288 (WH 288)

Speaker: Ammishaddai Ogyiri

Abstract: The security of data has been an issue across the globe due to potential threat to the confidentiality and the integrity of data by third parties obtaining unauthorized access to protected data. Cryptography has come a long way to help maintain the security of data. The Symmetric-Key (Secret-Key Algorithm) and the Asymmetric-Key (Public-Key Algorithm) have been the two common classes of Cryptography that help make data extremely difficult to be accessed without the authorized key. In this research paper, we delve into the Asymmetric-Key Algorithm and focus on the Rivest-Shamir-Adleman (RSA) algorithm. The more secured a key must be, the longer it takes to encrypt and decrypt the data. We compare the speed of encrypting and decrypting data with the ordinary RSA algorithm and RSA-CRT (Chinese Remainder Theorem). Moduli of 1024 bits and 4096 bits have been used for this comparison. We also discuss the effectiveness of the CRT in RSA cryptography in its security and the speed of the decryption process.

Thesis Advisor: Dr. In-Jae Kim


Spring 2021 Graduate Student Master Thesis Defense

Comparison of Classification Algorithms in machine learning

Wednesday, May 5, 3-3:50pm

Location: On-line

Zoom Information: https://minnstate.zoom.us/j/918 4173 4732

Speaker: Dong Young Park

Abstract: Classification in data science is the process of predicting the class of given data points. Classes are sometimes called as targets/labels or categories. Classification predictive modeling is the task of approximating a mapping function (f) from input variables (X) to discrete output variables (y). Classification belongs to the category of supervised learning where the targets also provided with the input data. We are going to use several classification algorithms to classify two different kinds of datasets. The algorithms we used are decision tree, support vector machine, logistic regression, and neural networks. The dataset we used are MNIST handwritten digits dataset and wine quality dataset. MNIST is a graphical data, but wine quality dataset is a numerical dataset.

Thesis Advisor: Dr. Namyong Lee


Conformal Deformation of Surfaces by the Extrinsic Dirac Operator

Wednesday April 28, 2-2:50pm

Location: On-line

Zoom Information: https://minnstate.zoom.us/j/96717822990

Speaker: Katelyn LaPorte

Abstract: The purpose of the APP is to survey the methods used by Crane and others to create conformal deformations of surfaces in 3-dimensional Euclidean space. His goal was to utilize this for applications in image processing. Here we will go into more detail of the mathematical theory behind his method including the not so familiar Quaternion-Valued Extrinsic Dirac Operator. We will also explain the integrability conditions of the conformal deformation problem, which can be reduced to an eigenvalue problem related to this Dirac operator. As it is a first order linear operator, it has high efficiency in discretization and surface curvature editing.

Thesis Advisor: Dr. Ke Zhu


Comparing Various Robust Estimation Techniques in Regression Analysis

Friday, April 23, 4-5pm

Location: On-line

Zoom Information: https://minnstate.zoom.us/j/97782055828

Speaker: Tracy Sharon Morrison

Abstract: In regression analysis, the use of the ordinary least squares (OLS) method is inadvisable when dealing with outlier or extreme observations. As a result, we require a method of robust estimation in which the estimation value is not significantly affected by outlier or extreme observations. Four methods of estimation will be compared in this paper in order to determine the best estimation: the M estimation method, the Least Trimmed Square Estimator, the S-estimation method, and the MM estimation method in robust regression. We discover that the best method is the MM-estimation method in this study. The M-estimation method is an extension of the maximum likelihood method, whereas the MM estimation method is a development of the M-estimation method, and the S- estimation method is related to the M-estimation method due to the use of the M-estimation residual scale. While robust regression methods can significantly improve estimation precision, they should not be used in place of more traditional methods.

Thesis Advisor: Dr. Mezbahur Rahman


Count regression models for Covid-19 related deaths and overall deaths

Tuesday, Apr 20, 4-5pm

Location: On-line

Zoom Information: https://minnstate.zoom.us/j/94358910903

Speaker: Manori Ampe Mohottige Dona

Abstract: With the start of the ongoing Covid-19 pandemic, the number of deaths worldwide has increased in a considerable amount. Confirmed coronavirus cases in the United States increased significantly in the third week of March in 2020 as testing was made more rapid and overtook China’s on the 26th of March 2020, making the US the world’s most affected country by the coronavirus.

This study aims to determine the relationship of overall death counts and Covid-19 related death counts of five main states in the United States to the different age groups and gender over the period of one year. The data were collected from the government data repository, data.gov.

Poisson Regression analysis and Negative Binomial Regression analysis were used for model building purposes and total death count prediction. The k fold cross-validation and leave-one-out cross-validation were used to identify the best model.

The Negative Binomial regression model was identified as the best model compared to the Poisson regression model. According to the model, the most significant factor for total deaths and covid-19 deaths is gender. Texas has the highest significant contribution to the Covid-19 model and the most significant age group is 84 years or over.

Thesis Advisor: Dr. Iresha Premarathna


Correlational Study: An Application of Factor Analysis on a Life Expectancy Data Set

Monday, April 19, 4:15-5:15pm

Speaker: Afrah Alhamad

Abstract: Many statistical techniques focus on analyzing the association between two variables. However, these techniques are not very useful when the interest centers on analyzing the mutual associations across all the variables with no distinctions made between them. Factor analysis is one of the multivariate statistical methods commonly used for this purpose. This paper applies and explains the exploratory factor analysis procedure using a data set. Additionally, the theoretical aspects of factor analysis are briefly discussed from a practical, applied perspective. Particularly, the objective of the paper is to explore the factorial structure of a life expectancy data set by means of exploratory factor analysis and to identify the factor scores.

Thesis Advisor: Dr. Mezbahur Rahman


Prediction of Heart Disease Using Bayesian Logistic Regression by Polya-Gamma Data Augmentation

Friday, April 16, 3-4pm

Location: On-line

Zoom Inforamtion: https://minnstate.zoom.us/j/939 3476 3132

Speaker: Zhenhan Fang

Abstract: Heart disease is one of the most common diseases nowadays, due to number of contributing factors, such as high blood pressure, high blood cholesterol, and smoking. About half of Americans (47%) have at least one of these three risk factors. To reduce the risk of heart disease, healthcare industries generate enormous amount of data, and have been seeking an early diagnosis of such disease for many years. Many data analytics tools have also been applied to help health care providers to identify some of the early signs of heart disease. Many tests can be performed on potential patients to take the extra precautions measures to reduce the effect of having such a disease, and reliable methods to predict early stages of heart disease. In this study, Logistic Regression and Bayesian Logistic Regression are used to establish models to predict heart disease. We apply the Polya-Gamma data augmentation to our Bayesian Logistic model. We found that Bayesian Logistic model can provide a better performance, although it is more expensive than general Logistic model.

Thesis Advisor: Dr. Han Wu


Classification of Chess Games: An exploration of classifiers for anomaly detection in chess

Friday, April 2, 4-5pm

Location: On-line

Zoom Information: https://minnstate.zoom.us/j/5074676277

Speaker: Masudul Hoque

Abstract: Chess is a strategy board game with its inception dating back to the 15th century. The Covid-19 pandemic has led to a chess boom online with 95,853,038 chess games being played on January 2021 on one online chess site (lichess.com) alone. Along with the chess boom, instances of cheating have also become more rampant. Classifications have been used for anomaly detection in fields such as network security and online games and thus it is a natural idea to develop classifiers to detect cheating. However, there are no such prior examples of this, and it is difficult to obtain data where cheating has occurred. So in this paper, we develop 4 machine learning classifiers, Linear Discriminant Analysis, Quadratic Discriminant Analysis, Multinomial Logistic Regression, and K Nearest Neighbour classifiers to predict chess game results and explore predictors that produce the best accuracy performance. We use Confusion Matrix, K Fold Cross Validation, and Leave One Out Cross Validation methods to find the accuracy metrics.

There is three phases of analysis. In phase I, we train classifiers using 1.94 million over the board game as training data and 20 thousand online games as testing data and obtain accuracy metrics. In Phase II, we select a smaller pool of 212 games, pick 8 additional predictor variables from chess engine evaluation of the moves played in those games and check whether the inclusion of the variables improve performance. Finally, in Phase III, we shall investigate for patterns in misclassified cases to define anomalous values.

From Phase I, the models are not performing at a utilizable level of accuracy (44-63%). For all classifiers, it is no better than deciding the class with a coin toss. K Nearest Neighbour with K = 7 was the best model. In Phase II, adding the new predictors improved the performance of all the classifiers significantly across all validation methods. In fact, using only significant variables as predictors produced highly accurate classifiers. Finally, from Phase III, we could not find any patterns or significant differences between the predictors for both correct classifications and misclassifications.

In conclusion, Machine learning classification is only one useful tool to spot instances that indicates anomalies. However, we cannot simply judge for anomalous games using only one method.

Thesis Advisor: Dr. Iresha Premarathna


Sequential Probability Ratio Test and Experiment

Tuesday, March 16, 4-5pm

Location: AH 102

Speaker: Brianna Klapoetke

Abstract: The Sequential Probability Ratio Test (SPRT) is a method of testing simple hypotheses where the sample size is not determined in advance. In this talk I will describe the general process of using the SPRT, overview the theory that supports it, and describe how I applied it to data I collected to determine what alpha values people used to make their decisions in a simple game I designed.

Thesis Advisor: Dr. Mezbahur Rahman


Fall 2020 Graduate Student Master Thesis Defense

Grobner Bases and Systems of Polynomial Equations

Monday, November 23, 4-5pm

Speaker: Rachel Holmes

Abstract: The goal of this paper is to explore the use and construction of Grobner bases through Buchberger’s algorithm. Specifically, applications of such bases for solving systems of polynomial equations will be discussed. Furthermore, we relate many concepts in commutative algebra to ideas in computational algebraic geometry.

Thesis Advisor: Dr. Wook Kim


Improvement in Regression Analysis through Optimal Clustering Algorithms with Machine Learning

Wednesday, November 18, 3-4pm

Speaker: Taeyoung Choi

Abstract: The primary purpose of the project is to enhance the quality of data analysis by adapting various clustering systems with machine learning and apply the advanced clustering techniques to regression model in order to improve the efficiency of the analysis. First and foremost, this research aims to expand the knowledge of data analysis through diverse clustering algorithms, including Hierarchical, K-Means, Partition Around Medoid (PAM), Clustering Large Applications (CLARA), and Clustering Large Applications based upon Randomized Search (CLARANS). The clustering algorithms assist in high-quality data analysis by constructing particular groups within the given data. The clustering techniques could be easily applicable to multiple fields, including clinical, manufacturing, or business sectors. For example, large type II diabetes patient information data sets with numerous variables could be classified with relevant personal medical histories, physical activity level, response to a certain treatment, or diet habits through the appropriate cluster analysis.

Thesis Advisor: Dr. Mezbahur Rahman


Summer 2020 Graduate Student Master Thesis Defense

Multiple Regression Analysis with Continuous and Binary Response Variable

Friday, August 14, 11am-12pm

Speaker: Eunhye Lee

Abstract: This alternate plan paper aimed to analyze student data in different Regression models to fit the best model and find the best model out of different types of regression models. The inferential statistics could provide more information beyond the descriptive statistics by answering questions in terms of data, testing hypotheses, and fitting into a proper model not only to describe the relationship in data set but also to predict a target. A statistical method, regression can be utilized in numerous fields in order to reveal the relationship between variables including finance, marketing, biology, investment, health, even psychology, etc. The main question in this paper is what variables are affecting to the final grade the most. The goal is to fit a multiple linear regression model and multiple logistic regression model properly, to detect the most relevant and effective variables in the fitted model to help understanding in respect to final mathematics grade. I will cover the linear regression model, one of the basic types of regression to describe the simultaneous associations of observed variables with a continuous dependent variable. To get the valid linear regression model, the assumptions of residual normality, linearity, independence of residual terms, zero mean of residual and homogeneity of residual variance checked to satisfy. Secondly, the logistic regression is to study the effect of binary outcomes regardless of the other regressor measurement. Logistic model is based on the logit function with the interpretation of probability than a value. The assumption for logistic regression comes with the response variable to be ordinal, the error terms to be independent, absence of multicollinearity, and linearity of independent variables and log odds with large sample size.

Thesis Advisor: Dr. Metzbahur Rahman


Soybean Price Prediction Using Time Series Foresting with Google Trend

Friday, August 7, 10-11am

Speaker: Zhuoning Li

Abstract: We use the time series methods to analyze the trend, predict price in U.S. soybean commodity market, and find the impact on the soybean price by the "trade war" between China and the U.S.. We use autoregressive integrated moving average and autoregressive conditional heteroskedasticity models to predict soybean price by using the U.S soybean daily price data, and we also use vector autoregression(VAR) and long short time memory models to predict soybean price by using the previous data and google trend data. By comparing these methods, we get the best prediction from VAR model.

Thesis Advisor: Dr. Deepak Sanjel


An Application for Bank Loan Default Prediction Analysis using Logistic Regression and Support Vector Machine

Friday, July 31, 10-11am

Speaker: Shuk Ping Wong

Abstract: Risk Management is one of the most crucial areas for banks. Banks are constantly working on effective models to estimate the likelihood of whether a customer could default to maintain a sustainable and profitable business. Although credit scoring is a common indicator for bankers, some financial datasets simply do not come with this variable. This study built a logistic regression model and a support vector machine (SVM) model to predict whether the loan borrower will default based on different categorical variables. The performance of the models is compared based on accuracy and efficiency. We found that a logistic regression model generally provides more depth in analysis of the variables and is better in terms of interpretability. Although SVM has a higher accuracy rate, the method took too much time for the computer to run and it suffers from a lack of interpretability. Logistic regression model has a better performance in general.

Thesis Advisor: Dr. Metzbahur Rahman


Number Construction

Monday, July 20, 11am-12pm

Speaker: Brian Bertness

Abstract: This paper describes how numbers are constructed via sets and equivalence relations. The necessary Zermelo-Franko set theory axioms are used to define basic sets, relations, and functions. Employing the Axiom of Infinity, the natural numbers are then constructed in terms of sets with an ordering that also conforms to the Peano axioms. Using the set of natural numbers and an equivalence relation the set of integers with an ordering are created followed, in turn, by the set of rational numbers. Lastly, Cauchy sequences are introduced and, using an equivalence relation, these are turned into the set of real numbers which are shown to have an ordering and the completeness property.

Thesis Advisor: Dr. Wook Kim


Spring 2020 Graduate Student Master Thesis Defense

The Roots of Root Finding

Wednesday, May 6, 2-3pm

Speaker: Kurt Grunzke

Abstract: One of the biggest challenges facing teachers is convincing students that their intuition about a concept is incorrect. In particular, our current social climate fuels an intuition that mathematicians are “nerds,” “geeks,” or other terms that generally refer to a boring person who lacks social skills. The goal of this paper is to demolish that stereotype by demonstrating that mathematicians are independent, argumentative, and vibrant individuals, whose energy is fueled by the social climate of their time. In order to demonstrate these characteristics, we will consider the question of solving polynomial equations, and not just one of them, but all of them. The answer to our question will span thousands of years, cross through multiple civilizations and continents, and introduce us to some lively mathematicians. Furthermore, this investigation will provide an approachable access point to concepts in higher mathematics.

Thesis Advisor: Dr. Namyong Lee


Discrete Morse Theory by Vector Fields: A Survey and New Directions

Tuesday, May 5, 4-5pm

Speaker: Matthew Nemitz

Abstract: We synthesize some of the main tools in discrete Morse theory from various sources. We do this in regards to abstract simplicial complexes with an emphasis on vector fields and use this as a building block to achieve our main result which is to investigate the relationship between simplicial maps and homotopy. We use the discrete vector field as a catalyst to build a chain homotopy between chain maps induced by simplicial maps.

Thesis Advisor: Dr. Brandon Rowekamp


Heat Kernel Voting with Geometric Invariants

Friday, May 1, 4-5pm

Speaker: Alexander Harr

Abstract: Here we provide a method for comparing geometric objects. Two objects of interest are embedded an infinite dimensional Hilbert space using their Laplacian eigenvectors and eigenfunctions into an infinite dimensional space, truncated to a finite dimensional Euclidean space, where correspondences between the objects are found and voted on. To simplify correspondence finding, we propose using several geometric invariants to reduce the necessary computations. This method improves on voting methods by identifying isometric regions in shapes of dimension greater than 3, and genus greater than 0, as well as almost retaining isometry. The voting approach evaluates local correspondences while at the same time respecting the global structure.

Thesis Advisor: Dr. Ke Zhu

A Mathematical Model for Malaria with Age-Heterogenous Biting Rate

Wednesday, April 22, 3-4pm

Speaker: Sho Kawakami

Abstract: We propose a mathematical model for malaria with age-heterogeneous biting rate. The existence of the model, the local behaviour of the disease free equillibrium are explored. Furthurmore the model is extended to an optimal control problem and the correspond- ing adjoint equations and optimality conditions are derived. Age dependent parameter values are estimated and numerical simulations are carried out for the model. The new model better accounts for difference in biting rates between different age groups, and improvements in stability to the explicit algorithm. The optimal control is also shown to depend on the age distribution of the biting rate.

Thesis Advisor: Dr. Ruijun Zhao


Apply logistic regression procedures to datasets with a binary and a nominal response variable

Wednesday, April 15, 2-3pm

Speaker: Duaa Alsubhi

Abstract: A major emphasis of this paper is on applying a binomial and a multinomial regression. In binomial regression, we used a heart disease dataset to illustrate how to build a modeling strategy by using a purposeful selection variable to determine the model with the best fit. In multinomial regression, we used an Adolescent Placement Study dataset to compare the logistic regression model with and without insignificant independent variables. In addition, we are interested in the impact of the insignificant predictor variable, which is explained in terms of an odds ratio.

Thesis Advisor: Dr. Mezbahur Rahman


Theory of Principal Components for Applications in Exploratory Crime Analysis and Clusting

Thursday, April 9, 3-4pm

Speaker: Daniel Silva

Abstract: The purpose of this paper is to develop the theory of principal components analysis succinctly from the fundamentals of matrix algebra and multivariate statistics. Principal components analysis is sometimes used as a descriptive technique to explain the variance-covariance or correlation structure of a dataset. However, most often, it is used as a dimensionality reduction technique to visualize a high dimensional dataset in a lower dimensional space. Principal components analysis accomplishes this by using the first few principal components, provided that they account for a substantial proportion of variation in the original dataset. In the same way, the first few principal components can be used as inputs into a cluster analysis in order to combat the curse of dimensionality and optimize the runtime for large datasets. The application portion of this paper will apply these methods to a US Crime 2018 dataset extracted from the Uniform Crime Reports on the FBI’s website.

Thesis Advisor: Dr. Iresha Premarathna