2nd Annual Data Science Day

April 4th, 2023 | 2:10PM - 5:25PM

Ogden College Hall & Snell Hall

Schedule Overview

2:10PM

Welcome!

Ogden Hall Auditorium

Dr. Cate Webb, Director, ARTP

Dr. Richard Schugart, Director, Applied Center for Data Science (ACDS)

2:15PM - 3:10PM

Plenary Talk

Ogden College Hall 1006

Renée Cummings, Assistant Professor of the Practice in Data Science, University of Virginia

Title: Critical Data Science and the Interdisciplinary Imagination

Abstract: How data activism stretches the imagination of data science and sparks interdisciplinary collaborations out of which rigorous and robust ethical guardrails can be created to minimize the risks and maximize the rewards of data science.

3:25PM - 3:45 PM

Short Industrial Talk - Canceled

Snell Hall 1108

Cameron White, Dollar General Corp, WKU alum

Title: Data Goldilocks: What to Do When Your Data are Too Big, Too Small, and Just Right

Abstract: Often in academia, the size of datasets under study is “just right”: large enough to glean many insights from, yet small enough to be worked on with a single desktop PC. In this talk, experienced data scientist Cameron White will discuss various approaches on how to handle those situations when the datasets are not just right. At some firms, the datasets aren’t really that large and comprehensive, thus hampering some of the questions that can be asked. At others, there is too much data, which can be very difficult to wrangle on a single machine.

3:50PM - 5:25PM

General Session I

Snell Hall 1108 (see below for specific times of talks, titles, presenters, and abstracts)

3:50PM - 5:25PM

General Session II

Snell Hall 1102 (see below for specific times of talks, titles, presenters, and abstracts)

3:50PM - 5:25PM

General Session III

Snell Hall 1101 (see below for specific times of talks, titles, presenters, and abstracts)

General Session I

Location: Snell Hall 1108

3:50-4:05 PM | Molecular Evolution of Plant Volatiles: A Case Study with Eugenol, Allicin, & Limonens.

Presenters: Vivian Rivera, WKU Undergraduate Researcher; Mentor: Dr. Chandrakanth Emani, Department of Biology

Abstract: Plant volatiles that can also have the properties of pharmaceutically important phytochemicals have the potential to interact with clinically measured markers within their protein domain architecture. Studying their molecular evolution and deciphering their phylogenies allow for better understanding of the ecology, genetics, and biochemistry. The present study is a bioinformatics based analysis of three such compounds, namely, Eugenol from Ocimum sanctum(basil), Allicin from Allium sativum(garlic) and Limonene from Mentha spicata(Mint)

4:10-4:25 PM | Masculinity, Femininity, and Dignity: Examining the Relationship Between Gender Expression and Respect

Presenters: Ashton Lyvers, WKU Undergraduate Researcher; Mentor: Dr. Amy Brausch, Department of Psychological Sciences

Abstract: Gender expectations can create complications in various aspects of our lives, including feeling respected in our relationships, education, and career. This study aims to explore the relationship between gender expression and respect. It is hypothesized that participants are more likely to associate a negative connotation with stereotypically feminine characteristics and that higher levels of conformity to femininity will be associated with lower levels of respect received from peers.

Participants were 294 college students recruited from Western Kentucky University. Participants completed the following measures: two versions of the Bem Sex Role Inventory (BSRI), a Feelings of Respect Inventory (FRI), the Conformity to Feminine Norms Inventory (CFNI), and the Conformity to Masculine Norms Inventory (CMNI). The BSRI was modified to measure the connotation participants associate with stereotypical gender traits. Data collection is ongoing and preliminary analyses are reported here.

One-way ANOVA results indicated that feminine traits are more likely to have a negative connotation than masculine traits. Non-cisgender individuals reported feeling significantly less respect, while also conforming to feminine traits more than males and masculine traits more than cisgender individuals. Future research should examine any recent changes in gender roles. A larger sample of males and non-cisgender individuals could reveal more patterns.

4:30-4:45 PM | A Framework for Improving the Explanatory Value of Predictive Analytics

Presenters: Ellie Chitwood, WKU Undergraduate Researcher; Mentor: Dr. Thad Crews, Department of Analytics and Information Systems

Abstract: The Analytics Life Cycle involves three levels of understanding. At the Problem Level, the focus is on identifying a problem or question that needs data-driven insights. This level includes understanding the business context, goals, and objectives, as well as defining the problem or question to be answered. Moving to the Conceptual Modeling Level, the focus shifts to the models and techniques used to analyze and interpret data. This level includes statistical techniques, machine learning algorithms, and data visualization approaches, as well as methods of collecting, cleaning, and transforming data into a usable format for analysis. At the Tool Level, the emphasis is on the practical application of relevant software tools, programming languages, and technical skills (such as Python, R, SAS, Tableau, Power BI, SQL, etc.) to effectively and efficiently achieve the desired results.

This study explores the value of the “Triangle Analytics” model as a cross-disciplinary strategic framework for understanding and communicating predictive analytics. The model draws on insights from cognitive psychology, biology, and neuroscience to help analysts communicate better with clients and better manage complexity. This presentation illustrates the value of the Triangle Analytics model in explaining the results of a specific predictive model.

4:50-5:05 PM | Creating Corpora for an EEO Law Question Generator

Presenters: Logan Stewart, WKU Undergraduate Researcher; Mentor: Dr. Xiaowen Chen, Department of Psychological Sciences

Abstract: The project is a part of the big, long-term project of developing a question generator for Equal Employment Opportunity (EEO) laws and regulations. Question generator is an AI technology and a few models have been developed to generate questions from texts; however, those models are usually generic and focus on reading comprehension for general news or documents. In-depth and complex applications are the bottleneck of this AI technology development. The first step to develop an EEO law question generator is to build EEO law specific corpora. This is the objective of this project. Two corpora have been developed: one a large dataset of news, blog posts, and text related to EEO law, and one with pairs of question and answer. The corpora are mostly completed, with around 500MB of raw text and a few thousand question-answer pairs. The data for these corpora were gathered from web scraping public-facing websites, including government sites containing court records or EEO laws, regulations, policies, and guidelines.

5:10-5:25 PM | Exploring the Semantic Connections in the Equal Employment Opportunity Laws Q&A Corpus

Presenters: Parmeshvar Prakash, Gatton Academy Researcher; Mentor: Dr. Xiaowen Chen, Department of Psychological Sciences

Abstract: This research is a part of the big, long-term project of developing an interactive learning application for Equal Employment Opportunity (EEO) laws and regulations. It aims at exploring the semantic connections between questions and answers regarding EEO Laws and regulations with machine learning techniques. The connection can be semantic similarity, structure, logic, and relations, and/or based on certain word usage in the questions and answers. Establishing the connection between the answers and questions can help to build a more accurate question generator. My pilot study has found the connections in terms of semantic similarity, ranging from 40% to 90%. The high similarity values indicate that semantic similarity is a method to generate answers. The low similarity values indicate that there exist other semantic structures, relations, or logic. Most of the words in the answers are not present in the questions for the lower similarities. But, the similarity values are moderate.

Future research will be conducted on finding a relationship between the answers and questions using machine learning and deep learning.

General Session II

Location: Snell Hall 1102

3:50-4:05 PM | Use topic modeling to identify the job requirements of commercial airline pilots

Presenters: George Nguyen, Gatton Academy Researcher; Om Patel, Gatton Academy Researcher; Mentor: Dr. Xiaowen Chen, Department of Psychological Sciences

Abstract: This study was to discover the hidden “topics” from over 600 job descriptions of commercial airline pilots. In Python, the Latent Dirichlet Allocation model and a coherence model were used to find the most accurate set of topics to represent the text-based data. Looking into these topics, we can understand the demands of the aviation industry and what skills and qualifications are necessary for commercial airline pilots. The topics revealed a range of competencies and qualifications, including pilot training, technical proficiency, and the job benefits provided by American airlines. In the future, we want to apply structural topic modeling that uses the metadata of the documents to improve the assignment of words to topics.

4:05-4:20 PM | What are the Most Needed Skills for a Successful Management Career?

Presenters: Pranav Gangumolu, Gatton Academy Researcher; Mentor: Dr. Xiaowen Chen, Department of Psychological Sciences

Abstract: With modernization in the United States, managerial occupations are rising more rapidly than the average job growth; over 880,000 new manager jobs are estimated to emerge by 2031 (US Bureau of Labor Statistics). Managerial jobs are critical as the job holders carry many responsibilities to lead a team of workers. Corporations expect applicants to have a variety of skills and knowledge to fit a managerial position. My research questions are: (1) what skills are required for the managerial positions, and (2) what skills are the most popular across the managerial positions? The job advertisements are good resources to help to understand the skills required for the managerial jobs. I used a web scraper to collect 15,259 managerial job advertisements from the website. To identify and describe the skills, I will use Latent Dirichlet Allocation (LDA) and Structured Topic Modeling (STM) to analyze the job descriptions in the advertisements. My research will: investigate the skills needed to be a manager and experiment with machine learning techniques, more specific topic modeling, on conducting a thematic analysis for a large amount of text-based data. Overall, my research will demonstrate the usage of ML techniques in social science research.

4:30-4:45 PM | GaussianToolKit: A Mathematica package for working with electronic structure data

Presenters: Dr. Jeremy Maddox, Department of Chemistry

Abstract: In this presentation, I will discuss the development of a Mathematica package for reading and processing data generated with the Gaussian electronic structure software. These files contain a variety of computational data for modeling the electronic and vibrational properties using first principles of physics. I envision that the package will be a useful tool for applications in computational chemistry research project management, related code development, and instructing/training students. An broad overview of Gaussian’s data structures will be provided along with some live demonstrations of GaussianToolKit’s functionality.

4:50-5:05 PM | Understanding Secondary Organic Aerosol Formation Through Chemical Kinetics Models

Presenters: Nathan Gillispie, WKU Undergraduate Researcher; Mentor: Dr. Matt Nee, Department of Chemistry

Abstract: Secondary organic aerosols (SOAs) are carbon-based solid or liquid particles dispersed in air that arise from oxidation reactions. Sulfur containing SOAs are likely to impact human and environmental health, as well as affect the earth’s albedo, yet their formation is not well understood. The concentrations of SOA precursors can be tracked over time as reactions progress in a reaction chamber. These concentrations could be perfectly described by a set of differential equations, but only when the exact reaction mechanism is known. These mechanisms can have hundreds of steps and also require rate constants – the speed at which each individual step occurs. Through a process analogous to hyperparameter optimization in machine learning, this research tests models of the formation of SOAs against data from experiments to determine the validity of a proposed reaction mechanism. A better understanding of such a process will help determine the most important conditions/factors that promote SOA formation.

5:10-5:25 PM | A Predictive Model for Urban Karst Groundwater Systems

Presenters: Trayson Lawler, WKU Graduate Student Researcher; Mentor: Dr. Jason Polk, Director, Center for Human GeoEnvironmental Studies, Department of Earth, Environmental, and Atmospheric Sciences

Abstract: Urban karst environments are often plagued by groundwater flooding, a type of flooding where water rises from the subsurface to the surface through the underlying caves and karst features. The heterogeneity and duality of karst systems make them very unpredictable, especially during intense storm events and residents in such areas are frequently disturbed and financially burdened by the effects of karst groundwater flooding. The City of Bowling Green, Kentucky, experiences frequent, unpredictable groundwater flooding making it the ideal study area for this project. This project attempts to aid the flooding problem in Bowling Green through the creation of a predictive flood model for the Lost River Basin - a 150 km2 groundwater basin that contains most of the city. The machine-learning model will be trained using precipitation and antecedent moisture conditions to predict fluctuations of the potentiometric surface. High-resolution data monitoring of 1-minute intervals have been employed at 44 water level monitoring sites and 15 precipitation sites to ensure accuracy of the model. As a result, this study will give advanced warning for flood events, offer additional information on the storage and response times of the aquifer, and create a strong methodology for other flood-prone, urban karst areas.

General Session III

Location: Snell Hall 1101

3:50-4:05 PM | Deep Learning Application of LSTM to predict the risk factors of etiology cardiovascular disease

Presenters: Shake Ibna Abir, WKU Graduate Student Researcher; Mentor: J.F. Guo, Yanshan University, China

Abstract: Cardiovascular disease (CVD) is presently one of the leading causes of death, with an estimated 24.1 million people expected to be affected by 2025. Therefore, the establishment of the health care industry's objective is to gather a vast amount of data on cardiovascular disease and utilize Deep Learning (DL) algorithms to analyze the information to assist doctors in early detection and identification of potential risk factors for CVD. DL algorithms can help to discover potential patterns of diseases and symptoms based on this structured and unstructured case information. In epidemiology, this is the first prospective study on cardiovascular disease in the community free movement population, and the related risk factors can be recognized. The prediction method of cardiovascular disease based on LSTM is proposed, and the connection between LSTM and unit state is tried to ensure the correct data acquisition during operation, and the prediction method based on LSTM is realized. The original medical data of 4434 participants in the data set with 11628 observations are verified by experiments. The algorithm has an accuracy of nearly 94% and a 0.96 Matthews correlation coefficient (MCC) score.

4:10-4:25 PM | A Comparison of Computational Perfusion Imaging Techniques

Presenters: Shaharina Shoha, WKU Graduate Student Researcher; Mentor: Dr. Richard Schugart, Department of Mathematics; Mentor: Dr. Mark Robinson, Department of Mathematics

Abstract: Perfusion imaging is valuable because it is used to help grade tumors; differentiate between tumor types; differentiate tumors from non-neoplastic lesions; guide intraoperative sampling; and, most importantly, determine the efficacy of treatment. Computational techniques combined with the imaged data can help identify important biological parameters. For example, key parameters for stroke patients include cerebral blood flow (CBF), cerebral blood volume (CBV) and mean transit time (MTT). These parameters can help distinguish between the likely salvageable tissue and irreversibly damaged infarct core. The parameters are calculated de-convolving contrast-time curves with the arterial inlet input function. A common approach employed with the de-convolution method is a singular value decomposition (SVD). However, these algorithms are very sensitive to noise and artifacts in the source image which may introduce additional distortions in the output parameters. For this reason, we employ Tikhonov regularization and truncated SVD as ways to reduce the sensitivity to noise. In the future, we would like to compare parameter estimates measured from Tikhonov regularization and truncated SVD to measurements using machine-learning algorithms.

4:30-4:45 PM | Statistical Analysis of Kentucky Mesonet Big Data: Factors Affecting Soil Moisture and Plant Growth in Kentucky

Presenters: Ariti Gani, Gatton Academy Researcher; Mentor: Dr. Ngoc Nguyen, Department of Mathematics; Mentor: Dr. Jerry Brotzge, Director, Kentucky Climate Center and Kentucky Mesonet, Department of Earth, Environmental, and Atmospheric Sciences

Abstract: Although Kentucky’s climate falls broadly in humid subtropical category, different parts of the state can experience very different environments. The Kentucky Mesonet has an infrastructure to monitor weather and climate statewide. Over the past 15 years, automated stations have been collecting near-surface environmental and weather conditions in about 86 locations across the state. These big data provide an opportunity for analysis with robust statistical methods. Our ongoing research focuses on different parameters of the weather, their relationship with soil moisture, and how that can affect plant growth.

Initially for this project, we have selected soil moisture and other related data from the GAMA station in Monroe County. We are analyzing the data mainly through R, a versatile programming language for interpreting large amount of data. Doing this will allow us to define the relationship between soil moisture and other variables, such as air temperature, precipitation, humidity, and pressure. The results of this research can be significant in understanding how plants are influenced by soil moisture, and the optimal conditions under which they can grow. It is important to understand the relationship of living organisms with their environment, and how the climatic and environmental conditions are changing in Kentucky over time.

4:50-5:05 PM | A Regression Study of Meteorological Impacts on Air Pollutants in Kentucky

Presenters: Sarah Hartman, WKU Undergraduate Researcher; Mentor: Dr. Melanie Autin, Department of Mathematics; Mentor: Dr. Ritchie Taylor, Department of Public Health

Abstract: Air pollution is an important part of global environmental health and there are many things that can affect pollution levels. This study investigates the role of multiple meteorological factors on Nitrogen Dioxide (NO2), Sulfur Dioxide (SO2), and Ozone (O3) air pollutant levels using multiple linear regression models. 6 monitoring sites in various Kentucky counties were used for this study and meteorological factors include temperature, humidity, average wind speed, precipitation, prevailing wind direction and incoming solar radiation.

5:10-5:25 PM | Authorship Attribution via Occupancy-problem-type Indices

Presenters: Dr. Lukun Zheng, Department of Mathematics

Abstract: In this presentation, we propose a new methodology for authorship attribution based on a profile of indices related to the occupancy problem, called occupancy-problem indices. The occupancy problem has a long history and is an important example in standard textbooks like Feller (1971). We base our methodology on function words. We establish a testing procedure by constructing a confidence band of the occupancy-problem indices using the sampling distribution of the number of distinct function words. We validate our proposed methodology using controlled and constructed writing samples whose authorship is known. We then apply this methodology to explore the question of who wrote the 15th Oz book, which has a disputing authorship between Lyman Frank Baum (1856–1919) and his successor Ruth Plumly Thompson (1891–1976) on the Oz series.

2nd Annual Data Science Day