2nd Annual Data Science Day
2nd Annual Data Science Day
April 4th, 2023 | 2:10PM - 5:25PM
Ogden College Hall & Snell Hall
Ogden Hall Auditorium
Dr. Cate Webb, Director, ARTP
Dr. Richard Schugart, Director, Applied Center for Data Science (ACDS)
2:15PM - 3:10PM
Ogden College Hall 1006
Renée Cummings, Assistant Professor of the Practice in Data Science, University of Virginia
Title: Critical Data Science and the Interdisciplinary Imagination
Abstract: How data activism stretches the imagination of data science and sparks interdisciplinary collaborations out of which rigorous and robust ethical guardrails can be created to minimize the risks and maximize the rewards of data science.
3:25PM - 3:45 PM
Snell Hall 1108
Cameron White, Dollar General Corp, WKU alum
Title: Data Goldilocks: What to Do When Your Data are Too Big, Too Small, and Just Right
Abstract: Often in academia, the size of datasets under study is “just right”: large enough to glean many insights from, yet small enough to be worked on with a single desktop PC. In this talk, experienced data scientist Cameron White will discuss various approaches on how to handle those situations when the datasets are not just right. At some firms, the datasets aren’t really that large and comprehensive, thus hampering some of the questions that can be asked. At others, there is too much data, which can be very difficult to wrangle on a single machine.
3:50PM - 5:25PM
Snell Hall 1108 (see below for specific times of talks, titles, presenters, and abstracts)
3:50PM - 5:25PM
Snell Hall 1102 (see below for specific times of talks, titles, presenters, and abstracts)
3:50PM - 5:25PM
Snell Hall 1101 (see below for specific times of talks, titles, presenters, and abstracts)
General Session I
Location: Snell Hall 1108
Presenters: Vivian Rivera, WKU Undergraduate Researcher; Mentor: Dr. Chandrakanth Emani, Department of Biology
Abstract: Plant volatiles that can also have the properties of pharmaceutically important phytochemicals have the potential to interact with clinically measured markers within their protein domain architecture. Studying their molecular evolution and deciphering their phylogenies allow for better understanding of the ecology, genetics, and biochemistry. The present study is a bioinformatics based analysis of three such compounds, namely, Eugenol from Ocimum sanctum(basil), Allicin from Allium sativum(garlic) and Limonene from Mentha spicata(Mint)
Presenters: Ashton Lyvers, WKU Undergraduate Researcher; Mentor: Dr. Amy Brausch, Department of Psychological Sciences
Abstract: Gender expectations can create complications in various aspects of our lives, including feeling respected in our relationships, education, and career. This study aims to explore the relationship between gender expression and respect. It is hypothesized that participants are more likely to associate a negative connotation with stereotypically feminine characteristics and that higher levels of conformity to femininity will be associated with lower levels of respect received from peers.
Participants were 294 college students recruited from Western Kentucky University. Participants completed the following measures: two versions of the Bem Sex Role Inventory (BSRI), a Feelings of Respect Inventory (FRI), the Conformity to Feminine Norms Inventory (CFNI), and the Conformity to Masculine Norms Inventory (CMNI). The BSRI was modified to measure the connotation participants associate with stereotypical gender traits. Data collection is ongoing and preliminary analyses are reported here.
One-way ANOVA results indicated that feminine traits are more likely to have a negative connotation than masculine traits. Non-cisgender individuals reported feeling significantly less respect, while also conforming to feminine traits more than males and masculine traits more than cisgender individuals. Future research should examine any recent changes in gender roles. A larger sample of males and non-cisgender individuals could reveal more patterns.
Presenters: Ellie Chitwood, WKU Undergraduate Researcher; Mentor: Dr. Thad Crews, Department of Analytics and Information Systems
Abstract: The Analytics Life Cycle involves three levels of understanding. At the Problem Level, the focus is on identifying a problem or question that needs data-driven insights. This level includes understanding the business context, goals, and objectives, as well as defining the problem or question to be answered. Moving to the Conceptual Modeling Level, the focus shifts to the models and techniques used to analyze and interpret data. This level includes statistical techniques, machine learning algorithms, and data visualization approaches, as well as methods of collecting, cleaning, and transforming data into a usable format for analysis. At the Tool Level, the emphasis is on the practical application of relevant software tools, programming languages, and technical skills (such as Python, R, SAS, Tableau, Power BI, SQL, etc.) to effectively and efficiently achieve the desired results.
This study explores the value of the “Triangle Analytics” model as a cross-disciplinary strategic framework for understanding and communicating predictive analytics. The model draws on insights from cognitive psychology, biology, and neuroscience to help analysts communicate better with clients and better manage complexity. This presentation illustrates the value of the Triangle Analytics model in explaining the results of a specific predictive model.
Presenters: Logan Stewart, WKU Undergraduate Researcher; Mentor: Dr. Xiaowen Chen, Department of Psychological Sciences
Abstract: The project is a part of the big, long-term project of developing a question generator for Equal Employment Opportunity (EEO) laws and regulations. Question generator is an AI technology and a few models have been developed to generate questions from texts; however, those models are usually generic and focus on reading comprehension for general news or documents. In-depth and complex applications are the bottleneck of this AI technology development. The first step to develop an EEO law question generator is to build EEO law specific corpora. This is the objective of this project. Two corpora have been developed: one a large dataset of news, blog posts, and text related to EEO law, and one with pairs of question and answer. The corpora are mostly completed, with around 500MB of raw text and a few thousand question-answer pairs. The data for these corpora were gathered from web scraping public-facing websites, including government sites containing court records or EEO laws, regulations, policies, and guidelines.
Presenters: Parmeshvar Prakash, Gatton Academy Researcher; Mentor: Dr. Xiaowen Chen, Department of Psychological Sciences
Abstract: This research is a part of the big, long-term project of developing an interactive learning application for Equal Employment Opportunity (EEO) laws and regulations. It aims at exploring the semantic connections between questions and answers regarding EEO Laws and regulations with machine learning techniques. The connection can be semantic similarity, structure, logic, and relations, and/or based on certain word usage in the questions and answers. Establishing the connection between the answers and questions can help to build a more accurate question generator. My pilot study has found the connections in terms of semantic similarity, ranging from 40% to 90%. The high similarity values indicate that semantic similarity is a method to generate answers. The low similarity values indicate that there exist other semantic structures, relations, or logic. Most of the words in the answers are not present in the questions for the lower similarities. But, the similarity values are moderate.
Future research will be conducted on finding a relationship between the answers and questions using machine learning and deep learning.
General Session II
Location: Snell Hall 1102
Presenters: George Nguyen, Gatton Academy Researcher; Om Patel, Gatton Academy Researcher; Mentor: Dr. Xiaowen Chen, Department of Psychological Sciences
Abstract: This study was to discover the hidden “topics” from over 600 job descriptions of commercial airline pilots. In Python, the Latent Dirichlet Allocation model and a coherence model were used to find the most accurate set of topics to represent the text-based data. Looking into these topics, we can understand the demands of the aviation industry and what skills and qualifications are necessary for commercial airline pilots. The topics revealed a range of competencies and qualifications, including pilot training, technical proficiency, and the job benefits provided by American airlines. In the future, we want to apply structural topic modeling that uses the metadata of the documents to improve the assignment of words to topics.
Presenters: Pranav Gangumolu, Gatton Academy Researcher; Mentor: Dr. Xiaowen Chen, Department of Psychological Sciences
Abstract: With modernization in the United States, managerial occupations are rising more rapidly than the average job growth; over 880,000 new manager jobs are estimated to emerge by 2031 (US Bureau of Labor Statistics). Managerial jobs are critical as the job holders carry many responsibilities to lead a team of workers. Corporations expect applicants to have a variety of skills and knowledge to fit a managerial position. My research questions are: (1) what skills are required for the managerial positions, and (2) what skills are the most popular across the managerial positions? The job advertisements are good resources to help to understand the skills required for the managerial jobs. I used a web scraper to collect 15,259 managerial job advertisements from the website. To identify and describe the skills, I will use Latent Dirichlet Allocation (LDA) and Structured Topic Modeling (STM) to analyze the job descriptions in the advertisements. My research will: investigate the skills needed to be a manager and experiment with machine learning techniques, more specific topic modeling, on conducting a thematic analysis for a large amount of text-based data. Overall, my research will demonstrate the usage of ML techniques in social science research.
Presenters: Dr. Jeremy Maddox, Department of Chemistry
Abstract: In this presentation, I will discuss the development of a Mathematica package for reading and processing data generated with the Gaussian electronic structure software. These files contain a variety of computational data for modeling the electronic and vibrational properties using first principles of physics. I envision that the package will be a useful tool for applications in computational chemistry research project management, related code development, and instructing/training students. An broad overview of Gaussian’s data structures will be provided along with some live demonstrations of GaussianToolKit’s functionality.
Presenters: Nathan Gillispie, WKU Undergraduate Researcher; Mentor: Dr. Matt Nee, Department of Chemistry
Abstract: Secondary organic aerosols (SOAs) are carbon-based solid or liquid particles dispersed in air that arise from oxidation reactions. Sulfur containing SOAs are likely to impact human and environmental health, as well as affect the earth’s albedo, yet their formation is not well understood. The concentrations of SOA precursors can be tracked over time as reactions progress in a reaction chamber. These concentrations could be perfectly described by a set of differential equations, but only when the exact reaction mechanism is known. These mechanisms can have hundreds of steps and also require rate constants – the speed at which each individual step occurs. Through a process analogous to hyperparameter optimization in machine learning, this research tests models of the formation of SOAs against data from experiments to determine the validity of a proposed reaction mechanism. A better understanding of such a process will help determine the most important conditions/factors that promote SOA formation.
Presenters: Trayson Lawler, WKU Graduate Student Researcher; Mentor: Dr. Jason Polk, Director, Center for Human GeoEnvironmental Studies, Department of Earth, Environmental, and Atmospheric Sciences
Abstract: Urban karst environments are often plagued by groundwater flooding, a type of flooding where water rises from the subsurface to the surface through the underlying caves and karst features. The heterogeneity and duality of karst systems make them very unpredictable, especially during intense storm events and residents in such areas are frequently disturbed and financially burdened by the effects of karst groundwater flooding. The City of Bowling Green, Kentucky, experiences frequent, unpredictable groundwater flooding making it the ideal study area for this project. This project attempts to aid the flooding problem in Bowling Green through the creation of a predictive flood model for the Lost River Basin - a 150 km2 groundwater basin that contains most of the city. The machine-learning model will be trained using precipitation and antecedent moisture conditions to predict fluctuations of the potentiometric surface. High-resolution data monitoring of 1-minute intervals have been employed at 44 water level monitoring sites and 15 precipitation sites to ensure accuracy of the model. As a result, this study will give advanced warning for flood events, offer additional information on the storage and response times of the aquifer, and create a strong methodology for other flood-prone, urban karst areas.
General Session III
Location: Snell Hall 1101
Presenters: Shake Ibna Abir, WKU Graduate Student Researcher; Mentor: J.F. Guo, Yanshan University, China
Abstract: Cardiovascular disease (CVD) is presently one of the leading causes of death, with an estimated 24.1 million people expected to be affected by 2025. Therefore, the establishment of the health care industry's objective is to gather a vast amount of data on cardiovascular disease and utilize Deep Learning (DL) algorithms to analyze the information to assist doctors in early detection and identification of potential risk factors for CVD. DL algorithms can help to discover potential patterns of diseases and symptoms based on this structured and unstructured case information. In epidemiology, this is the first prospective study on cardiovascular disease in the community free movement population, and the related risk factors can be recognized. The prediction method of cardiovascular disease based on LSTM is proposed, and the connection between LSTM and unit state is tried to ensure the correct data acquisition during operation, and the prediction method based on LSTM is realized. The original medical data of 4434 participants in the data set with 11628 observations are verified by experiments. The algorithm has an accuracy of nearly 94% and a 0.96 Matthews correlation coefficient (MCC) score.
Presenters: Shaharina Shoha, WKU Graduate Student Researcher; Mentor: Dr. Richard Schugart, Department of Mathematics; Mentor: Dr. Mark Robinson, Department of Mathematics
Abstract: Perfusion imaging is valuable because it is used to help grade tumors; differentiate between tumor types; differentiate tumors from non-neoplastic lesions; guide intraoperative sampling; and, most importantly, determine the efficacy of treatment. Computational techniques combined with the imaged data can help identify important biological parameters. For example, key parameters for stroke patients include cerebral blood flow (CBF), cerebral blood volume (CBV) and mean transit time (MTT). These parameters can help distinguish between the likely salvageable tissue and irreversibly damaged infarct core. The parameters are calculated de-convolving contrast-time curves with the arterial inlet input function. A common approach employed with the de-convolution method is a singular value decomposition (SVD). However, these algorithms are very sensitive to noise and artifacts in the source image which may introduce additional distortions in the output parameters. For this reason, we employ Tikhonov regularization and truncated SVD as ways to reduce the sensitivity to noise. In the future, we would like to compare parameter estimates measured from Tikhonov regularization and truncated SVD to measurements using machine-learning algorithms.
Presenters: Ariti Gani, Gatton Academy Researcher; Mentor: Dr. Ngoc Nguyen, Department of Mathematics; Mentor: Dr. Jerry Brotzge, Director, Kentucky Climate Center and Kentucky Mesonet, Department of Earth, Environmental, and Atmospheric Sciences
Abstract: Although Kentucky’s climate falls broadly in humid subtropical category, different parts of the state can experience very different environments. The Kentucky Mesonet has an infrastructure to monitor weather and climate statewide. Over the past 15 years, automated stations have been collecting near-surface environmental and weather conditions in about 86 locations across the state. These big data provide an opportunity for analysis with robust statistical methods. Our ongoing research focuses on different parameters of the weather, their relationship with soil moisture, and how that can affect plant growth.
Initially for this project, we have selected soil moisture and other related data from the GAMA station in Monroe County. We are analyzing the data mainly through R, a versatile programming language for interpreting large amount of data. Doing this will allow us to define the relationship between soil moisture and other variables, such as air temperature, precipitation, humidity, and pressure. The results of this research can be significant in understanding how plants are influenced by soil moisture, and the optimal conditions under which they can grow. It is important to understand the relationship of living organisms with their environment, and how the climatic and environmental conditions are changing in Kentucky over time.
Presenters: Sarah Hartman, WKU Undergraduate Researcher; Mentor: Dr. Melanie Autin, Department of Mathematics; Mentor: Dr. Ritchie Taylor, Department of Public Health
Abstract: Air pollution is an important part of global environmental health and there are many things that can affect pollution levels. This study investigates the role of multiple meteorological factors on Nitrogen Dioxide (NO2), Sulfur Dioxide (SO2), and Ozone (O3) air pollutant levels using multiple linear regression models. 6 monitoring sites in various Kentucky counties were used for this study and meteorological factors include temperature, humidity, average wind speed, precipitation, prevailing wind direction and incoming solar radiation.
Presenters: Dr. Lukun Zheng, Department of Mathematics
Abstract: In this presentation, we propose a new methodology for authorship attribution based on a profile of indices related to the occupancy problem, called occupancy-problem indices. The occupancy problem has a long history and is an important example in standard textbooks like Feller (1971). We base our methodology on function words. We establish a testing procedure by constructing a confidence band of the occupancy-problem indices using the sampling distribution of the number of distinct function words. We validate our proposed methodology using controlled and constructed writing samples whose authorship is known. We then apply this methodology to explore the question of who wrote the 15th Oz book, which has a disputing authorship between Lyman Frank Baum (1856–1919) and his successor Ruth Plumly Thompson (1891–1976) on the Oz series.