Data Science Half Day 2022
Data Science Half Day
April 26, 2022 | 2:25PM - 5:00PM
College High Hall
Dr. Bud Fischer, Provost
Dr. Cate Webb, Director, Applied Research & Technology Program
Dr. Richard Schugart, Director, Applied Center for Data Science
2:30 - 4:53pm
COHH 3117 (see below for specific times of talks, titles, presenters, and abstracts)
2:30 - 4:23pm
COHH 3123 (see below for specific times of talks, titles, presenters, and abstracts)
General Session I
Location: COHH 3117
Presenters: Dr. Ranjit Koodali, Associate Provost for Research & Graduate Education
Abstract: The development of advanced materials is an important aspect of modern life. However,
the discovery of novel materials involves searching the vast chemical space to find
materials with desired properties. Recent developments in the applications of Machine
Learning (ML) in materials chemistry show promise to accelerate the material discovery
process. In this perspective article, we highlight the importance of ML in materials
chemistry. We discuss few examples of ML applications in synthesis, characterization,
and predicting activities of materials. Finally, we discuss the challenges in this
field and how the progress in ML in chemistry is leveraged together with advanced
robotics to perform automated optimization of material discovery.
Presenters: Hannah Laney, Gatton Academy; Jenna Waltrip, Gatton Academy
Abstract: Manual electroencephalogram (EEG) analysis requires extensive training and is a subjective process. A computerized method to classify types of seizures from EEG readings will make seizure diagnoses less subjective, more precise, and less timely. Our project uses simulated EEG data combined with convolutional neural networks composed of different layers in order to classify three different types of seizures (absence, tonic, and myoclonic) based on the image of their graphs.
Presenters: Dr. Eric Rappin, Kentucky Mesonet
Abstract: In this talk we will discuss one of the original big data questions, how to utilize vast amounts of weather observations to improve weather forecasting. We will discuss the history of weather forecasting from the “simple” models of World War II to the data hungry machine learning algorithms being developed today. Furthermore, we will discuss how the data is generated, be it from observations or models that span spatiotemporal scales from local to global and from the diurnal to the decadal. The focus will be on how weather models have adapted with the growth in computing power and the transition from a fluid dynamical basis to a machine learning focus. Discussion on the overall ability to forecast a nonlinear problem will also be provided.
Presenter: Nathan Hogg, Gatton Academy; Armaan Rai, Gatton Academy
Abstract: Satellites are vital to the functionality of our ever-evolving technological world. They are the go between for nearly all Earth-based communications, Global Positioning Systems, and many other technical applications. Currently, there aren't many accessible, user-friendly programs that are able to calculate the launch conditions of these satellites. Our project aims to bridge this gap by providing the user with streamlined input methods and clear outputs of critical data.
Presenter: Dr. Lance Hahn, Department of Psychological Sciences
Abstract: Psychological science and computer science have an important history of intellectual exchange. Image recognition, learning, language comprehension, and intelligence are among the topics that have rich literatures in both fields. Marr’s interdisciplinary “Levels of Analysis” is useful for defining a focus and constructing a problem space. Example data and problems will demonstrate the value of an interdisciplinary approach with a focus ranging from pixels to Rogerian talk therapy.
Presenter: Desmond Harris, Gatton Academy; Reagan Phelps, Gatton Academy
Abstract: Stress and performance theories have been poorly represented in education. A random forest model is developed to help athletes assess their stress and how it impacts performance. The model was partially developed identifying critical variables outlined from different psychology textbooks describing performance theory.
Presenter: Dr. Xiaowen Chen, Department of Psychological Sciences
Abstract: This study was to figure out the leadership competencies sought for by employers and to compare those competencies with desired leadership traits described by leadership literature. In order to realize the two purposes, we generated two massive text datasets, one consists of job description of leadership positions available in a job-search engine and the other consists of the leadership book chapters. Topic modeling and other ML techniques were used for automatic competency modeling.
Presenter: Brody Johnson, Gatton Academy; Diego Moreno, Gatton Academy
Abstract: Using MATLAB’s Text Analytics Toolbox, we implemented the Valence Aware Dictionary for Sentiment Reasoning (VADER) algorithm to perform a subtype of text analysis known as sentiment analysis, in which the positive & negative sentiments expressed in a text are evaluated, analyzed, and given a score. The model used is polar, meaning all data exists on a 1-dimensional scale, this particular scale being the interval from -1 to 1. Such a program is useful for analyzing writings in a large variety of fields, from the academically prestiged (research papers, economic journals) to the more commonplace (social media posts, customer reviews).
Presenter: Dr. Lukun Zheng, Department of Mathematics
Abstract: Video game genre classification based on its cover and textual description would be utterly beneficial to many modern identification, collocation, and retrieval systems. At the same time, it is also an extremely challenging task due to the following reasons: First, there exists a wide variety of video game genres, many of which are not concretely defined. Second, video game covers vary in many different ways such as colors, styles, textual information, etc, even for games of the same genre. Third, cover designs and textual descriptions may vary due to many external factors such as country, culture, target reader populations, etc. With the growing competitiveness in the video game industry, the cover designers and typographers push the cover designs to its limit in the hope of attracting sales. The computer-based automatic video game genre classification systems become a particularly exciting research topic in recent years. In this paper, we propose a multi-modal deep learning framework to solve this problem. The contribution of this paper is four-fold. First, we compiles a large dataset consisting of 50,000 video games from 21 genres made of cover images, description text, and title text and the genre information. Second, image-based and text-based, state-of-the-art models are evaluated thoroughly for the task of genre classification for video games. Third, we developed an efficient and salable multi-modal framework based on both images and texts. Fourth, a thorough analysis of the experimental results is given and future works to improve the performance is suggested. The results show that the multi-modal framework outperforms the current state-of-the-art image-based or text-based models. Several challenges are outlined for this task. More efforts and resources are needed for this classification task in order to reach a satisfactory level."
General Session II
Location: COHH 3123
Presenters: Shaharina Shoha, Department of Mathematics
Abstract: During the last many years, air pollution can interact to amplify risks to human
health and crop production. Every year, a large number of people have been diagnosed
with asthma and other breathing-related problems. The main reason behind this has
been the high concentration of life- threatening PM2.5 particles dissolved in its
atmosphere. In our project, we proposed ML models-random forest tree (RFT), gradient
boosting tree (GBT) and neural network (NN) model to forecast the concentration level
of these dissolved particles and analyses the factors, may help to humankind to prepare
with careful prevention and significant strategies to save from the major risk factor
in human diseases including chronic obstructive pulmonary disease, reduced lung function,
pneumonia, cardiovascular diseases, premature death and leukaemia. The dataset contains
71149 instances of hourly averaged responses from an array of 6 metal oxide chemical
sensors embedded in an air quality chemical multisensory device. To perform the comparative
study with ML models, we have used MAE, RMSE, and ROC as performance metrics.
Presenters: Gracie Davis, Gatton Academy; Aubrey Morse, Gatton Academy; Maria Pfeifer, Gatton
Abstract: Menstrual cycle predictions carry uncertainty in many different ways, some inherent to the physical process, while other uncertainty comes from the data collection itself. This project looks at identifying and minimizing uncertainty, an idea paramount to data science in the health field, through the lens of menstrual cycle predictions.
Presenters: Dr. Chandrakanth Emani, Department of Biology
Abstract: The presence of non-neuronal acetylcholine in plants and animals is implicated in the regulation of cell differentiation, phytochrome-mediated processes, cytoskeletal organization, immune function, and ion transport. Clinical significance of the non-neuronal acetylcholine’s role is in pathogenesis of diseases such as acute and chronic inflammation, local and systemic infection, dementia, atherosclerosis and cancer. Tracing the molecular evolution of the acetylcholine pathway and subsequent bioinformatics analysis may uncover its transition from a non-neuronal role to a neuronal role. My talk will focus on how the bioinformatics analysis of specific acetylcholine related enzymes provide us valuable model systems to explore molecular basis of diseases.
Presenter: Addison Hoskins, Gatton Academy; Tasha Otieno, Gatton Academy
Abstract: Using synthetic data from the user input, our project works to analyze population growth trends and predict future population changes through differential equations, like the Verhulst Model. Our program will prompt the user to select a biome, select an animal, and input values for the current and past populations of the species. Using the logistic growth model, our program will predict future population changes at small increments of time and curve fit the data to generate a function matching the predicted population sizes. This information will then be displayed onscreen, along with a brief overview of the ideal environmental conditions.
Presenter: Dr. John Erickson, Department of Information Systems
Abstract: Open-ended questions in mathematics are commonly used by teachers to monitor and assess students’ deeper conceptual understanding of content. Student answers to these types of questions often exhibit a combination of language, drawn diagrams and tables, and mathematical formulas and expressions that supply teachers with insight into the processes and strategies adopted by students in formulating their responses. While these student responses help to inform teachers on their students’ progress and understanding, the amount of variation in these responses can make it difficult and time-consuming for teachers to manually read, assess, and provide feedback to student work. For this reason, there has been a growing body of research in developing AI-powered tools to support teachers in this task. This work seeks to build upon this prior research by introducing a model that is designed to help automate the assessment of student responses to open-ended questions in mathematics through sentence-level semantic representations. We find that this model outperforms previously published benchmarks across three different metrics. With this model, we conduct an error analysis to examine characteristics of student responses that may be considered to further improve the method.
Presenter: Holly McClure, Gatton Academy; Sophie Wielawski, Gatton Academy
Abstract: Our project will use existing government data for U.S. domestic flights, as well as MATLAB analysis prediction features, to estimate the cheapest destination options personal to the user. The user will be prompted to enter their current state, desired vacation region ( urban, coastal, mountainous, etc.), and the season they plan to vacation during. The code will compare the user inputs to the predicted data to display the information of the most affordable travel destinations.
Presenter: Jake Boils, Department of Physics; Lars Hebenstiel, Department of Physics; Dr. Gordon
Emslie, Department of Physics; Dr. Ivan Novikov, Department of Physics
Abstract: Solar flares are extremely intense releases of mass and energy and can hurl charged particles towards Earth, damaging electronics. It is important to us to be able to predict such occurrences. In March of 2024, NASA plans to launch two instruments to study solar flares. Using various solar data sources, on the order of 50 petabytes, we plan to construct a convolutional neural network (CNN) to predict solar flares for this NASA mission. So far, we have studied our data sets and have constructed a small four-dimensional datacube in (t,λ,x,y), and have trained a small CNN to classify flares.