Projects
Research Initiation Award: RECREATE: A Robust and Efficient Cooperative Resource Allocation and Scheduling System for High Resource Utilization and Throughput in Clouds (supported by NSF (Award Number: 2400459)) Investigators: Jinwei Liu (PI at Florida A&M University)
An efficient resource allocation and scheduling system helps to improve resource utilization and throughput while ensuring Service Level Objective (SLO) availability, which is crucial to cloud providers for high profit. The goal of this project is to develop a robust and efficient cooperative resource allocation and scheduling system to help cloud providers achieve high profit by improving resource utilization and throughput while ensuring SLO availability and robustness. The project breaks new ground by incorporating machine learning into cloud computing and designing and implementing a novel resource allocation and scheduling system for achieving high resource utilization and throughput while improving SLO availability and robustness in clouds, and it identifies the root cause of low resource utilization and enables new understanding of machine learning in optimizing resource allocation and scheduling in clouds. This research produces innovation in resource allocation and scheduling, and the results of this research help to advance theories, concepts, and methods in cloud computing. Specific aims of this project are to: 1) develop an efficient machine learning based resource allocation system consisting of cooperative demand-based resource allocation and cooperative opportunistic-based resource allocation for high resource utilization while ensuring SLO availability in clouds; 2) develop a robust machine learning based scheduling system for improving throughput and failure resilience.
RCN-UBE: HBCU Faculty Collaborative Network for the Integration of Artificial Intelligence and Machine Learning into Teaching General Biology (supported by NSF (Award Number: 2417643)) Investigators: Clement Yedjou (PI at Florida A&M University), Waneene Dorsey (Co-PI at Grambling State University), Jinwei Liu (Co-PI at Florida A&M University), Felicite Noubissi (Co-PI at Jackson State University), JAMEKA GRIGSBY (Co-PI at Alcorn State University)
The goal of this project is to equip biology faculty with training in artificial intelligence (AI) and machine learning (ML), enabling them to integrate these technologies into their biology courses, thereby enhancing student success.
Data Management and Optimization for Fault-tolerance in Distributed Storage
Cloud-based applications become increasingly complex and involve a huge number of computers, and they face all sorts of failures ranging from power outages and software bugs to malicious attackers. These failures typically can be categorized as correlated (machine) failures and non-correlated (machine) failures. Datacenter storage system, e.g., Hadoop Distributed File System (HDFS), Google File System (GFS) and Windows Azure, is an important component of cloud datacenters, especially for data-intensive services in this big data era. Data availability and data durability are critical for cloud storage providers to provide high QoS and reduce Service Level Agreement (SLA) violations and the associated penalties. Data availability and data durability are usually affected by data loss, which is typically caused by machine failures including correlated and non-correlated failures. Replication is a common approach to enhance data availability and data durability in distributed/cloud storage systems. Reducing the probability of data loss and cost (i.e., storage, consistency maintenance, and bandwidth cost) caused by replication poses a challenge. Inspired by these, my works design a low-cost multi-failure resilient replication scheme (MRR). MRR considers data popularity existing in current distributed/cloud storage systems and builds a nonlinear integer programming (NLIP) model to derive the replication degree of each data object, and then it uses the Balanced Incomplete Block Design (BIBD)-based method to create FTSs (fault-tolerant sets) and stores the replicas of each data chunk in an FTS to reduce the data loss probability in both correlated and non-correlated machine failures and therefore enhance the expected data availability. Also, MRR leverages data popularity, and uses less number of replicas and chooses cheap storage mediums for unpopular data objects to reduce cost caused by replication. My works design a popularity-aware multi-failure resilient and cost-effective replication scheme (PMCR), which splits the cloud storage system into primary tier and backup tier, and stores the three replicas of the same data into one FTS formed by two servers in the primary tier and one server in the backup tier to reduce data loss caused by both correlated and non-correlated failures and thus improve data durability. Also, PMCR classifies data into hot data, warm data and cold data based on data popularities, and uses data compression/duplication (Similar Compression and Delta Compression) to eliminate data redundancy and thus reduce the storage cost and bandwidth cost. Also, it chooses storage mediums for storing data objects based on data popularity to further reduce storage cost.
Scheduling using Machine Learning in Clouds and Clusters
The rapid increase in data and cloud based applications has led to growth of datacenters at an unprecedented pace. Resource scheduling becomes one of the most important problems in clouds and clusters, and handling the uncertainty and constraints of jobs/tasks (e.g., task dependency, job response time, resource efficiency) is a challenge for designing efficient scheduling algorithm. To address this challenge, I have designed dependency-aware and resource-efficient scheduling (DRS). DRS considers task dependency and assigns tasks that are independent of each other to different workers so that the response time of jobs can be reduced. Also, it introduces scheduler domains and uses the Gossip protocol for the communication between the scheduler managers in different domains based on Geometric Random Graph (GRG) to reduce the communication overhead. In addition, DRS uses the mutual reinforcement learning and bipartite graph to estimate tasks' waiting time in the queue of workers, and assigns tasks to workers with considering tasks' waiting time to reduce jobs’ response time so that the response time can be reduced. I have also designed a dependency-aware scheduling and preemption system in data-parallel clusters.