+1(978)310-4246 credencewriters@gmail.com
  

Description

Prepare a PPT on below topic and sample project doc attached below please check and add speaker notes also in the PPT with 8-10 slides and documentation 4 pages
Topic:  Sentiment Analysis of Social Media Posts for Improved Brand Reputation Management1
Running Head: SENTIMENT ANALYSIS
Sentiment Analysis
Name
Instructor
Institution
Course Code
Date
2
SENTIMENT ANALYSIS
Sentiment Analysis of Social Media Posts for Improved Brand Reputation Management
Project Proposal
With the increasing use of social media for brand promotion and customer
engagement, understanding and monitoring customer sentiment towards a brand has become
crucial for companies. Sentiment analysis of social media posts can provide valuable insights
into customer opinions and help companies make informed decisions to improve brand
reputation. This project aims to develop a sentiment analysis system for social media posts.
The objectives of this project include the following;
1. To collect and pre-process a large dataset of social media posts related to a specific
brand.
2. To train a machine learning model to classify social media posts into positive,
negative, and neutral categories.
3. To evaluate the performance of the sentiment analysis model and identify areas for
improvement.
4. To develop a web-based interface to visualize the sentiment analysis results and
provide actionable insights to brand managers.
A large dataset of social media posts related to a specific brand will be collected from
various platforms, such as Twitter, Facebook, and Instagram. The collected data will be preprocessed to remove irrelevant information, such as URLs and hashtags, and to correct
spelling and grammar errors. The pre-processed data will train a machine-learning model for
sentiment analysis. The model will be fine-tuned to improve performance. The trained model
will be integrated into a web-based interface to visualize the sentiment analysis results and
provide actionable insights to brand managers. The performance of the sentiment analysis
model will be evaluated through manual annotation of a sample of the collected data.
3
SENTIMENT ANALYSIS
The expected deliverable by the end of the project is a web-based interface for
sentiment analysis of social media posts. Another deliverable is a report detailing the
performance of the sentiment analysis model, including evaluation results and areas for
improvement. In addition, recommendations for future development will be provided based
on the evaluation results. The budget for this project will include the cost of collecting and
pre-processing the data, model training and development, and system implementation.
Sentiment analysis of social media posts is a valuable tool for brand reputation management.
This project aims to develop a sentiment analysis system for social media posts that can
provide actionable insights to brand managers. The success of this project will be evaluated
through manual annotation of a sample of the collected data, and the results will be used to
make recommendations for future development. The development of a sentiment analysis
system for social media posts has the potential to provide valuable insights into customer
opinions and help companies make informed decisions to improve brand reputation.
References
SENTIMENT ANALYSIS
4
Baah, F. O., Teitelman, A. M., & Riegel, B. (2019). Marginalization: Conceptualizing patient
vulnerabilities in the framework of social determinants of health—An integrative
review. Nursing inquiry, 26(1), e12268.
https://onlinelibrary.wiley.com/doi/abs/10.1111/nin.12268?casa_token=ub4hptEF8MgAAAA
A:5quIotM434xMWOpSZ15hJdQqTLLlecg01yHcBhnQVz_dho2mPVyMQ403ws17
unGPyIuLOcsvcNjdR9g7
Ehie, O., Muse, I., Hill, L., & Bastien, A. (2021). Professionalism: microaggression in the
healthcare setting. Current opinion in anaesthesiology, 34(2), 131.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7984763/
McCall, J., & Pauly, B. (2019). Sowing a seed of safety: providing culturally safe care for
people who use drugs in acute care settings. Journal of Mental Health and Addiction
Nursing, 3(1), e1-e7.
http://jmhan.org/index.php/JMHAN/article/view/33
Real-Time Prediction of Docker Container Resource Load
Based on a Hybrid Model of
ARIMA and Triple Exponential Smoothing
Overview
1.
2.
3.
4.
5.
6.
7.
8.
Introduction
Key Findings of the Study
Hybrid Model Based On ARIMA And Triple Exponential Smoothing
Advantages of Hybrid Model
System Design and Implementation
Experimental Evaluation
Average prediction time comparison between the two models.
Conclusion
Introduction
Proper resource allocation is crucial for smooth
application performance in virtual machines or
Docker containers. Over-provisioning leads to
resource waste, while under-provisioning can cause
competition for resources during heavy workloads.
Cloud computing uses prediction algorithms to
optimize resource allocation and improve utilization
and service quality.
Key Findings of the Study
Hybrid ARIMA and triple exponential smoothing prediction model for
improved Docker container load prediction accuracy.
Docker container load prediction system that predicts multidimensional
resource load and optimizes CPU and memory resource usage.
Good prediction accuracy and low time overhead demonstrated through
simulations and real cloud environments.
Hybrid Model Based On ARIMA And Triple Exponential Smoothing
Design of Hybrid Model
1. Update container resource utilization data and assess ARIMA/triple
exponential smoothing prediction accuracy to refine hybrid model.
2. Predict updated time series using ARIMA and triple exponential
smoothing models
3. Combine ARIMA and triple exponential smoothing predictions using
weight coefficients to get hybrid model’s prediction result.
The execution process of the hybrid model algorithm. The numbers on the
line indicate the order of execution.
ARIMA and triple
exponential
smoothing combined
for improved
workload analysis,
capable of handling
diverse workload
characteristics.
ARIMA uncovering
data relationships,
triple exponential
smoothing analyzes
change trends for
more accurate data
characterization.
Advantages of Hybrid Model
Hybrid model has
improved antiinterference
capability, ARIMA
reduces random
fluctuations, triple
exponential
smoothing smooths
data with weighted
historical data.
System Design and
Implementation
System architecture
Experimental Evaluation
Heavy Memcached
load
Postmark Load
Categories of Tests
Memcached light load
Configuration Information
Stress Load
Postmark Load
Stress Load
Memcached
Heavy
Memcached light
Prediction Result in Simulated Cloud Environment
ARIMA Prediction
Triple ES Prediction
Hybrid Model Prediction
ANN+SaDE Prediction
Prediction
Result in
Simulated
Cloud
Environment
Postmark Load
Stress Load
Memcached
Heavy
Memcached light
Prediction Result in t2.micro Instance Environment
ARIMA Prediction
Triple ES Prediction
Hybrid Model Prediction
ANN+SaDE Prediction
Prediction
Result in
t2.micro
Instance Enviro
nment
Postmark Load
Stress Load
Memcached
Heavy
Memcached light
Prediction Result in t2.2xlarge Instance Environment
ARIMA Prediction
Triple ES Prediction
Hybrid Model Prediction
ANN+SaDE Prediction
Prediction
Result in
t2.2xlarge
Instance
Environment
Average prediction time comparison between the two models.
The hybrid model takes slightly
longer to predict than ARIMA, with
differences ranging from 0.59% to
5.4% across the four time series. The
additional time overhead is minimal
due to the small overhead of the
triple exponential smoothing
component.
Conclusion
The paper presents a hybrid model that combines ARIMA and triple
exponential smoothing to improve the accuracy of predicting resource usage
in dynamic container workload environments in cloud computing platforms.
The proposed system, a Docker container resource prediction system, has
been designed and implemented for efficient information collection,
storage, prediction, and scheduling. The hybrid model outperforms other
models such as ARIMA, triple exponential smoothing, and ANN+SaDE by
improving prediction accuracy by 52.64, 20.15, and 203.72 percent on
average with minimal time overhead. The proposed system can be used to
improve resource utilization in container-based cloud platforms or as a
reference for implementing a resource prediction system for other
platforms.
References
1. P. Mell and T. Grance, “The NIST definition of cloud computing,” Commun. ACM, vol. 53, no. 6,
pp. 50–50, 2011.
2. K.-T. Seo, H.-S. Hwang, I.-Y. Moon, O.-Y. Kwon, and B.-J. Kim, “Performance comparison analysis of
Linux container and virtual machine for building cloud,” Adv. Sci. Technol. Lett., vol. 66, no. 2, pp.
105–111, 2014.
3. C. Anderson, “Docker [software engineering],” IEEE Softw., vol. 32,no. 3, pp. 102–c3,
May/Jun. 2015.
4. A. S. Shanmugam, “Docker container reactive scalability and pre- diction of CPU utilization based
on proactive modelling,” Masters thesis, Dublin, Nat. College Ireland, 2017. [Online].
Available: http://trap.ncirl.ie/2884/1 aravindsamyshanmugam.pdf
5. N. Roy, A. Dubey, and A. S. Gokhale, “Efficient autoscaling in the cloud using predictive models
for workload forecasting,” in Proc. IEEE Int. Conf. Cloud Comput., 2011, pp. 500–507.
Thank you
1386
IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 10, NO. 2, APRIL-JUNE 2022
Real-Time Prediction of Docker Container
Resource Load Based on a Hybrid Model of
ARIMA and Triple Exponential Smoothing
Yulai Xie , Member, IEEE, Minpeng Jin, Zhuping Zou , Gongming Xu ,
Dan Feng, Member, IEEE, Wenmao Liu, and Darrell Long , Fellow, IEEE
Abstract—More and more enterprises are beginning to use Docker containers to build cloud platforms. Predicting the resource usage
of container workload has been an important and challenging problem to improve the performance of cloud computing platform. The
existing prediction models either incur large time overhead or have insufficient accuracy. This article proposes a hybrid model of the
ARIMA and triple exponential smoothing. It can accurately predict both linear and nonlinear relationships in the container resource load
sequence. To deal with the dynamic Docker container resource load, the weighting values of the two single models in the hybrid model
are chosen according to the sum of squares of their predicted errors for a period of time. We also design and implement a real-time
prediction system that consists of the collection, storage, prediction of Docker container resource load data and scheduling optimization
of CPU and memory resource usage based on predicted values. The experimental results show that the predicting accuracy of the
hybrid model improves by 52.64, 20.15, and 203.72 percent on average compared to the ARIMA, the triple exponential smoothing
model and ANN+SaDE model respectively with a small time overhead.
Index Terms—Docker container, prediction, hybrid model
Ç
1
INTRODUCTION
the development and popularization of cloud
computing platforms, most enterprises have their
own data centers. By providing users with various virtual
resources, such as computing resources, storage resources
and network resources, users can get high quality, strong
security and highly scalable infrastructure services at relatively low cost [1]. However, with the continuous expansion
of the cloud computing platform, the virtual machine [2]
has problems such as low running efficiency and slow startup. To alleviate these problems, Docker container [3] has
emerged as a new virtualization technology.
Whether it’s in the virtual machine or Docker container,
we should allocate sufficient resources for an application to
run smoothly. However, in most cases, the application is not
running at the heaviest load, the pre-provisioned resources
are idle most of the time. This causes a waste of resources. In
W
ITH

Y. Xie is with Hubei Engineering Research Center on Big Data Security,
School of Cyber Science and Engineering, Wuhan National Laboratory for
Optoelectronics, Huazhong University of Science and Technology, Wuhan
430074, P.R. China. E-mail: ylxie@hust.edu.cn.
M. Jin, Z. Zou, G. Xu, and D. Feng are with the School of Computer, Wuhan
National Laboratory for Optoelectronics, Huazhong University of Science
and Technology, Wuhan 430074, P.R. China. E-mail: {jinminpeng0510,
zouzhup, xugongming38}@gmail.com, dfeng@hust.edu.cn.
W. Liu is with NSFOCUS Inc., Haidian District, Beijing 100089, China.
E-mail: liuwenmao@nsfocus.com.
D. Long is with the Jack Baskin School of Engineering, University of
California, Santa Cruz, CA 95064 USA. E-mail: darrell@ucsc.edu.
Manuscript received 25 Sept. 2019; revised 23 Feb. 2020; accepted 18 Apr. 2020.
Date of publication 22 Apr. 2020; date of current version 7 June 2022.
(Corresponding author: Yulai Xie.)
Recommended for acceptance by B. Schulze.
Digital Object Identifier no. 10.1109/TCC.2020.2989631
addition, when the workload is heavy and has to compete
with other applications simultaneously, the pre-allocated
resources may be not enough. In order to solve such problems, a specific prediction algorithm is usually used in cloud
computing to predict resource requirements, and resource
allocation optimization is performed in advance to improve
resource utilization and service quality.
At present, there are few related studies on resource load
prediction of Docker containers. Shanmugam et al. [4] predicted the CPU usage of the container through ARIMA
model and then distributed the load to the container’s web
service using a loop-based algorithm. Roy et al. [5], [6] proposed a cloud computing load prediction model based on
ARIMA model, which smooths the time series first. Huang
et al. [7] proposed a resource prediction model based on quadratic exponential smoothing to predict the cloud resources
that customers need to subscribe to. It not only considers the
current resource status but also considers historical resource
records and obtains higher prediction accuracy.
The above several types of resource load predictions are
based on ARIMA model or the quadratic exponential
smoothing model. This is because the resource load sequence
is a time series, and the two models are common prediction
models for time series prediction. But whether it is ARIMA
model or quadratic exponential smoothing model, they in
essence are linear models. However, the time series generated by different resource loads in the Docker container are
not only linear but also non-linear, as shown in Fig. 1. These
two models do not have a very good prediction accuracy
on predicting the nonlinear relationship in the container
resource load sequence.
2168-7161 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See ht_tps://www.ieee.org/publications/rights/index.html for more information.
XIE ET AL.: REAL-TIME PREDICTION OF DOCKER CONTAINER RESOURCE LOAD BASED ON A HYBRID MODEL OF ARIMA AND…
1387
Fig. 1. The linear and non-linear relationships in the Docker container resource load. The experiment data are acquired by using the stress (version
1.0.4) tool to test the resource usage of a memcached (version 1.5.6) container on the Ubuntu operating system.
However, though ARIMA can be used in both container
and vitual machine, as the resource usage sequence jitters
more in a container environment than in a virtual machine,
ARIMA is more suitable for container environment to eliminate the random fluctuation of the container resource load
sequence. In addition, as each container usually starts and
closes in a short period, this will result in the failure to collect a large amount of container resource usage data. In the
case of insufficient historial data, some machine learning
prediction models (such as neural networks and linear
regresssion prediction methods [8]) that have been used in
virual machines are not suitable for use in containers.
It can be seen that the existing models either cannot predict both the linear and non-linear workloads or are not suitable for use in the container environment. To address the
above problems, this paper proposes a hybrid model of the
ARIMA and triple exponential smoothing model to predict
both the linear and non-linear relationships in the Docker
container resource load. To deal with the dynamic Docker
container resource load, the weight values of the two models
in the hybrid model are chosen according to the sum of
squares of their respective predicted errors for a period of
time. In the process of prediction, ARIMA model is used to
mine the linear relationship and eliminate the random fluctuation of the container resource load sequence, while triple
exponential smoothing is used to mine the nonlinear relationship and smooth the container resource load sequence.
We also design and implement a real-time prediction system
for Docker container workload. The system implements the
collection, storage, prediction of Docker container resource
load data and scheduling optimization of CPU and memory
resource usage based on predicted values.
The contributions of this paper are as follows:

We propose a hybrid prediction model of ARIMA
and triple exponential smoothing that can predict
both linear and nonlinear relationships in the container load sequence and significantly improve the
Docker container load prediction accuracy.
We design a Docker container load prediction system that can predict multidimensional resource
load, and automatically make schedule optimization
of CPU and memory resource usage based on predicted values.
We evaluate the hybrid model on a variety of Docker
containers in both simulated and real cloud environments. The experimental results demonstrate the
hybrid model has a good prediction accuracy with a
small time overhead.
2
BACKGROUND AND MOTIVATION
We first introduce the Docker technology, then we describe
the ARIMA and triple exponential smoothing model and
then motivate our research.
2.1 Docker Technology
Docker is an open source application container engine that
packages the application and its runtime environment into a
lightweight, portable container [9], [10]. Docker can be built
on top of LXC [11] or libcontainer [12]. Libcontainer controls
containers by managing namespaces, cgroups, capabilities,
and file systems. In order to ensure that the processes among
the different containers do not interfere and affect each other,
libcontainer isolates the resources they use through namespaces. In order to solve the problem of competing resources
between containers, libcontainer also uses cgroups to limit
and isolate the resource usage(CPU, memory, disk I/O, network, etc.). Compared with virtual machines, Docker container is more light-weight and enables quick creation and
destruction [13].
With the popularity of Docker container technology, more
and more Internet companies are using large container clusters to serve as application runtime environments, such as
Amazon AWS, Microsoft Azure, and Alibaba Cloud, which
already support Docker containers and also provide Container as a Service (CaaS) [14]. In order to facilitate the management of Docker clusters, we can use Docker swarm or
Kubernetes to easily deploy container clusters of multiple
nodes [15], [16].
2.2 ARIMA Model
The ARIMA(p,d,q) model [17] is called the differential autoregressive moving average model that predicts time series using
a linear combination of AR model (autoregressive model) and
MA model (moving average model). p and q are the orders of
the autoregressive model and moving-average model respectively, and d is the number of differences required to make the
original time series into a stationary sequence.
When applying ARIMA model to container resource load
prediction, it can be roughly divided into the following steps:
1)
2)
collecting and obtaining a time series of container
resource usage.
ADF test [18], which is called augmented dickeyfuller test, is used to determine whether the time
series is stable. If it is not stable, the time series is
processed by using difference until the series is stable. The times of difference is recorded as d.
1388
IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 10, NO. 2, APRIL-JUNE 2022
3)
steps: the first step is to calculate the smoothed values using
the current value of the current time point and the smoothed
values of the previous time point; the second step is to use
the three smoothed values obtained in the first step to obtain
the coefficients of the prediction model; the third step is to
use the obtained coefficients to establish a mathematical
model for prediction. The smoothed values of triple exponential smoothing model are as follows:
Obtain the range of values of the order of the autoregressive part p and the order of the moving-average
part q of the stationary sequence. The different p and
q values are substituted into the model to get the
AIC values [19]. And the values of p and q with the
smallest AIC value determine the best ARIMA
model for this time series.
4) The autocorrelation coefficient and partial autocorrelation coefficient are calculated by using Levinson
algorithm [20].
5) Then the ARIMA model can be used to predict future
resource use of the container.
The resource usage value yt at each moment of the container is expressed as the linear function of the container
resource usage values of the previous p times, the prediction
error value of the container resource usage values of the
previous q times and the error term at current time. The linear function formula is as follows:
yt ¼ m þ “t þ
p
X
i¼1
g i yt i þ
q
X
ui “t i :
(1)
i¼1
m is constant term. ” is the prediction error sequence of the
model. p is the autoregressive number. q is the moving average number. g is the autocorrelation coefficient, that is, the
respective weights of the container resource usage values of
the previous p times. u is the partial autocorrelation coefficient, that is, the respective weights of the prediction error values of the container resource usages of the previous q times.
However, ARIMA model cannot mine and grasp the
nonlinear trend of a time series. When applying ARIMA
model to the prediction of container load resources, it will
not get great accuracy. The best performance of the algorithm for approximating the nonlinear trend in the sequence
is the artificial neural network algorithm. But for small-scale
cloud computing platform built with Docker containers, the
artificial neural network algorithm consumes too much
computing and storage resource. Therefore, artificial neural
networks are often used for large-scale cluster resource
usage prediction. Moreover, the training period of artificial
neural networks tends to be too long and requires a large
amount of data, which is an unacceptable overhead in the
real-time prediction of the container resource load. So we
choose triple exponential smoothing to mine the nonlinear
trend of time series.
2.3 Triple Exponential Smoothing
Exponential smoothing method [21] uses a special weighted
averaging method to achieve the smoothing of the time series
data samples. Its principle is to decompose the time series
into three parts: the overall mean, the overall trend, and the
seasonal trend. It assigns weights in a unique way, that is,
the more distant the time point of historical data is from the
current time point, the less weight is given to the true value
of the time point. The true value before the current time point
is given a weight that decreases exponentially from near to
far and gradually converges to zero. This not only ensures
the integrity of the time series information but also focuses
on the information at different points in time.
Triple exponential smoothing model [22] applies exponential smoothing three times. There are generally three
ð1Þ
ð1Þ
St
¼ axt þ ð1 aÞSt 1
ð2Þ
¼ aSt þ ð1 aÞSt 1
ð3Þ
¼ aSt þ ð1 aÞSt 1 ;
St
St
ð1Þ
ð2Þ
ð2Þ
ð3Þ
(2)
(3)
(4)
ð1Þ
a is the smoothing factor, St is the smoothed value of an
exponential smoothing model, xt is the observed value of
ð2Þ
the current time point, St is the smoothed value of the quað3Þ
is the
dratic exponential smoothing model, and St
smoothed value of triple exponential smoothing model.
After obtaining the above three smoothed values, the coefficients of triple exponential smoothing model can be calculated as follows:
ð1Þ
bt ¼
a
2ð1 aÞ
ct ¼
ð2Þ
ð3Þ
(5)
at ¼ 3St 3St þ St
!
h
i
ð1Þ
ð2Þ
ð3Þ
ð6 5aÞSt 2ð5 4aÞSt þ ð4 3aÞSt
2
!
a2
2
2ð1 aÞ
(6)
h
i
ð1Þ
ð2Þ
ð3Þ
St 2St þ St :
(7)
Then triple exponential smoothing model for prediction
is obtained as Formula (8):
Ftþm ¼ at þ bt m þ ct m2 :
(8)
m is the number of predicted points starting from the
time point t. It can be seen that triple exponential smoothing
model can mine and predict the nonlinear trend of time
series. Each time a prediction is made, the resource usage of
the container at the next time point can be predicted by
using the data of each resource usage of the container at the
current time point and the smoothed values of the previous
time point.
However, triple exponential smoothing model is a nonlinear model, which cannot mine the linear relationship in
the resource load sequence. In addition, when the weighted
average smoothing of historical data is carried out by triple
exponential smoothing model, the information of some
other influencing factors contained in the data will be lost,
and random fluctuations in the resource load sequence are
not considered.
3
HYBRID MODEL BASED ON ARIMA AND TRIPLE
EXPONENTIAL SMOOTHING
In this section, we first introduce the advantage of the hybrid
model and then we elaborate the design of the model.
XIE ET AL.: REAL-TIME PREDICTION OF DOCKER CONTAINER RESOURCE LOAD BASED ON A HYBRID MODEL OF ARIMA AND…
1389
performs better in the next time period; or ARIMA predicts
better on one application workload, and triple exponential
smoothing predicts well on the other application. In order to
better adapt to the dynamics of Docker container resources,
we choose the weight coefficient determination method
based on error-index as shown in Algorithm 1.
Since the error in the process of continuous prediction is
also continuously generated, in order to make the weight
coefficient keep up with the change of the time series of container resource usage, the weight of the model in the next
prediction can be obtained by using the sum of squares of
prediction error. Assume the size of the time series window
is size. We accumulate the latest size prediction errors. If
there is not size prediction errors, we use sum of the prediction errors that have been generated and the initial weight
is 0.5. When the number of prediction errors generated is
greater than size, every time the latest prediction error is
obtained, the oldest prediction error is removed. In this
way, the weight of the two single models can be adjusted
with the change of time series of container resource usage.
Algorithm 1. calculate_weight
Fig. 2. The execution process of the hybrid model algorithm. The numbers on the line indicate the order of execution.
3.1 Advantage of the Hybrid Model
ARIMA model and triple exponential smoothing model
have their own shortcomings when applied to the container
resource load prediction. But their respective advantages
can be complementary as follows.
First, ARIMA is a model to mine the linear relationship
between different data items, while the triple exponential
smoothing finds the nonlinear relationship buried in a large
amount of data. Their combination can cope with a wide
variety of workloads with different characteristics.
Second, ARIMA is mainly to dig the inherent relationship between data items, while the triple exponential
smoothing further exploits the change trend of the whole
time series. Their combination can describe the whole data
characteristics more accurately.
Third, the hybrid model has a much stronger data antijamming ability. On one hand, MA model in the ARIMA
model focuses on the accumulation of the predicted error
terms. This can effectively eliminate the random fluctuations
in the load sequence of container resources. On the other
hand, the triple exponential smoothing model can assign different weights to historical data to smooth data. This makes
the hybrid model has a much stronger anti-interference ability.
3.2 Design of the Hybrid Model
How to determine the weight coefficients of combined
ARIMA and triple exponential smoothing is the key to the
design of the hybrid model. Commonly used methods are
the equal-weighted average coefficient method and weight
coefficient determination method based on the error-index.
The equal-weighted average coefficient method assigns the
same weight to ARIMA and triple exponential smoothing.
However, these two models have different prediction accuracy especially with the dynamics of container resource load.
That is, it may occur in a certain time period, ARIMA model
performs better, and triple exponential smoothing model
Input: Err arimat ; Err est // Err arimat is prediction error of
ARIMA model at current time point and Err est is prediction
error of triple exponential smoothing model at current time point.
Output: Weight arima; Weight es // Weight arima is the
weight of ARIMA model next time and Weight es is the weight
of triple exponential smoothing model next time.
1: if Err arima:size ¼ size and Err es:size ¼ size then
2: Sum arima
Sum arima Err arima½0 2
3: Sum es
Sum es Err es½0 2
4: Remove the first element from array Err arima and array
Err es
5: end if
6: Err arima:push backðErr arimat Þ
7: Err es:push backðErr est Þ
8: Sum arima
Sum arima þ Err arima2t
9: Sum es
Sum es þ Err es2t
10: Weight arima
Sum es=ðSum arima þ Sum esÞ
11: Weight es
Sum arima=ðSum arima þ Sum esÞ
12: return Weight arima; Weight es
In Algorithm 1, Err arima and Err es are sliding windows, which are implemented with array, recording the
prediction errors of ARIMA model and triple exponential
smoothing model respectively in the previous period.
Sum arima and Sum es are the sum of the prediction error
squares of the two models respectively.
The execution process of the hybrid model algorithm can
be divided into the following steps and is shown in Fig. 2.
1)
2)
3)
Every time container resource usage data is collected,
the time series of container resource usage is updated.
Meanwhile, the prediction errors of ARIMA model
and triple exponential smoothing model are calculated
to update the weight coefficient of the hybrid model.
The updated time series are predicted by using
ARIMA model and triple exponential smoothing
model respectively.
According to the prediction results of ARIMA model
and triple exponential smoothing model and the
1390
IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 10, NO. 2, APRIL-JUNE 2022
coefficient and partial autocorrelation coefficient. Its time
complexity is O(Nq ln L þ Lðp þ qÞ þ qp2 N 2 ). The algorithm
of triple exponential smoothing runs in time O(1).
4
SYSTEM DESIGN AND IMPLEMENTATION
In this section, we first describe the overall design of a container resource load prediction system, then we elaborate
the individual modules in detail.
Fig. 3. System architecture.
4)
previously determined weight coefficient, the prediction result of the hybrid model is sum of the prediction result of ARIMA model multiplied by the
corresponding weight coefficient and the prediction
result of triple exponential smoothing model multiplied by the corresponding weight coefficient.
After collecting the true value of the container
resource usage data at the next time point, go back to
step 1 to make the next prediction.
4.1 System Architecture
Fig. 3 shows the overall design of the Docker container
resource load real-time prediction system based on a hybrid
model of ARIMA and triple exponential smoothing.
The system consists of the following modules: container
information collection, prediction container selection, container
resource acquisition, container data storage, container resource
prediction, and container resource scheduling.
The main functions of these modules are shown below:

Algorithm 2. Hybrid Model Prediction Algorithm
Input: yt // yt is current observation value.
Output: Predict value //The predicted value of the hybrid model.
1: Err arimat
yt Pred arimat
2: Err est
yt Pred est
3: ðWeight es; Weight arimaÞ
calculate weightðErr est ; Err arimat Þ
4: Pred arimatþ1
ARIMA:predictðyt Þ
5: Pred estþ1
TripleExpSmoothing:predictðyt Þ
6: Pred value
Pred arimatþ1 Weight arima þ Pred estþ1
Weight es
7: return Pred value
The algorithm is shown in Algorithm 2. Pred arimat and
Pred est represent the predicted value of the ARIMA model
and triple exponential smoothing model respectively in a
period. Err arimat and Err est represent the prediction
error of the ARIMA model and triple exponential smoothing
model respectively. Pred arimatþ1 and Pred estþ1 represent
the predicted value of the ARIMA model and triple exponential smoothing model respectively in the next period.
Weight arima and Weight es represent the weight of the
ARIMA model and triple exponential smoothing model
respectively.
Assume the length of the workload sequence of container
resource is L. In ARIMA model, assume the value of the
order of autoregressive model is p, the value of the order of
the moving-average model is q, where the value range of p
and q is 0 p; q N, and p and q cannot be both zero. The
main computational cost of ARIMA model algorithm is to
determine the order of autoregressive model and the order
of moving-average model and calculate the autocorrelation

The container information collection module is to collect the ID and status of starting and stopping the
container through the Docker API. If the container
status is start, the container is added to the queue of
the prediction container selection module; if the container status is stop, the container is removed from
the queue.
The prediction container selection module is to maintain a container queue to be predicted and scheduled
according to the next prediction scheduling time of
the container. Each time the first container in the
queue is monitored, the container ID is acquired,
and the container resource acquisition module is called
for data collection.
The container resource acquisition module obtains the
container ID sent by the prediction container selection
module, collects the CPU usage, memory usage, disk
read rate, disk write rate, network receiving rate and
network transmission rate of the container, and
sends them to the container data storage module.
The container data storage module stores the container
resource data into the database, and then organizes
the data into a specified format and sends it to the
container resource prediction module.
The container resource prediction module employs the
ARIMA-triple exponential smoothing model to predict the resource usage of the container, and sends
the prediction result to the container resource scheduling module.
The container resource scheduling module dynamically
updates the resource usage (CPU and Memory
usage) of the container according to the prediction
result of container resource prediction module.
4.2 Container Information Collection
The container information collection module collects the container start, stop, ID, image and task information through
the Docker API. This module maintains the state list called
ContainerInfoList of each container that is running. The elements in ContainerInfoList are shown in Table 1.
XIE ET AL.: REAL-TIME PREDICTION OF DOCKER CONTAINER RESOURCE LOAD BASED ON A HYBRID MODEL OF ARIMA AND…
TABLE 1
Element Definition of ContainerInfoList
Field name
ContainerImage
ContainerTask
ContainerID
TABLE 2
The Elements of Each Container in the Queue
Type of data
Description
Field name
Type of data
Description
string
string
string
Image
Task
Container ID
ContainerID
predictCycle
nextTime
string
int
timet
Container ID
Container prediction period
Next container prediction time
We use the “docker events“ command to get real-time
events about the container that occurs in the host. When we
find that a container has started, we use the “docker inspect
containerID“ command to get the basic information of its
image information and task information according to its container ID. Then we store the container’s information in ContainerInfoList. When the container stops running, the related
element is deleted from ContainerInfoList based on container ID. In addition, we have to send the container startup
and stop information to the prediction container selection module in the following format: { “container_id“: ContainerID,
“type“: type }. The type has two values: start and stop.
4.3
1391
Prediction Container Selection
Algorithm 3. Prediction Container Selection
Input: Container queue // Container queue is a queue that
stores the container to be predicted and scheduled, and the elements are shown in Table 2.
1: while true do
2: if Container_queue is not empty then
3:
get the first container from Container_queue
4:
next time
first container:nextTime
5:
if current time < next time then 6: sleep time next time current time 7: sleepðsleep timeÞ 8: else 9: ID first container:ContainerID 10: first container:nextTime current time þfirst container:predictCycle 11: send the ID to container resource acquisition model 12: update the Container_queue according to the nextTime 13: end if 14: end if 15: end while The prediction container selection module maintains a container queue to be predicted and scheduled, and performs incremental sorting according to the next predicted scheduling time of the container. The elements for each container in this queue are shown in Table 2 and the process of prediction container selection is shown in Algorithm 3. According to the container startup and stop information sent by the container information collection module, the containers to be predicted are added into the queue or deleted from the queue, and the order of the container queue is updated accordingly. At the same time, a thread is started in the module to select the container to be predicted from the queue. The thread execution process is as follows: if the current container queue is empty, we do nothing but wait. Otherwise, the container ID and the next predicted time of the first container of the queue are taken out, and the next predicted time is compared with the current time. If the current time does not reach the next predicted time of the container, the current process is blocked until the next predicted time. If the predicted time is reached, the next predicted time of the first container is updated as the current time plus the prediction period. The order of the container queue is then updated according to the prediction time. Finally, the container ID is sent to the container resource acquisition module to collect the data of the container. 4.4 Container Resource Acquisition The resource usage of the container is acquired by reading the data recorded in the cgroups folder on the host machine and the /proc virtual file system. (1)CPU usage To calculate the CPU usage of the container on the host, we first read the CPU time slices the container has used so far from the host’s cgroup file (/sys/fs/cgroup/cpuacct/docker/ containerID/cpuacct.usage), record it as cpu use time. Then we read the total CPU time slice from the /proc/stat directory, denoting it as cpu total time. After reading the two data, we use the latter to divide the former to get the CPU usage of the container. If the host has multiple processors, multiply the calculated CPU usage by the number of processor cores as follows: Ucpu ¼ cpu use timet2 cpu use timet1 cores 100%: cpu total timet2 cpu total timet1 (2)Memory usage After the container is started, the host allocates a certain amount of memory for it, and the value is stored in the host’s cgroup file (/sys/fs/cgroup/memory/docker/containerID/ memory.limit.in_bytes) and recorded as mem- Limit. There exists another cgroup file (/sys/fs/cgroup/mem-ory/docker/containerID/memory.usage_in_bytes) that stores the amount of memory used by the container, which is recorded as memUsed. So the calculation formula for memory usage is Umem ¼ memUsed cores 100%: memLimit Using the same method, we can also calculate the disk read and write rates and the network receiving and sending rates. 4.5 Container Data Storage This module is to receive the container resource data sent by container resource acquisition and store it in the database. Then it organizes the data into a specified format and sends it to the container resource prediction module. We design two tables in the database, one is the data storage table, and the other is the control table. The database chosen is InfluxDB 1392 IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 10, NO. 2, APRIL-JUNE 2022 [23], an open-source distributed timing, event, and metrics database. We use the control table to implement the following functions: When the program starts, it will start a thread to scan the control table periodically and read the timestamp of the container in the control table. If the container is not updated beyond a specified period of time, the container is treated as a closed container, and we delete the corresponding rows in the control table and the data in the data storage table. This saves the storage overhead of the database. When receiving the container resource data sent by the container resource acquisition module, we search the container ID in the control table. If the same ID is found, we store the data in the data storage table and then update the control table. Specifically, the timestamp field is updated and the field that stores the number of rows plus one. If the same ID is not found, we create a new row in both the data storage table and the control table. After the container’s new resource usage data is written, the number of rows of data in the control table is read. If it does not satisfy the number of rows required for the initial data of the container resource prediction module (we record this number as m which is equal to the time series window size), it does nothing but wait for the next database write. If it is exactly equal to m, we get the m rows of container resource usage data from the container’s data storage table and put them into an array. Then we send the array to the container resource prediction module, which makes the next prediction of the container resource usage. If it is greater than m, we just need to send the lastest row of the container’s data storage table to container resource prediction module. The latest row will replace the first row of the array to form a new time series window. 4.6 Container Resource Prediction This module uses the ARIMA-triple exponential combination model algorithm to predict the container resource. But it needs to perform redundant deletion and missing value padding on the acquired data before making a prediction. This is due to the possibility of data loss or duplication in a large amount of container data acquired. Either way, it will have an impact on the outcome of the prediction. To remove redundant data, it is necessary to refer to the identification of the data in the data set. The identifier used here is the container ID and the acquisition time of the data. If multiple data have the same container ID and the acquisition time, it means that the data is duplicated, and we should delete it. For serious missing data, we need to discard the data; If the missing data is not serious, Lagrange interpolation can be used to complete the missing data. After the redundant deletion of the data and the missing value padding operation, the acquired time series data set can be predicted. The process of prediction is as follows. 1) After each time series of container resource usage is updated, the weighting coefficients of the hybrid 2) 3) 4) model are also updated using the method described in Section 3.2 at the same time. After determining the weighting coefficient, the updated time series of container resource usage is predicted using ARIMA model and triple exponential smoothing model respectively. The prediction result of the hybrid model is obtained according to the prediction result of ARIMA model and triple exponential smoothing model, and is sent to the container resource scheduling module. After collecting the true value of the current period sent by the container data storage module, return to 1) for the next prediction and scheduling. 4.7 Container Resource Scheduling This module currently implements dynamic update of CPU and memory resources that are allocated to the container. After obtaining the predicted value of the container resource usage, the container resource cannot be directly update with the value. First, the prediction itself may have errors, and second, the data has random volatility. These two factors should be considered when redistributing the resources. For these two factors, the maximum fluctuation value in the time series of container resource usage is represented as the maximum value minus the minimum value in the time series for prediction, so the new distribution value calculation formula is as follows: NewLimit ¼ Predict þ 2 ðMaxfyt g Minfyt gÞ: Where yt is the time series for prediction, Predict is the predicted value of the container resource usage, and NewLimit is the assigned value of the container resource usage. For CPU resource, after NewLimit of CPU usage is obtained, we call the docker update - -cpu-period= hvaluei - -cpu-quota= hvaluei containerID directive to limit the CPU usage of the container, where - -cpu-period is the scheduling period for the CPU usage of each container. The default value of - -cpu-period is 100 ms. - -cpu-quota is the maximum CPU time that the container can use in the period, and its value is the value of - -cpu-period multiplied by NewLimit. For memory resource, at the beginning of the creation of each container, the cgroups limits the maximum memory that each container can occupy. It only needs to adjust the value according to the predicted value. 5 EXPERIMENTAL EVALUATION In this section, we will first describe the experimental environment, then we make sensitive analysis of the parameters of the hybrid model. At last, we compare the hybrid model with a series of workload prediction models from the aspect of prediction accuracy, prediction time, and computational costs. 5.1 Experimental Environment We perform experiments in both simulated and real cloud environments and the configuration is shown in Table 3. The simulated cloud environment is with 8 CPU cores and 32 GB RAM and runs Ubuntu 16.04 and Docker 18.03.1-ce. For the real cloud environment, we adopt the Amazon EC2 cloud platform. We use two types of instances. One is called t2.micro which is for free with limited use of 1 CPU core XIE ET AL.: REAL-TIME PREDICTION OF DOCKER CONTAINER RESOURCE LOAD BASED ON A HYBRID MODEL OF ARIMA AND... 1393 TABLE 3 Configuration Information of the Cloud Environment Instance Hardware Software simulated Intel(R) Core(TM) CPU E5620 @3.40GHz 8 Cores, 32GB RAM t2.micro Intel(R) Xeon(R) CPU E5-2676 @2.40GHz 1 Cores, 1GB RAM Docker 18.09.7-ce Memcached 1.5.6 Stress 1.0.4 Postmark 1.51 Ubuntu 18.04 t2.2xlarge Intel(R) Xeon(R) CPU E5-2686 @2.30GHz 8 Cores, 32GB RAM Docker 18.09.7-ce Memcached 1.5.6 Stress 1.0.4 Postmark 1.51 Ubuntu 16.04 Docker 18.03.1-ce Memcached 1.5.6 Stress 1.0.4 Postmark 1.51 Ubuntu 18.04 and 1 GB RAM. Another is called t2.2xlarge with 8 CPU cores and 32 GB RAM. Both of the platforms run Ubuntu 18.04 and Docker 18.09.7-ce. We use three typical application workloads to evaluate the predicted result of the hybrid model. The first type of workload is the Memcached [24] container, which is a free, open source, high-performance distributed memory object caching system that stores the results from database calls, API calls, or page renderings in memory in a key-value store. For the container, we use Mutilate [25] as the load generator, mainly to consume the CPU resources of the container. By adjusting the parameters of the load generator, the generated load can form a time series with different fluctuations. When the parameter is small, the load of the Memcached container is lighter and more stable. When the parameter is larger, the load of the Memcached container is heavier and the corresponding fluctuation is larger. The second is the stress tool [26], which is a simple workload generator for POSIX systems. It can impose configurable CPU, memory, I/O, and disk pressure on the system. It is written in C and is free software licensed under the GPLv2. We start ubuntu mirror container, install the Stress in the container, execute the command Stress - -vm 1 - -vmbytes 100M - -timeout 3600s, which means to add a new memory allocation process. The third is Postmark [27], which is a benchmark used to simulate the behavior of mail servers. It is divided into three phases. In the first phase, a file pool is created; in the second phase, four types of transactions are executed: create, delete, read, and attach files; in the final phase, the file pool is deleted. Since we cannot set the test time for Postmark, we add a loop to the cli run function of the Postmark source code, then compile and run it in the container. We also set the file size, read and write concurrent parameters, collect the resource usage data of the container, and then make prediction of the resource usage. As each kind of container mainly consumes different resource, we choose to predict CPU usage for the Memcached and memory usage for the Stress and Postmark. In addition, we use google cluster-usage traces [28] to further evaluate the hybrid model. A google cluster is a set of Fig. 4. Mean square error variation of four time series of container resource usage in different window sizes. machines connected by a high-bandwidth cluster management system that allocates jobs to machines. A job consists of one or more tasks that are accompanied with a set of resource requirements. 5.2 Parameter Selection of Hybrid Model 5.2.1 Time Series Window Size For real-time prediction, analysis of too much historical data could result in a long prediction time and requires a large amount of space to record historical data. However, if the historical data used is small, the prediction accuracy will be low. Therefore, the reasonable selection of historical data is critical to the prediction system. The following experiments calculate the mean square error after 50 predictions using the time series of the three workloads in different size of window. The result is shown in Fig. 4. As can be seen from Fig. 4, when the window size is between 10 and 50, as the window size increases, more historical data will be used, and the accuracy of prediction will be improved. However, with the further increase of the window size, the prediction accuracy is not significantly improved. In addition, as more historical data is used, the prediction time is bound to increase. Considering the above, the window size of ARIMA model, triple exponential smoothing model and the hybrid model is 50. 5.2.2 The Smoothing Factor For the exponential smoothing method, whether the selection of the smoothing factor a is reasonable has a great impact on the prediction accuracy. The smoothing factor determines the corresponding sensitivity to the gap between the predicted value and the actual value, that is, the closer the smoothing factor is to 0, the slower the influence of the distant historical data on this prediction declines, and the closer the smoothing factor is to 1, the more rapidly the influence degree of distant historical data on this prediction declines. In addition, the smoothing factor also determines the ability of the model to smooth the random error generated during the prediction process. The larger the smoothing factor, the stronger the smoothing ability. Therefore, the selection of the smoothing factor is the key to exponential smoothing prediction. The experiment selects the optimal smoothing factor for the time series of the three application workloads. The mean square errors of the prediction of the 1394 IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 10, NO. 2, APRIL-JUNE 2022 5.3 Comparison of Prediction Accuracy We compare the prediction accuracy of four prediction models. They are ARIMA model, triple exponential smoothing model, the hybrid model and a workload prediction model using neural network and self adaptive differential evolution algorithm [29] which we call ANN+SaDE model. ANN+SaDE model uses genetic algorithm to train neural network to make prediction of time series. The evaluation of prediction performance is based on mean absolute percentage error(MAPE) and mean squared error(MSE) which are widely used error metrics for evaluating results of time-series prediction, their formulas is shown as blow: n 100% X Ai Pi MAPE ¼ n i¼1 Ai Fig. 5. Mean square error variation of four time series of container resource usage under different smoothing factors. exponential smoothing method are calculated after 50 times of predictions, and the optimal smoothing factor is selected according to the mean square error. As can be seen from Fig. 5, for the two large fluctuations of the Memcached heavy load and the Stress load, optimal smoothing factors are both 0.5. The optimal smoothing factor of Memcached light load is 0.3, and the optimal smoothing factor of Postmark is 0.1. This is because when the data fluctuates greatly, we generally choose a larger smoothing factor, increase the weight of the recent data, and also smooth the data. When the data fluctuation is small, the smoothing factor should be chosen to be smaller. The smoothing factor of the Postmark load is smaller than the smoothing factor of the Memcached light load, indicating that the resource usage of the Postmark is more stable than it. MSE ¼ n 1X ðAi Pi Þ2 : n i¼1 (9) (10) Where Ai is the actual value and Pi is the predicted value. 5.3.1 Experiment in Local Simulated Cloud Environment We first compare the predicted results of the four models under different loads in local simulated cloud environment. The predicted results are shown in Fig. 6 and Table 4. In Fig. 6, the first 50 points in the ARIMA, triple exponential smoothing model and the hybrid model have no prediction curve. This is because the initial window size is 50, and these 50 data are historical data used by Fig. 6. The prediction effect diagram of four models under different loads in simulated cloud environment. (a1 a4 ) represents the prediction effect of Memcached light load, (b1 b4 ) represents the prediction effect of Memcached heavy load, (c1 c4 ) represents the prediction effect of Stress load, (d1 d4 ) represents the prediction effect of Postmark load. The blue line shows the real CPU or memory usage, and the red line shows the prediction result of the different models. XIE ET AL.: REAL-TIME PREDICTION OF DOCKER CONTAINER RESOURCE LOAD BASED ON A HYBRID MODEL OF ARIMA AND... 1395 TABLE 4 Prediction Result in Simulated Cloud Environment the load type Memcached light load Memcached heavy load Stress load Postmark load prediction model MAPE MSE ARIMA triple exponential smoothing hybrid model ANN+SaDE ARIMA triple exponential smoothing hybrid model ANN+SaDE ARIMA triple exponential smoothing hybrid model ANN+SaDE ARIMA triple exponential smoothing hybrid model ANN+SaDE 0.897% 1.062% 0.805% 0.934% 3.126% 2.878% 1.987% 2.433% 11.285% 13.932% 10.276% 11.731% 12.415% 7.252% 7.216% 13.162% 1.238 1.595 0.935 1.242 91.340 70.330 43.262 54.703 3.447 5.634 2.902 3.484 153.841 54.781 49.187 102.542 the models for the first prediction. There are more points that have no prediction curve for ANN+SaDE. This is because the neural network needs to be trained first, and thus ANN+SaDE needs more data for training than the other three models. The prediction accuracy of the hybrid model is better than both of the single models. This is because the hybrid model will give more weight to the model with small prediction deviation for different time series. In other word, the hybrid model is able to combine the advantages of the two models to a certain extent. It mines more useful information of time series, so the prediction accuracy is improved. Compared with the ANN+SaDE model, the prediction accuracy of the hybrid model is also better. This is because the resource time series generated by the container load is realtime, and the trend of the previous training data for ANN +SaDE model will not be the same as the trend of the prediction data. 5.3.2 Experiment in Amazon Elastic Compute Cloud Considering the heterogeneity between the local physical machine and the real cloud environment, we conducted further experiments on the prediction accuracy of the four models using Amazon EC2 cloud platform as shown in Fig. 7. The prediction effect diagram of four models under different loads in t2.micro instance. (a1 a4 ) represents the prediction effect of Memcached light load, (b1 b4 ) represents the prediction effect of Memcached heavy load, (c1 c4 ) represents the prediction effect of Stress load, (d1 d4 ) represents the prediction effect of Postmark load. The blue line shows the real CPU or memory usage, and the red line shows the prediction result of the different models. 1396 IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 10, NO. 2, APRIL-JUNE 2022 TABLE 5 Prediction Result in t2.micro Instance Environment the load type Memcached light load Memcached heavy load Stress load Postmark load prediction model MAPE MSE ARIMA triple exponential smoothing hybrid model ANN+SaDE ARIMA triple exponential smoothing hybrid model ANN+SaDE ARIMA triple exponential smoothing hybrid model ANN+SaDE ARIMA triple exponential smoothing hybrid model ANN+SaDE 10.241% 12.917% 10.153% 14.372% 10.265% 11.954% 10.039% 30.804% 14.026% 9.876% 9.497% 24.604% 2.078% 0.767% 0.696% 10.920% 33.041 52.693 32.346 48.814 122.575 174.664 116.503 866.069 9.619 13.528 7.998 13.762 5.544 1.120 0.992 92.338 Table 3. We use t2.micro and t2.2xlarge instances to measure the cases of single core and multi-core respectively. The experimental results for the t2.micro instance are shown in Fig. 7 and Table 5. In order to simulate a sudden traffic request, we adjust the load generator’s -T parameter to add more CPU overhead at a certain moment in Memcached heavy load. And we adjust memory usage through the - -vm-bytes parameter in the Stress load to make the trend of resource usage time series go up and then down. Under all loads, the prediction accuracy of the hybrid model is also the highest. The prediction accuracy of the ANN +SaDE model is much lower than the other three models. This is because the neural network is trained using the initial part of historical data, and its prediction accuracy will greatly decrease when the time series trend changes, and the trends of different resource time series in Amazon cloud environments change more than that in simulated cloud environments. We perform further experiments on the t2.2xlarge instance with 8 CPU cores to compare the prediction accuracy of different models in a higher physical configuration environment. The experimental results are shown in Fig. 8 Fig. 8. The prediction effect diagram of four models under different loads in t2.2xlarge instance. (a1 a4 ) represents the prediction effect of Memcached light load, (b1 b4 ) represents the prediction effect of Memcached heavy load, (c1 c4 ) represents the prediction effect of Stress load, (d1 d4 ) represents the prediction effect of Postmark load. The blue line shows the real CPU or memory usage, and the red line shows the prediction result of the different models. XIE ET AL.: REAL-TIME PREDICTION OF DOCKER CONTAINER RESOURCE LOAD BASED ON A HYBRID MODEL OF ARIMA AND... 1397 TABLE 6 Prediction Result in t2.2xlarge Instance Environment the load type Memcached light load Memcached heavy load Stress load Postmark load prediction model MAPE MSE ARIMA triple exponential smoothing hybrid model ANN+SaDE ARIMA triple exponential smoothing hybrid model ANN+SaDE ARIMA triple exponential smoothing hybrid model ANN+SaDE ARIMA triple exponential smoothing hybrid model ANN+SaDE 12.469% 13.184% 11.140% 17.719% 10.855% 14.592% 10.233% 16.672% 3.819% 4.013% 3.120% 5.329% 12.720% 4.240% 3.929% 15.253% 567.937 540.428 365.493 843.932 1304.671 2529.447 1180.015 3086.117 0.504 0.512 0.323 1.069 4.127 0.538 0.499 8.674 and Table 6. Similar as in t2.micro instance, the hybrid model has the highest prediction accuracy, and the ANN +SaDE model has the lowest prediction accuracy. 5.3.3 Experiment in Google Cluster-Usage Traces The trace we choose is Google cluster-usage traces clusterdata-2011-2. We randomly select a long duration job with jobID 3418309 and choose task index 0 and task index 1 in the job. The experimental results are shown in Fig. 9 and Table 7. From Fig. 9, we can see that performance trend of the two traces fluctuates slightly and the tasks consume very little CPU resource. Table 7 shows that the hybrid model has the better prediction accuracy than the ANN +SaDE model. This is because the ANN+SaDE model is trained using the first 40 percent of the traces which as a whole is trending downward, resulting in its prediction value being greater than the actual value. The hybrid model uses the recent historical data to make predictions, so it can find the data change trend more quickly. Thus it has better prediction results. 5.4 Prediction Time For the real-time prediction model, the prediction time should not be too long. A long prediction time may cause the current prediction result come out after the actual value of the next period has been collected, so the prediction itself loses its meaning. For the hybrid model based on ARIMA model and triple exponential smoothing model, the main Fig. 9. The prediction effect diagram of four models on Google cluster data. a1 and a2 represent the prediction effect of task index 0, b1 and b2 represent the prediction effect of task index 1. The blue line shows the real CPU usage, and the red line shows the prediction result of the different models. 1398 IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 10, NO. 2, APRIL-JUNE 2022 TABLE 7 Prediction Result in Google Cluster Data the load type prediction model MAPE MSE task index 0 hybrid model ANN+SaDE hybrid model ANN+SaDE 6.857% 20.581% 7.388% 19.165% 6.41e-09 3.51e-08 5.914e-09 2.23e-08 task index 1 prediction time consumption lies in ARIMA model. Because each prediction of triple exponential smoothing model is based on the previous prediction. Although the model considers all historical data, it does not require to record and use all historical data. Thus the prediction time for triple exponential smoothing is quite short and almost negligible. For ARIMA model and the hybrid model, 50 predictions are performed under different loads, and their average prediction time is shown in Fig. 11. The prediction time of the hybrid model is slightly longer than that of ARIMA model by 0.59, 5.4, 3.17 and 1.3 percent in the four-time series respectively. The extra time overhead is almost negligible because the time overhead of triple exponential smoothing model is small. 5.5 CPU and Memory Overhead We test the CPU and memory overhead of running the hybrid and ANN+SaDE models using the google cluster data as shown in Fig. 10. We remove the prediction interval from the programs and record the CPU and memory usage every 0.01 seconds. Because the execution time of the hybrid model is much less than ANN+SaDE model, the curve of the hybrid model is much shorter than the ANN+SaDE model. The experimental result shows the CPU usage of the hybrid model is lower than the ANN+SaDE model. This means the computational cost of the hybrid model is lower than ANN+SaDE model. In addition, we can see from the figure that the memory usage of the hybrid model is also much lower than the ANN+SaDE model. This is because artificial neural network training consumes a lot of CPU and memory resources. 6 RELATED WORK We first introduce the general prediction method and then describe the related work on Docker container resource load prediction. Fig. 10. The resource consumption of host. Fig. 11. Average prediction time comparison between the two models. 6.1 Prediction Method The trend extrapolation is a technique of predicting using statistical methods in order to predict future patterns of time series data [30]. The trend extrapolation can be subdivided into two types: the moving average method, and the exponential smoothing method. The moving average is extremely useful for forecasting long-term trends. When the value of the time series is affected by the periodic variation and random interference, the fluctuation of the time series is large, and the development trend of the time series cannot be clearly displayed. The moving average method can effectively eliminate the influence of the random factor and explore the overall trend of the time series. The moving average method usually does not consider the historical data of a long time, and the full-motion moving average rule uses all the historical data of the time series equally (i.e., gives all the historical data the same weight to calculate the average). The exponential smoothing method combines the advantages of full-motion moving average and moving average. It uses all historical data, and assigns the weight of the historical data from near to far to a value that gradually converges to zero. Grayscale prediction is a method of predicting gray systems. The process of grayscale prediction is generally to accumulate data first to eliminate the randomness and volatility of the data. Then the white differential equation is established, and the solution of the equation is the prediction result. Khalid et al. [31] used the GM(1,1) model in the grey model to predict wind power in a short period of time. The gray prediction model is an exponential prediction XIE ET AL.: REAL-TIME PREDICTION OF DOCKER CONTAINER RESOURCE LOAD BASED ON A HYBRID MODEL OF ARIMA AND... model, which is suitable for time series in which the data trend is exponential [32]. But for other trends, the prediction effect may not be so good. In recent years, SVM [33] has been able to perform highprecision fitting because of its excellent generalization ability, and it has also been applied to solve the problem of regression. However, the selection of model parameters is the difficulty of SVM regression prediction. So far, there is no unified guiding theory. This hinders the development of SVM for data prediction. The neural network model [34] is an information processing system simulating the structure and function of human brain nerve cells. In fact, it is a complex network composed of a large number of neurons. Each neuron represents an output function, and the connections between neurons represent weights. With different weights and output functions, the final output of the network will be different. Neural networks have very powerful learning capabilities and can learn to approach any nonlinear mapping relationship. Thus they are widely used in various fields of forecasting, such as stock price forecasting in the financial sector [35], and traffic wandering in the transportation sector [36]. 6.2 Docker Resource Load Prediction At present, there are few related studies on resource usage prediction of Docker containers. Shanmugam et al. [4] predicted the CPU usage of the container through the ARIMA model, and then distributed the load to the container’s web service using a loop-based algorithm. There are many studies on cloud computing resource load prediction. The dynamic and real-time characteristics of Docker container resources are consistent with the characteristics of cloud computing load. Therefore, researching cloud computing load prediction has a strong reference and learning significance for Docker container resource load prediction. Calheiros et al. [5], [6] proposed a cloud computing load prediction model based on the ARIMA model. First, the time series are smoothed to determine the value of d, and then the values of p and q are determined by the autocorrelation function and the partial autocorrelation function. At this time, the historical data of the load will conform to the determined values of p, d and q. The ARIMA model is then used to predict future load values and can achieve an average of 91 percent accuracy. Huang et al. [7] proposed a resource prediction model based on quadratic exponential smoothing to predict the cloud resources that customers need to subscribe to. It not only considers the current resource status but also considers historical resource records and thus obtains higher prediction accuracy. But the quadratic exponential smoothing model is in essence a linear model and cannot mine the nonlinear relationship in the time series. Islam et al. [8] used a combination of neural networks and linear regression to predict the resources of managed applications in the cloud environment. The method uses traditional linear regression to approximate linear relationships in time series and then uses neural networks to approximate nonlinear relationships in sequences. It takes into account more influencing factors in the time series and obtains better prediction results. However, enough samples 1399 are needed to train the neural network. When applied to the container, it needs to collect a large amount of time series of container resource usage, which has great storage and computation overheads. 7 CONCLUSION Predicting the resource usage of container workload in dynamic container workload environments has been a great challenge to improve the performance of cloud computing platform. This paper proposes a hybrid model that combines ARIMA with triple exponential smoothing to accurately predict both the linear and non-linear relationships in the time series of the resource workload of Docker container. Besides, to enable automatic prediction and alleviate the management burden, we also design and implement a Docker container resource prediction system that enables efficient container information collection, storage, prediction and schedule. This hybrid model improves prediction accuracy by 52.64, 20.15 and 203.72 percent on average compared to ARIMA, the triple exponential smoothing model and ANN+SaDE respectively with a small time overhead. Users can directly use the system we designed to improve the resource utilization of the container-based cloud platform, or use the hybrid model to implement a resource prediction system adapted to their own platform. ACKNOWLEDGMENTS This work was supported in part by the National Science Foundation of China under Grant No. 61972449, U1705261, and 61821003, in part by CCF-NSFOCUS Kun Peng research fund, in part by Wuhan Application Basic Research Program under Grant No. 2017010201010104, in part by Hubei Natural Science and Technology Foundation under Grant No. 2017CFB304, and in part by the Fundamental Research Funds for the Central Universities under Grant No. 2019kfyXKJC021. REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] P. Mell and T. Grance, “The NIST definition of cloud computing,” Commun. ACM, vol. 53, no. 6, pp. 50–50, 2011. K.-T. Seo, H.-S. Hwang, I.-Y. Moon, O.-Y. Kwon, and B.-J. Kim, “Performance comparison analysis of Linux container and virtual machine for building cloud,” Adv. Sci. Technol. Lett., vol. 66, no. 2, pp. 105–111, 2014. C. Anderson, “Docker [software engineering],” IEEE Softw., vol. 32, no. 3, pp. 102–c3, May/Jun. 2015. A. S. Shanmugam, “Docker container reactive scalability and prediction of CPU utilization based on proactive modelling,” Masters thesis, Dublin, Nat. College Ireland, 2017. [Online]. Available: http://trap.ncirl.ie/2884/1/aravindsamyshanmugam.pdf N. Roy, A. Dubey, and A. S. Gokhale, “Efficient autoscaling in the cloud using predictive models for workload forecasting,” in Proc. IEEE Int. Conf. Cloud Comput., 2011, pp. 500–507. V. G. Tran, V. Debusschere, and S. Bacha, “Hourly server workload forecasting up to 168 hours ahead using seasonal ARIMA model,” in Proc. IEEE Int. Conf. Ind. Technol., 2012, pp. 1127–1131. J. Huang, C. Li, and Y. Jie, “Resource prediction based on double exponential smoothing in cloud computing,” in Proc. 2nd Int. Conf. Consum. Electron. Commun. Netw., 2012, pp. 2056–2060. S. Islam, J. Keung, K. Lee, and A. Liu, “Empirical prediction models for adaptive resource provisioning in the cloud,” Future Gener. Comput. Syst., vol. 28, no. 1, pp. 155–162, 2012. Z. Zou, Y. Xie, K. Huang, G. Xu, D. Feng, and D. Long, “A docker container anomaly monitoring system based on optimized isolation forest,” IEEE Trans. Cloud Comput., to be published, doi: 10.1109/TCC.2019.2935724. 1400 IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 10, NO. 2, APRIL-JUNE 2022 [10] Z. Huang, S. Wu, S. Jiang, and H. Jin, “FastBuild: Accelerating docker image building for efficient development and deployment of container,” in Proc. 35th Symp. Mass Storage Syst. Technol., 2019, pp. 28–37. [11] LXC. 2013. [Online]. Available: https://linuxcontainers.org/ [12] Libcontainer. 2013. [Online]. Available: https://github.com/ docker/libcontainer [13] D. Merkel, “Docker: Lightweight Linux containers for consistent development and deployment,” Linux J., vol. 2014, no. 239, 2014, Art. no. 2. [14] K. Kaur, T. Dhand, N. Kumar, and S. Zeadally, “Container-as-aservice at the edge: Trade-off between energy efficiency and service availability at fog nano data centers,” IEEE Wireless Commun., vol. 24, no. 3, pp. 48–56, Jun. 2017. [15] N. Ferry, A. Rossini, F. Chauvel, B. Morin, and A. Solberg, “Towards model-driven provisioning, deployment, monitoring, and adaptation of multi-cloud systems,” in Proc. IEEE 6th Int. Conf. Cloud Comput., 2013, pp. 887–894. [16] N. Naik, “Migrating from virtualization to dockerization in the cloud: Simulation and evaluation of distributed systems,” in Proc. IEEE 10th Int. Symp. Maintenance Evol. Service-Oriented Cloud-Based Environ., 2016, pp. 1–8. [17] Autoregressive Integrated Moving Average model. 2004. [Online]. Available: https://people.duke.edu/ rnau/411arim.htm [18] B. Li, J. Zhang, Y. He, and Y. Wang, “Short-term load-forecasting method based on wavelet decomposition with second-order gray neural network model combined with ADF test,” IEEE Access, vol. 5, pp. 16 324–16 331, 2017. [19] K. Yamaoka, T. Nakagawa, and T. Uno, “Application of Akaike’s information criterion (AIC) in the evaluation of linear pharmacokinetic equations,” J. Pharmacokinetics Biopharmaceutics, vol. 6, no. 2, pp. 165–175, 1978. [20] Levinson recursion. 2004. [Online]. Available: https://en. wikipedia.org/wiki/Levinson_recursion [21] E. S. Gardner Jr, “Exponential smoothing: The state of the art,” J. Forecasting, vol. 4, no. 1, pp. 1–28, 1985. [22] P. S. Kalekar, “Time series forecasting using holt-winters exponential smoothing,” Kanwal Rekhi School of Information Technology, 4329008, 2014. [Online]. Available: https://www.researchgate.net/ publication/268340653_Time_series_Forecasting_using_HoltWinters_Exponential_Smoothing [23] InfluxDB. 2014. [Online]. Available: https://www.infoq.com/fr/ presentations/influx-db/ [24] What is Memcached. 2007. [Online]. Available: http://memcached. org/ [25] Leverich.Mutilate. 2018. [Online]. Available: https://github.com/ leverich/mutilate [26] Stress. 2017. [Online]. Available: https://www.archlinux.org/ packages/community/x86_64/stress/ [27] Postmark. 2006. [Online]. Available: http://www.filesystems.org/ docs/auto-pilot/Postmark.html [28] C. Reiss, J. Wilkes, and J. L. Hellerstein, “Google cluster-usage traces: Format + schema,” Google Inc., Mountain View, CA, USA, Technical Report, revised 2014–11-17 for version 2.1, Nov. 2011. [Online]. Available: https://github.com/google/cluster-data [29] J. Kumar and A. K. Singh, “Workload prediction in cloud using artificial neural network and adaptive differential evolution,” Future Gener. Comput. Syst., vol. 81, pp. 41–52, 2018. [30] Trend extrapolation. 2017. [Online]. Available: https:// thelawdictionary.org/trend-extrapolation/ [31] M. Khalid and A. V. Savkin, “A method for short-term wind power prediction with multiple observation points,” IEEE Trans. Power Syst., vol. 27, no. 2, pp. 579–586, May 2012. [32] E. Kayacan, B. Ulutas, and O. Kaynak, “Grey system theory-based models in time series prediction,” Expert Syst. Appl., vol. 37, no. 2, pp. 1784–1789, 2010. [33] Support Vector Machine. 2017. [Online]. Available: https://www. sciencedirect.com/topics/neuroscience/support-vector-machine [34] Neural Networks Model. 2014. [Online]. Available: https://www. ibm.com/support/knowledgecenter/en/SS3RA7_15.0.0/com.ibm. spss.modeler.help/neuralnet_model.htm [35] M. Qiu, Y. Song, and F. Akagi, “Application of artificial neural network for the prediction of stock market returns: The case of the Japanese stock market,” Chaos Solitons Fractals, vol. 85, pp. 1–7, 2016. [36] K. Kumar, M. Parida, and V. K. Katiyar, “Short term traffic flow prediction in heterogeneous condition using artificial neural network,” Transport, vol. 30, no. 4, pp. 397–405, 2015. Yulai Xie (Member, IEEE) received the BE and PhD degrees in computer science from the Huazhong University of Science and Technology (HUST), Wuhan, China, in 2007 and 2013, respectively. He was a visiting scholar with the University of California, Santa Cruz, in 2010 and a visiting scholar with the Chinese University of Hong Kong, in 2015. He is currently an associate professor with the School of Cyber Science and Engineering, HUST, China. His research interests mainly include cloud storage and virtualization, digital provenance, intrusion detection, machine learning, and computer architecture. Minpeng Jin received the BE degree in computer science from Northeastern University, Shenyang, China, in 2019. He is currently working toward the master’s degree at the Huazhong University of Science and Technology (HUST), Wuhan, China. His research interests include Docker container and virtualization. Zhuping Zou received the BE degree in computer science from the Central South University of Forestry and Technology, Changsha, China, in 2017, and the master’s degree from the Huazhong University of Science and Technology (HUST), Wuhan, China, in 2019, respectively. His research interests include Docker container and virtualization. Gongming Xu received the BE degree in computer science from the Wuhan Institute of Technology, Wuhan, China, in 2018. He is currently working toward the master’s degree at the Huazhong University of Science and Technology (HUST), Wuhan, China. Dan Feng (Member, IEEE) received the BE, ME, and PhD degrees in computer science and technology from the Huazhong University of Science and Technology (HUST), Wuhan, China, in 1991, 1994, and 1997, respectively. She is currently a professor and director of Data Storage System Division, Wuhan National Lab for Optoelectronics. She is also dean of the School of Computer Science and Technology, HUST. Her research interests include computer architecture, massive storage systems, parallel file systems, disk array, and solid state disk. She has more than 100 publications in journals and international conferences, including FAST, USENIX ATC, ICDCS, HPDC, SC, the Information, Communication & Society, and IPDPS. She is a member of ACM. XIE ET AL.: REAL-TIME PREDICTION OF DOCKER CONTAINER RESOURCE LOAD BASED ON A HYBRID MODEL OF ARIMA AND... Wenmao Liu received the PhD degree in information security from the Harbin Institute of Technology, Harbin, China, in 2013. He is the director of the Innovation Center of NSFOCUS. After completion of his degree, he served as a researcher with NSFOCUS Inc. During the first two years in NSFOCUS, he was also working with Tsinghua University as a postdoc. His interests are focused on cloud security, IoT security, threat intelligence, and advanced security analytics. He has published a book Software-Defined Security, in the next generation inspired by SDN/NFV technology and participate cloud security related national and industrial standards. Now, he has been promoting the adoption of container security, and DevSecOps. 1401 Darrell Long (Fellow, IEEE) received the BS degree in computer science from San Diego State University, San Diego, California, and the MS and PhD degrees from the University of California, San Diego, California. He is distinguished professor of computer engineering with the University of California, Santa Cruz. He holds the Kumar Malavalli endowed chair of Storage Systems Research and is director of the Storage Systems Research Center. His current research interests include the storage systems area include high performance storage systems, archival storage systems, and energy-efficient storage systems. His research also includes computer system reliability, video-on-demand, applied machine learning, mobile computing, and cyber security. He is fellow of the American Association for the Advancement of Science (AAAS). " For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/csdl. UML Diagrams in Software Engineering Research: A Systematic Literature Review By Hatice Koç, Ali Mert Erdoğan, Yousef Barjakly and Serhat Peker Presented by Murali Krishna Reddy Voruganti CS699AO Professional seminar Prof. Vladimir Riabov Agenda 01 Introduction 02 UML Usage 03 Methodology 04 Result 05 Conclusion and Improvements UML UML stands for Unified Modeling Language. It’s a rich language to model software solutions, application structures, system behavior and business processes. Use It helps to explain business functions. Reduce the development effort. It works as a communication channel between developers and functionall users Improve the productivity in all process. Research To systematically review the literature on UML diagram utilization in software engineering research. RQ1. What is the distribution of the number of publications by year? RQ2. What is the distribution of the number of publications by publishers and publishing types? RQ3. What is the distribution of the publications according to the application areas? RQ4. For which purposes are UML diagrams utilized in the publications? RQ5. What are the most used UML diagrams in the publications? Methodology 247 at the least one UML diagram Between 2000 and 2019 Total Articles 128 English only Results : RQ1 RQ1. What is the distribution of the number of publications by year? Results : RQ2 By Publisher By Publication Type Results : RQ 3 & 4 By Application Area The least number of articles was published for finance and other application areas. By Usage More than two-thirds of the publications used UML diagrams for design purposes. Results : RQ 5 RQ5. What are the most used UML diagrams in the publications? ResultConclusions & Conclusion Class diagrams leads, while sequence and state diagrams were the lowest Most of the publications were either conference proceedings or journals. The largest number of articles using UML diagrams was published by IEEE Mostly used for the purposes of designing and modeling in computer science and industry application fields Improvements Result & Conclusion • Improve the search strings used for search criteria. ✓ Development ✓ SLDC ✓ Testing ✓ Analysis Result Reference & Conclusion About the Unified Modeling Language Specification Version 2.5.1. (2022). Object Management Group. https://www.omg.org/spec/UML/2.5.1/About-UML/ Koc, Hatice & Erdoğan, Ali & Barjakly, Yousef & Peker, Serhat. (2021). UML Diagrams in Software Engineering Research: A Systematic Literature Review. Proceedings. 74. 13. 10.3390/proceedings2021074013. Thank you Proceeding UML Diagrams in Software Engineering Research: A Systematic Literature Review † Hatice Koç *, Ali Mert Erdoğan, Yousef Barjakly and Serhat Peker Department of Management Information Systems, Izmir Bakircay University, 35665 Menemen, Turkey; alimert.erdogan@bakircay.edu.tr (A.M.E.); ybarjakly@gmail.com (Y.B.); serhat.peker@bakircay.edu.tr (S.P.) * Correspondence: hatcekoc@gmail.com † Presented at the 7th International Management Information Systems Conference, Online, 9–11 December 2020. Abstract: Software engineering is a discipline utilizing Unified Modelling Language (UML) diagrams, which are accepted as a standard to depict object-oriented design models. UML diagrams make it easier to identify the requirements and scopes of systems and applications by providing visual models. In this manner, this study aims to systematically review the literature on UML diagram utilization in software engineering research. A comprehensive review was conducted over the last two decades, spanning from 2000 to 2019. Among several papers, 128 were selected and examined. The main findings showed that UML diagrams were mostly used for the purpose of design and modeling, and class diagrams were the most commonly used ones. Keywords: software engineering; UML diagrams; literature review; systematic mapping; classification 1. Introduction Citation: Koç, H.; Erdoğan, A.M.; Barjakly, Y.; Peker, S. UML Diagrams in Software Engineering Research: A Systematic Literature Review. Proceedings 2021, 74, 13. https://doi.org/10.3390/proceedings2021074013 Published: 10 March 2021 Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. Copyright: © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/). Software enables organizations to adopt competitive differentiation and competitive change because they can design, enhance, and adapt their systems, products, and services to different market sectors, from manufacturing to art, and provide rapid and flexible supply chain management [1]. However, every aspect of a system or application is determined to develop software. Therefore, software development is complex [2], and software engineering has emerged as an engineering discipline which deals with any software product from the early stages of system specification to the maintenance of this system or application. It helps develop more reliable systems and decreases the cost for developing the system [3]. Systematic literature review (SLR) is a research methodology, which makes it easier to recognize, analyze, and interpret all existing studies [4]. Its objective is not only finding all evidence for research questions but also contributing to improve evidence-based guidelines [5]. It consists of three processes, which are planning, execution, and reporting. Although these processes can consist of many steps depending on the research target, it must include the steps of data retrieval, study selection, data extraction, and data synthesis [6]. The Unified Modeling Language (UML) is also used to develop a system in software engineering, which is a visual language to define and document a system. The requirements in scenarios that express how users use a system are shown with the UML. The constraints of a system are also shown with the UML [4]. Hence, many researchers who work as software engineers publish papers about how UML diagrams are utilized to develop a system and contribute to the practice in order to advance the software engineering discipline. In our study, SLR is used to understand which UML diagrams are popular, why they are used, and which application areas are the most popular [2]. Proceedings 2021, 74, 13. https://doi.org/10.3390/proceedings2021074013 www.mdpi.com/journal/proceedings Proceedings 2021, 74, 13 2 of 8 The aim of this paper is to determine the situation and the future of UML diagrams in the software engineering discipline. Thus, the research questions and keywords were identified, and then publications between 2000 and 2019 were investigated using Google Scholar. A total of 247 publications were found, and 128 of them included the following UML diagrams: a class diagram, activity diagram, sequence/interaction diagram, state machine diagram, system sequence diagram, deployment diagram, collaboration/communication diagram, package diagram, object diagram, domain model diagram, and a component diagram. These publications were classified in terms of the distribution years, the publishers, the application areas, the usage purpose, and the types of UML diagrams. A Microsoft Excel spreadsheet was used to store and analyze these data with bar graphs and pie charts. The rest of the paper is composed of three sections: Method, Results, and Conclusion. In the Method section, the SLR process is investigated in detail, giving an outline for how the methodology is applied and how the data is collected, which consists of four subsections: Research Questions, Search Strategy, Inclusion and Exclusion Criteria, and Data Extraction. The Results section expresses the findings for the included papers, which is composed of five subsections, those being the answers to the research questions. The last section includes discussion and comments on the findings, the situation, and the future of this study. 2. Method This study was conducted with the SLR methodology in three phases, consisting of planning, exploring, and reporting, based on Kitchenham’s theoretical framework. In this framework, each of the phases can be broken down into many steps [6]. The planning phase consists of the following steps: research questions, search strategy, inclusion and exclusion criteria, and data extraction. 2.1. Research Questions The objective of this paper is to investigate the use of various types of UML diagrams against various variables. Several research questions were discussed, based on the previous literature and on common sense. The following are the basic research questions: RQ1. What is the distribution of the number of publications by year? RQ2. What is the distribution of the number of publications by publishers and publishing types? RQ3. What is the distribution of the publications according to the application areas? RQ4. For which purposes are UML diagrams utilized in the publications? RQ5. What are the most commonly used UML diagrams in the publications? 2.2. Search Strategy This systematic literature review was performed through only the Google Scholar search engine, using a set of predefined keywords (shown in Table 1). The base keyword for the search strings was UML. This keyword was combined with the search strings listed in Table 1. The years between 2000 and 2019 were determined to be the target period, and relevant articles were downloaded that met the general criterion, which included at least one of the UML diagrams given in Table 2. Table 1. Search strings. Search Strings System implementation Software implementation Application implementation System design Model for system Model for software Model for application Architecture for system Proceedings 2021, 74, 13 3 of 8 Software design Application design Framework for system Framework for software Framework for application Architecture for software Architecture for application System architecture System model System framework Moreover, the process of forward and backward snowballing was undertaken to extend the research into two stages: using the original papers and then using the additional papers that were found [7]. To do this, for each paper, the members of the team checked the references in the paper, looking at the titles as well as the abstracts. Table 2. Types of Unified Modeling Language (UML) diagrams. Types of UML Diagrams Use Case Diagram Communication/Collaboration Diagram System Sequence Diagram Class Diagram Domain Model (diagram) Component Diagram Activity Diagram Deployment Diagram State Machine Diagram Object Diagram Sequence/Interaction Diagram Package Diagram 2.3. Inclusion and Exclusion Criteria After a general research strategy and criteria, several relevant keywords were identified in terms of the research questions, the research was organized, and 247 publications were found in the databases. A set of detailed criteria was created in order to select the publications related to the research purpose. The inclusion and exclusion criteria were the following: • • • The publications must be published in the English language; The publications must be published between 2000 and 2019; The publications must include at the least one UML diagram. Figure 1 displays the SLR process and the results of the inclusion and exclusion criteria, and 52% of the downloaded publications—that is 128 publications—were included in the study out of a total number of 247 papers. Figure 1. Systematic literature review diagram. 2.4. Data Extraction A data extraction process was conducted in order to deal with the research questions and discover patterns and trends. For this purpose, a Microsoft Excel spreadsheet was used to store and organize the data about the publications, which were the certain classification characteristics regarding the research questions such as type, publisher, usage Proceedings 2021, 74, 13 4 of 8 purpose, and application area. Table 3 shows each classification characteristic and their categories used in this study. Table 3. The classification characteristics for the publications. Characteristics Publication Type Publishers Goals Application Categories Journals, conferences, book chapters, and other academic publications IEEE, ACM, Elsevier, Springer, and others Design, testing, implementation, and others Health, industry and business, finance, service, computer science, education, and others 3. Results This section explains the results of our literature review analyses on the publications and includes the findings related to the research questions. It is organized as subsections in terms of the research questions. 3.1. RQ1. What Is the Distribution of the Number of Publications by Year? Figure 2 shows the distribution of the publications between 2000 and 2019 through four-year subperiods. The peak subperiod was between 2012 and 2015 at 25%, whereas the subperiod between 2000 and 2004 was 23%, the subperiod between 2004 and 2007 was 20%, and the subperiod between 2016 and 2019 was 17%. 40 32 30 29 25 22 19 20 10 0 2000–2003 2004–2007 2008–2011 2012–2015 2016–2019 Year Periods Number of Studies Figure 2. Distribution of papers based on four-year subperiods. 3.2. RQ2. What Is the Distribution of the Number of Publications by Publishers and Publishing Types? Figure 3 illustrates the distribution of the types of publications. It expresses that the number of conference proceedings was 60, which was 47% of all publications, while the book chapter publications had the lowest number and percentage of 4%, the number of journal papers had a rate of 44%, and the percentage of other publications was 5%. Figure 4 shows the number of publications in terms of the publishers. A total of 44 publications were published by IEEE, while Elsevier and Springer had the same number of publications at 17. Moreover, 9 publications were published in ACM. Other publishers, such as Taylor & Francis, Wiley, and others, had 41 publications. Proceedings 2021, 74, 13 5 of 8 Figure 3. The number of articles by publication type. 50 44 41 40 30 20 17 17 Elseiver Springer 9 10 0 IEEE ACM Others Publishers Number of Studies Figure 4. Distribution of articles by publisher. 3.3. RQ3. What Is the Distribution of the Publications According to the Application Areas? Figure 5 expresses the distribution of publications for each application. The greatest number of publications was mainly published for computer science and industry and business applications, respectively, whereas the least number of articles was published for finance and other application areas. Figure 5. Distribution of publications by application area. Proceedings 2021, 74, 13 6 of 8 3.4. RQ4. For Which Purposes Are UML Diagrams Utilized in the Publications? More than two-thirds of the publications used UML diagrams for design purposes. Other purposes for utilizing UML diagrams included testing and implementation or development, with percentages of 18% and 13.3%, respectively. These can be seen in Figure 6 in detail. Figure 6. Distribution of articles by purpose of UML diagram usage. 3.5. RQ5. What Are the Most Commonly Used UML Diagrams in the Publications? The distribution for the number of each type of UML diagram is expressed in Figure 7. The least-used UML diagram was the component diagram, which had a rate of 0.7%. However, the class diagram was the most commonly used one and was in 26.3% of all the articles. 80 70 60 50 40 30 20 10 0 71 44 41 34 33 12 9 7 6 6 5 2 Figure 7. UML diagram usage in publications. Table 4 gives information about the distribution of publications that either had only one UML diagram type or more than one diagram type, and half of the studies contained only one distinct diagram type; 18.8% of the publications included two or three different types of diagrams, and 13.2% of the publications included four different types of UML Proceedings 2021, 74, 13 7 of 8 diagrams. Only one publication contained five different types of UML diagrams, and 3% of all the publications contained six different types of UML diagrams. Table 4. Distribution of publications by UML diagram type usage. The Number of UML Diagram Type Usages 1 2 3 4 5 Total Count 59 24 24 17 4 128 Percentage 46.1% 18.8% 18.8% 13.2% 3.1% 100% Apart from this table, when the diagrams under the category of Others were examined one by one, it was seen that single usages of the collaboration, component, and object diagrams totaled zero; that is, they were never used individually in any publication. Table 5 was formed to see the associations of the diagrams that were used in the same publication. In other words, one can find the counts of publications that included two specific diagrams in a study by looking at the junction square of the diagram names in the table. Additionally, the bold numbers in the middle of the table give the total counts of publications that included the related diagrams. Table 5. The association matrix for the usage of UML diagram types. Class Activity 71 22 22 44 23 16 19 9 19 8 27 16 Class Activity Use Case Sequence/Interaction State Machine Others Use Case 23 16 41 13 13 25 Sequence/Interaction 19 9 13 34 12 9 State Machine 19 8 13 12 33 13 Others 27 16 25 9 13 47 The five diagrams that had high usage rates in Figure 7 took place directly by their names in the table. The other six diagrams were taken under the category of Others. Accordingly, it is obvious that high associations were correlated with the usage rates of the diagrams. When comparing the differences between the associations together with the total number of the publications, there were no significant differences, but when a class diagram had 27 associations with the other diagrams in 71 total publications, the use case for the other diagrams had 25 associations with 41 total publications, which was significantly lower than the class diagrams. The activity diagrams also had less... Purchase answer to see full attachment

error: Content is protected !!