/Data-duplication-problem

This is a research done based on the data duplication to adress the problem of duplication

Apache License 2.0Apache-2.0

MULTI ENVIRONMENT BIG DATA DEDUPLICATION SOLUTION

A MULTI ENVIRONMENT BIG DATA DEDUPLICATION SOLUTION FOR CLOUD STORAGE SYSTEMS by ERASTUS. karungokihara@gmail.com

1. Summary

The main objective of research is to understand the major issues caused by the data deduplication and how the challenges of these issues can be reduced for the organisations using the benefits of big data and cloud. The research will aim to understand the problems of duplicate big data in cloud environment and will aim to list and suggest some of the best solutions for the same. The deduplication process will be studies and tested for the better understanding and results of the process will be used to conclude the results and to provide some insights.

2. Introduction

The increase in the rise of the working using internet and due to immense amount of the digital footprints, the use of big data is increasing day by day. The big data is one of the highly used technological term and techniques used in the organisations because of the large amount of data collected on daily basis by business processes. The increase in data not only brings various computational and business benefits to organisations but is also the main reason of concerns for the organisations. The data collected on daily basis is not structured and contains various information and unimportant information that need to be handled carefully for making the best use of the available data. The best method used by the companies in order to save this data and keep its collection process dynamic, is the utilisation of the cloud computing technology

The use of cloud computing for the storage of big data is very common practice for everyday data collection, filtering, processing and storage by organisations. Generally, this data contains the information in the form of text, images, audios, backup data or videos which is collected with the permission of user. This data when collected from various end points may contain similarity or duplicity that just adds to the pile in data size but reduces the meaningfulness from data. The deduplication process of big data is one of the solutions used by the big data handlers in order to remove the duplicity and to increase the reliability and quality of data. The work here will discuss the solution of the duplicity removal problem from the big data in cloud storage environment. As displayed in the image above, the process result of deduplication will reduce the repetition of the similar data and will enhance the overall efficiency of data processing.

2.1 Problem Statement

A problem statement section is used for the elaboration of the problems that are concerned with the research and he following are some of the issues that are caused by the duplicity of data in cloud storage:

  1. The increase in the amount of data in cloud add on to the expenses of the handling data and storage. Cloud is a service that is used because of is benefit of pay as you use method, increasing amount of data requires large storage space and thus adds more to the expenses. The duplicity in data unnecessary adds to the storage and handling prices of cloud for any organisation.

  2. the duplication of data caused the major issue of decreasing the overall quality of data. The data collected from various business processes is used for gaining major insights for future business process and decision making and because of data duplicity, the output generated from data processing are hampered which in turn reduces the benefits of big data for company.

Big data technology has evolved tremendously within the last decade. Almost every organization in the world is leveraging Big Data analytics to improve their business performance and making better smart data-driven critical business decisions. The large volume of Big Data makes it difficult to access the data in a normal system environment. Therefore, companies use distinct Big data analytics techniques on the cloud such as AWS, Azure, Google Cloud, etc. to perform Big data analytics. The task here is to understand the process and technology which can be used for Big data analytics in the Hadoop environment on the AWS cloud service.

2.2 Research questions

The research questions to the research work include the following:

  1. How can we solve the problem of data duplication for the cloud storage systems using ML model
  2. How can we use Hadoop to solve the problem of data duplication in the AWS and the other components used together with the Hadoop?

2.3 Conclusion

This chapter have addressed the major key things we will look into in our research work. The major things we will deal with include the Hadoop and the model development using simple ML language that will train our datasets and get the solution of duplication.

3. Literature Review

3.1 Introduction This chapter is a research on the similar works performed by other researchers in the problem of data duplication. We went through the scholarly materials from the internet considering the headers and the abstracts to help in doing our literature review. We also pointed out clear problems solved using the different methods.

3.2 Data duplication

Data deduplication is defined as the technique that is utilized for solving data repetition problems. These techniques are used in the cloud servers for minimizing the server space. For preventing the unauthenticated utilization of data access along with creating duplicated data over the cloud the encryption approach towards the data encryption before saving it to the server.

The image is from the process as suggested by Ebinazer, Savarimuthu and Banu (2021), stating the process of deduplication named and abbreviated as SDD-RT-BF. This is the process of secure data duplication process suggested in the form if radix trie using the bloom filter, which is completed in three stages of authorized duplication of data, providing proof of ownership and key role updating.

Cloud storage generally comprises business confidential data and procedures. Therefore, higher security is defined as the one and only solution for retaining the strong inter-relations among the users and services of the cloud. Also, for handling the security threats, this assessment work proposes several types of cloud storage. Thus, the general data storage forms like databases and files of any individuals are split and saved into several cloud storages. • Data deduplication is generally adopted for the data redundancy removal technique. This is classified as similar data, storing the data copy, and consider same copies along with their references in comparison to store complete copies. • This is generally allowing for entire volumes over the AFF system. This is also allowed based on the pre-aggregate or per-volume types. This supports the functioning of the system running over the background and online activities to amplify the respective savings. • The data deduplication techniques include up some critical procedures that are hashing, chunking, and comparison hashes for handling the redundancy. The chunking procedure breaks out the file into various small files that are defined as chunks. • This work defines the security of the data deduplication system with the help of the MECC algorithm in the collaborative cloud environment. Convergent type of encryption is applicable for correct choice for implementing the dada deduplication using the cloud domain encryption (Shynu et al. 2020). • Data deduplication is the procedure that is suitable for removing the extra copies of data along with minimizing the data storage capacity necessities. This is defined as a minimum data loss technique that is effective for online procedures along with managing the background procedure to amplify the savings. • Data deduplication technique concerns up various benefits like protocol independence, minimum overhead, application-independent, butte validation, free charges, collaborative storage efficiencies, archive storage, and many more. There are various types of functions concern with data deduplication and this compares every object with the removal of objects that are available into the data set. The deduplication procedure removes some of the blocks that are different (Tarwani and Chug, 2020). • Dividing the data input into some blocks. • Evaluating hash value corresponding to every block. • Utilization of specific values for determining in case the block of similar data needs to be stored. • Replacement of the duplicate data in relation to already available data in the database. The procedure associated with the data deduplication technique can easily be implicated in several manners. The elimination of the duplicate data can easily be compared among the files and the respective decisions can be executed in case it is required. The general methods required for the implementation of the data deduplication are defined as: - • Sub-block hashing • Sub-block versioning • File-related versioning • File-related hashing • File-related comparison The current situation of data storage in companies results to data problems. There have been many proposals on how to address the issue of storage where you at instance you find that the company is facing an issue with storage of data (Anand et al, 2017). According to Xue Yang (Xue et al, 2017), secure data duplication helps in addressing the issues and can be used in solving the communication overheads on the cloud storage devices. According to him, the common methods in the industry pose great danger to brute-force attacks where the logins and personal data can be brute-force and easily cracked. According to Xue, the encryption standards should be part of the secure data duplication problem addressing. According to Naresh (Naresh et al, 2018) Hadoop method have proved to be one of the best methods when it comes to the secure duplication in the cloud systems. Naresh explores on the content defined chunking (CDC) and provided a solution based on the same for the duplication problem in the cloud computing. There are a number of approaches that can be used to address the duplication and one of the approach include the Two Thresholds Two Divisor (TTTD-P) CDC which aids in minimizing the computing operations or a common case scenario. IOT devices integrated with the cloud operations also face a major issue in the data duplication. These is the largest part where the data duplication needs to be addressed. It is because these devices operates in real time and might collect large dataset in only a very short span of time. If the issue is not addressed, the storage might get used up in a very short while as stated by Yan (Yan et al, 2020). Example assume a case where the device is a camera. If a CCTV Camera is let to record information directly it can eat space massively. Also another case is the weather station, thee tools collect more data at a go and hence should be well addressed.

4. Methodology and Solution for Deduplication

4.1 Introduction

Big data analytics has been performed on the high volume, variety and viscosity of data. Therefore, it requires a great computation environment and storage capability. The normal systems are not capable of handling such kinds of data even with the smart configuration. Hence, for Big data analytics different advanced techniques are being used. One of these techniques is to perform the analytics on a Hadoop environment and store and host the data on a cloud platform such as AWS. In our analysis we used the data from online sources to help in our project. The data was downloaded from the data.gov website and I realized it have been hosted by a number of to the researchers having different approaches to the cloud solutions. AWS is one of the most secure and efficient cloud services available out there. It provides a good storage and computation facility. Therefore, many organizations prefer to use cloud platforms instead of building their own server rooms.

4.2 Hadoop in AWS

4.2.1 Introduction to hadoop

Hadoop is a big data environment which makes big data access easy by making chunks. It processes the data into chunks rather than dealing with it altogether. Therefore, the computations become easy and fast. The AWS cloud platform provides various features and the Hadoop environment is one of them. The AWS Hadoop environment makes the work more efficient by distributing the computation in chunks. These chunks make the data processing easier in the Hadoop environments (How to analyse big data with Hadoop - Amazon Web Services (AWS), 2021). Besides, integrating the Hadoop environment is super easy with AWS. You can introduce another Hadoop group powerfully and rapidly, or add workers to your current Amazon EMR bunch, fundamentally diminishing the time it takes to make assets accessible to your clients and information researchers. Utilizing Hadoop on the AWS stage can significantly expand your hierarchical deftness by bringing down the expense and time it takes to dispense assets for experimentation and advancement. Hadoop design, organizing, worker establishment, security arrangement, and continuous regulatory support can be a convoluted and testing movement. As an oversaw administration, Amazon EMR tends to your Hadoop framework necessities so you can zero in on your center business. You can without much of a stretch incorporate your Hadoop climate with different administrations like Amazon S3, Amazon Kinesis, Amazon Redshift, and Amazon DynamoDB to empower information development, work processes, and examination across the numerous assorted administrations on the AWS stage. Moreover, you can utilize the AWS Glue Data Catalog as an oversaw metadata archive for Apache Hive and Apache Spark (How to create a Hadoop Cluster for free in AWS Cloud? 2021).

Numerous Hadoop occupations are spiky in nature. For example, an ETL occupation can run hourly, every day, or month to month, while displaying occupations for monetary firms or hereditary sequencing may happen a couple of times each year. Utilizing Hadoop on Amazon EMR permits you to turn up these responsibility bunches effectively, save the outcomes, and shut down your Hadoop assets when they're presently not required, to keep away from superfluous framework costs. EMR 6.x backings Hadoop 3, which permits the YARN Node Manager to dispatch holders either straightforwardly on the EMR bunch or inside a Docker compartment. If it's not too much trouble, see our documentation to find out additional.

By utilizing Hadoop on AWS EMR, you have the adaptability to dispatch your groups in quite a few Availability Zones in any AWS area. A possible issue or danger in one area or zone can be handily bypassed by dispatching a bunch in another zone in minutes. Scope quantification preceding conveying a Hadoop climate can frequently bring about costly inactive assets or asset constraints. With Amazon EMR, you can make bunches with the necessary limit in practically no time and use EMR Managed Scaling to powerfully scale out and scale in hubs (How To Create a Hadoop Cluster in AWS -- Virtualization Review, 2021). Hadoop with AWS is being utilized by many organizations to perform big data analytics. The data analytics will then help businesses in knowing their customers, users, and visitors better. This analytics tool gives them great insights regarding their businesses and equips them with data driven results. The AWS Hadoop environment is a great medium to perform big data analytics as it also provides several in-built analytics techniques. These techniques involve creating visualizations, statistical analytics tools, and many other techniques. The AWS Hadoop integration makes the Big data analytics process smooth and trouble-less. Besides, AWS also provides machine learning algorithms implementation with their respective libraries with the cloud service. This can help the companies hosting their machine learning models. And to use them further for making predictions on the upcoming live data. Since the cloud platform has a good amount of storage facility that makes the upcoming data storage and mathematical computation easier. Moreover, the live training with respect to upcoming new data will also be easier with the use of Hadoop AWS service. Even if the upcoming data has high volume, diversity, the cloud will be able to handle them. While a normal system can-not. The system has its own limitations similarly the cloud platform has too. However, the limitations of the cloud services can be sorted out within a short span of time just by upgrading the plan. The cloud service providers have different kinds of plans with respect to the provided features and facilities. These plans also vary with the cloud service providers. Among all the plans and platforms out there, AWS provides the most beneficial facilities for big data analytics. Therefore, Big data analytics it is the most preferable cloud platform (Apache Hadoop Amazon Web Services support – Hadoop-AWS module: Integration with Amazon Web Services, 2021). Hadoop allows a series of thing like the ease of analysis of a large dataset. Large dataset, needs to well analysed for use before storage. Apache hadoop offers the processing of the big datasets like the ones for the large customer base. Processing such information using the manual method might be hard and the hadoop have proven to be effective to use.

4.2.2 Components of Hadoop

Apache Hadoop can be referred to as a project since it is basically a project. Amazon EMR installs and configures the Hadoop on the personal clusters, or generally the clusters used for the datasets by the user. As we all know Amazon EMR is basically used in the analysis of big data hence Hadoop helps in the process of analysis and filtering of the datasets

The figure above shows the AWS Hadoop. In many occasions it is used together with the Amazon S3 for the processing of large data. In order to launch it there are steps to be taken for the analysis which are simple. The components of the Hadoop include the following: • Mapreduce, Tez, Yarn (Yet another resource negotiator) These components main role is to process the workloads. The processing of the workload is done by breaking the received information from the cloud and breaking it down into smaller pieces that can be well analysed. The design of the component is such that it is believed that at an instance the machine might fail and hence there should be a way of resolving the issue. I by any case, the machine fails, the Hadoop is easily alerted and the running of the workload is done on another subsidiary machine (Adamov and Abzetdin, 2018). The scripting of the mapreduce and the tex can be done using programming languages like java and hence it is easy to perform an implementation of big data processing to solve the issue of duplication. • Storage with Amazon S3 and also with EMRSF Amazon S3 is cheap and simple to use for the data processing. Hadoop can be used with the S3 together with the ERM file system. Data encryption is done by the S3 and the processing of such can be easily done using the EMR file system. The EMR has the ability to read and write the information and data from the S3 hence when Haddop is integrated the processing gets easy (Parsico et al, 2020). • HDFS for the on-cluster storage The Hadoop distributed file system HDFS, stores data and information in the cluster blocks in the local disk. HDFS automatically detects the replication and the duplication of data easily. In most cases people prefer to use the HDFS together with the Amazon S3 when it comes to the data processing (Wiktorski, 2019). This is very important for the duplication problem solving.

4.2.3 Starting the haddop

Hadoop can be configured on the AWS to help in the process of data filtering. Hadoop uses the instances method. We have to launch an EC2 instance and select the virtual machine we intend to use in the process. We use a single node for testing and configuring the haddop method. To install or launch an instance we follow the general procedures for the starting of the EC2 instance method. The following process are followed to launch the hadoop for our clusters. Step 1: Launching the instance Launching the instance involves visiting the AWS and accessing the EC2 instance as shown below

The figure above shows the starting of the EC2 instance. From the figure there are a number of virtual machines we can launch but we selected the Ubuntu Server 20.04 which is one of the latest machines to use. The good thing about the AWS is that we can enjoy the use of the latest platforms and machines at a very low costs (Nguyen et al, 2017). Example, if we were to make a purchase of the physical Ubuntu 20.04 server, it would cost us thousands of dollars yet with the AWS we can access it at a very low cost and be able to implement online.

The figure above shows the selecting of the type of instance. For our case we selected the t2.micro. This type is the commonly used in the process of analysis andThe figure above shows the downloading the key pair to use in the connecting with the virtual machine. The instance connection can now be done with the downloaded key pair. This key pair can be shared to the other users of the system to be able to connect to the instance remotely

The figure above shows the running instances form the already created instances. We can see that the four instances we created for the Ubuntu server are running well.

The figure shows the naming of the instances. Naming the instances we use datanodes since we are going to use each instance as a specific datanode. Figure shows the configuration file we need to add so as to effectively use the virtual machine on the AWS. From the figure, we can see that we called all the three nodes which are referenced as node, node1, node2, and node3. Once everything have been done we can now configure the HDFS on the VM using the script shown on the figure above. Once the HDFS have been configured. We need to also configure the mapreduce in the nodes.

it is secured with the IPV6.

The figure above shows the details of the instance. From the figure we can see we used the default details of the instance. The number of instances launched are 4 from the figure Figure above shows the storage description of the cloud storage. These is where the information about the storage size of the machine we are launching. We used a size of 8GB.

The figure shows the security group settings. We used the general settings to allow all IP addresses to be able to interact with the machine as we can see. It is clear we have allowed all traffic on the system. Step 2: Launching the instance Lunching the instance is simply done by downloading the key file. This can be seen as shown below

The figure above shows the downloading the key pair to use in the connecting with the virtual machine. The instance connection can now be done with the downloaded key pair. This key pair can be shared to the other users of the system to be able to connect to the instance remotely

The figure above shows the running instances form the already created instances. We can see that the four instances we created for the Ubuntu server are running well.

The figure shows the naming of the instances. Naming the instances we use datanodes since we are going to use each instance as a specific datanode. Figure shows the configuration file we need to add so as to effectively use the virtual machine on the AWS. From the figure, we can see that we called all the three nodes which are referenced as node, node1, node2, and node3. Once everything have been done we can now configure the HDFS on the VM using the script shown on the figure above. Once the HDFS have been configured. We need to also configure the mapreduce in the nodes.

The figure above shows the starting of the hadoop cluster. Once we now have fully configured the system we can now access the system via the GUI. The Hadoop instance is easy to use and can be used for a real hosted datasets to make analysis and solve the problem of multiple data duplication.

4.3 Modelling the solution

4.3.1 Introduction

Data duplication as we have discussed can be solved using the general techniques of storage to ensure that we have enough storage for data. There are a number of tools that can be used in helping in the process of modelling the solution. We concentrated on the use of the jupyter notebook for the solution since it allows the use of AI methods in solving the common case scenarios. We used the database we had collected and first to perform the filtering of the dataset. In modelling the solution we need to come up with a way or an algorithm that will help in the detecting of the duplicated data in the cloud platforms or the dataset (Zao et al, 2018). The detection of the data is the key thing first to work on. For a case of a login system where we have many customers operating online they might duplicate the data and resubmit by mistake, the system should have a way of detecting the time span of resending the data and check for the similarity. To achieve these we have to employ an algorithm to help in the detecting. A user of the system inputs information to the system using the input dashboard for any common MIS. The system allows the user input, the system could be the comment page, the feedback or any other information the user might be intending to share to the server. For the case of a file system, this might be hard for the system since it have to compare the information in the system against what is being uploaded. It checks to ensure that all the files are not copied directly using some algorithm. Example to achieve that it can check things like: • Plagiarism of the document or the words uploaded • The document similarity using the online platforms like the draftable.com platform. • Checking the headers of the documents • The naming of the document When the algorithm checks all that it then checks to confirm there is no similarity from the user. The database design could be done to use the algorithm such that when the user sends information similar to other person, then it refers that with same Index. Hence when we need to call the response or the file from the client we can use an ID instead which will save on the space of database.

4.3.2 Algorithm design

The blocking algorithm checks the user inputs and filters them against the common ways of comparison to check on the database. When here is similarity, the information is referred with an ID. It creates a loop of the similar objects and stores in the database as one object which can be called with the ID. Once the information leaves the blocking algorithm it goes to the encryption algorithm. The encryption id for the purpose of security on a common input in a system. The algorithm performs search of the similarity from the local space and also from the global space. These is so as to optimise the storage and it have been represented by the diagram below:

Figure 4 shows the best solution for the duplication problem. As we had discussed earlier the algorithm as seen from the figure performs the search of the similarity on both the local and the global servers. It first goes through the local space to check if the information exists. If it exists it records the duplicate and drops the information by using the pointer as an ID (Pasadas et al, 2020). If it does not exist, the model checks on the global storage for any similarity. If the similarity is detected similar thing is done and if not, the information is saved in the server with a new ID. These way the duplication problem is well addressed. We can also use a flow chart to explain the traversing of the server to check for the similarity. The flow chart is shown below:

Figure 6 above shows the steps used by the model to check for the similarity of the input by the user. From the figure we can see that the first thing is user at input, if the information is available it is recorded as available. Else, from the figure it is clear that the model checks the global servers for the similarity. If it exists on the global servers, the record is stated as available and hence duplication is avoided. Data clustering is important in this case and have been well addressed as can be seen from the figure 6. In order to filter the similarities we use the clusters. The algorithm can divide the datasets or the information in the server for ease of analysis and to increase the speed of operation. Here we need the use of clusters to simplify the model and traversal process.

Figure 7 shows the architectural design of the model. The model uses two clusters in order to analyse the information. First we use the first two local clusters before going to the common space allocated for the system at hand (Osongo et al, 2020).

4.3.4 The model

The model used the unsupervised learning and the K-means clustering method. These is one of the simplest way to implement the solution and I feel it is the most reliable and the one that can solve such problem.

Filtering

In order to start using the dataset we have to understand well what it is all about. To do that we get the header and the list of the first ten rows in the dataset. With the use of the language it is easy to run the command and this can be shown by the figure below

Figure 7 shows the use of the image to help in understanding more about the dataset we used. From the figure we can see that the dataset the columns for the names, the ID, the date of birth, and the gender. These information as we can see have been duplicated. Example from the ID=3 up to ID=4, the date of birth and the information is similar. Also this is same to the ID= 0. This is the same person but have been duplicated on our dataset (Osongo et al, 2020). This takes more space and if someone can have a duplicate on himself on the system three times then we will end up having a very large space being used up

Figure 8 shows the description of the dataset. From the dataset we check how unique the columns are. The unique lines are 37, we have the unique date of birth being 24, the gender under normal circumstances we can have a male or a female hence we have the value as 2. Unique first name from the dataset we have a value of 28. This information can help us understand the number of users of the system.

Splitting dataset

In order to use the dataset we can first split it. For our case we split the date of birth. We believe that having the same date of birth is always not an easy thing unless the dataset was collected on a specific age.

Figure 9 shows the splitting of the dates. From the figure we used the splitting in terms of the day, month and the year and this is very clear from the image. We also performed a split on the gender to either male or female

Figure 10 shows the vectoring of the name. It is common to have people to have similar first or last name. However having the people will all names being same is at times not easy. The probability rises under such situations. However with our model we used a generalization and performed a vectoring of the first and the last name as can be seen from figure 10.

Manual feature design

It is important to tell our model to be able to differentiate the names well. The feature will be able to differentiate between different names in the dataset. Here it should check the names well and be able to effectively differentiate between one and the other. SK-Learn library (Popov et al, 2020) is very important in the training of the dataset to be able to different the names as we can see below.

The figure 11 above shows the manual feature defined in the model. From the figure we can see that we defined the array of name length as length of first and last name. We created the column labels for the first name and the last names. We then went ahead to find the similarity of the coefficient of a given first name with another first name. In short how similar it is from the first name and the other. We also did for the other last names.

4.3.5 The evaluation of the model

Once we have done the splitting we can now evaluate our model. Some of the important algorithms used in the evaluation include the Silhouette Coefficient. These method is found from the using of the intra cluster-distance and also the nearest cluster distance. Taking the intra cluster to be a The nearest cluster to be b We have the coefficient being (b-a)/max(a,b) With sk-learn library we can directly call the method without having to apply the formulas like when we were to use the mathematical method (Yang et al, 2020)

Figure 12 shows the use of the silhouette method in clustering and evaluation of the method. From the figure we can see that we have to import the library for the silhouette. We then use the library in our model to perform the evaluation. We use the clustering method to find the similarity a can be seen on the model.

Figure 13 shows the use of the k-means clustering in the evaluation of the model. From the figure we can see that we used the clustering to perform our evaluation. Once the evaluation have been done, the results are saved in an external file which can be accessed to view the results of the solved duplication problem.

Figure 14 shows the solution of the model once the evaluation have been done. From the figure we can see that taking the first five element from the results we can see everything is completely different from the other and hence we can ascertain we are good to go. Figure 14: The graph of the raw data before the duplication problem had been solved. Figure 14 shows the dataset graph before the problem of duplication had been solved. From the graph we can see that we had the last name being highly different from the others.

Figure 15 shows the graph after having solved the duplication problem on the dataset. From the figure we can see that the last and the first names are now well laid out and kind of balanced compared to the figure 14. It is a clear indication that the solution worked well on the dataset. Some values that we do not expect to change so much is the gender since it is a male or female.

The figure above shows the starting of the hadoop cluster. Once we now have fully configured the system we can now access the system via the GUI. The Hadoop instance is easy to use and can be used for a real hosted datasets to make analysis and solve the problem of multiple data duplication.

4.3 Modelling the solution

4.3.1 Introduction

Data duplication as we have discussed can be solved using the general techniques of storage to ensure that we have enough storage for data. There are a number of tools that can be used in helping in the process of modelling the solution. We concentrated on the use of the jupyter notebook for the solution since it allows the use of AI methods in solving the common case scenarios. We used the database we had collected and first to perform the filtering of the dataset. In modelling the solution we need to come up with a way or an algorithm that will help in the detecting of the duplicated data in the cloud platforms or the dataset (Zao et al, 2018). The detection of the data is the key thing first to work on. For a case of a login system where we have many customers operating online they might duplicate the data and resubmit by mistake, the system should have a way of detecting the time span of resending the data and check for the similarity. To achieve these we have to employ an algorithm to help in the detecting. A user of the system inputs information to the system using the input dashboard for any common MIS. The system allows the user input, the system could be the comment page, the feedback or any other information the user might be intending to share to the server. For the case of a file system, this might be hard for the system since it have to compare the information in the system against what is being uploaded. It checks to ensure that all the files are not copied directly using some algorithm. Example to achieve that it can check things like: • Plagiarism of the document or the words uploaded • The document similarity using the online platforms like the draftable.com platform. • Checking the headers of the documents • The naming of the document When the algorithm checks all that it then checks to confirm there is no similarity from the user. The database design could be done to use the algorithm such that when the user sends information similar to other person, then it refers that with same Index. Hence when we need to call the response or the file from the client we can use an ID instead which will save on the space of database.

4.3.2 Algorithm design

The blocking algorithm checks the user inputs and filters them against the common ways of comparison to check on the database. When here is similarity, the information is referred with an ID. It creates a loop of the similar objects and stores in the database as one object which can be called with the ID. Once the information leaves the blocking algorithm it goes to the encryption algorithm. The encryption id for the purpose of security on a common input in a system. The algorithm performs search of the similarity from the local space and also from the global space. These is so as to optimise the storage and it have been represented by the diagram below:

5. Conclusion

There are a number of ways that can be used in solving the problem of duplication as we had discussed. For the cloud based systems. We can consider the use of Hadoop since it allows the ease of processing large datasets. The Hadoop should be used together with the ERM or the Amazon S3. These is the most effective method in solving the problem of data duplication. It is true that the problem of data duplication will never end since the people needs to insert information and at times they can mistake and resend the information by mistake. In order to ensure we have enough space for the system it is important to keep updating the system regularly and processing the datasets to remove the duplications. As a company it can consider using the premium options for the Hadoop in conjunction with the Amazon S3 which is cost effective. Also in our research we have dwelt so deep into the designing of a model in Jupyter that can be used in processing the dataset and solving the problem of the duplication. In our case we used the datasets collected and analysed the datasets and used it to come with the model. We tested the model and the results we all correct. Such kind of model can be used as a prototype in the verge of solving the problem of duplication. Some of the advantages of using Hadoop with EMR are as follows: • Good speed of operation and the agility The development agility is reduced since we are using the AWS which provides many services related to the cloud computing and data analysis. Starting the servers is way faster and simpler. This also applied to the starting of the resources and configurations. At times we have automated configuration settings with the IaC. • Reduced complexity of management The EMR addresses all the other components related to the configuration and the other information. This allows the developer and the auditor to concentrate on the core of the business which is to perform analysis of the datasets and remove the duplication on the collected information by the server. • Integration with additional Services In order to get more analytics and be able to understand well the data, we can always use the Haddop with other services like the Amazon S3, Redshift and Kinesis. These services are easy to use since it is in the same platform. It gets easier to implement since all services are available. • Disaster recovery. When data is lost it gets hard to retrieve it. A scenario is when we were perming the analysis and we happen to loss the data we were using. That is hadoop comes in. With Amazon we can launch the instances easily from the location depending on the availability zone. Hence this makes it easy to get information. The assessment was completed within the given timeline and all was done in accordance. However, I still feel if given some additional time I would have addressed more on the issue of duplication which is a major thing in the servers. This not only happens to the servers but even to the mobile phones. How the data is duplicated we have to delete a file after the other in trying to resolve the duplication of files. The modules assessment was well accomplished and I believe we are going to deal more with similar things in the classroom like how we can integrate AI based and ML models into the cloud platforms. As I was going through the AWS I realized something about big data analysis and developing of Machine Learning algorithms are highly supported I believe we are going to deal with something of the kind in the next module.

6. References

Amazon Web Services, Inc. 2021. How to analyze big data with Hadoop - Amazon Web Services (AWS). [online] Available at: https://aws.amazon.com/getting-started/hands-on/analyze-big-data/ [Accessed 20 August 2021]. Ebinazer, S.E. and Savarimuthu, N., 2020. An efficient secure data deduplication method using radix trie with bloom filter (SDD-RT-BF) in cloud environment. Peer-to-Peer Networking and Applications, pp.1-9. Hadoop.apache.org. 2021. Apache Hadoop Amazon Web Services support – Hadoop-AWS module: Integration with Amazon Web Services. [online] Available at: https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html [Accessed 20 August 2021]. Medium. 2021. How to create a Hadoop Cluster for free in AWS Cloud?. [online] Available at: https://medium.com/analytics-vidhya/how-to-create-a-hadoop-cluster-for-free-in-aws-cloud-a95154980b11 [Accessed 20 August 2021]. Shynu, P. G., Nadesh, R. K., Menon, V. G., Venu, P., Abbasi, M., & Khosravi, M. R. (2020). A secure data deduplication system for integrated cloud-edge networks. Journal of Cloud Computing, 9(1), 1-12. Tarwani, S. and Chug, A., 2020. Assessment of optimum refactoring sequence to improve the software quality of object-oriented software. Journal of Information and Optimization Sciences, 41(6), pp.1433-1442.
Virtualization Review. 2021. How To Create a Hadoop Cluster in AWS -- Virtualization Review. [online] Available at: https://virtualizationreview.com/articles/2019/05/14/how-to-create-a-hadoop-cluster-in-aws.aspx [Accessed 20 August 2021]. Bhalerao, A. and Pawar, A., 2017, May. A survey: On data deduplication for efficiently utilizing cloud storage for big data backups. In 2017 International Conference on Trends in Electronics and Informatics (ICEI) (pp. 933-938). IEEE.
Kumar, N. and Jain, S.C., 2019. Efficient data deduplication for big data storage systems. In Progress in Advanced Computing and Intelligent Engineering (pp. 351-371). Springer, Singapore. Shin, Y., Koo, D. and Hur, J., 2017. A survey of secure data deduplication schemes for cloud storage systems. ACM computing surveys (CSUR), 49(4), pp.1-38. Persico, V., Montieri, A. and Pescape, A., 2016, October. On the network performance of amazon S3 cloud-storage service. In 2016 5th IEEE International Conference on Cloud Networking (Cloudnet) (pp. 113-118). IEEE. Yang, X., Lu, R., Choo, K.K.R., Yin, F. and Tang, X., 2017. Achieving efficient and privacy-preserving cross-domain big data deduplication in cloud. IEEE transactions on big data. Aujla, G.S., Chaudhary, R., Kumar, N., Das, A.K. and Rodrigues, J.J., 2018. SecSVA: secure storage, verification, and auditing of big data in the cloud environment. IEEE Communications Magazine, 56(1), pp.78-85. Yan, J., Wang, X., Gan, Q., Li, S. and Huang, D., 2020. Secure and efficient big data deduplication in fog computing. Soft Computing, 24(8), pp.5671-5682. Popov, N.V., Razmochaeva, N.V. and Klionskiy, D.M., 2020, June. Investigation of Algorithms for Converting Dimension of Feature Space in Retail Data Analysis Problems. In 2020 9th Mediterranean Conference on Embedded Computing (MECO) (pp. 1-4). IEEE. Adamov, A., 2018, October. Large-scale data modelling in hive and distributed query processing using mapreduce and tez. In DiVAI 2018-Distance Learning in Applied Informatics. Yang, Y., Hao, X., Zhang, L. and Ren, L., 2020. Application of scikit and keras libraries for the classification of iron ore data acquired by laser-induced breakdown spectroscopy (LIBS). Sensors, 20(5), p.1393. Posadas, H., Merino, J. and Villar, E., 2020, November. Data flow analysis from UML/MARTE models based on binary traces. In 2020 XXXV Conference on Design of Circuits and Integrated Systems (DCIS) (pp. 1-6). IEEE. Zhao, S., Talasila, M., Jacobson, G., Borcea, C., Aftab, S.A. and Murray, J.F., 2018, December. Packaging and sharing machine learning models via the acumos ai open platform. In 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA) (pp. 841-846). IEEE. Onsongo, G., Lam, H.C., Bower, M. and Thyagarajan, B., 2020, September. Hadoop-CNV-RF: A Scalable Copy Number Variation Detection Tool for Next-Generation Sequencing Data. In Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics (pp. 1-8). Onsongo, G., Lam, H.C., Bower, M. and Thyagarajan, B., 2020. Hadoop-CNV-RF: a clinically validated and scalable copy number variation detection tool for next-generation sequencing data. Wiktorski, T., 2019. Hadoop 101 and Reference Scenario. In Data-intensive Systems (pp. 19-30). Springer, Cham. Nguyen, T.L., 2017. Setting Up a Hadoop System in Cloud A Lab Activity for Big Data Analytics. In Proceedings of the EDSIG Conference ISSN (Vol. 2473, p. 3857).