Book an appointment to meet Cloud Assert at Microsoft Ignite 2019 - Register
Author: Vidhya tharani/Monday, June 24, 2019/Categories: General
Big Data and Overcoming QA challenges in Big Data Testing
Big Data, we all have heard this term, and Everyone is talking about big data in the last 4 to 5 years. But do you really know what exactly is this Big Data, how is it making an impact on our lives & why organizations are hunting for professionals with Big Data skills?
Quantity of data on our earth is growing exponentially for many reasons. Our daily activities and various sources generate lots of data. Every single thing we do leaves a digital trace in the web. The data growth rate has increased rapidly with the smart objects. Major sources of Big Data are social media sites, sensor networks, digital images/videos, cell phones, purchase transaction records, web logs, medical records, archives, military surveillance, eCommerce, complex scientific research and so on.
All this information amounts to around some Quintillion bytes of data. By 2020, the data volumes will be around 40 Zettabytes which is equivalent to adding every single grain of sand on the planet multiplied by seventy-five.
What is Big Data?
Big Data is a term used for a collection of data sets that are large and complex, which is difficult to store and process using available database management tools or traditional data processing applications. The challenge includes capturing, organize, storing, searching, sharing, transferring, analyzing and visualization of this data.
It is okay with storing the data into our servers because the volume of the data was pretty limited, and the amount of time to process this data was also okay. But now in this current technological world, the data is growing too fast and people are relying on the data a lot of times. Also, the speed at which the data is growing, it is becoming impossible to store the data into any server.
5 V’s in Big Data:
Volume refers to the ‘amount of data’, which is growing day by day at a very fast pace. The size of data generated by humans, machines and their interactions on social media itself is massive. Researchers have predicted that 40 Zettabytes (40,000 Exabytes) will be generated by 2020, which is an increase of 300 times from 2005.
Velocity is defined as different sources that generate the data every day. This is massive and continuous. There are 1.03 billion Daily Active Users (Facebook DAU) on Mobile as of now, which is an increase of 22% year-over-year. This shows how fast the number of users is growing on social media and how fast the data is getting generated daily.
There are many sources which are contributing to Big Data. The type of data generating is different. It can be structured, semi-structured or unstructured. Hence, there is a variety of data which is getting generated every day. Earlier, we used to get the data from excel and databases, now the data are coming in the form of images, audios, videos, sensor data etc.
Veracity refers to the data in doubt or uncertainty of data available due to data inconsistency and incompleteness. Just how accurate is all this data? For example, think about all the Twitter posts with hash tags, abbreviations, typos, etc., and the reliability and accuracy of all that content. Collecting loads and loads of data is of no use if the quality or trustworthiness is not accurate.
After discussing Volume, Velocity, Variety and Veracity, there is another V that should be considered when looking at Big Data i.e. Value. Is it adding to the benefits of the organizations who are analyzing big data? Is the organization working on Big Data achieving high ROI (Return On Investment)? Unless, it adds to their profits by working on Big Data, it is useless.
Types of Big Data:
· Structured: The data that can be stored and processed in a fixed format is called as Structured Data. Data stored in a relational database management system (RDBMS) is one example of ‘structured’ data. It is easy to process structured data as it has a fixed schema. Structured Query Language (SQL) is often used to manage such kind of Data.
· Semi-Structured: Semi-Structured Data is a type of data which does not have a formal structure of a data model, i.e. a table definition in a relational DBMS, but it has some organizational properties like tags and other markers to separate semantic elements that makes it easier to analyze. XML files or JSON documents are examples of semi-structured data.
· Unstructured: The data which have unknown form and cannot be stored in RDBMS and cannot be analyzed unless it is transformed into a structured format is called as unstructured data. Text Files and multimedia contents like images, audios, videos are example of unstructured data. The unstructured data is growing quicker than others, experts say that 80 percent of the data in an organization are unstructured.
Big Data testing:
Big Data Testing plays a vital role in Big Data Systems. If Big Data systems not appropriately tested, then it will affect business. There are various type of testing in Big Data projects such as Database testing, Infrastructure, and Performance testing, and Functional testing. The primary example of Big Data is E-commerce sites such as Amazon, Flipkart, Snapdeal and any other E-commerce site which have millions of visitors and products.
Benefits of Big data Testing:
1. Implementing new strategy
2. Improves cost effectiveness on storage.
3. Improves client expectations on different large data sets
4. Helps in estimating business by structure and unstructured data
5. Helps in identification errors instantly
6. Data readily available for decision making and Reduction in time
Top Challenges in Big Data Testing:
• Heterogeneity and Incompleteness of data
• High Scalability
• Test Data Management
Testing a huge volume of data is the biggest challenge. A decade ago, a data pool of 10 million records was considered gigantic. Today, businesses have to store Petabyte or Exabyte data, extracted from various online and offline sources, to conduct their daily business. Testers are required to audit such voluminous data to ensure that they are a fit for business purposes. Full-volume testing is impossible due to such a huge data size.
How to overcome:
Parallelism is the most common approach to handle this. Databases can achieve parallelism in two ways,
In the first approach, the information is divided into intelligent segments. Different database operations are performed simultaneously. However, if the two operations are using the same information, they should work in a serial manner. This approach can be used to handle overwhelming workload.
In the second approach, the application information is intelligently segmented. NoSQL handles all kind of data. It can be used to store unstructured data. Indexing is another method to handle the performance issues. In indexing, the records are sorted on multiple fields. In this approach, another database is required for holding the pointer to the record. NoSQL can be used for creating this index.
There are several techniques to handle the scalability issues in big data testing.
Clustering techniques: Large amount of data is distributed equally among all the nodes of a cluster. With this technique, large data files can be easily split into different chunks and stored in different nodes of a cluster. The file chunks are replicated and stored in different nodes to reduce the machine dependency. Hadoop is also a clustered database. Hadoop programs can be easily scaled up to a larger hardware with little changes in the program.
Data Partitioning: In the data partitioning, parallelism is conducted at the CPU level. It is less complex and offers easier execution.
For the Big Data testing strategy to be effective, testers need to continuously monitor and validate the 5Vs (basic characteristics) of Data – Volume, Variety, Veracity, Velocity and Value. Understanding the data and its impact on the business is the real challenge faced by any Big Data tester. It is not easy to measure the testing efforts and strategy without proper knowledge of the nature of available data. Testers need to understand business rules and the relationship between different subsets of data. They also have to understand statistical correlation between different data sets and their benefits for business users.
Big Data testers need to understand the components of the Big Data system. Today, testers understand that they have to think beyond the regular parameters of automated testing and manual testing. Big Data, with its unexpected format, can cause problems that automated test cases fail to understand. Creating automated test cases for such a Big Data pool requires expertise and coordination between team members.
The testing team should coordinate with the development team and marketing team to understand data extraction from different resources, data filtering and pre and post processing algorithms. As there are a number of fully automated testing tools available in the market for Big Data validation, the tester has to possess the required skillset certainly and influence Big Data technologies like Hadoop. Also, organizations need to be ready to invest in Big Data-specific training programs and to develop the Big Data test automation solutions.
If the testing process is not standardized and strengthened for re-utilization and optimization of test case sets, the test cycle / test suite would go beyond the intended and in turn causes increased costs, maintenance issues and delivery slippages. Test cycles might stretch into weeks or even longer in manual testing. Hence, test cycles need to be accelerated with the adoption of validation tools, proper infrastructure and data processing methodologies.
The big data market is growing at high number.
• In the future, there will be advanced strategies to generate the giant quantity of test data and reduce the big data test set-up time and cost. Specialized test tools and creative exploratory approaches will be used for data error trapping and business rule implementation.
• The data quality will be a critical issue in the near future. There will be various benchmarks for evaluating the data quality such as conformity, accuracy, consistency, validity, duplication and data completeness.
• Live data integration will be used to improve big data testing efficiency. It will be used to verify the data before it is moved in database.
• There will be novel strategies to gauge the performance of big data applications. In future, performance testing methodologies will integrate the statistical analysis from different application layers.
These are just some of the challenges and solutions that testers face while dealing with the QA of a vast data pool. All in all, Big Data testing has much distinction for today’s businesses. If right test strategies are embraced and best practices are followed, defects can be identified in early stages and overall testing costs can be reduced while achieving high Big Data quality at speed.
Number of views (141)/Comments (0)