Big data 1: What is big data?

What is big data?

There are many definitions of the term ‘big data’ but most suggest something like the following:

'Extremely large collections of data (data sets) that may be analysed to reveal patterns, trends, and associations, especially relating to human behaviour and interactions.'

In addition, many definitions also state that the data sets are so large that conventional methods of storing and processing the data will not work.

Sources of big data

Main sources of big data can be grouped under the headings of social (human), machine (sensor) and transactional.

Social (human) – this source is becoming more and more relevant to organisations. This source includes all social media posts, videos posted etc.

Machine (sensor) – this data comes from what can be measured by the equipment used.

Transactional – this comes from the transactions which are undertaken by the organisation. This is perhaps the most traditional of the sources.

Characteristics of big data

The characteristics of big data, known as the 5Vs, are:

Volume
Variety
Velocity
Veracity
Value

These characteristics have been generally adopted as the essential qualities of big data.

Volume

The volume of big data held by large companies such as Walmart (supermarkets), Apple and EBay is measured in multiple petabytes. A typical disc on a personal computer (PC) holds a gigabyte, so the big data depositories of these companies hold at least the data that could typically be held on 1 million PCs, perhaps even 10 to 20 million PCs.

The scale of this is difficult to comprehend. It is probably more useful to consider the types of data that large companies will typically store.

Retailers
Via loyalty cards being swiped at checkouts: details of all purchases you make, when, where, how you pay, use of coupons.

Via websites: every product you have every looked at, every page you have visited, every product you have ever bought.

Social media (such as Facebook and Twitter)
Friends and contacts, postings made, your location when postings are made, photographs (that can be scanned for identification), any other data you might choose to reveal to the universe.

Mobile phone companies
Numbers you ring, texts you send (which can be automatically scanned for key words), every location your phone has ever been whilst switched on (to an accuracy of a few metres), your browsing habits and voice mails.

Internet providers and browser providers
Every site and every page you visit. Information about all downloads and all emails (again these are routinely scanned to provide insights into your interests). Search terms which you enter.

Banking systems
Every receipt, payment, credit card information (amount, date, retailer, location), location of ATM machines used.

Variety

Some of the variety of information can be seen from the examples listed above. In particular, the following types of information are held:

Browsing activities: sites, pages visited, membership of sites, downloads, searches
Financial transactions
Interests
Buying habits
Reaction to advertisements on the internet or to advertising emails
Geographical information
Information about social and business contacts
Text
Numerical information
Graphical information (such as photographs)
Oral information (such as voice mails)
Technical information, such as jet engine vibration and temperature analysis

This data can be both structured and unstructured:

Structured data: this data is stored within defined fields (numerical, text, date etc) often with defined lengths, within a defined record, in a file of similar records. Structured data requires a model of the types and format of business data that will be recorded and how the data will be stored, processed and accessed. This is called a data model. Designing the model defines and limits the data which can be collected and stored, and the processing that can be performed on it.

An example of structured data is found in banking systems, which record the receipts and payments from your current account: date, amount, receipt/payment, short explanations such as payee or source of the money.

Structured data is easily accessible by well-established database structured query languages.

Unstructured data: refers to information that does not have a pre-defined data-model. It comes in all shapes and sizes and it is this variety and irregularity which makes it difficult to store in a way that will allow it to be analysed, searched or otherwise used. An often quoted statistic is that 80% of business data is unstructured, residing it in word processor documents, spreadsheets, PowerPoint files, audio, video, social media interactions and map data.

Here is an example of unstructured data and an example of its use in a retail environment:

You enter a large store and have your mobile phone with you. That allows your movement round the store to be tracked. The store might or might not know who you are (depending on whether it knows your mobile phone number). The store can record what departments you visit, and how long you spend in each. Security cameras in the ceiling match up your image with the phone, so now they know what you look like and would be able to recognise you on future visits. You pass near a particular product and previous records show that you had looked at that product before, so a text message can be sent perhaps reminding you about it, or advertising a 10% price reduction. Perhaps the store has a marketing campaign that states that it will never be undersold, so when you pass near products you might be making a price comparison and the store has to check prices on other stores websites and message you with a new price. If you buy the product then the store might have further marketing opportunities for related products and consumables and this data has to be recorded also. You pay with an affinity credit card (a card with associations with another organisations such as a charity or an airline), so now the store has some insight into your interests. Perhaps you buy several products and the store will want to discover if these items are generally bought together.

So just walking round a store can generate a vast quantity of data which will be very different in size and nature for every individual.

Velocity

Information must be provided quickly enough to be of use in decision-making and performance management. For example, in the above store scenario, there would be little use in obtaining the price-comparison information and texting customers once they had left the store. If facial recognition is going to be used by shops and hotels, it has to be more or less instant so that guests can be welcomed by name.

You will understand that the volume and variety conspire against velocity and, so, methods have to be found to process huge quantities of non-uniform, awkward data in real-time.

Veracity

Veracity means accuracy and truthfulness and relates to the quality of the data. In the context of big data, for any analysis to provide useful findings for decision making, the data collected must be true. To assess how true the data collected is, companies must consider not only how accurate or reliable a data set might be but also how trusted is the source of the data. Companies must be able to trust the source of the data being collected and be confident that the data is reliable and accurate if they are to base important, and often costly decisions on the findings of its analysis.

The difficulty that companies face here is that by its very nature, the data collected comes from many different sources. Some will be more trustworthy that others. For example machine and transactional sourced data would be seen as more reliable than human sourced data. Data from transactional and machine sources would be easier to verify and less easy to manipulate. Human data, for example from social media, however can be more easily manipulated and care must be taken when using this type of data, particularly given the recent increase in so called ‘fake news’ and growing reports of deliberately manipulated customer reviews on retail sites.

Veracity also ties in to velocity. To be useful in decision making, data needs to be analysed as soon as possible. Velocity shows that the data being collected changes quickly. Analysing out of date data could lead to poor decision making.

Value

The last V of big data (although some models have added more) is Value. There is little point in going to the effort and expense of gathering and analysing the data if this does not ultimately result in adding value to the company. It is important for companies to consider the potential of big data analytics and the value it could create if gathered, analysed and used wisely.

An example of how data analysis was used by British supermarket group Tesco to add value:

Tesco has operations in several countries around the world. In Ireland, the company developed a system to analyse the temperature of its in-store refrigerators. Sensors were placed in the fridges that measured the temperature every three seconds and sent the information over the internet to a central data warehouse. Analysis of this data allowed the company to identify units that were operating at incorrect temperatures. The company discovered that a number of fridges were operating at temperatures below the -21◦C to -23◦C recommended. This was clearly costing the company in terms of wasted energy. Having this information allowed the company to correct the temperature of the fridges. Given that the company was spending €10 million per year on fridge cooling costs in Ireland, an expected 20% reduction in these costs was a significant saving.

The system also allowed the engineers to monitor the performance of the fridges remotely. When they identified that a particular unit was malfunctioning, they could analyse the problem then visit the store with the right parts and replace them. Previously the fridges would only be fixed when a problem had been discovered by the store manager, which would usually be when the problem had developed into something more major. The engineers would have to visit the store, identify the problem, and then make a second visit to the store with the required parts.

Processing and analysing big data

The processing of big data is generally known as big data analytics and includes:

Data mining: analysing data to identify patterns and establish relationships such as associations (where several events are connected), sequences (where one event leads to another) and correlations.
Predictive analytics: a type of data mining which aims to predict future events. For example, the chance of someone being persuaded to upgrade a flight.
Text analytics: scanning text such as emails and word processing documents to extract useful information. It could simply be looking for key-words that indicate an interest in a product or place.
Voice analytics: as above but with audio.
Statistical analytics: used to identify trends, correlations and changes in behaviour.

Google provides website owners with Google Analytics that will track many features of website traffic. For example, the website OpenTuition.com provides free ACCA study resources. Google analytics reports statistics such as the following:

GEOGRAPHICAL DISTRIBUTION OF USERS

TYPE OF BROWSER USED

AGE OF USER

The final table is instructive. OpenTuition.com does not ask for users’ ages, so this data has been pieced together from other information available to Google. It has been able to do this for only about 58% of users.

These analytical findings can lead to:

Better marketing
Better customer service and relationship management
Increased customer loyalty
Increased competitive strength
Increased operational efficiency
The discovery of new sources of revenue.

The Big Data (DIKW) pyramid

The DIKW pyramid, also known as the knowledge pyramid became well known in 1989 from the work of Askoff. With the emergence of big data, the pyramid has also become known as the big data pyramid. The work of Jennifer Rowley in 2007 explained the relationships between data, information, knowledge and wisdom.

Rowley explained the pyramid: 'Typically information is defined in terms of data, knowledge in terms of information, and wisdom in terms of knowledge.'

Data: a range of data can be collected from various sources – this is raw data and not particularly useful in this form.

Information: The raw data can be analysed to look for trends or patterns, for example it may appear that there is a link between the purchase of a particular product and a particular group of customers. This is information.

Knowledge: The information can be analysed further to establish how the identified links are connected. Knowing the details of exactly what types of customers buy a particular product or favour particular product features is knowledge.

Wisdom: The knowledge gathered can be used to make informed business decisions.

Example of how the pyramid could be used:
A soft drink manufacturer makes a range of fruity soft drinks in four different flavours (orange, apple, lime and pear). It has traditionally used plastic bottles but has recently run a trial whereby two flavours were also made available in glass bottles. It is making its plan for next year’s production and is considering if it should expand the use of glass bottles.

Data: The company has collected a range of data from previous purchases, customer questionnaires, social media posts etc.

Information: The raw data was analysed to look for trends or patterns. The company finds that there appears to be a link between the types of bottles purchased by different age groups.

Knowledge: Further analysis has shown that younger customers prefer the glass bottles while customers from the older age range prefer plastic bottles. Previous analysis also showed that lime flavour is almost exclusively only purchased by older customers and pear is almost exclusively only purchased by younger customers.

Wisdom: How can this knowledge be used? The company should only produce lime flavour in plastic bottles and only produce pear flavour in glass bottles. Here, the company is using the insights gained in order to make a decision and therefore this is classed as wisdom.

Dangers/risks of big data

Despite the examples of the use of big data in commerce, particularly for marketing and customer relationship management, there are some potential dangers and drawbacks.

Cost: It is expensive to establish the hardware and analytical software needed, though these costs are continually falling.

Regulation: Some countries and cultures worry about the amount of information that is being collected and have passed laws governing its collection, storage and use. Breaking a law can have serious reputational and punitive consequences.

Loss and theft of data: Apart from the consequences arising from regulatory breaches as mentioned above, companies might find themselves open to civil legal action if data were stolen and individuals suffered as a consequence.

Incorrect data: If the data held is incorrect or out of date incorrect conclusions are likely. Even if the data is correct, some correlations might be spurious leading to false positive results.

Written by a member of the PM examining team