"BIG DATA" - Студенческий научный форум

X Международная студенческая научная конференция Студенческий научный форум - 2018

"BIG DATA"

Кислова Е.И. 1
1Владимирский государственный университет имени Александра Григорьевича и Николая Григорьевича Столетовых
 Комментарии
Текст работы размещён без изображений и формул.
Полная версия работы доступна во вкладке "Файлы работы" в формате PDF
The term Big Data

The term Big Data appeared as a new term and logo in an article by Clifford Lynch, editor of Nature magazine on September 3, 2008, who devoted a special issue of one of the most famous journals to the topic “what big data sets can mean for modern science”. And here use of the word " big " was connected not so much with quantity, and with a qualitative assessment.

Big data is a term that refers to many datasets that are so large and complex that it is impossible to use existing traditional database management tools and applications to process them. The problem is the collection, cleaning, storage, search, access, transfer, analysis, and visualization of sets such as an integral entity, not local fragments.[6]

As defining characteristics for big data, note "three V": Volume ( in the sense of physical volume), Velocity (meaning in this context the growth rate and the need for high-speed processing and obtaining results), Variety ( in the sense of the possibility of simultaneous processing of different types of structured and semi-structured data). The leading characteristic here is the volume of data. Modernity shows us examples of the monstrous size of digitized data generated today. [2]

Sources of this avalanche of data are numerous digital devices, concentrating and directing in bottomless open spaces of the Internet production of the human mind-tweets, posts in Facebook and Vkontakte, inquiries in search systems, etc., and also data from sensors and controllers of millions devices which measure temperature and humidity, a condition of roads and conditioners and many other that is United today by the term “smart devices”. [1]

These are video streams from surveillance cameras, digitized audio signals, GPS coordinates of mobile devices and many other things generated by the machines independently during the operation of the equipment and existing in the form of data bits. All this data is stored in different databases, storages and is simply lost. Some of this data is accessible via the Internet and some have local access. The big data approach is intended to significantly increase the use of available information and to allow it to be presented in a form suitable for practical application. However, this is not only a problem of quantity – the Volume of data is the first “v”. As has been noted, the second “v”- Velocity-is also important for big data. The results of big data processing should be received in the time determined by the problem to be solved with their help. The speed of data access and processing speed is an important criterion for the quality of big data technologies. [3]

Finally, the third “v " - Variety of data-suggests that big data should be handled effectively regardless of their structure. Here it is customary to distinguish three main types of data by their degree of structuring:

The First level is a usual structured data that can be represented by separable and pre-defined fields, which consist of bits having different semantics.

The second level is semi-structured data. This type of data is structured but cannot be represented as a table because some data does not have some attributes.

The third level is unstructured data. Such data includes texts recorded by symbols of different languages, recordings of sounds, still images, video files, e-mails, tweets, presentations and other business information not from databases. It is believed that 80 to 90 percent of all data in organizations is unstructured data. The semi-structured data entered above is often called as unstructured data. [4]

Big data as a phenomenon already has a strong impact on the business and social lives of many people. For example, when analyzing big data of Internet requests, researchers found a strange phenomenon. For several years the surge in Google searches on such terms as the flu treatment, flu symptoms etc., for a few weeks precedes the beginning of the rapid growth of the flu epidemic. This regularity is already used today for prediction flu epidemic in many States - instructing of doctors, the release of therapeutic beds, etc. It should be noted, previously used information from district doctors and emergency rooms, usually lags behind from real picture. [6]

Big data techniques

There are many methods and techniques for analyzing large data. The most famous are:

  1. Data Mining class methods: association rule learning, classification, cluster analysis, regression analysis;

  2. crowdsourcing - categorization of data by forces of people involved on the basis of a public offer, without entering into an employment relationship;

  3. data fusion and integration;

  4. machine learning (Supervised learning and Unsupervised learning);

  5. artificial neural networks, network analysis, optimization, genetic algorithms;

  6. pattern recognition;

  7. predictive Analytics;

  8. simulation modeling;

  9. Spatial analysis is a class of methods that use topological, geometric, and geographic information in data;

  10. statistical analysis;

  11. analytical data visualization is the presentation of information in the form of drawings, diagrams, with the use of interactive functions and animations both for obtaining results and for use as source data for further analysis; [5]

Consider data visualization in more detail, because this method is the key and final in the presentation of information to the subject.

Visualization technique is a powerful method of data mining. Visualization techniques are often used to view and verify data before creating a model, and after generating forecasts. Visualization is the transformation of numerical data into some visual image, in order to simplify the perception of large amounts of information. Visualizers are used for visualization. Visualizers can be either a standalone application or a plugin or part of another application. The following types of visualization:

  • Visualization of texts

  • Cluster visualization

  • Visualization of association

  • Landscape visualization

  • Visualization of hypotheses

  • Visualization of decision trees

Visualization of texts:

The Visualizer calculates the periodicity of mentions of a word and assigns a conditional weight depending on that periodicity to the words. Words of different weights in the visualization have different markup, which means different views on the screen. Some words more than others. [6]

Cluster visualization:

One of the most frequently used visualizations is to visualize clusters. Clusters are groups of objects with similar properties. Most Visualizers supports clustering algorithms and is able to divide data into clusters. Typically, for a visual representation of the clusters for objects from different clusters are used in a contrasting color. [6]

Visualization of Association:

Association visualization demonstrates the frequency at which items appear together in a dataset, which determines the structure of the data organization.

Landscape visualization:

Landscape visualization consists in the presentation of data in the form of a three — dimensional landscape-bar charts, with individual height and color. A typical landscape Visualizer allows analysts to monitor data deployment.

Visualization of the hypothesis:

Visualization of the hypotheses allows us to show the regularities, confirming the proposed hypothesis. The presentation of information is different in different Visualizers.

Visualization of decision trees:

Visualization of the decision tree allows you to present hierarchically organized information in the form of the landscape and observe the data in the form of nodes and branches. The landscape can be both two-dimensional and three-dimensional. Quantitative and relational characteristics of the data become visible through hierarchically related nodes. In everyone node are the numbers, or histograms height and color of which corresponds to the data values. The lines connecting the nodes show the relationships. [6]

All this suggests that the Visual representation of the data helps to see and understand what the numbers hide behind them. The results of data analysis, obtained with the help of Big Data technologies, allow you to choose the right marketing strategy. Real-time data analysis allows you to instantly make the right business decisions. The information obtained from the accumulated raw data, allows you to see new ways for business development. Visual reports help to detect hidden problems in time and avoid them.

References

1. Hellerstein, Joe. "Parallel Programming in the Age of Big Data". Gigaom Blog, 2008. – 223 p.

2. Segaran, Toby; Hammerbacher, Jeff. Beautiful Data: The Stories Behind Elegant Data Solutions, 2009. – 257 p.

3. Dedic, N.; Stanier C., "Towards Differentiating Business Intelligence, Big Data, Data Analytics and Knowledge Discovery". Berlin ; Heidelberg: Springer International Publishing, 2017. – 285 p.

4. Daniele Medri: Big Data & Business: An on-going revolution". Statistics Views, 21 October 2013.

5. Kalil, Tom. "Big Data is a Big Deal". White House, 26 September 2012.

6. Big Data and their applications in electroenergetic. Available at: https://docviewer.yandex.ru/view/330432739 (accessed 1.03.2018).

Просмотров работы: 29