Data Sources

Data Sources

Overview

Data sources represent a fundamental part of the big data system. These components are the generators of the data that are collected to provide insights and business value to the companies.

Types of sources

The most common sources of the data can be split into these 4 categories:

  • Transactions
  • Social media
  • Machines
  • Customers

Transactions

This category refers to payments, orders, deliveries, storage records. In most situations, it is generated by financial, trading, and eCommerce applications. It is useful for analyzing the market situation, predicting stock prices, and optimizing costs.

Social media

This category refers to the activity on social platforms. Examples of this activity are likes, tweets and retweets, comments, posts, video uploads, and general media that are uploaded and shared. It is useful for insights into consumer preferences, changing trends, and marketing analytics.

Machines

This category refers to the industrial equipment, sensor measurements, generated in real-time, and the company’s server logs. It is a standard data source in the IoT solutions. It is useful for analyzing the system performance, tracking the state in real-time, sending notifications, predicting system behavior, and reacting to failure scenarios.

Customers

This category refers to users that use the company’s web or mobile client applications. Data can be generated directly by a user, or indirectly by monitoring user behavior. It is useful for adapting ads and recommendation engines.

Data format

Data from sources can come in multiple forms, formats. Format refers to the structure and organization. It can be split into 4 categories, categories are put in decreasing order of the occurrence in collected datasets:

  • Structured
  • Semi-structured
  • Quasi-structured
  • Unstructured

Most of the generated data in the big data context is unstructured or semi-structured.

Structured

Format with predefined data types organized in a fixed structure. Examples: traditional RDBMS, CSV, spreadsheets.

Semi-structured

Format with a self-describing structure that is not fixed. Technically speaking, it has semantic tags or markers which is used to group types of data and enforce hierarchies within the data. Examples: XML, JSON.

Quasi-structured

Format with inconsistent types and values, that has a observable structure pattern that cannot be easily parsed. Examples: web pages, web click-streams, google search results.

Unstructured

Format without data types, that has no inherent, identifiable structure. It cannot be processed and analyzed using the conventional methods. Examples: text documents, email messages, PDF, word, images, videos.

Sources in our solution

Source Type Format
Customer orders Transactions RDBMS
Clicks Customer JSON
Profile data Customer RDBMS
Front-end service logs Machine JSON
Back-end service logs Machine JSON
Domain events Machine JSON

Acronyms

  • IoT - Internet of things
  • RDBMS - Relational database management system
  • CSV - Comma separated values
  • XML - Extensible markup language
  • JSON - JavaScript object notation
  • PDF - Portable document format

Contributors
Petar Zečević
Petar Zečević

Petar is a Software Engineer with experience in Android, IoT, embedded programming, REST API and Big Data / Apache Spark.