Data Sources

Overview

Data sources represent a fundamental part of the big data system. These components are the generators of the data that are collected to provide insights and business value to the companies.

Types of sources

The most common sources of the data can be split into these 4 categories:

Transactions
Social media
Machines
Customers

Transactions

This category refers to payments, orders, deliveries, storage records. In most situations, it is generated by financial, trading, and eCommerce applications. It is useful for analyzing the market situation, predicting stock prices, and optimizing costs.

This category refers to the activity on social platforms. Examples of this activity are likes, tweets and retweets, comments, posts, video uploads, and general media that are uploaded and shared. It is useful for insights into consumer preferences, changing trends, and marketing analytics.

Machines

This category refers to the industrial equipment, sensor measurements, generated in real-time, and the company’s server logs. It is a standard data source in the IoT solutions. It is useful for analyzing the system performance, tracking the state in real-time, sending notifications, predicting system behavior, and reacting to failure scenarios.

Customers

This category refers to users that use the company’s web or mobile client applications. Data can be generated directly by a user, or indirectly by monitoring user behavior. It is useful for adapting ads and recommendation engines.

Data format

Data from sources can come in multiple forms, formats. Format refers to the structure and organization. It can be split into 4 categories, categories are put in decreasing order of the occurrence in collected datasets:

Structured
Semi-structured
Quasi-structured
Unstructured

Most of the generated data in the big data context is unstructured or semi-structured.

Structured

Format with predefined data types organized in a fixed structure. Examples: traditional RDBMS, CSV, spreadsheets.

Semi-structured

Format with a self-describing structure that is not fixed. Technically speaking, it has semantic tags or markers which is used to group types of data and enforce hierarchies within the data. Examples: XML, JSON.

Quasi-structured

Format with inconsistent types and values, that has a observable structure pattern that cannot be easily parsed. Examples: web pages, web click-streams, google search results.

Unstructured

Format without data types, that has no inherent, identifiable structure. It cannot be processed and analyzed using the conventional methods. Examples: text documents, email messages, PDF, word, images, videos.

Sources in our solution

Source	Type	Format
Customer orders	Transactions	`RDBMS`
Clicks	Customer	`JSON`
Profile data	Customer	`RDBMS`
Front-end service logs	Machine	`JSON`
Back-end service logs	Machine	`JSON`
Domain events	Machine	`JSON`

Acronyms

IoT - Internet of things
RDBMS - Relational database management system
CSV - Comma separated values
XML - Extensible markup language
JSON - JavaScript object notation
PDF - Portable document format

Contributors

Petar Zečević

Petar is a Software Engineer with experience in Android, IoT, embedded programming, REST API and Big Data / Apache Spark.

Overview Data Storage