The Need For MapReduce and NoSQL

The Need for MapReduce

Relational Database Management Systems have been in use since 1970s. They provide the SQL language interface.

They are good at needle in the haystack problems – finding small results from big datasets.

They provide a number of advantages:

  • A declarative query language
  • Schemas
  • Logical Data Independence
  • Database Indexing
  • Optimizations Through Use of Relational Algebra
  • Views
  • Acid Properties (Atomicity, Consistency, Isolation and Durability)

They provide scalability in the sense that even if the data doesn’t fit in main memory, the query will finish efficiently.

However, they are not good at scalability in another sense – that is, having multiple machines available, or multi-cores available will not reduce the time the query takes.

The need thus arose for a system which gives scalability when more machines are added and is thus able to process huge datasets (in 50 GB or more range).

The NoSQL databases give up on atleast one of the ACID properties to achieve better performance on parallel and/or distributed hardware.

 

Introduction to Hadoop

Apache Hadoop is a software framework for distributed processing of very large datasets.

It provides a distributed storage system Hadoop Distributed File System (HDFS)), and a processing part of the system MapReduce.

The system is so designed so as to recover from hardware failures in some nodes that make up the distributed cluster.

Hadoop is itself written in Java. There are interfaces to use the framework from other languages.

Files on hadoop system are stored in a distributed manner. They are split into blocks which are then stored on different nodes.

The map and reduce functions that a user of hadoop framework writes are packaged and sent to different nodes where they are processed in parallel.

Architecture

The Hadoop framework is composed of the following four modules:

  • Hadoop Common: – The libraries needed by the other Hadoop modules;
  • Hadoop Distributed File System (HDFS):  A distributed file-system that stores data on commodity machines.
  • Hadoop YARN: A platform responsible for managing computing resources in clusters.
  • Hadoop MapReduce: A programming model for large scale data processing.

HDFS

Files in HDFS are broken into blocks. The default block size is 64MB.

An HDFS cluster has two types of nodes – namenode (it works as the master node) and a number of datanodes (these work as the worker nodes).

The namenodes manages the filesystem namespace, maintains the filesystem tree, and the metadata of each file.

The user does not directly have to interact with the namenode or datanodes. A client does that on the user’s behalf. The client also provides a POSIX like filesystem interface.

MapReduce Engine

The MapReduce Engine sits on top of the HDFS filesystem. It consists of one JobTracker and a TaskTracker. The MapReduce jobs of the client applications are submitted to the JobTracker, which pushes the work to the available TaskTracker nodes in the cluster.

The filesystem is designed such that the JobTracker can know the nodes that contain the data, and how far they are.

The TaskTracker communicates with the JobTracker every few minutes to check its status.

The Job Tracker and TaskTracker expose their status and information through Jetty. This can be viewed from a web browser.

By default Hadoop uses FIFO scheduling. The job scheduler in modern versions is separate and it is possible to use an alternate scheduler (eg. Fair scheduler , Capacity scheduler).

The Hadoop Ecosystem

The term “Hadoop” now refers not just to the 4 base modules above, but also to the ecosystem, or collection of additional software packages that can be installed on top of or alongside Hadoop.

Examples of these are

Apache Pig,

Apache Hive,

Apache HBase,

Apache Spark,

Apache ZooKeeper,

Impala,

Apache Flume,

Apache Sqoop,

Apache Oozie,

Apache Storm.

Quotients

1. Motivation

Consider the set {{\mathbb Q}} of rational numbers. We say that a rational number is of the form {a/b} where {b \ne 0}. But all numbers of the form, {m a / m b} where {m \ne 0} are also the same rational number. Thus, the elements of {{\mathbb Q}} are not really numbers of the form {a/b} where {b \ne 0}, but “equivalence classes” of numbers, where two numbers {a/b} and {c/d} are said to belong to the same equivalence class if {ad = bc}.

This method of treating two elements of a set as equal to create a new set can be used in other settings and on other structures, like groups and fields.

2. Using Quotients to Create a Field from a Ring

Consider the set of all polynomials {{\mathbb Q}[x]} in variable {x} with coefficients in {{\mathbb Q}} (For example, {2x^3 + \frac{1}{2} x^2 - x - 1} is an element of this set.). Any two polynomials in this set can be added, subtracted or multiplied together to give another polynomial from the same set. Thus {{\mathbb Q}[x]} is a commutative ring. But it is not a field because dividing one polynomial by another does not give another polynomial.

We can however convert this to a field by using the concept of equivalence classes.

Let us chose any polynomial (say {x^3 - x - 1}) which does not have roots in {{\mathbb Q}} and thus cannot be factorized in it. We create an equivalence class from this polynomial such that all multiples of this this polynomial belong to the class and this class would stand for zero in the new field that we are creating.

Any polynomial {P}, which is not a multiple of {x^3 - x - 1} can then be written as {PQ + R(x^3 - x - 1) = 1}. Then, since we regard {x^3 - x - 1} as zero, {Q} is the inverse of {P} in this new field. Thus all {P}s have inverses and we have turned the commutative ring into a field which is called the quotient of {{\mathbb Q}[x]} by {x^3 - x - 1}, written as {{\mathbb Q}[x]/(x^3 - x - 1)}.

In creating quotient fields the operation on fields and the equivalence relation are defined such that if {P} and {P^\prime} belong to the same equivalence class and so do {Q} and {Q^\prime}, with {P \ne Q}, then {P + Q} and {P^\prime + Q^\prime} belong to the same equivalence class as does {PQ} and {P^\prime Q^\prime}.

3. Quotient Groups

We want to extend what we did while creating a field from the ring of polynomials to create a new group from two given groups.

Let {G} be a group and {H} be a subgroup of G. We will define two elements of {G}, {g_1} and {g_2} to be equivalent with respect to the group {H} if {g_1^{-1}g_2 \in H}. The equivalence class of an element {g \in G} is then {g h} such that {h \in H}. This set of the equivalence class of {g} is written as {gH} and is called the left coset of {H}.

We want to turn the set of left cosets into a group. The natural choice for the composition operation would be using the group multiplication already defined. Thus, {g_1 H \ast g_2 H = g_1 g_2 H}.

However, this does not always work. It only works under the condition that {H} should be a normal subgroup of {G}. Under this condition, different elements from the original cosets map to the same coset {g_1 g_2 H}.

If {H} is a normal subgroup, then the set of left cosets forms a group written as {G/H} and called the quotient of {G} by {H}. Conversely, we regard {G} as the product of {H} and {G/H}. Thus, if we understand {H} and {G/H} we understand {G}.

The groups that do not have normal subgroups are called “simple groups” and have a special role, in that they act the way prime numbers do in number theory.

Difference Equations

1. Difference Equations

1.1. Introduction

Time series analysis deals with a series of random variables.

1.2. First Order Difference Equations

We will study time indexed random variables {y_t}.

Let {y_t} be a linear function of {y_{t-1}} and {w_t}.

\displaystyle  y_t = \phi{y_t} + w_t  \ \ \ \ \ (1)

Equation 1 is a linear first-order difference equation. It is a first-order difference equation because {y_t} only depends on {y_{t-1}} and not on other previous {y_t}s.

In this chapter, we treat {w_t} as a deterministic number and later on we will analyse the effects of treating it as a random variable.

1.3. Solution by Recursive Substitution

The equations are:

\displaystyle  \begin{array}{rcl}  y_0 = \phi{y_{-1}} + w_t \\ y_1 = \phi{y_0} + w_t \\ \vdots \\ y_{t-1} = \phi{y_{t-2}} + w_{t-1} \\ y_t = \phi{y_t} + w_t \end{array}

By recursively substituting we obtain:

\displaystyle  y_t = \phi^{t+1}y_{-1} + \phi^tw_1 + \phi^{t-2}w_2 + \dotsc + \phi{w_{t-1}} + w_t  \ \ \ \ \ (2)

1.4. Dynamic Multipliers

We want to know the effect of increasing {w_t} on {y_{t+j}}. This can be obtained by differentiating equation 10 with respect to {w_t}.

\displaystyle  \frac{\partial y_{t+j}}{\partial w_t} = \phi^{j} \ \ \ \ \ (3)

2. pth-Order Difference Equations

We generalize the above dynamic system to let the value of y to depend on p of its own lags in addition to the current value of the input variable {w_t}.

\displaystyle  y_t = \phi_{1}y_{-1} + \phi^tw_1 + \phi^{t-2}w_2 + \dotsc + \phi{w_{t-1}} + w_t  \ \ \ \ \ (4)

We will rewrite the above p-th order equation to a vector first order equation.

We define,

\displaystyle  \xi_t= \begin{bmatrix} y_t \\ y_{t - 1} \\ \vdots \\ y_{t - p + 1} \end{bmatrix},\quad \ \ \ \ \ (5)

\displaystyle  F = \begin{bmatrix} \phi_{1} & \phi_{2} & \phi_{3} & \cdots & \phi_{p - 1} & \phi_{p} \\ 1 & 0 & 0 & \cdots & 0 & 0\\ 0 & 1 & 0 & \cdots & 0 & 0\\ \vdots & \vdots & \vdots & \cdots & \vdots & \vdots \\ 0 & 0 & 0 & \cdots & 0 & 0 \end{bmatrix},\quad \ \ \ \ \ (6)

\displaystyle  v_t= \begin{bmatrix} w_t \\ 0 \\ 0 \\ \vdots \\ 0 \end{bmatrix},\quad \ \ \ \ \ (7)

Then,

\displaystyle  \xi_t = F\xi_{t-1} + v_t  \ \ \ \ \ (8)

or,

\displaystyle  \begin{bmatrix} y_t \\ y_{t - 1} \\ y_{t - 3;} \\ \vdots \\ y_{t - p + 1} \end{bmatrix}\quad = \quad \begin{bmatrix} \phi_{1} & \phi_{2} & \phi_{3} & \cdots & \phi_{p - 1} & \phi_{p} \\ 1 & 0 & 0 & \cdots & 0 & 0\\ 0 & 1 & 0 & \cdots & 0 & 0\\ \vdots & \vdots & \vdots & \cdots & \vdots & \vdots \\ 0 & 0 & 0 & \cdots & 0 & 0 \end{bmatrix}\quad \begin{bmatrix} y_{t - 1} \\ y_{t - 2} \\ y_{t - 3} \\ \vdots \\ y_{t - p} \end{bmatrix}\quad + \quad \begin{bmatrix} w_t \\ 0 \\ 0 \\ \vdots \\ 0 \end{bmatrix}\quad \ \ \ \ \ (9)

Following the approach taken for solving first order difference equation and applying it to the vector equation, we get,

\displaystyle  \xi_t = F^{t+1}\xi_{-1} + F^tv_0 + F^{t-1}v_1 + F^{t-2}v_2 + \dotsc + F{v_{t-1}} + v_t  \ \ \ \ \ (10)

3. General Solution of a p-th Order Difference Equation

If the eigenvalues of F matrix are distinct then we can write F as

\displaystyle  F = T\Lambda T^{-1} \ \ \ \ \ (11)

\displaystyle  \Lambda = \begin{bmatrix} \lambda_{1} & 0 & 0 & \cdots & 0 \\ 0 & \lambda_{2} & 0 & \cdots & 0 \\ 0 & 0 & \lambda_{3} & \cdots & 0 \\ \vdots & \vdots & \vdots & \cdots & \vdots \\ 0 & 0 & 0 & \cdots & \lambda_{p} \end{bmatrix},\quad \ \ \ \ \ (12)

where T is a non-singular matrix.

Thus,

\displaystyle  F^2 = T\Lambda^2 T^{-1} \ \ \ \ \ (13)

and

\displaystyle  \Lambda^2 = \begin{bmatrix} \lambda_{1}^2 & 0 & 0 & \cdots & 0 \\ 0 & \lambda_{2}^2 & 0 & \cdots & 0 \\ 0 & 0 & \lambda_{3}^2 & \cdots & 0 \\ \vdots & \vdots & \vdots & \cdots & \vdots \\ 0 & 0 & 0 & \cdots & \lambda_{p}^2 \end{bmatrix},\quad \ \ \ \ \ (14)

In general,

\displaystyle  F^n = T\Lambda^n T^{-1} \ \ \ \ \ (15)

and

\displaystyle  \Lambda^n = \begin{bmatrix} \lambda_{1}^n & 0 & 0 & \cdots & 0 \\ 0 & \lambda_{2}^n & 0 & \cdots & 0 \\ 0 & 0 & \lambda_{3}^n & \cdots & 0 \\ \vdots & \vdots & \vdots & \cdots & \vdots \\ 0 & 0 & 0 & \cdots & \lambda_{p}^n \end{bmatrix},\quad \ \ \ \ \ (16)

Misconceptions About P Values and The History of Hypothesis Testing

1. Introduction

The misconceptions surrounding p values are an example where knowing the history of the field and the mathematical and philosophical principles behind it can greatly help in understanding.

The classical statistical testing of today is a hybrid of the approaches taken by R. A. Fisher on the one hand and Jerzy Neyman and Egon Pearson on the other.

P-value is often confused with the Type I error rate – {\alpha} of the Neyman-Pearson approach.

In statistics journals and research, the Neyman-Pearson approach replaced the significance testing paradigm 60 years ago, but in empirical work the Fisher approach is pervasive.

The statistical testing approach found in the textbooks on statistics is a hybrid

2. Fisher’s Significance Testing

Fisher held the belief that statistics could be used for inductive inference, that “it is possible to argue from consequences to causes, from observations to hypothesis” and that it it possible to draw inferences from the particular to the general.

Hence he rejected the methods in which probability of a hypothesis given the data, {\mathop{\mathbb P}(H \vert x)}, are used in favour of ones in which probability of data, {\mathop{\mathbb P}(H \vert x)}, given a particular hypothesis are used.

In his approach, the discrepancies in the data are used to reject the null hypothesis. This is done as follows:

The researcher sets up a null hypothesis, which is the status quo belief.

The sampling distribution under the null hypothesis is known.

If the observed data deviates from the mean of the sampling distribution by more than a specified level, called the level of significance, then we reject the null hypothesis. Otherwise, we “fail to reject” the null hypothesis.

The p-value in this approach is the probability of observing data at least as favorable to the alternative hypothesis as our current data set, if the null hypothesis is true.

There is a common misconception that if {p = .05}, then the null hypothesis has only a 5% chance of being true.

This is clearly false, and can be seen from the definition as the P value is calculated under the assumption that the null hypothesis is true. It therefore cannot be a probability of the null hypothesis being false.

Conversely, a p value being high merely means that a null effect is statistically consistent with the observed results.

It does not mean that the null hypothesis is true. We only fail to reject, if we adopt the data based inductive approach.

We need to consider the Type I and Type II error probabilities to draw such conclusions, which is done in the Neyman-Pearson approach.

3. Neyman-Pearson Theory

The main contribution of Neyman-Pearson hypothesis testing framework (they named it to distinguish it from the inductive approach of Fisher’s significance testing) is the introduction of the

  • probabilities of committing two kinds of errors, false rejection (Type I error), called {\alpha}, and false acceptance (Type II error), called {\beta}, of the null hypothesis.
  • power of a statistical test. It is defined as the probability of rejecting a false null hypothesis. It is equal to {1 - \beta}.

Fisher’s theory relied on the rejection of null hypothesis based on the data, assuming null hypothesis to be true. In contrast, the Neyman-Pearson approach provides rules to make decisions to choose the between the two hypothesis.

Neyman–Pearson theory, then, replaces the idea of inductive reasoning with that of, what they called, inductive behavior.

In his own words, inductive behaviour was meant to imply: “The term ‘inductive behavior’ means simply the habit of humans and other animals (Pavlov’s dog, etc.) to adjust their actions to noticed frequencies of events, so as to avoid undesirable consequences”.

Then, in this approach, the costs associated with Type I and Type II behaviour determine the decision to accpet or reject. These costs vary from experiment to experiment and this thus is the main advantage of the Neyman-Pearson approach over Fisher’s approach.

Thus while designing the experiment the researcher has to control the probabilities of Type I and Type II errors. The best test is the one that minimizes the Type II error given an upper bound on the Type I error.

And what adds to the source of confusion is the fact that Neyman called the Type I error the level of significance, a term that Fisher used to denote the p values.

The Development of Mathematical Analysis

1. Reasons Behind The Creation of Analysis

Newton had approached his calculus with fluxions and flows while Leibniz had done with differentials.

Both methods however had to deal with infinities and infinitely small quantities, specifically to find methods that combine infintely many infinitely small quantities to get finite quantities.

The reasoning used to arrive at the methods by Newton and Leibniz was vague. For example, Leibniz correctly stated the result that:

\displaystyle  d(uv) = udv + vdu. \ \ \ \ \ (1)

The argument used to derive the result is as follows: Since,

\displaystyle  d(uv) = (u + du)(v + dv) - uv \\ = udv + vdu + dudv \ \ \ \ \ (2)

The term {du\thinspace dv} is a second order differential and thus extremely small compared to the first order differentials and is thus treated as 0.

So there is an inconsistency in the way that differentials are treated. The terms to ignore are arrived at not by reasoning but by back tracking from the correct answer.

Another problem was the treatment of divergent series.

For example, it was proven that

\displaystyle  1 - x + x^2 - x^3 + \dots = 1 / (1 + x) \ \ \ \ \ (3)

But it was not clear why this result breaks down when {x = 1}:

\displaystyle  1 - 1 + 1 - 1 + 1 - 1 + \dots \neq 1/2 \ \ \ \ \ (4)

The works of the leading mathematicians of that time (before the development of what we know today as analysis) – Euler, Daniel Bernoulli, Taylor, D’Alembert etc. included many arguments of suspect validity.

The mathematicians who solved all these problems and the works of whom led to the creation of analysis were (in chronological order) – Lagrange, Cauch, Reimann and Weierstrass. Their contributions are discussed below:

2. Lagrange

The initial development of calculus had relied heavily on geometry and geometrical methods and visualization.

Lagrange brought about a shift in treatment of calculus from the geometrical approach to “algebraic analysis” of “analytic functions”.

Analytic function initially had meant any function with a single expression, but Lagrange corrected it to a function which has a convergent Taylor series representation.

While he as right in defining the analytic functions, the arguments he used turned out to be unextendable and thus untenable. This was corrected by Cauchy.

3. Cauchy

In the first decades of the nineteenth century, Cauchy revived the limit approach to analysis and gave the definitions of continuity and derivatives in terms of limits.

His greatest contribution to the field of analysis was that he gave clear definitions. For example, he defined the sum of an infinite series as the limit of the sequence of partial sums. This provided a unified approach for series of numbers as well as of functions.

Cauchy also gave the correct definition of continuity, defining it to be on an interval rather than on a point.

His definition of definite integral assumed continuity, which was corrected by Reimann.

Abel found an error in his work that led to the notion of uniform convergence as being different from convergence.

4. Reimann

As already noted, Reimann’s main contribution to analysis was the generalization of the definition of definite integral to include discontinuous functions.

As an example, he showed a function, discontinuous on any interval, having an integral.

Dirichlet had corrected an error in Cauchy’s work. While studying the conditions under which the Fourier series expansion of a function converges to the function, Dirichlet succeeded in proving such convergence for a function that has period {2\pi}, is integrable on an interval of that length, does not have infinitely many maxima and minima, and at jump discontinuities takes on the average value between the two limiting values on each side.

Reimann was able to give the generalized definition the definite integral by extending Dirichlet’s method to more cases.

His definition disspelled the idea that integration was the inverse of differentiation, although the relationshiop still held for continuous functions.

5. Weierstrass

The two main contributions by Weierstrass are the elimination of the idea of motion from limit processes and the representation of functions.

The new definition of limit without the idea of motion was based on what is now called the topology of the real line or complex plane.

He also developed different classes of functions by using their power series representations.

He also produced counter examples to critique the work of other mathematicians and to point out errors.

One famous example is of the nowhere differentiable but everywhere continous function:

\displaystyle  f(x) = \sum b^n \cos(a^n x) \ \ \ \ \ (5)

His approach of producing pathological examples lifted the precision of hypothesis in analysis substantially.