We have implemented Google Analytics on our website (Pexitics.com) and have been trying to track the number of visitors, their activities and their different behavior pattern using various goals, filters and campaigns. Google Analytics is an important tool which provides a wide range of features to generate data of the website’s traffic. It can also review online campaigns by tracking landing page quality and conversions (goals). Goals might include sales, lead generation, viewing a specific page, or downloading a particular file.

Once Google Analytics starts generating data, it needs to be analysed in synchronization with the other activities that are running parallel to the website and which can directly or indirectly influence the website traffic. So, from the two weeks’ data generated, the number of views on the website per day were taken and analysed by conducting various statistical tests using Excel and Data Analysis Plug-in.

The number of visitors on the website and their activities were tracked over a period of time and analysed to draw insights. The number of hits on the pages are triggered either by some social media campaigns or by some new posts or by other promotional events conducted outside. All these activities are recorded and analysed to state the success of one such event over the other.

**Data Exploration**

We performed exploratory data analysis on the two weeks’ data and some basic tests were conducted to derive some fundamental information. The basic purpose of conducting these tests is to find similarity and closeness between two data sets and those activities which have an indirect influence on them.

Fig 1: Two weeks’ data covering the number of page views:

A very brief analysis (Fig 2) of the above data can be used to draw some basic insights:

Fig 2: Preliminary analysis of two weeks’ data covering the number of page views

Hypothesis testing is an essential procedure in statistics, which evaluates two mutually exclusive statements about a population to determine which statement is best supported by the sample data. In hypothesis testing, a statistical sample is tested, with the goal of accepting or rejecting a null hypothesis (which is an assumption considered to be true). The tests tell us whether or not the assumed primary hypothesis is true. If it isn’t true, a new hypothesis to be tested is formulated, repeating the process until data reveals a true hypothesis.

**Hypothesis Testing**

It is done in 5 simple steps:

- Figure out and state the null hypothesis,
- State the alternate hypothesis,
- Choose what kind of test we need to perform,
- Define the significance level,
- Either accept or reject the null hypothesis.

So, now we conduct a hypothesis test where:

Null Hypothesis (H_{0}): The mean of the number of visitors is same for 2 weeks

Alternate Hypothesis (H_{a}): The mean of the number of visitors is different for 2 weeks.

A p-value (level of marginal significance) is used, to make the determination. If the p-value is less than or equal to the level of significance, which is a cut-off point that is defined, then the null hypothesis can be rejected. Let’s consider a p-value of 0.05 in order to indicate strong evidence against the null hypothesis which is considering the two weeks’ data to follow a similar pattern.

Firstly, the one-way ANOVA (Analysis of Variance) Test is performed. It is a technique that is used to compare means of two or more samples (using the F distribution). This technique is used only for numerical response data, the “Y”, usually one variable, and numerical or (usually) categorical input data, the “X”, always one variable, hence “one-way”, ANOVA is a relatively robust procedure with respect to violations of the normality assumptions and thus, apt for our situation.

Performing the test following result was produced:

** ****ANOVA:**

Here, the p-value is much greater than what was expected (0.05) and other measurements like sum of squares and mean of squares of the two data sets also differ a lot. Thus, we can reject the null hypothesis.

But, before coming to conclusion of rejecting the null hypothesis, other results about these two data sets were calculated and analysed.

Following results are derived from the above calculations:

Now, the null hypothesis can be rejected as it suggests that the range of 3-sigma mean of the data is very wide and hence, these 2 distributions can’t be treated as same.

**Test for Normality:**

The ratio of mean to median of two data sets vary significantly and deviate from 1 in two different direction.

Hence it is not a case of normal distribution and therefore, we will conduct other tests

**Non- Parametric Tests:**

We perform __non-parametric statistical test__ on the data to explore more:

Nonparametric tests are sometimes called distribution-free tests because they are based on fewer assumptions (e.g., they do not assume that the outcome is approximately normally distributed). Parametric tests involve specific probability distributions (e.g., the normal distribution) and the tests involve estimation of the key parameters of that distribution (e.g., the mean or difference in means) from the sample data. The cost of fewer assumptions is that nonparametric tests are generally less powerful than their parametric counterparts (i.e., when the alternative is true, they may be less likely to reject H_{0}).

Here, in this case the following assumptions are made:

Hypothesized difference (D): 0

Significance level (%): 5

p-value: Asymptotic p-value

Continuity correction: Yes

Now, we perform a series of test on the given data set:

**I**. **Sign Test / Two-tailed Test**:

The p-value is computed using exact method.

__Test Interpretation:__

H_{0}: The two samples follow the same distribution.

H_{a}: The distributions of the two samples are different.

As the computed p-value is greater than the significance level alpha=0.05, one cannot reject the

null hypothesis H_{0}.

The risk to reject the null hypothesis H0 while it is true is 68.75%.

**II. Wilcoxon signed-rank test / Two-tailed test:**

An approximate method has been used to compute the p-value.

__Test interpretation__:

H_{0}: The two samples follow the same distribution.

H_{a}: The distributions of the two samples are different.

As the computed p-value is greater than the significance level alpha=0.05, one cannot reject

the null hypothesis H_{0}.

The risk to reject the null hypothesis H0 while it is true is 83.35%.

The continuity correction has been applied.

Ties have been detected in the data and the appropriate corrections have been applied.

Summary:

So, the non-parametric tests conducted, states that the probability of the two sets of data to follow the same distribution are 68.8% and 83.3% respectively or else we can say that they follow different distribution.

**III. Comparison of two distribution using Kolmogorov–Smirnov test (K–S test or KS test):**

This is also a non-parametric test of the equality of continuous, one-dimensional probability distributions that can be used to compare a sample with a reference probability distribution (one-sample K–S test), or to compare two samples (two-sample K–S test). The assumptions are same:

Hypothesized difference (D): 0

Significance level (%): 5

p-value: Asymptotic p-value

The calculations of the test are as follows:

__Test Interpretation:__An approximation has been used to compute the p-value.

H_{0}: The two samples follow the same distribution.

H_{a}: The distributions of the two samples are different.

As the computed p-value is greater than the significance level alpha=0.05, one cannot reject the

null hypothesis H_{0}.

The risk to reject the null hypothesis H0 while it is true is 54.12%.

So, the probability of the two distribution to be same is 54.12 % which is significantly less to be taken as similar or related. Thus, we can say that they are either different or we need more sets of data to conclude that they are same.

## So, what do you think ?