Pexitics Blog

How HR Assessment and Survey tools are Validated.

Case Study – Validity of reports

The question of the validity of an assessment is one of the primary questions in the mind for  Human  Resource professionals while subscribing to the assessments and services provided by us as a part of the community and engaging our base of happily served clients across the globe. While many may or may not be familiar with the statistical concepts of validation, we feel it is our duty to explain the concepts and practices to make  HR  professionals aware of these measures irrespective of the platform and services used.  We hope the document will help clear some of the misconceptions and notions we have to have them clarified.

Sources of some of the content have also been shared as a reference at the end, in case you seek further reading and understanding of the subject.


These three terms are not only important but what is most read and heard about in an assessment validation report. Consider this; if you went to a restaurant and ordered a dish, you shall also seek the same standards.

Internal Consistency: The food is well cooked at one end and uncooked at the other. Consistency can also be of textures, being equally hard or soft, depending on the dish, across all the areas or layers of the dish.

Reliability: This is the factor that makes promise a promise. A simple example here is the fact that most olive oils sold do not contain 100% olive oils. Or most ice-creams are not made of cream. Reliability is when you get what you are promised. In a restaurant setting, it is that promise; the meat or vegetables are as they have been explained or written about, and nothing else.

Homogeneity: This is the easiest to explain. It is the equal portion of saltiness, spiciness or sweetness in every bite.

Now that we are done with the meals, let’s get into some serious stuff.

In a general sense, internal consistency and homogeneity seem like the same; they are however quite different, at least in the world of assessments. Internal consistency means that the assessment scoring methods are consistent and do not differentiate for various factors which might be observed statistically. One good example here is the attribute of agreeableness.

If the platform uses a second level of differentiating scores for agreeableness based on gender, it has been found that simply by stating your gender as female will get you a lower score than a male. This is a lack of internal consistency by the scoring or profiling platform. Homogeneity helps define that once an assessment is taken, the scores should be similar for assessees and all scores should lead to one conclusion. This is correct for an individual item; two persons with similar personality and IQ should score similarly in an assessment for a select item or attribute. But most confuse this for the entire assessment which might contain multiple attributes or factors that are being assessed.

The entire assessment could be measuring multiple factors, like in a DiSC or OCEAN, or our 7AS, and hence homogeneity cannot be expected as humans have variations plus are a complex bundle of intelligence and emotions. We at thus strongly disagree with the personality profiling; it is like boxing people into ‘x’ types, irrespective of the finer complexities they inherit and showcase at the workplace and in general life. Reliability helps explain that the results for a certain assessment should reflect perfectly similar results when taken by the same person.

However, one of the biggest factors which need to be underscored here is time. A person taking an assessment right now and taking the same assessment after 6 months will definitely hold different results due to maturity and knowledge. Secondly, in psychometric assessments, if one wants to prove that the system is not reliable, one needs to remember a few of their responses and offer a counter response in the immediate next round to prove that the system is unreliable. To eliminate this factor, we do not allow testing more than once in a timeframe of 6 months.

The final understanding is thus of validity. Validity might again mean different things for different factors; for the workplace, it means that the results meet the objectives. So what are the objectives for an assessment tool at work?

Internal consistency for our instrument comes from the simple fact that you are using a single yardstick for measuring every one. We also do not distinguish our scoring for an item on any basis, whatsoever. The same number of questions are posed to every user, albeit randomized. We have, however, measured surrogate questions and found a 93% consistency in responses. (the missing 7% is due to either not attempting or, in some cases, providing inconsistent answers). In the research available on this subject, internal consistency of 70% and above is considered good.  This is based on over 2000 individual assessments conducted till the date of data measurement.

Reliability for our assessments come from two factors. First of all, we do not predict personality profiles. This means there can be infinite variables of personalities. Our reports provide specific item-wise scores for the same reason. These are internally consistent across all attempts, leading to high reliability. The second factor of reliability comes from the fact that we have created the content from convergent practices and used psychologists to create the questions, thus further eliminating any personal bias. 

Homogeneity is not something we feel we offer. The reason for the same has been mentioned in the earlier page. Homogeneity is however always advised; we advise our clients to benchmark existing employees using the tool to create the benchmark for performance homogeneity. This means the scores of performers become the benchmark for using it on others as a base measure. 

PexiScore in-house benchmark validation 

When you use a single measurement yardstick for an item, you can measure the best and worst employee and create your own validation scores. This is the right way of validating any assessment, as it makes it flexible and fair too. 

One natural question could be; what about Validity factors? It gets answered automatically through the above. When you measure your best and worst employee on a single yardstick, you find where do the good ones score and where the bad ones lie. Using statistical techniques, we then can help organizations find the ideal mark for defining both hiring and performance indicators, bringing about a larger change in people management framework and efficiency.

Further study by dividing the data over 5 clusters and running optimization statistics provided the following pointers;

For the particular job role, we found that certain items had a higher priority over others; this captures the interitem correlation between items within an assessment, and in a manner that becomes meaningful for Human Resources function.

The concept of “consistent scores” does not necessarily mean the scores are identical or similar across items in an assessment. In a psychometric assessment with two items, if all employees agreed with the statement “I like to have more fun at work” and disagreed with the statement “I do not like a serious environment at work”, it would indicate the perfect internal consistency of the assessment as the two questions are negatively correlated, the degree of the correlation will as high as 1. If there are more items similar to the above two items further added, the average interitem correlations could be close to zero because positive interitem correlations can cancel out negative ones.

To overcome this problem, it is advisable to either change the polarity of negative items into positive or use the mean of absolute values of the correlations between questions to measure of internal consistency.     “One disadvantage of the average interitem correlation is that this measure is influenced by the presence of extreme correlation values. In addition, it does not reflect the variability among the interitem correlations. Therefore, the use of the average interitem correlation is problematic when the correlations between items follow a skewed distribution, in particular when some extreme correlation values are observed.” ~1.

On Cronbach Alpha measurements in standard assessment validations provided;

“If one combines the second and fourth statements (“Alpha is the lower bound of reliability of a test.” and “Alpha is equal to reliability in conditions of essential tau-equivalence.”), then one sees that Cronbach’s alpha is a lower bound of reliability and that it approaches reliability as the measurements become essentially tau-equivalent. Measurements are essentially tau-equivalent if they are linearly related and differ only by a constant. Cronbach’s alpha is a lower bound of reliability because perfect essential tau-equivalence is seldom if ever achieved (standardized alpha is not a lower bound, but is a direct approximation of reliability given items with equal observed variance). When tests comprise equal portions of general and group factor variance (in Cronbach’s, 1947, model), then their items are essentially tau-equivalent, and Cronbach’s alpha equals reliability.” ~ 2.

(Note: 1. Internal Consistency: Do We Really Know What It Is and How to Assess It? Wei Tang, Ying Cui, Oksana Babenko ©2014 2. What Is Coefficient Alpha? An Examination of Theory and Applications Jose M Cortina © 1993)

Consistency for validating what works is in the primary interest of the organisation. However, it is also true that instruments like PexiScore Assessments become more effective when both come together in a concerted effort to put things into work. It is like one more of the restaurant examples we can’t wait to share; the passion put into cooking can only be experienced with an equal passion for eating. Instruments come into their full capability when the data emerging from it is used to deliver further decisions that make the experience easier, the results more accurate and feedback that results into finer qualifications into the scoring methodology. Without it, any instrument will only continue to provide the same data or use feedback from other organisations to correct your benchmarks; and this is not appropriate. 

Reliability on our assessments from user testimonials can be further found on our webpages; do visit and also connect with us to help you further. A reliable instrument is one which is used across all levels within the organization to ensure that not only is everyone measured on a single yardstick, the capability to move further within the organization is being based on measurements that have been used for all levels, thus making it more reliable. Just like in sports; the factors which are increasing have a unipolar improvement over the same yardsticks and laws of measurement. A 100 meters sprint and a 500 meters race use the same stopwatch. Else, measurements would be flawed.

However, most organisations do this mistake; they use different yardsticks to measure leadership positions while they use simpler methods to hire and promote junior positions. This is acceptable if you promote no one from within and always hire from outside. But then we do promote people from mid level positions to leadership; else the organisation would be bereft of people at senior-mid positions.

Our assessments at PexiScore cover the entire range starting from an intern to the Senior Leadership/Management roles. We use the same yardstick to measure everyone; the only thing that differs is weightage. Different weightages at different levels ensures that while the scoring yardstick is not compromised, the number of metres to run differs.

Reach us ( to know more about People Assessments at Work.  We want to help you do Smarter Assessments.

Spread the love

So, what do you think ?