Time after time!

Improving validity and reliability with regards to EAL learners in summative Maths assessments.

“A successful assessment system will enable the teacher to understand thoroughly what is expected to be mastered by pupils at any given stage of education, and assess the progress towards doing so in a meaningful and fair way” (Education Endowment Foundation 2017)

Since the ending of levels and the introduction of the National Curriculum 2014, assessment has been a contentious and divisive issue amongst teachers (NASUWT 2015). As well as commercially produced assessment schemes, such as Rising Stars or GL Assessment, some schools have opted to create their own assessment materials based on the Age Related Expectation for each year group. However, this is no easy task and a number of important factors must be considered when devising an assessment scheme (Fig 1)

Flow chart

How, through a balance of formative and summative assessments can teachers ensure robust assessments which:

  1. Has a clear purpose
  2. Ensures validity
  3. Produces reliable results?

To put this problem into context, my current Primary school follows the National Curriculum and consists of over 800 students, with 74% of children identified as having EAL needs. Approximately 60% of the children are UAE nationals, whose first language is Arabic, and there is over 50 other different nationalities represented throughout the school. In order for teachers to assess children effectively we need to take into consideration factors such as reliability and validity, however, we also have to consider the implications of children studying and recording their assessments in a second language – English. As we currently use an assessment scheme commercially produced for English schools with lower proportions of EAL learners, this impacts on the underlying purpose of the assessments and therefore the validity and reliability of the results.

Firstly, it is important to define the terms purpose, validity and reliability. 


The ‘Final report of the Commission on Assessment without Levels’ states:

The data generated by assessment can be an invaluable starting point to inform teaching, but it is important to be clear about the intended purpose of different assessments and the uses of the data that they generate. (John McIntosh 2015)

Any assessment, whether formative or summative, should have the one of two main purposes; assist with learning or quantify what learning has taken place (Harlen 2006). The purpose for which an assessment is to be used should be clearly defined (EEF 2017). From Fig 1, we can see it is important to determine what the outcome of the assessment is to be used for, for example, to measure the progress made by, or the attainment of, a student, or identify students who require extra support.

The type of assessment will differ depending on the purpose of an assessment used (and vice versa) and the data gained will be used differently depending on the audience. Assessments can be divided into three broad categories (Final report of the Commission on Assessment without Levels 2015):

  1. Day to day formative assessment
  2. In-school summative assessment
  3. National standardised summative assessment

Each with its own purpose depending on the audience. All of the assessments above could be used by children, teachers, parents, school leaders or Ofsted, each for a differing purpose.

Day to day formative assessments are used by the class teacher to identify what pupils have learned against the learning objective and plan accordingly for ways to support or extend a child’s learning in future lessons. These types of day to day formative assessment can be used by teachers to inform pupils and parents of specific areas of strength and weakness within lessons and identify targets to help pupils improve.

In-school summative assessments are used by teachers and pupils to identify what learning has taken place over a longer period of time, for example, a specific unit of work. Pupils can use the outcomes from these assessments to reflect on their learning and identify areas of strength and weaknesses. These results can also be used by the teacher to reflect on their practice and plan future lessons to tackle any areas of weakness identified for that specific cohort. They can also be used teachers to communicate with parents through the use of termly reports. School leaders use in-school summative assessments to monitor pupil progress and identify any areas which may require intervention.

National standardised summative assessment (SATs, GCSE, A-levels) could be used by teachers to reflect upon their own teaching and assess their own performance in relation to national levels. In my current school, they allow us to compare ourselves to how schools are performing in England. For pupils and parents, they can use these assessment to compare themselves against children nationally, but parents can also use the result to hold school accountable for poor performance. School leaders use the results from these assessment to measure their school against others, both locally and nationally. Government would also use the results from national tests to inform policy and hold education providers to account.

Ofsted would use all three of the above to form an overall picture of a school during an inspection. Through discussions with teacher and observations, Ofsted would expect to see day to day formative assessments taking place in class, through effective questioning and appropriate planning being in place. With regards to in-house and national summative assessments, Ofsted would expect to see these being used effectively, with appropriate measures in place to support pupil progress and attainment based on the results. All three aspects would inform Ofsted’s overall judgement of a school.

Whilst some assessments can be used for more than one purpose, it is important for the audience to know and remember the primary aim of any assessment. It should be clearly communicated to parents and children the purpose of any assessment, as these are not always clear to those outside the profession. This would help to reduce the possibility of misinterpreting the outcome. Whilst an in-school or national standardised assessment can be used to monitor pupil progress, school leaders could also use the results to monitor teacher performance and influence the outcomes in a teacher’s performance management. In this instance, school leaders must be aware this is not the primary purpose of these assessments and be careful not to misinterpret results by transferring the outcomes for an unsuitable purpose for which the assessment was not intended.

From the outset, an assessment, whether formative or summative, must be clear about what it intends to assess and what the purpose is for the information which it produces and who its audience is. This should not be confused with validity, which will be explained below.


“The term ‘validity’ is one of the most important and one of the most debated concepts in educational measurement” (Goldstein 2015)

This is in part due to fact the definition of validity has continually changed over the last three decades (Shepard 1993) with the evolution of various sub-sections, such as construct validity, predictive validity, concurrent validity and content validity, within the overarching term. Traditionally, a test is determined to be valid if it assesses what it claims to assess. However this definition is limited, primarily because validity cannot be a property of an assessment, but a property of the conclusions drawn from it (Wiliam 2000).

Content validity is established by ensuring the questions in an assessment relate to the course work covered in the curriculum. Therefore, as a Year 3 teacher, I must ensure the questions covered in an assessment will assess the content of the lessons I have taught. I would not expect to find questions relating to the area of compound shapes in a maths assessment or human reproduction in an end of year science assessment. This can lead to the problem of ‘teaching to the test’ if teachers are aware of what questions will be asked in an assessment.

Construct validity does not refer to how a test was created or ‘constructed’. Instead, construct validity refers to whether an assessment is testing what it claims to test, often in areas in which is difficult to draw specific conclusions due to lack of quantitative data. An example would be assessing whether a program followed by a school designed to develop a growth-mindset actually worked. The construct validity, in this case, would need to ensure any assessment carried out measured the development of growth-mindset.

Predictive validity means it is possible to predict, with a high level of certainty, what a child will achieve in the future based on their current level of performance. The Entrance Test assessment used in my current school, the CATs produced by GL Assessment, assess not only a child’s current ability, but also claims to predict, based on large data sets, what a child is expected to achieve at GCSE and A-Level.

Concurrent validity is the degree to which the results of one assessment will correlate with the results of another assessment, designed to test the same thing. If I was to test my maths class with a Rising Stars Year 3 Fractions assessment, I would expect them to achieve similar results in another commercially produced assessment, such as White Rose Maths Hub. If they did have similar results, then this would be an example of concurrent validity.

As teachers, we use assessments in a number of different ways. In order for an assessment to be valid, it must test what is claims to be testing i.e. fit for purpose. A Primary Science assessment which requires a certain proficiency in reading is not testing a student’s Science knowledge and skills alone, therefore it is not solely testing what it claims to test. As a result, there is a reduction in the validity of any conclusions a teacher may draw from this assessment. This becomes particularly important in the context of EAL learners in my current school and will be discussed in greater detail, with specific examples, later.

Teachers also use assessments to predict outcomes for students in future tests. Wiliam (2000) uses the example of universities offering places to high school students. Without knowing exactly how a student will perform, universities use A-level results as an indicator of future attainment. It would, however, be valid for a university to conclude a student achieving a maths result of A* at A-level will perform well in a maths related subject, for example engineering. Conversely, Stobart (2006) states it would be invalid to conclude a student who achieves a level of A* at A-level in Drama would perform equally as well in an engineering course.

These examples have referred to assessments in general terms, but validity impacts differently on summative and formative assessments. Both the previous examples are forms of summative assessment, or a more traditional ‘test’ scenario. Due to the amount of knowledge and skills taught to students, it would be impossible to devise a test which covered all areas of the curriculum, even at Primary level. Whilst national assessments can improve their validity if they use a representative selection/ sample of content, an assessment such as a SAT, GCSE or A-level cannot be completely valid as it cannot be expected to cover the complete range of skills and knowledge taught in the curriculum.

As a Class Teacher, we use formative assessment on a daily basis, whether through discussions with children, games, quizzes, whiteboard work, etc. We draw conclusions from these interactions with the children and plan accordingly to enable the children to progress. Although these day to day formative assessments used by teachers in class are ‘low stakes’, the validity of these conclusions is harder to monitor and relies on professional integrity and ‘trustworthiness’ (Stobart 2006). By ensuring comparative frameworks are in place such as rubrics, with similar expectations, and rigorous moderation, we can improve the validity of these day to day judgements made using formative assessments.

Crooks et al (1996), focusing on summative assessments, describe a model of 8 ‘threats’ to the validity, represented as a chain:

  1. Administration
  2. Scoring
  3. Aggregation
  4. Generalisation
  5. Extrapolation
  6. Evaluation
  7. Decision
  8. Impact

They provide examples of specific threats for each stage of the assessment to be considered by teachers creating or administering tests. By considering each of the possible ‘threats’ associated with each stage of the assessment process, we can maximize the validity of any conclusion we draw from the assessments. An area which I have explored is the Administration of summative assessments, through the use of Arabic speaking TAs in class. We used the TAs to provide direct translations of maths questions with two classes and two without, using the same assessment paper. We found those who had Arabic translations performed better in the assessments than those without, therefore we made this a standard across the year group, as to not do so would result in invalid results.

The definition of validity is not static, and continues to evolve over the last 60 years. Whilst teachers draw conclusions from the assessments we administer to our students, we must ensure those conclusions are valid by ensuring the tests do cover what they set out to test, and we are interpreting the results correctly.


Reliability is the degree to which the ‘score’ achieved any one individual is an accurate reflection of that person’s actual ability, or ‘true score’ (Shillingburg 2016). If an assessment is reliable, the participant would achieve a similar score if they retook the test at a different time, on a different day, in a different room etc. Whilst they may now achieve the exact same score, the results should be within an acceptable range to conclude the test is reliable. The opposite of this would be if the results from a test varied so widely whenever it was taken that it could not be determined the ‘true score’ of the participant.

However, due to their very nature, no test can be 100% reliable, as external factors contribute to the outcome of the test. Shillingburg (2016) highlights the reduced reliability of national-level assessments as opposed to class-level assessments. National-level assessments must represent a generalised view, as opposed to class-level assessment, which does take into account the entire population (a group, a class or a cohort) sitting a particular test. Therefore, the more students who are expected to take a test, the larger the population and more generalised the findings, hence a reduction in the reliability of the assessment.

Christodoulou (2016) highlights three factors which result in unreliable outcomes of national-level assessments:

  1. Sampling
  2. Student unreliability
  3. Marker unreliability

When a cohort of students take a test, for example GCSE Physics, it then becomes unusable as a method of assessment for following years. The questions in the test must be changed in order to ensure the next cohort of students do not know the questions in advance. Whilst the papers are testing the same skills and knowledge year on year, the fact the wording/phrasing has been changed means they are not identical tests and this impacts on its reliability. Hence it is not uncommon to hear Secondary teachers remarking on the difficulty of a particular paper, when comparing to previous years.

The second factor highlighted by Chirstdoulou is student reliability or the performance of the student on the day. This is perhaps the most unpredictable factor and one which cannot be legislated for. As humans, we all have good days and bad days. Students may underperform in a test for any number of reasons – illness, lack of sleep, hunger, nerves, stress, a falling out with friends – and these are out with our control. They can, however, contribute to a student achieving lower, or perhaps even higher grades than anticipated.

Finally, the nature of the test and how it is marked also influences the overall reliability of the assessment. Multiple choice tests will have a higher marker reliability as the answers are not open to interpretation by the marker. However, exams which have more open-ended questions are subject to marker interpretation and the more markers there are, the less reliable the grades become. The reliability of open-ended questions can be improved through moderation and clear marking guidelines, however, there will always be an element of subjectivity and therefore, issues with marker reliability.

As test creation has become more sophisticated, so too has the reporting of results. An example of this is New Group Reading Test (NGRT) produced by GL Assessments, the assessments we currently use in my school, which includes a 90% confidence band with its Standard Age Score (SAS). Whilst the SAS score represents what that particular student achieved on the assessment on that day, the range represents what they could be expected achieve, taking the same test on a different occasion. Whilst it is impossible to create a test which is completely reliable (Wiliam, 2000, Christodoulou 2016), there are ways to make them as reliable as possible.

GL Example

In context

Now that the terms purpose, validity and reliability have been defined, I will now examine how this impacts on the assessments which currently take place in my own school. During my 10 years of experience of teaching in international schools in Qatar, China and United Arab Emirates, the same question has emerged when administering maths assessments; what are we actually testing? In all of the schools I have worked, the majority of the children have been identified as EAL, with my current school classifying 74% of the roll as EAL. The assessments we currently use are purchased as part of the Rising Stars scheme, produced by Hachette UK. However, these assessments are produced for English schools, the majority of whom have less EAL learners than my current school. As a result, there is an implicit understanding the children will have a certain level of proficiency in English reading and writing, linked to their Age Related Expectations.

As a school, we have discussed the purpose of these tests, the validity of the conclusions regarding progress and attainment which can be made and the reliability of the tests. It was one particular question which led to much discussion in the last year, and which I will use to illustrate the difficulties of using UK-based, commercially produced assessments with EAL learners.


The question above was in a Year 3 end of term assessment from the Rising Stars scheme of work. The direct translation of the answer given is ‘one and forty-five minutes’. However, the question we are now faced with is, should the child receive the mark? This leads us back to the question, ‘What is the purpose of the assessment?’


When thinking about purpose, we must decide whether an assessment actually assesses what it sets out to assess. What is the purpose of the question above? Is it to assess whether a student can:

  1. Tell the time, while using Roman numerals, on an analogue clock
  2. Tell the time, while using Roman numerals, on an analogue clock and record their answer in English

This final caveat is extremely important as an EAL teacher. If the purpose of the question is 1) then the child is awarded the mark, however, if we are assessing 2), and the child is expected to record the answer in English, then they would not receive the mark.

As a teacher of EAL children in an international school following the National Curriculum, I am acutely aware the children will eventually sit the iGCSE and A-levels, which are of course in English. By then, there is no question which language they must use to respond to each question, however, as a teacher of Primary aged learners, I ask myself, what is more important; they understand the concept of telling the time, or they can do it in English? The answer is of course both, but at what point in their education does the purpose of the assessment switch from Purpose 1 to Purpose 2?

As a teacher of EAL children we have to teach both skills – telling the time and all the associated vocabulary which goes with it (numbers, big hand, small hand, o’clock, to, past, quarter, half, clockwise, digital, analogue, time, minutes, hours, seconds). All of these words and an understanding of how they are put into practice, are required by the learner in order to be able to answer this question.

If, however, we cannot decide on the purpose of the above question, we then must question the validity of any conclusions we draw from it.


The validity of the conclusion drawn by myself and the subsequent actions taken, will depend on the purpose of the assessment being clearly defined. If we determine the purpose of the question is to tell the time using Roman numerals on an analogue clock, then 1 mark is awarded for the answer given. As the assessment is seeking to attain whether a child can tell the time or not, based on purely mathematical stand point, this would be the valid conclusion to draw from the answer. However, because the assessment has been generated and is primarily intended to be used with native English speakers, there is an unspoken expectation by the teachers and school as a whole, that the answer will be recorded in English. Based on the fact the purpose of the intended outcome of the question is to ascertain whether the child can tell the time, it would be invalid not to award them the mark on the basis they communicated the answer in Arabic. This has led to much discussion across the Primary and Secondary SLT, and we have identified this an area we must address from next year. Currently, we do not explicitly instruct the children to answer in English. As a whole school, we must decide when to introduce this a ‘non-negotiable’ outcome.

This is the problem facing teachers of EAL children. Whilst the purpose of the question is to check for understanding of time, there needs to be explicit teaching of time related vocabulary, which may not be familiar to the student in the their everyday life. Unlike EAL children living in UK, the pupils at my current school are not immersed in the English language. The majority spend their time outside of school conversing in another language (predominantly Arabic) and do not use the subject specific vocabulary as part of their day to day language. As a class teacher, I would use formative assessment to make note of the use of Arabic to answer the question and put in place support for the child to improve their English skills with regards to using subject specific vocabulary. Examining the answer closely, the child has met the Age Related Expectations for reading as he has understood the question. However, the lack of English in the answer would not result in the child ‘losing’ any marks for the answer.

If we conclude the purpose of the question is to assess the child’s ability to tell the time using Roman numerals on an analogue clock, and record the answer in English then they would be awarded no marks for the answer given as it clearly does not meet to criteria for a correct answer. The inability to answer in English would supersede to correct mathematical response to the question and therefore result in 0 marks awarded.

As a teacher my next step would be the same as before, to put in place support for the child to improve their English skills in relation to telling the time. I would note the response and when I planned to teach another unit on Time, I would ensure the child focused on learning and using the correct language associated with time.

As stated previously, I and my Primary colleagues are aware the children in my current school will be expected to undertake iGCSE and A-level exams in Secondary school and the majority of those exams will be completed in English. As a Primary school we are expected to prepare the children for these exams by not only teaching the appropriate content upon which Secondary colleagues can build, but also the subject specific vocabulary associated with each subject. In Secondary exams the use of English is non-negotiable, however, at one point does this become true for Primary students and the purpose, in this example of the maths assessment, change from Purpose 1 to Purpose 2, as defined above?

There is no ‘one size fits all’ answer to this question. As a through school (EYFS – Year 13) we are in a fortunate position whereby we can collaborate closely with colleagues from across Primary and Secondary. This way we can identify pupils who require extra support and put in place interventions to assist them with the language acquisition. Whilst End of Key Stage 1 assessments may be too early to demand all children answer a maths assessment in English, the End of Key Stage 2 SATs would appear a more appropriate time to introduce this requirement. If we are to accept the commonly held statistic that it takes the average learner 4-7 years to acquire an academically proficient level of English (Hekuta et al 2000), this would give the school and the student adequate time to learn the required subject specific vocabulary.


If reliability is the degree to which a test score reflects the true ability of the learner, then we are faced with the same problem as above. There is little doubt the child in question would answer the same had they taken the test at another time, however, we must first determine the purpose of the question in order to conclude whether the test is reliable, when administered to EAL learners.

If we agree the purpose of the question is to assess whether a student can tell the time, while using Roman numerals, on an analogue clock then the answer does show a true reflection of the student’s understanding. The same can be said if the purpose of the test comes with the added caveat of ‘and can record their answer in English’. If this is the purpose of the question then the student would not receive the mark and this would be an accurate reflection of their ability in relation to the purpose of the test.

However, if the purpose of the question is undetermined then it becomes unreliable. If the question falls under Purpose 1, as stated above, and they do not receive the mark, this does not accurately reflect their level of ability. Conversely, if the purpose of the questions falls under Purpose 2 and the student is awarded the mark, as the result of marker unreliability, then the results of the test do not demonstrate an accurate reflection of the student’s ability and therefore, the test suffers from decreased reliability.

It is vital the purpose of the assessment is clearly defined in order to make the outcome as reliable as possible. This is why it is important to be explicit with EAL learners about the use of language which they use to communicate their answers. Once again, however, at what point in the education process does this become non-negotiable for the student and change from an explicit to implicit expectation.


One of the proposed benefits of the removal of levels from the National Curriculum was it would allow schools the freedom to develop assessments, tailored to meet the needs of their pupils, thereby improving progress and attainment as well as promoting a ‘higher quality of teaching, learning and assessment’ (Final report of the Commission on Assessment without Levels 2015 p.12). However, due to the difficulties in producing a valid, reliable assessment with a clearly defined purpose, many schools in England have opted to use commercially produced assessments.

However, as a teacher in an international with a high proportion of EAL learners, we cannot simple adopt an assessment scheme designed for a predominantly English native speaking setting and transfer it to our own setting without making some adjustments. EAL teachers must make the implicit explicit and ensure the students are aware of the expectation to answer questions in English. In order to enable them to achieve this, we must make a conscious effort to teach the subject specific vocabulary associated with maths and plan lessons and activities which allow children the opportunities to practice these language skills.

Each school must set its own guidelines with regards to when answering all questions becomes non-negotiable. This will depend on a number of factors including resources, allocation of teaching staff, intervention schemes and the level of collaboration between Primary and Secondary schools. However, schools must ensure these guidelines are adhered to or risk children sitting high stakes exams (iGCSEs and A-levels) without the language competencies to access the content or succeed.



Christodoulou, D., 2017, Making Good Progress?: The Future of Assessment for Learning, Oxford, Oxford University Press

Cronbach, L.J., and Meehl, P.E., 1955, Construct validity in psychological tests, Psychological Bulletin, Vol 52, 281-302

Crooks, T.J., Kane, M.T., Cohen, A., Threats to the Valid Use of Assessments, Assessment in Education: Principles, Policies and Practice, Vol. 3, pp265-285

Department for Education, 2015, Final report of the Commission on Assessment without Levels, London, DfE

Education Endowment Foundation, 2017, Assessing and Monitoring Pupil Progress, London, Education Endowment Foundation. Available from:

https://educationendowmentfoundation.org.uk/tools/assessing-and-monitoring-pupil-progress/ [accessed 16th March 2018]

Goldstein, H., 2015, Validity, science and educational measurement, Assessment in Education: Principles, Policies and Practice, Vol. 22, No. 2, pp193-201

Harlen, W., 2006, On the relationship between Assessment for Formative and Summative Purposes, In: J. Gardiner (Ed). Assessment and Learning, London, Sage Publication, pp61-81

NASUWT, 2015, Assessment without levels: Taking stock Birmingham, NASUWT

Shepard, L.A., 1993, Evaluating Test Validity, In: L. Darling-Hammon (Ed). Review of Research in Education, Vol. 19, pp405-450

Shillingburg, W., 2016, Understanding Validity and Reliability in Classroom, School-Wide, or District-Wide Assessments to be used in Techer/Principal Evaluations, Arizona, Arizona Department of Education

Stobart, G., 2006, The Validity of Formative Assessment, In: J. Gardiner (Ed). Assessment and Learning, London, Sage Publication, pp133-146

Wiliam, D., 2000, The meaning and consequences of educational assessments, Critical Quarterly, Vol. 42, No. 1, pp105-127













Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s