The Change, across years, is much larger for writing than the other subjects

Dave Treder, Assessment Consultant at Genesee ISD, put this together to help people understand the large changes schools and districts experienced in their writing results this year.

A few points on the “extreme” between-year swings noticed by districts and schools on the MEAP writing test:

The change in the percent of students passing across years is much larger for writing than the other subjects.

The average change in a school’s percent proficient, from 2002 to 2004, in writing, is 36% larger than math, 54% larger than reading, and 66% larger than both science and social studies. (See graph on page 3)

The year-to-year swings in grade 5 MEAP writing test are extreme.

The 5th grade MEAP writing test had characteristics similar to the current 4th grade MEAP Writing Test. The average school change, between 2003 and 2004, for the 4th grade writing test, was 12% (factoring out the “+” or “–“ in the State’s change). The across-year change on 5th writing MEAP writing test was 11%, 12%, 7%, and 5%. The graph on page 4 shows the extent of the extreme swings.

Results from raters indicate difficulty in differentiating between “2’s” and “3’s” – which approximates the point where the Cut Score is located.

The graph on page 5 indicates that the scores most often assigned by raters are 2s and 3s. Further, more than 25% of the time, raters disagree on whether a paper deserves a 2 or a 3 (1163 out of 4302 papers). And this is the point where the Pass-Fail decision is made -- two ‘3’s’ will award a student a passing grade, while a 2 and a 3 will not.

The use of a “cut score” exaggerates the difference between schools. This also heightens the change in schools’ scores across years.

As noted on the graph on page 6, the vast majority of districts received about 40-50% of the possible writing points (a 10 percentage point range), while the percent passing ranged from 30% to 66% for these districts (a 36 percentage point range).

The MEAP writing assessment consists of a single prompt, which is not sufficient for determining a student’s writing ability or a school’s ability to effectively instruct writing.

Students' knowledge/interest will vary from prompt to prompt. Providing only one prompt leaves much to chance, as to whether the score a student receives actually represents the student’s true ability. As noted by Richard Shavelson, Dean of the School of Education at Stanford University, "You probably need between six and ten tasks to get a reliable measure of performance." (Quoted in Education Week, May 17, 2000.)

While the raters are appropriately qualified and trained, and the scoring process is well validated and properly monitored, inter-rater issues and the scoring process will invariably add unreliability to individual and school-level scores.

The company responsible for hiring and training raters and conducting the scoring does 1) utilize college graduate educators; 2) conduct rigorous and well-validated training; and 3) periodically check raters to ensure a continued connection to the rubric.

However, without getting into it too deeply, the method utilized by the MEAP office to report inter-rater agreement – the “percent of ratings within one” – is not considered an acceptable practice. Rather, a Kappa statistic or an Intra-class correlation coefficient would be more appropriate. For both of these, the results (.43 and .81, respectively, for 2004 4th grade Writing) would be considered marginally reliable, at best. * And the inter-rater agreement issue is exacerbated by the issue discussed above, i.e., the point where raters show the most difficulty in differentiating between papers – between a score of 2 and 3 – is the point where pass and fail is differentiated.

Further, all the papers for a teacher group are rated by the same raters to make it easier to identify “irregular administration practices.” It is not possible to actually compute the additional “error” that this will add to the scores because of differences in raters – but it seems relatively sure that the error would be larger than if the papers were randomly assigned to raters.

* The apparently disparate facts, that 1) raters are sufficiently skilled and trained, and 2) inter-rater agreement is only marginal, result from, I would surmise, the fine distinctions that raters are required to make (see the Table on page 7, which delineates the differences teachers must make, between a 2 and a 3 paper: differentiating between vocabulary that is limited vs. basic, and writing that is only occasionally clear and focused vs. somewhat clear and focused.)

Average School Change, 2003 to 2004 -- w/State Change Factored Out

(All Michigan Schools w/ 10 or more students taking the test)

Percent

Subject Change

Distribution of Ratings Received By Students (GISD)

Gr. 4 MEAP Writing - 2004

“Cut” Score

Rater’s

Scores

Scale

Scores 440 455 470 485 500 515 530 545 560 575 590 605 620

Grade 5 Writing, Percent “Proficient” 1998-2002

(Every 10th school in Genesee ISD)

Proficient

1 2 3 4 5 6 7 8 9 10 11 12 13

SCHOOL “NUMBER”

District Mean Percent Correct & Percent “Passing”

Percent “Passing” MEAP Writing Test

Average Percent Correct (out of 12 possible points) on the MEAP Writing Test

Gr. 4 MEAP Writing, 2004

MEAP Writing Rubric

Differentiating Between a ‘2’ and a ‘3’ Paper

What classifies a paper as a ‘2’ What classifies a paper as a ‘3’

The writing is only occasionally clear and focused. Ideas and content are underdeveloped. There may be little evidence of organizational structure. Vocabulary may be limited. Limited control over writing conventions may make the writing difficult to understand.

The writing is somewhat clear and focused. Ideas and content are developed with limited or partially successful use of examples and details. There may be evidence of an organizational structure, but it may be artificial or ineffective. Incomplete mastery over writing conventions and language use may interfere with meaning some of the time. Vocabulary may be basic.

Highlighting the concepts/skills that are evaluated, and

the degree to which they need to be evident/accomplished

The writing is only occasionally clear and focused.

Ideas and content are underdeveloped.

There may be little evidence of organizational structure.

Limited control over writing conventions may make the writing difficult to understand.

Vocabulary may be limited.

The writing is somewhat clear and focused.

Ideas and content are developed with limited or partially successful use of examples and details.

There may be evidence of an organizational structure, but it may be artificial or ineffective.

Incomplete mastery over writing conventions and language use may interfere with meaning some of the time.

Vocabulary may be basic.