Teaching Scientific Writing at Scale: Characterizing Student Writing in Undergraduate Biology

Dan Johnson
Teaching Professor
Wake Forest University

Scientific writing builds students’ thinking & communication skills, but implementation in large class sizes is challenging. How can students’ writing development be evaluated at scale over time? Data science methods may provide novel insights into student writing across large cohorts that cannot be obtained by close reading of individual texts.

Guiding Questions:
We used a corpus of >4400 student reports from introductory biology labs to test whether fully machine-scorable features can be proxy metrics for students’ development as writers over time. Key questions were:
• What word- and sentence-level proxy measures are informative? What do they tell us about how student scientific writing changes over time?
• Can proxy measures summarize cohort-level changes in writing over a curriculum sequence?
• Can proxy measure predict human-assigned grades?

The corpus was divided into three cohorts based on relative writing experience: novice (first biology course, no prior experience), intermediate (one prior semester of writing experience) and advanced (two semesters’ writing experience.) Samples were independently sorted into four “score bins” based on overall GTA-assigned grade: Acceptable, Minor improvements needed, Major improvements needed, or Unacceptable/Flawed. Vocabulary range of each report was quantified as number of unique words used and type token ratios. Word choices were classified using fixed vocabulary lists, and mean sentence complexity quantified using readability indices. Ability of proxy metrics to predict human-assigned scores was tested using proportional odds ordinal logistic regression (POLR).

Multiple proxy metrics were significantly correlated with student development as writers over time but were not predictive of individual student grades. Total unique words used increased 25.1% (p<0.001) with more growth in academic and specialized terms (24.2-38.1%) than general terms (12.1%-17.8%), suggesting students moved towards a more “formal” vocabulary. Lexical richness did not change significantly with experience, but word repetition declined 11.4-20.6%. Scores on 14/32 readability indices were significantly different over the course sequence (p<0.001) with medium relative association (ɸC > 0.2.) Fit for single- and multi-factor POLR models was low overall (pseudo-R2 = 0.187) with 59% average predictive error.

New materials produced include:
• The corpus of student reports plus metadata;
• An open-access Scientific Writing Resource Guide, bins-based scoring rubrics, and instructor training materials;
• An R Shiny web form for collecting validated, well-structured student reports;
• Structured vocabularies and R scripts used for analyses.

Broader Impacts:
Monitoring student writing systematically through “close reading” of individual samples is impractical in high-enrollment STEM courses. This study showed that student scientific writing can be informatively evaluated using proxy metrics (word counts, vocabulary density, readability indices, etc.) that describe changes in writing over a curriculum sequence. Proxy features are less subject to interpretation, and harder to “game” because they reflect the entire document rather than text sample. We found that proxy measures also can help writing instructors triangulate on more complex features of interest/value which they want students to develop over time.


A. Daniel Johnson, Sabrina D. Setaro, T. Ryan Price, Jerid C. Francom, Wake Forest University, Winston-Salem NC