Unpacking data on student attitudes about math

12 min readJul 28, 2023

Note: I’m a college mathematics professor who usually writes about higher education issues. But I’m also branching out into learning data science and how data science tools and techniques can be used to improve higher education and the student learning experience, and I am writing about what I’m learning as I go. My dual hope here is to connect with other newbies and share my learning processes, as well as to get feedback from more experienced folks that I can use to improve. It’s well known that one of the fastest ways to get a correct answer to a question on the internet is to post an incorrect answer and then open up the comments. So fire away, and remember I’m new to this.

What are college students’ attitudes and perceptions about mathematics as a subject — and how do those attitudes and perceptions change after being in an innovative, applications-focused entry level course for a semester? Those questions were the focus of a study that I helped pilot back in the first part of 2023 and which my colleagues and I will be continuing this fall. While the pilot data are what they are — pilot data, and therefore not terribly conclusive — the tools and methods we used to gather and analyze data on student attitudes and perceptions about mathematics hold a lot of promise for insights as we move into the full study later.

Revamping the intro college math experience

A lot of college math students either dislike math, or they like it for questionable reasons such as “It was always easy for me” or “There’s only one right answer”. Their attitudes and perceptions of math don’t align well with those of expert/professional math people. This is true even of math and math-adjacent majors, and even if they succeed grade-wise in their coursework.

The typical front door into the college math experience — Calculus — often does no favors to students in this regard, since college Calculus courses tend to be focused only on getting the right answers to mechanical computations that are fully disconnected from legitimate real-world applications. This reinforces all of the negative perceptions and attitudes about math that students have, and that first impression in my experience tends to persist with extraordinary stubbornness.

Moreover, calculus isn’t even the best “front door” for college math any more, in my view and lot of other people’s views. That designation now fits better on the subject of linear algebra, which is at the heart of almost all major applications in the real world today, from AI and machine learning to population dynamics and web searches and beyond. (And engineering too!) Moreover, the barrier to entry in linear algebra is much lower than calculus; you can get started doing very interesting things with linear algebra with only the basic skills from middle- and high school algebra, without needing to know any esoterica about logarithms, trigonometry, and so forth.

For all these reasons, my department recently rolled out an innovative redesign of our linear algebra course. We lengthened the course from one semester to a two-semester sequence and lowered the mathematical prerequisites from Calculus 2 (totally unnecessary for the content) to Algebra 2. Students can now take Linear Algebra 1–2 in their first year and then Calculus 1–2 later, or vice versa, or both sequences simultaneously. Additionally, the content was refocused on applications (like the ones I linked above) and the use of computer tools like SageMath to automate calculations, making it possible for students to work directly with large data sets (which they do). My colleague David Austin wrote a free online textbook to go along with the courses, which will give you more details about what students do.

Measuring the impact

We are huge fans of this newly-envisioned linear algebra course and we hope it catches on elsewhere. But does it work? We have a lot of anecdotal evidence that this fresh approach to college math is having a big impact on our students in many ways — but it would also be nice to have real data.

One specific metric we wanted to know about, was whether being in the first semester of the new linear algebra course had any measurable impact on how students view and do math as a whole. Does this class take students from a place of hating or fearing math, or liking math but for facile reasons, to more of a place of seeing and appreciating the power and utility of the subject — that is, seeing math as it “really is”? And does it help them become more effective and persistent problem solvers overall?

This sounds hard to measure unless you’re willing to do some difficult and expensive qualitative research. Fortunately, there is a great tool freely available for gathering quantitative data on students’ attitudes and perceptions on math. It’s called, appropriately, the Mathematics Attitudes and Perceptions Survey (MAPS). It’s introduced and explained in this paper:

Code, W., Merchant, S., Maciejewski, W., Thomas, M., & Lo, J. (2016). The Mathematics Attitudes and Perceptions Survey: An instrument to assess expert-like views and dispositions among undergraduate mathematics students. International Journal of Mathematical Education in Science and Technology, 47(6), 917–937. https://doi.org/10.1080/0020739X.2015.1133854

The MAPS instrument is a 32-item survey consisting of statements about mathematics as a subject, such as “There is usually only one correct approach to solving a math problem” and “I enjoy solving math problems”. Each statement asks for a response on a 5-point Likert scale, “Strongly disagree” to “Strongly agree”. There is a filter question among these used to screen out respondents who aren’t reading the questions, so 31 content items in all.

The MAPS instrument was given to a number of expert mathematicians, whose responses were used to find questions where the professionals’ answers were typically on the “agree” side of the scale or on the “disagree” side. For example, professionals tended to disagree with the statement “There is usually only one correct approach to solving a math problem” but tended to agree with “I enjoy solving math problems”. It’s therefore possible to compute a score for each respondent that measures how closely their attitudes and perceptions align with those of professionals: If the respondent responds with “agree” or “strongly agree” on an item where the group of professionals tended to agree, award 1 point; otherwise award 0 points. Do likewise if the respondent disagrees or strongly disagrees with a statement where the professionals did the same.

Collecting and cleaning the data

We administered the MAPS instrument to students in three sections of the first semester of linear algebra, twice: Once as a pre-test in the seventh week of the semester, and again as a post-test during the last week of classes. Our goal was to see if there were any statistically significant changes in either the means in the item responses or in the overall scores from pre-test to post-test.

I won’t go into detail on the results of our data analysis or our findings here, since it was a pilot study (whose results shouldn’t be taken seriously). Here I mainly just want to talk about the tools and methods we used, along with some very high-level comments on what came out.

Students took the survey online in Qualtrics, and I saved the spreadsheet of results as a CSV for importing into RStudio. There were 47 respondents to the pre-test and 49 to the post-test. I think that’s about a 50% response rate, which we need to improve on in the main study.

First, we import the data as two separate data frames:

# Load packages
library(readr)
library(tidyverse)
library(dplyr)

# Load data source
maps_pretest_initial <- read.csv("MTH 204 MAPS pilot pretest.csv") |> tail(-4)
maps_posttest_initial <- read.csv("MTH 204 MAPS pilot posttest.csv") |> tail(-2)

The tail commands are there to remove unnecessary rows: Each spreadsheet had a header given to it by Qualtrics that I didn’t want, and the pre-test data had two extra rows where I was testing to make sure the survey worked.

Now several cleaning steps took place. First, I needed to remove blank responses (which Qualtrics indicates with a “Finished” variable that’s either True or False). Then, I needed to remove students who failed the filter question; this was Question 19 which reads: “We use this statement to discard the survey of people who are not reading the questions. Please select Agree (not Strongly Agree) for this question.” (Very clever.) Then, I needed to select only the columns of the spreadsheet that had the actual responses in them (not demographics questions and so forth). Finally, Qualtrics literally records “Agree”, “Strongly Agree”, etc. and I wanted to change those into numbers so I could run stats on them.

# Clean up data 

maps_pretest <- maps_pretest_initial |>
  filter(maps_pretest_initial$Finished == "True" 
         & maps_pretest_initial$Q19 == "Agree") |> 
  select(Q1:Q35, -Q19) |>
  mutate_at(c(1:31), funs(recode(., "Strongly disagree" = 1,
                                    "Disagree" = 2,
                                    "Neutral" = 3,
                                    "Agree" = 4,
                                    "Strongly Agree" = 5)))

(There’s a similar code block for the post-test data.)

The select command grabs Questions 1 through 35 which contain the actual responses to the questionnaire (leaving out some extra columns, like the date/time completed that Qualtrics sticks in). Questions 32 through 35 were demographic data, not part of MAPS. Question 19 is the are-you-paying-attention question, and having filtered on it in the previous line I didn’t need it anymore.

It took me a while to stumble across the last bit where I recoded the Likert answers. The mutate_at function is standard R for changing up a selection of columns in a dataframe. The funs command is a bit of functional programming — it takes the recode command and maps it across columns 1 through 31 (which are where the MAPS responses live). The result is a dataframe that has just the questionnaire responses minus Question 19, with numerical responses for all the Likert questions.

Looking for differences

Now on to the analysis. First up, were there statistically significant differences between the pre-test and post-test means on the MAPS items? For example, was there a significant increase in the average on the “I enjoy solving math problems” item between the pre- and post-test? Because if so, that would be a good sign.

First, I made two vectors, one that holds the means from the pre-test and one for the post-test. And I thought it would be helpful to look at a bar chart of the differences.

# Exploratory analysis 
pretest_means <- maps_pretest |> select(Q1:Q32) |> colMeans(na.rm = T)
posttest_means <- maps_posttest |> select(Q1:Q32) |> colMeans(na.rm = T)
barplot(posttest_means - pretest_means)

Here’s one place where I’ll share specific results:

This is “post minus pre” so a bar that points upwards indicates a higher average at the end of the course than in week 7, and one that points down means the average at the end was smaller than in week 7. This was really helpful to visualize which items would be most likely to have significant changes: Items 3, 10, 17, 25, and 32 really stand out.

To see if any of these differences actually were significant, I had to hunt around for the right test. I am not a statistician and I’m still not sure I picked the right one, but I used independent samples t-tests for this. My reasoning was that I was comparing means between two groups, and these two groups were not necessarily the same each time. In fact students never gave any identifying information, nor did we issue them ID numbers, so we have no idea who took the pre- versus the post-test.

The specific implementation in R I used for the t-test is t.test, a.k.a. the Welch two-sample t-test which assumes the means are not normally distributed, the variances are different, and the sample sizes are different. I knew for a fact that the variances and sample sizes were different. While I suspected the distributions of responses weren’t normal (I suspect most items like on MAPS will either skew heavily in one direction or another, or be bimodal), I needed to run Shapiro-Wilk tests on all the means to be sure:

## Check means for normality
## Not terribly efficient because you have to look through each result
pretest_shapiro_p_values <- maps_pretest |>  select(-Q34,-Q35) 
  |> lapply(shapiro.test)
posttest_shapiro_p_values <- maps_posttest |>  select(-Q34,-Q35) 
  |> lapply(shapiro.test)

Questions 34 and 35 are stripped out because they’re demographic questions about gender and ethnicity. As noted in the comment, this isn’t super efficient because each shapiro.test returns not only the p-value but a bunch of other data as well, so I had to go through and manually check each result from these two lines for p values under 0.05. (This was quicker than figuring out how to display just the p value in R.)

As expected, none of the means were normally distributed, which means that I could safely move on to checking statistical significance with t-tests. I had a sense of which ones from the bar chart, but this sealed it:

for (i in 1:length(maps_post_numeric)) {
  p <- t.test(maps_pre_numeric[,i], maps_post_numeric[,i])$p.value
  print(p)
  }

Yay! A for loop. I do miss the Pythonic way of doing things sometimes. All I needed to do was scan for p <0.05 and if there were any instance of this, trace back to find the question where it occurred.

Three items in the pilot data had significant differences from pre- to post-test. You can probably guess what they are from the bar chart. The rest of the differences didn’t rise to the level of statistical significance — which is sort of a discussion topic in itself. What questions arise, for example, by noticing that average responses to Question 15 (“Reasoning skills used to understand math can be helpful to me in my everyday life”) went virtually unchanged? Again, it’s just a pilot with 41 respondents, so don’t read too much into it — but you can bet that when we do the main study this fall, we’ll have our eyes on that.

Differences in the scores

Next we wanted to look at respondent scores, which as described earlier are a number between 0 and 31 that measures the degree to which a student’s response aligns with those of experts, with 31 being perfect alignment.

First I gathered the questions where experts tended to agree, and the ones where they tended to disagree (these are spelled out in the original paper):

disagree_questions <- c(1:5, 7:10, 14, 16:18, 21:24, 26:32)
agree_questions <- c(6, 11:13, 15, 20, 25)

Next, I redid a bunch of earlier work importing data — it just turned out to be easier to re-create the data frames and convert the verbal responses into bits (0 or 1) than take the existing 1–5 numerical responses and do the same.

maps_pretest_expert <- maps_pretest_initial |>
  filter(maps_pretest_initial$Finished == "True" 
         & maps_pretest_initial$Q19 == "Agree") |> 
  select(Q1:Q35) |>
  mutate_at(disagree_questions, funs(recode(., "Strongly disagree" = 1,
                                            "Disagree" = 1,
                                            "Neutral" = 0,
                                            "Agree" = 0,
                                            "Strongly Agree" = 0))) |>
  mutate_at(agree_questions, funs(recode(., "Strongly disagree" = 0,
                                         "Disagree" = 0,
                                         "Neutral" = 0,
                                         "Agree" = 1,
                                         "Strongly Agree" = 1))) |>
  select(-Q19)

And there’s a similar code chunk for the post-test data. The result for this is a dataframe full of o’s and 1’s on each response for each student, indicating whether the student aligned with the expert on that item (1) or not (0). This way, to get a student’s score I just had to sum up the rows:

pretest_scores <- maps_pretest_expert |> select(Q1:Q31) 
  |> rowSums(na.rm = TRUE)
posttest_scores <- maps_posttest_expert |> select(Q1:Q31) 
  |> rowSums(na.rm = TRUE)

Now it’s this easy to tell if the differences were significant:

t.test(pretest_scores, posttest_scores)

In the case of the pilot study, the differences in the scores between pre- and post-test were not statistically significant. Again, you can’t say much about pilot data, but if these were to happen in a full-on study, interesting questions arise: Why didn’t the scores change? Should they change, or is this too much to expect from a single course?

Conclusion

I will say again: This was just a pilot study, a shakedown cruise if you will, to iron out issues with the workflow and analysis. It had its issues: The response rate wasn’t the greatest, the sample was on the small side, and we waited too late (almost mid-semester) to do the “pre”-test. The plan is to run a full study, with all of these issues accounted for, this fall.

It took me a while to figure out much of this analysis and I am quite likely still not doing it all right, or as efficiently as possible. But it’s nice because I can just keep all the same code and run the same analysis by plugging in different spreadsheets. Had I been trying to do this in, say, Excel where the data analysis is happening in the same place as the data themselves, I’d be redoing a lot of work unnecessarily.

I also gained a new appreciation for the “pipeline” approach ( |> ) that R, in particular the tidyverse, uses. You see my use of pipelines |> all over the place, and in some ways it’s exactly what I was thinking in my head: Take this dataframe AND THEN do this AND THEN do this. It fits naturally with the way human beings think and I wish other tools like Python made it this easy.

As I mentioned at the top, I welcome comments and corrections on this. It’s been an interesting learning process, and I am hopeful this fall we’ll learn about the results of our curriculum efforts. In the meanwhile: If you have a comment, or see a mistake or a way to make this better, let me know in the comments.