DataShop Data can be confusing.

The PSLC DataShop is a repository of educational data. It serves as the hub for all data collected from the Open Learning Initiative (OLI) and could serve as a great tool for educational researchers, since it provides a vast variety of analytic tools.

However, for instructional designers, it is too complicated to use, in that it is not easy to draw insights in just a few clicks and discover what can be improved.

Courses need to be examined on both micro and macro scale.
Learning Curve Analysis and KC Model Analysis are the commonly used and powerful tools for analysing the course on a relatively micro level. They involve student transaction data and student step data. However, a macro level analysis is also needed for the instructor to see a broader picture of the course and gain insights from it.

It is also extremely difficult to garner insights about the way students progress through a course as DataShop data is not easy to parse without code.

Different student approaches to online courses arenot well understood.
Depending on prior domain knowledge, past experience with online learning environments and other factors, students progress through courses in a number of different ways and analyzing data from students movements can allow us to understand the learner profile in more detail.

Can we help instructional designers make data driven decisions by visualizing learning pathways through online courses?

We broke down our primary objective into the following questions:

  • - What are the common pathways students take in a course?
  • - How are learning pathways correlated with students achievements?
  • - How does student behavior (in terms of completion) affect performance?

Further, we tried to answer more specific questions under each visualization:

  • - Do high achieving students take similar pathways?
  • - Are there common pathways between all students?
  • - What are some features of students learning pathways?
  • - Which problems do many student tend to make error on?
  • - Which students performed badly?
  • - For those students who performed badly, which part do they perform particularly badly?

The Data

Course: Introduction to Python, Spring 2016@CMU
The data is from the Introduction to Principles of Computing with Python course, a blended learning course that has both online and offline components. This course is primarily for undergraduate students at CMU and the particular data that we are using is from the summer term of 2016. A total of 55 student participated in this course.

Source: Open Learning Initiative
The log data of the online course is auto-generated by the OLI platform, and stored in DataShop. The data includes Over 10,000 steps within 269 problems across 20 modules.

Interaction data for this course is in the standard DataShop/OLI format with all student-problem, student-step and student-transaction level interactions being recorded. Student performance data (beyond the pre and post tests) was not available directly on OLI but we were able to get it from the professor directly.

To further clean the data and improve the quality: firstly, we eliminated all the columns in the student-problem dataset that are not needed. After that, we joined the two dataset together on anonymous student id. Looking further into the data, we found a module on academic integrity and ethics with problems that are not relevant to the actual learning, thus we also dropped those rows. Finally, we dropped all rows that contains null values.

Data Quality

Completeness: Is the data complete?

The data is complete in the way that it can satisfy our need for answering the questions, which means it has the columns of: problem view, latency, first step, problem id, hint level, KC labels, and performance. However, within the particular student interactions, there are some missing values that occur due to OLI errors or student input errors. As OLI also only measures interaction time with the student, it is difficult to gauge students engagement with the instructional materials.

During Exploratory Data analysis, we also discovered that only 28 out of the 55 students in the course had completed the post-test, so we decided to accumulate other performance data to measure learning such as the final exam which was completed by almost all students. A large number of students have not completed either the pre or the post today.

aCcountability: Who has access to it?

The course data on OLI is on DataShop and stored as “private”, as such it can only be accessed after seeking permission from the owner, Norman Bier.
The student-performance data is confidential, only the instructor and administrators of the dataset have access to it. We had to ask for special permission to access this data and it was anonymized before we received it.

Coherence: Is the data coherent?

The data was not very coherent at first. But it has been cleaned by our team members. Firstly, we eliminated all the columns in the student-problem dataset that are not needed. We then read the student-problem dataset and the performance data into a pandas DataFrame. After that, we join the two dataset together on anonymous student id. Looking further into the data, we found a module on academic integrity and ethics with problems that are not relevant to the actual learning, thus we also dropped those rows. Finally, we dropped all rows that contain null values.

This processing has made the data cleaner and more coherent.

Correctness: Is the data correct?

The data is correct because all transaction and students behavior, including each step they take, how much time it take for them to do each problem, and their hint-asking behaviors are accurately recorded down by the Datashop automatically, minimizing the possibility of human error. By comparing OLI scores and completion rates with performance data we were further able to check the validity of the data.

We verified this correctness by plotting a number of graphs and validating discrete statistics such as averages and ranges of scores, etc. A detailed discussion of this processing is available in the Exploratory Analysis section below.

Exploratory Analysis

Data Pipeline

Descriptive Statistics

Total number of problems attempted by students: 10880
Modules in course: 20 (including Ethics Module)
Number of problems: 269 (37 problems are from Ethics module)
Number of students: 55

Initial Graphing

There is a strong correlation (m = 0.4912) between completion rate (percentage of questions completed on OLI)and total score (final grade). This simple analysis validates the effectiveness of the course and highlights a number of students who have high score but low completion rate, upon further analysis it was discovered that these students have high pre-test scores. This makes sense as a high pretest score implies prior knowledge which can justify high scores without high completion rate.

Bar Charts

We created a bar chart to analyze which modules were more commonly completed and to see if there was a variation in the number of students per module as we hypothesized:

Measuring Transfer

Finally, we plotted the OLI completion rate and performance against performance on the written assessment to verify if there was transfer between the two. We saw a very strong correlation between the performance on OLI against the performance on written assessments, with the few outliers again being students who had performed well on the pretest and did not have high completion rate due to this.

Data Analysis

Sankey Diagrams

How to Use:
The nodes in the diagram represent the different modules of the course. The links represent the number of students that move from one module to another. The solid links show forward movement and the dashed links represent when students go back to a module after completing it the first time. The size of the nodes and links represent the number of students who take that path.

You can interact with the nodes and move them around to see links more clearly. Hovering over links will reveal their counts in tooltips and hovering over nodes will highlight all source links to that node. Clicking on a node will allow you to highlight all target links that go out from that node. Link highlights can be reset by clicking any node.

D3.js Sankey Diagram to visualize flow of students through Summer 2016 course at a macro level:

click image to view full diagram

Following the general pattern, there is clear group of students who follow the recommended path as shown by the thick link. At the top left of the graph it is also easy to see a number of advanced students who skip the introduction lessons on loops and lists. Nodes which are small and pushed to the side like concurrency, common errors and logic are also key aareas for intervention as not enough students are attempting these modules.

Another interesting insight from this visualization is the idea that making decisions is linked to different previous modules and being a complicated module results in students going back to other modules to recap as shown by the large number of backwards links. The recommended intervention in this case would be to examine the contents of the making decisions module to decided if it requires more scaffolding or whether recap notes from other modules should be directly inserted into this.

We recreated the Sankey diagram for 2 other terms of the same course to verify how they have evolved.

D3.js Sankey Diagram to visualize flow of students through Spring 2016 course at a macro level:

click image to view full diagram

In this version of the Sankey flow, we can see that there are far fewer modules and transactions in the Spring as compared to summer. However, we can see two clear splits in the paths and can see that there are mny back links into Iteration, meaning it probably need to be improved to promote far learning.
Student attrition is also very clearly visible in this Sankey.

D3.js Sankey Diagram to visualize flow of students through Fall 2016 course at a macro level:

click image to view full diagram

The fall course size was much larger than the other two terms (more than 4 times the number of students), hence the abundance of links. Once again we immediately see the large link of students returning to the making decisions module, hoghlighting that it is an area for improvement, even across terms.
Like we saw in the summer, there is one clear path most students follow, an early group that skips introduction slides and atttrition is clear.

What is also obeserved is that many students skipped the post test and also the mid modules like Recursion, Data Representation and Encryption which show as very small nodes.

As before, instructional designers should consider Scaffolding the course so that novice learners and advanced learners can both take advatage of the online class.

Individual Student Pathways

Tableau visualization for each student's path at a micro level.

How to Use:
X axis: Week of problem start time
Y axis: Student ID and total score
Size: Problem view (longer the bar is, the more time they view)
Color: Different color means different problem hierarchy.
Interactivity: You can mouse over the graphic to see its details.

Questions addressed by this visualization:
- Do high achieving students take similar pathways?
- Are there common pathways between all students?
- What are some features of students learning pathways?

Since our data is relatively complex, that has different dimensions, this graphics is still not self- explanatory. But with a little bit of explanation, it should be much more interpretable than a bunch of numbers.


1) Low achieving students started struggling very early, on their first course module- For loops. (The third block on the graph, because the first two block are introduction and pretest). We can see from the brown block that showed for low achieving students, these are much longer, which means they have been visiting that for loops module a lot of times. This early struggling shows they are struggling, and may indicate that they may not do as well as others. So educators can take this as a sign, and give timely help to these struggling students who are using more time than others.

2) High achieving students (judging from the total- final performance), tend to do more modules. As we can see from the visual, in the upper part, there are more colorful blocks, which means these students visited modules more completely. We can also see that the time they take are more balanced. While if you look at the low-achieving students, you can see that there are more white space in their pathways, which means they have missed a lot of modules. This can lead to drop in their final performance. Also, as you can see from the graphics, the time use of low-achieving students are not as balanced as high-achieving ones. There tend to be some really long blocks, which could possibly mean that they are struggling.

3) However, there are also some anomaly, such as the one low-achieving student who scored 70 but did almost all modules. This could be a indicator of wheel-spinning.

Module Level Error Analysis

Tableau visualization to identify modules which are problem areas.

How to Use:
X axis: Total Student ID + Student final score
Y axis: Problem hierarchy
Size: Completion rate(the higher the completer rate, the bigger the squares are)
Color: The greener it is, the lower the error rate, the redder it is, the higher the error rate.
Interactivity: You can mouse over the graphic to see its details and click on different elements to select them.

Questions addressed by this visualization:
- Which problems do students tend to make errors on frequently?
- Which students performed badly?
- For those students who performed badly, which part of the course do they perform particularly badly on?

Evaluation and Insights:

This is not a transparent visualization, and probably also not too self-explanatory. But it has its advantage in that:

1) It is showing four variables at the same time:
     a) Student id
     b) normalized error
     c) final performance
     d) problem name.
Therefore it is a very comprehensive graphic which has a lot of information in it.

2) It can visualize everybody’s final performance in a very conspicuous way: by using red and green. The darker the green, means the students performed better. The darker the red, means the students perform worse. This is correlated with our common sense that red signals warning, risks and green signals passing and safety. Therefore, from the graphics, we can easily see which students are not performing well. And which particular module are they struggling with.

3) One example to understand how to use this visualization would be student with id AE. The large size of squares for this student shows the completion rate is high. However, we can see from the axis that the total score for this student is poor (around 72% which is well below class average).
Upon examining the visualization we can see that this is due to the fact that the student has a very high error rate and is attempting questions incorrectly at a high frequency. We can then use the visualization to pinpoint the areas that the student needs to improve in such as Time-Efficiency, Encryption and Lists.

4) Another way to use this visualization is to examine a particular module across students. For example: The Time-efficiency module is clearly very troublesome for low scoring students while high scoring students tend to not make errors on practice problems for this module. On the other hand the Encryption module seems to be hard for everyone and would require some re-evaluation to understand why.

Correlation - Evaluation of Course

Here, we create an evaluative visualization to verify that the course is working in the intended way. This visualization and the related statistical analysis shows that student completion rate is strongly positively correlated with their final performance, and that completion rate is more telling than time spent on problems, which was a surprising insight!

How to Use Chart 1: Completion Rate
X axis: Average completion rate
Y axis: Average Total (students' final performance)
Size: Time spent on problems
Color: The more orange it is the less time students spend, the more blue it is, the more time students spend.
Interactivity: You can mouse over the graphic to see its details.

How to Use Chart 2: Time Spent
X axis: Time spent on problems
Y axis: Average Total (students' final performance)
Size: Time spent on problems
Color: The more orange it is the less time students spend, the more blue it is, the more time students spend.
Interactivity: You can mouse over the graphic to see its details such as line equation.

These are observational studies that are conducted in an ex-post facto manner as the data is already available on DataShop.

Hypotheses to be tested:

Null Hypothesis 1: There is no statistically significant relationship between completion rate of problems and students’ final performance.
Independent Variable: Completion Rate (percentage of course materials completed)
Dependent Variable: Total Score (Final performance)

Null Hypothesis 2: There is no statistically significant relationship between students’ time spent and students’ final performance.
Independent Variable:Time spent (Number of problems * avg. time spent per problem)
Dependent Variable:Total Score (Final performance)

Statistical Evidence:
For hypothesis 1:
Equation: Avg(Total) = 0.142344 * Time-Spent + 76.8668
Correlation Value: 0.142344
P value: 0.0029142 **
Evaluation: The correlation value is more prominent when some outliers are removed as can be experimented interactively on the graph above. The key reason behind the outliers is the high pre-test score some students have. Due to vast prior knowledge, they are not engaged in the course but still get a high total score.
Conclusion: Reject null hypothesis. Completion rate of problems is positively correlated with students’ final performance.

For hypothesis 2:
Equation: Avg(Total) = 0.00833 * Time-Spent + 84.4179
Correlation Value: 0.000833
P value: 0.782187
Evaluation: The time spent on questions is not a good indicator of performance in any way. This is interesting as one would expect time spent to be directly correlated with completion rate, which is not the case at all.
Conclusion: Accept null hypothesis. Students’ time spent is not correlated with students’ final performance.


Observations and Insights

Completion rate is a better indicator than time spent to predict a students’ final performance on the course, supported by statistics of correlation variable (slope) and p value.

low-achieving students tend to start struggling very early on about some relatively easy modules, such as “for loop” in this course.

In this course while most people did badly in the pre-test, the majority of them did better in their post-test.

Generalization Potential

These visualization have been designed to be as easy to create for a non-practitioner of data science. The aim of this part of the project is to create a simple set of steps that any instructional designer can use to create these visualizations from any DataShop data set.

For the first visualization (Sankey diagram) the data needs to be cleaned using the data pipeline mentioned above. We plan to open source the code used in this project to clean the data into the required format (which uses Python and Pandas). Currently weightage on the Sankey diagram is done with just the student count, but student performance, time spent, and other metrics could also be used to weight the links depending on the domain.

For the other visualizations, Tableau is a ready to use format and our visualization workbooks are public and can be accessed by anyone. In order for the visualizations to work in a plug and play manner, data shop data will need to be exported (at the Student-Problem level) and then combined with performance data. While we have combined a number of performance parameters, both offline and online, in our analysis, the visualizations will still work if this combination is done simply by joining student final scores (out of 100) with their student Ids on the downloaded DataShop file.

This generalization potential can be seen above where we re-created the Sankey diagrams for two other semesters of the Introduction to Programming course, and each of these visualizations was created without writing any code, in less than 30 minutes.

Limitations and Future Work

This is only one course in several specific time periods, with limited students. The conclusion about learning pathways can benefit from larger amount of data to yield more explicit and solid pattern.

It is hard to judge the generalizability of our conclusion three because it is largely specific to this course.

The Sankey diagram is capable of displaying a macro picture, but in its current form it is difficult to generalize intuition from the diagram.

For online course with more than hundreds of students, such as MOOCs, further data aggregation is needed to avoid high complexity of the sankey diagram.

In the future, it may be possible to visualize for different courses across different domains to see if common pattern exist about learning pathways. Depending on further analysis and user studies with instructional designers, the Sankey diagram can be modified to include color and other features to communicate more information at a glance.

We also plan to move these visualizations into the LearnSphere environment to allow direct integration with all DataShop datasets.