IDS Final Project

Data Quality

Completeness: Is the data complete?

The data is complete in the way that it can satisfy our need for answering the questions, which means it has the columns of: problem view, latency, first step, problem id, hint level, KC labels, and performance. However, within the particular student interactions, there are some missing values that occur due to OLI errors or student input errors. As OLI also only measures interaction time with the student, it is difficult to gauge students engagement with the instructional materials.

During Exploratory Data analysis, we also discovered that only 28 out of the 55 students in the course had completed the post-test, so we decided to accumulate other performance data to measure learning such as the final exam which was completed by almost all students. A large number of students have not completed either the pre or the post today.

aCcountability: Who has access to it?

The course data on OLI is on DataShop and stored as “private”, as such it can only be accessed after seeking permission from the owner, Norman Bier.
The student-performance data is confidential, only the instructor and administrators of the dataset have access to it. We had to ask for special permission to access this data and it was anonymized before we received it.

Coherence: Is the data coherent?

The data was not very coherent at first. But it has been cleaned by our team members. Firstly, we eliminated all the columns in the student-problem dataset that are not needed. We then read the student-problem dataset and the performance data into a pandas DataFrame. After that, we join the two dataset together on anonymous student id. Looking further into the data, we found a module on academic integrity and ethics with problems that are not relevant to the actual learning, thus we also dropped those rows. Finally, we dropped all rows that contain null values.

This processing has made the data cleaner and more coherent.

Correctness: Is the data correct?

The data is correct because all transaction and students behavior, including each step they take, how much time it take for them to do each problem, and their hint-asking behaviors are accurately recorded down by the Datashop automatically, minimizing the possibility of human error. By comparing OLI scores and completion rates with performance data we were further able to check the validity of the data.

We verified this correctness by plotting a number of graphs and validating discrete statistics such as averages and ranges of scores, etc. A detailed discussion of this processing is available in the Exploratory Analysis section below.

Data Analysis

Sankey Diagrams

How to Use:
The nodes in the diagram represent the different modules of the course. The links represent the number of students that move from one module to another. The solid links show forward movement and the dashed links represent when students go back to a module after completing it the first time. The size of the nodes and links represent the number of students who take that path.

You can interact with the nodes and move them around to see links more clearly. Hovering over links will reveal their counts in tooltips and hovering over nodes will highlight all source links to that node. Clicking on a node will allow you to highlight all target links that go out from that node. Link highlights can be reset by clicking any node.

D3.js Sankey Diagram to visualize flow of students through Summer 2016 course at a macro level:

click image to view full diagram

Insights:
Following the general pattern, there is clear group of students who follow the recommended path as shown by the thick link. At the top left of the graph it is also easy to see a number of advanced students who skip the introduction lessons on loops and lists. Nodes which are small and pushed to the side like concurrency, common errors and logic are also key aareas for intervention as not enough students are attempting these modules.

Another interesting insight from this visualization is the idea that making decisions is linked to different previous modules and being a complicated module results in students going back to other modules to recap as shown by the large number of backwards links. The recommended intervention in this case would be to examine the contents of the making decisions module to decided if it requires more scaffolding or whether recap notes from other modules should be directly inserted into this.

We recreated the Sankey diagram for 2 other terms of the same course to verify how they have evolved.

D3.js Sankey Diagram to visualize flow of students through Spring 2016 course at a macro level:

click image to view full diagram

Insights:
In this version of the Sankey flow, we can see that there are far fewer modules and transactions in the Spring as compared to summer. However, we can see two clear splits in the paths and can see that there are mny back links into Iteration, meaning it probably need to be improved to promote far learning.
Student attrition is also very clearly visible in this Sankey.

D3.js Sankey Diagram to visualize flow of students through Fall 2016 course at a macro level:

click image to view full diagram

Insights:
The fall course size was much larger than the other two terms (more than 4 times the number of students), hence the abundance of links. Once again we immediately see the large link of students returning to the making decisions module, hoghlighting that it is an area for improvement, even across terms.
Like we saw in the summer, there is one clear path most students follow, an early group that skips introduction slides and atttrition is clear.

What is also obeserved is that many students skipped the post test and also the mid modules like Recursion, Data Representation and Encryption which show as very small nodes.

As before, instructional designers should consider Scaffolding the course so that novice learners and advanced learners can both take advatage of the online class.

Individual Student Pathways

Tableau visualization for each student's path at a micro level.

How to Use:
X axis: Week of problem start time
Y axis: Student ID and total score
Size: Problem view (longer the bar is, the more time they view)
Color: Different color means different problem hierarchy.
Interactivity: You can mouse over the graphic to see its details.

Questions addressed by this visualization:
- Do high achieving students take similar pathways?
- Are there common pathways between all students?
- What are some features of students learning pathways?

Evaluation:
Since our data is relatively complex, that has different dimensions, this graphics is still not self- explanatory. But with a little bit of explanation, it should be much more interpretable than a bunch of numbers.

Insights:

1) Low achieving students started struggling very early, on their first course module- For loops. (The third block on the graph, because the first two block are introduction and pretest). We can see from the brown block that showed for low achieving students, these are much longer, which means they have been visiting that for loops module a lot of times. This early struggling shows they are struggling, and may indicate that they may not do as well as others. So educators can take this as a sign, and give timely help to these struggling students who are using more time than others.

2) High achieving students (judging from the total- final performance), tend to do more modules. As we can see from the visual, in the upper part, there are more colorful blocks, which means these students visited modules more completely. We can also see that the time they take are more balanced. While if you look at the low-achieving students, you can see that there are more white space in their pathways, which means they have missed a lot of modules. This can lead to drop in their final performance. Also, as you can see from the graphics, the time use of low-achieving students are not as balanced as high-achieving ones. There tend to be some really long blocks, which could possibly mean that they are struggling.

3) However, there are also some anomaly, such as the one low-achieving student who scored 70 but did almost all modules. This could be a indicator of wheel-spinning.

Module Level Error Analysis

Tableau visualization to identify modules which are problem areas.

How to Use:
X axis: Total Student ID + Student final score
Y axis: Problem hierarchy
Size: Completion rate(the higher the completer rate, the bigger the squares are)
Color: The greener it is, the lower the error rate, the redder it is, the higher the error rate.
Interactivity: You can mouse over the graphic to see its details and click on different elements to select them.

Questions addressed by this visualization:
- Which problems do students tend to make errors on frequently?
- Which students performed badly?
- For those students who performed badly, which part of the course do they perform particularly badly on?

Evaluation and Insights:

This is not a transparent visualization, and probably also not too self-explanatory. But it has its advantage in that:

1) It is showing four variables at the same time:
     a) Student id
     b) normalized error
     c) final performance
     d) problem name.
Therefore it is a very comprehensive graphic which has a lot of information in it.

2) It can visualize everybody’s final performance in a very conspicuous way: by using red and green. The darker the green, means the students performed better. The darker the red, means the students perform worse. This is correlated with our common sense that red signals warning, risks and green signals passing and safety. Therefore, from the graphics, we can easily see which students are not performing well. And which particular module are they struggling with.

3) One example to understand how to use this visualization would be student with id AE. The large size of squares for this student shows the completion rate is high. However, we can see from the axis that the total score for this student is poor (around 72% which is well below class average).
Upon examining the visualization we can see that this is due to the fact that the student has a very high error rate and is attempting questions incorrectly at a high frequency. We can then use the visualization to pinpoint the areas that the student needs to improve in such as Time-Efficiency, Encryption and Lists.

4) Another way to use this visualization is to examine a particular module across students. For example: The Time-efficiency module is clearly very troublesome for low scoring students while high scoring students tend to not make errors on practice problems for this module. On the other hand the Encryption module seems to be hard for everyone and would require some re-evaluation to understand why.

Correlation - Evaluation of Course

Here, we create an evaluative visualization to verify that the course is working in the intended way. This visualization and the related statistical analysis shows that student completion rate is strongly positively correlated with their final performance, and that completion rate is more telling than time spent on problems, which was a surprising insight!

How to Use Chart 1: Completion Rate
X axis: Average completion rate
Y axis: Average Total (students' final performance)
Size: Time spent on problems
Color: The more orange it is the less time students spend, the more blue it is, the more time students spend.
Interactivity: You can mouse over the graphic to see its details.

How to Use Chart 2: Time Spent
X axis: Time spent on problems
Y axis: Average Total (students' final performance)
Size: Time spent on problems
Color: The more orange it is the less time students spend, the more blue it is, the more time students spend.
Interactivity: You can mouse over the graphic to see its details such as line equation.

These are observational studies that are conducted in an ex-post facto manner as the data is already available on DataShop.

Hypotheses to be tested:

Null Hypothesis 1: There is no statistically significant relationship between completion rate of problems and students’ final performance.
Independent Variable: Completion Rate (percentage of course materials completed)
Dependent Variable: Total Score (Final performance)

Null Hypothesis 2: There is no statistically significant relationship between students’ time spent and students’ final performance.
Independent Variable:Time spent (Number of problems * avg. time spent per problem)
Dependent Variable:Total Score (Final performance)

Statistical Evidence:
For hypothesis 1:
Equation: Avg(Total) = 0.142344 * Time-Spent + 76.8668
Correlation Value: 0.142344
P value: 0.0029142 **
Evaluation: The correlation value is more prominent when some outliers are removed as can be experimented interactively on the graph above. The key reason behind the outliers is the high pre-test score some students have. Due to vast prior knowledge, they are not engaged in the course but still get a high total score.
Conclusion: Reject null hypothesis. Completion rate of problems is positively correlated with students’ final performance.

For hypothesis 2:
Equation: Avg(Total) = 0.00833 * Time-Spent + 84.4179
Correlation Value: 0.000833
P value: 0.782187
Evaluation: The time spent on questions is not a good indicator of performance in any way. This is interesting as one would expect time spent to be directly correlated with completion rate, which is not the case at all.
Conclusion: Accept null hypothesis. Students’ time spent is not correlated with students’ final performance.

Conclusion

Observations, Insights, Limitations and Future Work

Observations and Insights

Completion rate is a better indicator than time spent to predict a students’ final performance on the course, supported by statistics of correlation variable (slope) and p value.

low-achieving students tend to start struggling very early on about some relatively easy modules, such as “for loop” in this course.

In this course while most people did badly in the pre-test, the majority of them did better in their post-test.

Generalization Potential

These visualization have been designed to be as easy to create for a non-practitioner of data science. The aim of this part of the project is to create a simple set of steps that any instructional designer can use to create these visualizations from any DataShop data set.

For the first visualization (Sankey diagram) the data needs to be cleaned using the data pipeline mentioned above. We plan to open source the code used in this project to clean the data into the required format (which uses Python and Pandas). Currently weightage on the Sankey diagram is done with just the student count, but student performance, time spent, and other metrics could also be used to weight the links depending on the domain.

For the other visualizations, Tableau is a ready to use format and our visualization workbooks are public and can be accessed by anyone. In order for the visualizations to work in a plug and play manner, data shop data will need to be exported (at the Student-Problem level) and then combined with performance data. While we have combined a number of performance parameters, both offline and online, in our analysis, the visualizations will still work if this combination is done simply by joining student final scores (out of 100) with their student Ids on the downloaded DataShop file.

This generalization potential can be seen above where we re-created the Sankey diagrams for two other semesters of the Introduction to Programming course, and each of these visualizations was created without writing any code, in less than 30 minutes.

Limitations and Future Work

This is only one course in several specific time periods, with limited students. The conclusion about learning pathways can benefit from larger amount of data to yield more explicit and solid pattern.

It is hard to judge the generalizability of our conclusion three because it is largely specific to this course.

The Sankey diagram is capable of displaying a macro picture, but in its current form it is difficult to generalize intuition from the diagram.

For online course with more than hundreds of students, such as MOOCs, further data aggregation is needed to avoid high complexity of the sankey diagram.

In the future, it may be possible to visualize for different courses across different domains to see if common pattern exist about learning pathways. Depending on further analysis and user studies with instructional designers, the Sankey diagram can be modified to include color and other features to communicate more information at a glance.

We also plan to move these visualizations into the LearnSphere environment to allow direct integration with all DataShop datasets.

IDS - Fall '18

Visualizing student learning pathways to enable data-driven instructional redesign.

Motivation

Can we help instructional designers make data driven decisions by visualizing learning pathways through online courses?

The Data

Data Quality

Completeness: Is the data complete?

aCcountability: Who has access to it?

Coherence: Is the data coherent?

Correctness: Is the data correct?

Exploratory Analysis

Data Pipeline

Descriptive Statistics

Initial Graphing

Bar Charts

Measuring Transfer

Data Analysis

Sankey Diagrams

Individual Student Pathways

Module Level Error Analysis

Correlation - Evaluation of Course

Conclusion

Observations and Insights

Generalization Potential

Limitations and Future Work