Will I successfully develop a cheating-detection system for my employer?
19
225
775
resolved Apr 30
Resolved
YES

I am a teacher. I recently offered to develop a cheating-detection system for my employer. To my surprise, they took me up on it. I have a pitch and demo tentatively scheduled for next week.

Immediately resolves as YES if I learn that my employer has authorized, approved, or assented to an action which is A) intended to address cheating and b) directly informed by the outputs of a system I developed. For example, if a teacher uses the results of the system (provided to them by my employer) to design a seating plan for test takers, this would resolve YES, regardless of whether any cases of cheating are ever decidedly confirmed. Does not immediately resolve YES if I am told that I can do whatever I want with my classes's data on my own time but that the company has no interest in the system.

Immediately resolves as NO if I communicate to my employer that I am unwilling or unable to develop such a system or if my employer communicates to me that they are no longer interested in such a system developed by me.

Some reasons to think I might succeed:

  • The bar is relatively low. I don't actually need to catch any cheaters, I just need to develop a method for systematically identifying students who are more likely than normal to have cheated in the past, which is sometimes blatantly obvious.

  • I have access to all of the data I would need to conduct such an analysis, and lots of it.

  • My employer seems enthusiastic about the project, and provided that I can develop a halfway decent prototype, I think it's likely that it sees some use, even if only as a test run.

Update: I have demo'd a prototype for my supervisors. They were impressed, but they are not authorized to give me all of the data which is required to perform a school-wide analysis; I will need to make do with only my own classes's answer data (I still have the overall answer frequency data that I need to perform the analysis; I just can't analyze students other than my own).

Some reasons to think I will fail:

  • I'm a teacher, not a software developer. I'm merely a programming hobbyist and I have never been financially compensated for any programming project I've worked on (nor do I necessarily expect to be compensated for the development component of this project; this is just to give you an idea of my level of experience).

  • If there were a significant incentive for my employer to implement the kind of system I'm proposing, they undoubtedly could have attempted to do so by now.

  • I do not have a working prototype, and I do not even have a proof of concept. It's entirely possible that when I finally get everything running, I will discover that my proposed methodology is entirely unactionable in practice (e.g. it has abysmal specificity and cannot be deployed due to the high risk of falsely flagging honest test takers as likely cheaters).

Update: I now have a working prototype, and it works well.

I will only buy YES shares, and I will never sell them.

Get Ṁ200 play money

🏅 Top traders

#NameTotal profit
1Ṁ439
2Ṁ38
3Ṁ14
4Ṁ12
5Ṁ10
Sort by:

I just heard back from my supervisor, and we have some confirmed cases of students flagged by the system having been monitored on a subsequent test and caught engaging in prohibited practices (communicating via messaging apps, wearing a concealed Bluetooth headphone, etc.). As per the stated resolution criteria, I have resolved this YES.

My direct supervisor and I are planning to test the system on the next test two weeks from now. If that proceeds, I will resolve this as YES, as per the explicit example in the resolution criteria. Could still resolve NO, provided that my supervisor reneges.

Fun development: I have my first confirmed case! During the last test, one of my former students came up flagged, so I just asked them if they cheated, and they confessed and confirmed that the identified collaborator was indeed the actual collaborator. There were around 3000000 student-pairs and somewhere around 150 flags, so that means there was only a 0.00005 probability of that happening by chance, if I'm conceptualizing it properly.

Everything is kind of in limbo right now. I have no idea if any higher-up is actively considering the proposal and looking into its feasibility, or if they just heard the pitch from my supervisor and now will never think about it ever again.

It's unfortunate, too, because I've been consistently making improvements and I'm very confident that the system works very well. If nothing ever comes of this, I hope to do something else with it, though I'm not sure exactly what. I might consider reaching out to my alma mater to see if they have any interest in it.

Or perhaps there's some large online testing platform that might be interested in it? It's somewhat counterintuitive to me that such a platform would not already have something like this in place, but I've done some Googling and can't find mention of any such system.

I heard back from my supervisor. He said that the higher-ups like the idea, but he didn't sound particularly confident that they would be willing to deal with the red tape required to bring me onto the team in some sort of role specific to cheating detection.

I'm still going to do my best to maneuver towards a YES resolution, but I'm less optimistic that it will be by way of me getting a fun new role.

My supervisor is bringing the proposal to the higher-ups today.

With regards to resolution criteria: if the company officially brings me onto the team in a role specifically to develop/maintain the system, I will probably regard that as a YES resolution, as their doing so would be intended to address cheating and would be influenced by the outputs of the system being regarded as valuable by my employer.

If anyone regards this as objectionable, feel free to state your case here.

Proposal submitted!

Finally got to test the system on a full dataset, and the results are very encouraging. For example, the two most suspected student-pairs include:

  • Two students who registered for the program at exactly the same time

  • Two students who share the same (rare) surname

Additionally, the next most suspected student is implicated in a sizeable “cheating ring” that includes a student that I know to have participated in such a cheating ring.

The next step is to curate the results so that I can present them to the higher-ups and make a case for bringing me onto the team in some capacity related specifically to this endeavour.

I'm scheduled to meet with my supervisor tomorrow to apply the system to a full test's worth of student data. That will be the first concrete indication of whether or not the system is sufficiently powered to be of potential interest to my employer.

I have successfully applied the system to a population of 12 students (my own students). Given such a small sample size (including for the purpose of generating the global response frequency data), obviously no real conclusions can be drawn about its power, but at the very least, everything seems to be working properly. With any luck, I'm hoping that my supervisors will oversee the application of the system to a larger population (1500+) sometime soon, so that we can get a better idea of how well-powered it will be.

Currently just waiting for my supervisor to get back to me with some data I need to run my first real analysis.

In the meantime, I've been brainstorming a variety of improvements to the system. Notably:

- Currently, the system is unable to discern the direction of cheating within a student-pair (i.e. whether student A was copying from student B or vice versa); it simply outputs the likelihood that there is a cheater amongst the pair. However, I've realized that I can differentiate students based on their level, which should not only improve accuracy generally, but also differentiate between P(A copied from B) and P(B copied from A). It should also be possible to estimate the likelihood of back-and-forth collaboration (sometimes A copies from B and sometimes B copies from A).

- I've begun designing another program which can take a confirmed case of cheating and reverse engineer the key student variables that go into the model, which should improve the accuracy of the model over time.

@NBAP I know I shorted you, but I like your dedication to this project. Here's some thoughts.

Is your system analyzing multiple choice assessments? Numeric answers? Open ended text? I've listed these in increasing order of "how much training data you're going to need".

Second, it's concerning to me that you only have a single confirmed case of cheating to train your model with. I'm not sure what statistical models you're working with, but if it's anything similar to what I've worked with, ideally you'd need hundreds, if not thousands of data points in order to get reliable data.

If the school isn't willing or able to provide better training data, I suggest you create some data at least. If you have kids, give them an already-completed test and ask them to copy it as if they're trying to not get caught copying. Or split your class into groups, give each group a different worksheet/topic of research, and get each student to write a report or something. Because each group is supposed to have collaborated within its own members, this is serviceable training data for correlated answers on an assessment.

If this cheating assessment system is for in-person classroom assessments, consider taking seating plan into account, and check for student proximity when determining likelihood of cheating.

Also, just in case you haven't already, ensure you have a training dataset and a completely separate test dataset to prevent overfitting. Also, there already exist many plagarism-checkers online, so you need to see where you're adding value.

Or you could use a different, or even a multilayered approach. Use a student's past scores, including classroom attendance, regular assignments grades and homework grades, to predict a student's assessment grades. If a student consistently scores highly in assessments while doing poorly in assignments, you can flag the student for closer investigation by seating plan or plagarism checker or so on. Linear model is good for this BUT you'll need at LEAST fifty known cheating cases to train this model. Data cleaning is going to be a bitch. If you have mock tests and actual assessments to compare with, it'll be easier, but still difficult.

Note that there's different forms of cheating, which need different data to detect. Copying from peers is revealed by similarity checks and seating plans. Illegal reference materials are detected by surprise assessments. Electronic cheating like phones, headsets, etc are countered by mobile phone jammers/EMPs. (Kidding.) Make sure you know what you're facing here, and make sure your detection method is suitable for the cheating meta at your school.

Also, be aware that for every 1 cheater caught, 9 escape. Just because they aren't flagged in the system doesn't mean it's 100% safe to use all students in your "innocent student" training dataset. Sometimes the outliers really are outliers. Sometimes there's a better explanation.

This is a huge project to take up on your spare time and I respect the effort. Goodluck with your work. And if you somehow happen to be using a completely different approach to what I've laid out above, I'd be VERY interested to discuss further and see what exactly you have in mind, and brainstorm from there.

@pym "Is your system analyzing multiple choice assessments? Numeric answers? Open ended text? I've listed these in increasing order of "how much training data you're going to need"."

Multiple choice and fill in the blank questions only.

"Second, it's concerning to me that you only have a single confirmed case of cheating to train your model with. I'm not sure what statistical models you're working with, but if it's anything similar to what I've worked with, ideally you'd need hundreds, if not thousands of data points in order to get reliable data."

The system isn't in practice with actual student data yet. When I mentioned a confirmed case of cheating above, it was in reference to a future program. With my current methodology, I am not training a model, but rather using Bayesian inference.

"If the school isn't willing or able to provide better training data, I suggest you create some data at least."

I've been using synthetic data to test the system. The main limitation of this is that I don't know the exact values behind my assumptions (like how often a cheating student will copy from their target), but I can generate data using a variety of assumptions (from realistic to unrealistic, in both directions), and see how the methodology fares. If it turns out that students are only copying each other's answers 5% of the time, my method almost certainly won't be sufficiently powered to be useful in practice, but I would regard that as a good problem to have.

"If this cheating assessment system is for in-person classroom assessments, consider taking seating plan into account, and check for student proximity when determining likelihood of cheating."

Yeah, the idea would be to implement a seating plan in future test that separates suspected collaborators and has their screens facing the teacher.

@NBAP Interesting, I think I understand the situation a little better now. Bayesian inference is a good approach and you sound like you know what you're doing there, so I'll direct my attention a bit more laterally.

Since it's a digital test, I wonder if it might be more practical to use more mundane methods of cheat prevention, like randomizing question order, or varying the given values (for a math/physics test). Or having questions with multi-step parts that require the student to demonstrate reasoning.

Also, assuming that questions aren't currently randomized, consider checking 'time at which Student X answered Question Y' and 'time taken for Student X to answer Question Y'. Your dataset almost certainly has this data, or otherwise it will be somehow calculable, and it will be much easier to discern the direction of any copying. If a student answered a question with the same answer, but answered it second, and/or took less time, there's a higher chance they're cheating. If one student consistently answers a few seconds after another student, it's another huge flag. I suspect this sort of checking might yield better results than checking for similarity of answers.

Also, what's an average test score? If students are regularly scoring 40-60% on a thirty-question test, then studying correlated answers will probably work very well. If they're scoring 90-100% regularly, then... probably not.

Are you at all planning to incorporate historical data on a per-student basis to project how well a student was expected to do on a particular test? I was also daydreaming about assigning each student an Elo score based on how they performed on tests, and monitor for any sudden jumps. Though that's probably not the best way to project student performance. Would be very entertaining to use this to watch students grow from behind the scenes though, and has implications for pedagogy beyond anti-cheat measures too.

@pym Since it's a digital test, I wonder if it might be more practical to use more mundane methods of cheat prevention, like randomizing question order, or varying the given values (for a math/physics test). Or having questions with multi-step parts that require the student to demonstrate reasoning.

In an ideal world, I think this would be best. However, in this particular case, we are limited in how we can modify the test, because the tests must be authorized by a third-party certification board. So I'm just trying to do the best I can to come up with a cheating detection system for an administration that is unable or uninterested in making significant changes to the lineup of existing tests, except when it becomes clear that there has been a large-scale leak.

Also, assuming that questions aren't currently randomized, consider checking 'time at which Student X answered Question Y' and 'time taken for Student X to answer Question Y'.


Very interesting idea! Unfortunately, I think we only record the time at which each test is submitted, not each question, but it's a very clever idea and I'll definitely keep it in mind as a suggestion if we look to move to a new testing system (which I believe the admins have been considering for a while).

Also, what's an average test score?

It varies by student level (students of various levels take the same test and are expected to meet certain benchmarks depending on their level. It's unusual for students to score above 90%, and the average is closer to 60%.

Are you at all planning to incorporate historical data on a per-student basis to project how well a student was expected to do on a particular test?

That would be part of the follow-up investigation (along with other things, like whether or not two potentially collaborating students in different classes previously had a class together, for example).

I was also daydreaming about assigning each student an Elo score based on how they performed on tests, and monitor for any sudden jumps. Though that's probably not the best way to project student performance. Would be very entertaining to use this to watch students grow from behind the scenes though, and has implications for pedagogy beyond anti-cheat measures too.

That is fun idea! Maybe an idea for a future project; perhaps I can include it in my pitch to have my employer take me on in a role specifically dedicated to this sort of thing.

Pitched and demo’d the system to two of my supervisors today, and they seemed very impressed. The system clearly works reasonably well in theory, but there are still some potential roadblocks:

  • My immediate supervisors are not authorized to give me the data necessary to apply the system to any class other than my own. I will not resolve this as YES if they merely give me the thumbs up to do with my own classes’s data as I see fit.

  • In order to apply the system to the entire student body, I will need to get it working with their data structures directly out of the box, and I don’t know if this will be feasible for me. It’s certainly not my preferred option.

  • I will need to recalibrate the system every several weeks to account for new tests. It is unclear if the higher-ups have any interest in compensating me for my time, and I obviously will not be putting in all of this extra time for free indefinitely.

  • Failing another resolution, I may ask the higher-ups to take me on in another part-time role to maintain the system, which would resolve all of the above. I have no idea if they would entertain the idea.

I’m still optimistic about a YES resolution, but I certainly understand people betting the other way. If nothing else, I’ve really enjoyed the project and I’m proud of the system.

how do you think about balancing the false negative rate (cheating student goes undetected) vs false positive rate (innocent student is flagged as cheater)?

what do you estimate are the base rates for cheating in your institution?

by straightforward Bayesian analysis, what percentage of students flagged by your system do you estimate will actually be guilty?

@pyrylium "how do you think about balancing the false negative rate (cheating student goes undetected) vs false positive rate (innocent student is flagged as cheater)?"

It's just a matter of setting a reasonably high "accusation threshold". In my testing using (what I consider to be) relatively pessimistic estimates about cheater prevalence and cheating predilection (the probability that a cheater cheats on a given question), I can generally still get some true positives even if I set my accusation threshold insanely high. This results in 0 false positives and a considerable amount of false negatives, which is generally preferred when it comes to cheating detection (better to have a cheater go undetected than an innocent student accused unfairly).

However, given that the intention is not to use the system to retroactively accuse cheaters, but rather to proactively monitor suspected cheaters in subsequent tests, false positives are not that big of an issue.

It should also be noted that, when I use more optimistic assumptions (especially about cheating predilection), the methodology is very well-powered, and false positives and false negatives are both trivially dropped to 0. It remains to be seen where actual test conditions will fall between my pessimistic and optimistic assumptions.

"what do you estimate are the base rates for cheating in your institution?"

I'm playing around with various probabilities in testing, specifically 0.01, 0.05, 0.1, and 0.2. I'm not too worried about it, firstly because I'll get a better and better estimate over time, and secondly because it doesn't move the needle nearly as much as cheating predilection. And the nice thing about the latter is that I can just make a reasonably low assumption, and the cheaters who push the envelope significantly further will only be making themselves easier to detect. And I can also reverse engineer a better estimate of cheating predilection once I have some concrete cases of cheating to examine.

"by straightforward Bayesian analysis, what percentage of students flagged by your system do you estimate will actually be guilty?"

If by "straightforward Bayesian analysis", you mean an accusation threshold of 0.5, I won't know for sure until I plug in a full actual dataset, which I won't be able to do until after my employer has approved the project. To refer back to my testing on synthetic data, when I used optimistic values for cheating predilection (0.75, for example), the PPV was 1.00, and there was really no risk of it ever dropping lower. Of course, I don't expect the actual average cheating predilection to be 0.75, but I also don't intend to use an accusation threshold of 0.5, and so I'm confident that I will be able to keep a PPV of 1.0 by letting some cheaters go unflagged.

@NBAP this is really great analysis, thanks! by "straightforward Bayesian analysis" I mean a baserate / false positive analysis (direct analogy to disease testing rates: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3153801/#:~:text=Just%201%2F1000th%20or%2010,%2F510%2C%20or%202%25.)

as an example (not saying this is representative of your system!), if you have 100% accuracy in flagging cheaters and 99% accuracy in flagging innocents, but only 0.1% of students are cheaters, then only 10% of students flagged as cheaters by your system would actually be guilty.

given you've stated a goal of minimizing false positives over false negatives (I would argue, correctly), I don't think you'll have this problem, but always good to run a sanity check! your assumptions about cheating prevalence may lead to some unintuitive conclusions.

@pyrylium If I'm understanding you correctly, you're referring to the positive predictive value (the probability that someone with a positive test result actually has the condition).

Using 0.1% of students as cheaters and a copying frequency of 0.5, I synthesized a test with 1500 students and got the following results with a 0.5 accusation threshold:

True Positives: 1

False Positives: 0

True Negatives: 1124247

False Negatives: 2

Sensitivity: 0.33

Specificity: 1.00

Positive Predictive Value: 1.00

Negative Predictive Value: 1.00

The reason that there are so many true negatives is because I'm analyzing student-pairs, not students.

The student-pair which got caught had a P(Collaboration) of approximately 0.90, and the cheating student-pairs which slipped through were at approximately 0.44 and 0.03. The first false positive would have happened at around 0.25, so the test could have gotten a sensitivity of 0.67 while maintaining perfect specificity, but of course there would be no way for me to know where exactly to set the threshold if I were blind to the identity of the cheaters.

@NBAP brilliant stuff, well done

What kind of organization is your employer? School? Public?

Has there been any talks of budget for this and how much?

Does your employer currently have any way for identifying cheaters?

Why does your employer want to find cheaters?

@voodoo My employer is a private academy.

I'm not sure what the budget is, nor do I really know what the company's revenue is, but if I had to make a very rough guess, I would say the latter is probably in the neighbourhood of $100M.
The current method of identifying potential cheaters is very simplistic, essentially just comparing two students and looking at what percentage of their answers were the same.
Presumably, it's not great for any academic institution to have to have cheaters operating unchecked.

bought Ṁ40 NO from 54% to 40%

@NBAP It’s presumably not great but where the rubber hits the road when it comes to investment decisions is the ROI and the clearer the line is between catching cheaters to important business goals or revenue the more likely it is that it gets funded.

@voodoo Very fair point. Worth considering, perhaps, is that the ROI is potentially very favourable. I’m developing the system as a passion project, and if my employer want to use it, I would probably just ask them to throw me an extra couple of hundred dollars a week to maintain the system and parse the test data (which is currently not very well-formatted for this purpose).

Within the past year, my employer has stressed the importance of ensuring that our graduating students are meeting certain standards to maintain the school’s reputation. For example, teachers had to undergo benchmarking to ensure we were not grading assignments too leniently.

More related questions