Software to detect plagiarism: Copyfind

This program examines a collection of document files. It extracts the text portions of those documents and looks through them for matching words in phrases of a specified minimum length. When it finds two files that share enough words in those phrases, copyfind generates html report files. These reports contain the document text with the matching phrases underlined.

What copyfind can do: It can find documents that share large amounts of text. This result may indicate that one file is a copy or partial copy of the other, or that they are both copies or partial copies of a third document.

What copyfind cannot do: It cannot search for text that was copied from any external source, unless you include that external source in the documents you give to copyfind. It works on only purely local data—it cannot search the web or internet to find matching documents. If you suspect that a particular outside source has been copied, you must create a local document containing that outside material and include this document in the collection of documents that you give to copyfind.

Notes: Compiles as-is under Microsoft Visual C++ and Linux, but may need slight editing for other compilers. I have commented out the line: #include "stdafx.h", but you may need to add it in if you compile in certain ways under Microsoft Visual C++. To compile under Linux, enter the command: “g++ -o copyfind –O copyfind.cpp”. The gnu C++ compiler will then produce an optimized executable called “copyfind”.

Notes: Compiled virus-free with full optimization on 3/7/2002. If you don’t trust executables that you don’t compile yourself, just load the source and compile that instead. I’m simply trying to make things easy for people who don’t have a C++ compiler or the skills to operate it.

How to Use:

Preparation: create a directory or folder containing: (1) the copyfind executable, (2) all the documents you want to compare, and (3) a doclist.txt file that contains a list of all the document files. The doclist.txt file must be a text file, not a word-processor file. It should have one document file name per line, with a hard return at the end of each line (including the last line). In DOS/Windows, that hard return is a carriage return & line feed combination. In Linux, it’s just a line feed. When you look at it with notepad (windows) or less (linux), you should see something like:

document1.doc
document2.doc
document3.doc

The simplest way to make the doclist.txt file is with the dir (Windows/DOS) or ls (Linux) commands. In Windows/DOS, enter the command prompt and use “cd” to get to the directory containing all your documents. Then execute “dir *.doc /b > doclist.txt” to create the list of documents directly into the doclist.txt file. You can then edit this list with notepad. In Linux, use “cd” to get to the directory containing all your documents. Then execute “ls *.doc > doclist.txt” to create the list of documents directly into the doclist.txt file. You can then use emacs or any other text editor to edit this list.

Copyfind should be able to find the text pieces within those document files, even if they are word processor files. However, if you discover that copyfind is picking out too much garbage from within the document files you gave it or not finding the text, you can simplify the task for copyfind by saving those document files in text only format. Copyfind can read text only files effortlessly, but must work hard to sift through non-printing characters in word processor files. I find that it does a pretty good job with most word processor documents, but I haven’t tried all possible word processors.

Execution: Open the directory containing all the documents and programs, and run the copyfind executable. It will prompt you for 4 parameters:

  1. Minimum number of words in a phrase – this value is the shortest identical portion of text that copyfind should consider a true match. I find that 6 words works nicely. Shorter than 6 is likely to give spurious matches, while longer than 6 may cause copyfind to miss valid matches.
  2. Minimum number of characters to consider as text – this value is the shortest string of printable characters (including blanks, linefeeds, tabs, and carriage returns) that copyfind should consider to be text. Word processor documents are filled with non-printing bytes that aren’t really characters at all and copyfind must sift through them for the real text portion of the document. Often it will find one or two bytes that correspond to printable characters in the middle of numeric garbage. Clearly, it should ignore those accidental “characters.” To make it ignore them, set this value to 10 or more. That way, copyfind will only consider what it finds to be real text if there are 10 or more printable characters in a row.
  3. Minimum number of matching words to report – Copyfind counts the number of matching words that it finds in a pair of documents. If its count exceeds the threshold that you select, it generates the html report files. I use a value of 500 matching words as the threshold above which I expect the pair of 1500-word documents to contain clear plagiarism. While copying is evident at levels well below this 33% rate, I have chosen not to pursue them.
  4. First new document number – When you are comparing a collection of documents for the first time, enter 1 as the value here. Copyfind will then compare every possible pair of documents. But if you’ve already checked the first 1000 documents against one another and have just added another 500 document, enter 1001 here. That way, copyfind won’t check the first 1000 documents against one another again and will focus its attention only on the 500 new documents. It will compare those 500 new documents against the 1000 old documents and also against one another.

Results: A new file “comparisons.txt” should appear in the current directory, along with html files for any pairs of files that contain above-threshold matching. The comparisons.txt file lists the number of matched words in every pair of documents checked (it lists two numbers: matches found in the left file and matches found in the right file while doing the comparison. If copyfind is working properly, those numbers should be the same.) The html files should appear in pairs and should show the contents of one file, with the words it shares with the second file underlined. The names of the two files and the number of words that match appear in the title.

Limitations: Copyfind is a simple program and has a primative (DOS) interface. If someone wants to write a pretty, windowing front end for it, I’ll be happy to post it here. I just don’t have time to spruce it up right now and I’m pleased to have it live in a single, relatively portable C++ source file. Also, I have restricted the characters that copyfind considers as text to the bytes 0x21 through 0x7E. Thus copyfind considers the foreign ASCII characters that occupy portions of the range 0x80 through 0xFF as non-printing. I haven’t tried including them in the list of printing characters, though it might work perfectly. Since I wrote copyfind for papers written in English, I never tried to accept those extended ASCII characters. Apart from this extended character set issue, copyfind should work for any language. It does not understand what it “reads,” it merely looks for matching words.

Release Notes:

  1. Copyfind V1.0: This release incorporates hash-coding and presorting in its techniques for matching pairs of documents. Compared to version V0.0 that I first used to identify similarities in 1,500 word term papers at the University of Virginia in April, 2001, this improved version runs about 350 times as fast and uses about 80% less memory. It compares about 2,500 term papers per second on an 800 MHz Linux box (dual-processor, but with only one processor running copyfind). While comparing 1850 term papers originally took about 70 hours with V0.0, it now takes about 12 minutes with V1.0. This release can also handle raw word processor documents, so that the task of saving each document in text only format is no longer necessary.
  2. Copyfind V1.1: This release fixed a bug that prevented copyfind from detecting some matches in documents that contained repetitive content (the same words appearing many times).
  3. Copyfind V1.2: This release handles non-English characters much better than the previous versions. It should handle most European languages.

Wish List of Possible Improvements to Copyfind:

Foreign Language Support: The current version of copyfind does not handle 2-byte characters well. It cannot handle Japanese or other languages that use 2-byte characters.

Better Word Processor Document Filtering: The current version of copyfind does its best to ignore the non-text portions of word processor documents by skipping over sections of data that contain non-printing characters. However, it still finds text-containing garbage such as “adobe photoshop image…” and “Microsoft Word 8.0 Document” embedded in the files. Sometimes copyfind gets so confused that it doesn’t find all of the real text portion of the document either. Ideally, copyfind would use an initial filter on each file it loaded, a filter that would strip out non-text material and leave only the pure, complete text portion of the word processing document.

Word Substitution Detection: The current version of copyfind can be fooled by word substitutions because it looks only for perfect matching. However, adding a table-lookup that mapped the hash-coded values of similar words onto a single hash-coded value would allow copyfind to see through word substitution games.

Sentence-Level Comparisons: Plagiarism is not just about reusing words, it’s also about reusing ideas. Even though a sentence has been reworded and rearranged, it can still be plagiarism. I know that sophisticated sentence analysis tools exist and if these could be incorporated into copyfind, they could allow it to ferret out more subtle forms of plagiarism.

Web-Search Capabilities: The current version of copyfind cannot surf the web for documents. However, it should not be too hard to write an interface between it and some of the web search engines to allow copyfind to download documents that are similar to the contents of the local documents. Copyfind could then do the full comparison.

Versions for Other Operating Systems/Platforms: I do not know if copyfind works on Mac’s or on other common computers. A Mac version would be very helpful to other people. I don’t have the wherewithal to make one, so I need help. If you can get it working on a Mac, I’ll post it.

Example Set of Documents Containing Plagiarism: To test out new versions of copyfind, we need sets of documents that contain plagiarism. To make them available here, these documents would have to be in the public domain, perhaps collections of old book text that is in the public domain. I would like to post one or more sets of test documents, but they must not violate the copyright laws or student rights to privacy or ownership of their own works.

My Challenge to You Experts Out There: If you are able to implement any of these ideas and thus improve copyfind, please let me know and I will consider posting your version of copyfind here for others to use. As always, these versions will be distributed free of charge under the GNU General Public License.


Copyright 1997-2006 © Louis A. Bloomfield, All Rights Reserved
Page Last Updated: May 9, 2002 11:20 AM