The team behind OnlineTools4Free — building free, private browser tools.
Published May 5, 2026 · 13 min read · Reviewed by OnlineTools4Free
Text Similarity vs Plagiarism Detection: What the Tools Actually Do
The Honest Answer: Most Free "Plagiarism Checkers" Are Similarity Calculators
Type "free plagiarism checker" into a search engine and you will find hundreds of web tools. Paste two blocks of text, click a button, get a percentage. Many of these tools brand themselves as plagiarism detectors. They are not. They are text-similarity calculators that compare exactly the two strings you pasted in. They have no access to the web, no academic database, no archive of student papers, and no embedding model that can detect paraphrase. They cannot tell you whether a text is plagiarized — they can only tell you how similar two specific texts are to each other.
That distinction matters. A plagiarism checker, in the meaningful sense, asks: "Has this text appeared somewhere before?" Answering that requires a corpus to compare against — billions of indexed web pages, millions of academic papers, the contents of every paper ever submitted by a student to the same service, and increasingly an embedding model that can spot paraphrase even when no exact words match. Building and maintaining that corpus is the entire business of Turnitin, iThenticate, and Copyleaks. It is expensive, contractual, and beyond what a free single-page tool can offer.
A similarity calculator, by contrast, asks: "How alike are these two specific texts?" That question can be answered with a few lines of JavaScript running in your browser, with no external data and no server. The answer is mathematically meaningful and useful for several real tasks — but those tasks are not "detect plagiarism in this student essay against the open web".
Vendors who blur this distinction are either marketing aggressively or genuinely confused about what their product does. The result is users who think they have caught a plagiarist when their tool simply found the same text in two clipboards. Or worse: students who paste their essay into a "free plagiarism checker", get a 2% match score against random text, and conclude their essay is original — when in reality the tool never compared against anything outside their session.
This article separates the two categories, explains what each actually measures, and points to the real tools for each use case.
What Text-Similarity Actually Measures
The math behind a similarity calculator is straightforward. Several algorithms are in common use, and each captures a slightly different notion of "alike".
Jaccard similarity. Treat each text as a set of unique tokens (words, or character n-grams). The Jaccard score is the size of the intersection divided by the size of the union: how many tokens do both texts share, normalized by the total number of distinct tokens between them. Jaccard ignores word order entirely — "the cat sat on the mat" and "the mat sat on the cat" are scored identically. Useful for detecting bag-of-words overlap, useless for detecting structural copying.
Cosine similarity on TF-IDF vectors. Convert each text to a vector where each dimension is a word and the value is term frequency multiplied by inverse document frequency (rarer words count for more). The cosine of the angle between the two vectors is the similarity score, ranging from 0 (orthogonal, no shared meaningful terms) to 1 (identical direction). Like Jaccard, cosine ignores order but weights rare terms more heavily, so two essays both about "epistemological pluralism" will score higher than two essays both about "the dog".
Levenshtein edit distance. The minimum number of single-character insertions, deletions, or substitutions needed to transform one text into the other. Sensitive to order and exact characters. Edit distance of 0 means identical; edit distance of N (the length of the longer text) means completely different. Normalized to a 0-1 score by dividing by the maximum length. Useful for detecting near-duplicates with small variations — typos corrected, a few words rephrased — but expensive on long texts (quadratic time complexity).
N-gram overlap. Slice each text into overlapping sequences of N words (commonly N=3, "trigrams"). Count the n-grams that appear in both texts and normalize by the total. This captures local word-order overlap: "the quick brown fox" and "the brown quick fox" share only one trigram even though they share all four words. N-gram overlap is the basis of most "highlight matching phrases" features in real plagiarism tools, scaled to N=5 or N=7 for practical detection.
SimHash and MinHash. Locality-sensitive hashing techniques that produce a compact fingerprint of each document. Similar documents produce similar hashes; comparing two hashes is constant-time regardless of document length. This is how systems like Google detect near-duplicate web pages at scale — the actual page comparison would be too expensive on billions of documents, but hash comparison is fast.
None of these algorithms detects paraphrase. If a text rewrites every sentence in different words while preserving meaning, all five score the rewrite as low-similarity. Detecting paraphrase requires moving from lexical overlap to semantic overlap, which means embedding models — sentence transformers like SBERT or general-purpose embeddings like OpenAI's text-embedding-3 family. Computing semantic similarity is what real plagiarism detection has been quietly migrating to over the past few years.
What Real Plagiarism Detection Requires
For a tool to honestly claim "plagiarism detection", it needs three components no free single-page tool can provide.
A web crawl. A continuously updated index of public web pages. The plagiarism tool fetches the user's text, breaks it into n-grams or embeddings, and queries the index for matches. Even a partial index of the open web is multi-terabyte storage and requires a pipeline that re-crawls frequently. Tools that maintain their own crawl include Turnitin, Copyleaks, and Plagscan. Some smaller services use the Bing or Google search APIs as a cheaper proxy — submitting suspicious phrases as search queries and parsing the results — but search APIs are rate-limited and expensive at any meaningful scale.
An academic corpus. A licensed collection of published academic papers, theses, conference proceedings, and other scholarly content. Most of this content is behind paywalls (Elsevier, Springer, Wiley, JSTOR) and requires institutional licensing agreements. Without it, a "plagiarism checker" cannot detect copy-pasting from research literature, which is the most common kind of student plagiarism in higher education. Turnitin and iThenticate (the same company, different products) have the largest licensed academic corpora and that licensing is a substantial portion of their value.
A submission archive. Every essay submitted to the service becomes a future comparison source. If a student submitted essay A in 2019 and another student submits a copied version of A in 2026, the system catches it. Without an archive, services miss the most common student-to-student copying. Building this archive is what locks institutions into Turnitin specifically — switching providers means losing access to a decade of past submissions to compare against.
A paraphrase detection layer. Modern plagiarism is rarely word-for-word copying. Students paraphrase, swap synonyms, or run text through paraphrasing tools (Quillbot and similar). Lexical n-gram matching misses all of this. Real detection now layers semantic similarity on top — embedding the suspicious text and the corpus chunks, comparing in embedding space, flagging high cosine similarity even when no n-grams overlap. This requires running embedding models at corpus scale, which is computationally expensive enough that only well-funded services do it.
If a tool does not have all four, it is not plagiarism detection. It might be a useful similarity tool, a useful self-check tool, or a useful deduplication tool — but the marketing label of "plagiarism checker" is misleading.
The Big Four Real Services and Their Access Models
Four services dominate genuine plagiarism detection. Each has a different pricing model and target audience.
Turnitin. The market leader for higher education. Sold to institutions on annual contracts; individual access is not available. Turnitin's corpus includes a proprietary web crawl, a substantial licensed academic corpus, and crucially the largest archive of student submissions in the world (many billions). The "Similarity Report" UI shows highlighted matched passages with their source URLs. iThenticate is Turnitin's sibling product targeted at researchers and journals — same backend, different UI, different pricing. You access Turnitin through your university's LMS (Canvas, Blackboard, Moodle); there is no consumer-facing checkout. Visit turnitin.com for institutional inquiries.
iThenticate. Same parent (Turnitin LLC), targeted at academic publishing. Used by journal editors before peer review to screen submitted manuscripts for prior publication or self-plagiarism. Pricing is per-document or per-seat for institutional accounts. Available at ithenticate.com. The corpus emphasizes published research over student essays, which is the opposite weighting from Turnitin proper.
Copyleaks. The most accessible commercial option. Offers a free tier (limited monthly word count), a consumer subscription, and an API for programmatic checking. Their corpus combines a web crawl with academic content and increasingly a focus on AI-generated text detection. The API is well-documented and used by content marketing teams to verify outsourced articles, by code-hosting platforms to check submitted snippets, and by smaller institutions that cannot justify Turnitin's pricing. See copyleaks.com.
Plagscan. Now operating as part of Ouriginal (which itself merged into Turnitin in 2021), Plagscan offered both institutional and individual checking. The brand has been largely absorbed into Turnitin's product family. If you encounter a "Plagscan" service today it is most often a legacy URL routing to current Turnitin products.
The pricing patterns across these services share a structure: free or trial access for very small text volumes, a consumer subscription tier (typically $10-30/month for individual writers), institutional contracts for university-wide use, and an API for programmatic integration. None offer a meaningful free tier for production use; the cost of maintaining the corpus rules that out.
Open-Source and Smaller Alternatives
For specific use cases — limited budget, narrow corpus, programmatic integration — several smaller tools are worth knowing.
Plagium. A web-search-based plagiarism checker that submits suspicious phrases as Google or Bing queries and aggregates the results. Lower coverage than Turnitin (limited to publicly indexed web, no academic paywalled corpus, no submission archive) but accessible without an institutional contract. Free for occasional use; paid for higher volumes. Visit plagium.com.
OUVoiceCheck and similar academic tools. Several universities have built internal tools combining basic similarity matching with their own institutional submission archive. These are usually not available outside the institution but are mentioned because they show that "good enough" plagiarism detection is achievable when the corpus is bounded to your own students.
Open-source Python libraries. The sklearn library's TfidfVectorizer plus cosine_similarity can build a custom plagiarism tool against your own corpus in 20 lines of code. Combined with sentence-transformers for embedding-based semantic similarity, you can build genuinely sophisticated detection — provided you supply the corpus. This is how some companies build internal "originality checkers" against their own document archive without paying for Turnitin.
Grammarly's plagiarism checker. Bundled with Grammarly Premium ($30/month). Compares against ProQuest's academic corpus and the open web. Less comprehensive than Turnitin but more accessible for individual writers. Available at grammarly.com.
When Simple Text-Similarity Is Exactly What You Want
The frustration of misnamed plagiarism tools obscures the genuine usefulness of similarity calculators. Several real-world tasks are best served by exactly the bag-of-words, n-gram, or edit-distance comparison those tools provide.
Self-deduplication on your own writing. You are writing a long-form report and want to confirm you have not accidentally repeated a paragraph from chapter 2 in chapter 5. A similarity tool comparing the two paragraphs — your own to your own — gives you the answer immediately. No corpus needed; the comparison is between two strings you control.
Originality QA on your own published work. You publish blog posts and want to confirm none of them duplicate each other. A similarity comparison across all your posts (an n^2 problem, but tractable for a few hundred articles) flags pairs that are suspiciously close. No external comparison needed.
Detecting AI rewrites of your own content. You suspect a freelancer ran your draft through a paraphrasing tool. The similarity score between your version and theirs, especially with character-level edit distance, tells you whether the structure was preserved (high similarity) or genuinely rewritten (low similarity).
Comparing translations. Two human translations of the same source text should have moderate similarity at the word level (different word choices) but high similarity at the sentence-structure level. Comparing similarity scores across translation pairs gives a quick measure of consistency.
Tracking edit volume. A teacher receives a revised essay and wants to know how much was actually rewritten. Compare similarity between the previous version and the new version — high similarity means superficial edits, low similarity means substantial rewriting.
Quick verification when you already know the source. A reader sends you a passage they think was lifted from your book. A similarity comparison between the suspect passage and your original tells you immediately whether the wording matches. You did not need a corpus — you already had the candidate source.
For these tasks, a free in-browser similarity tool is the right answer. It is fast, private (nothing leaves your browser), free, and produces a mathematically meaningful score for the question you actually asked.
Try our Text Similarity Checker for honest two-document comparison using Jaccard and n-gram analysis. If you need full plagiarism detection against the indexed web, use Copyleaks, Grammarly, or your institution's Turnitin access. For comparing two versions of the same document, our text comparison guide walks through diff-based comparison which is often what you actually want when revising a draft.
Text Similarity Checker
Compare two texts for similarity using Jaccard and n-gram analysis. Highlights matching phrases.
OnlineTools4Free Team
The OnlineTools4Free Team
We are a small team of developers and designers building free, privacy-first browser tools. Every tool on this platform runs entirely in your browser — your files never leave your device.
