PATENTCORPUS5000.zip - size is 43,045,961 bytes.
Example of F-measure computation
Patent CorpusThis data corpus can be downloaded from the link at the top. It contains 5000 files with technical descriptions of the patents that belong to 50 different classes and queries that are class definitions. Each class has exactly 100 files sitting in subfolder with the name matching class number. Each individual document is named in accordance with the class number and patent number such as C101P7024994.txt, where 101 is class number and 7024994 is patent number. Each document contains only technical description of the patent, it does not have claims, classes, names of attorneys or inventors. All files are in ASCII format. The corpus can be used for validation of the semantic search software and different document clustering technologies. After the documents are clustered the result can be matched with file names and accuracy can be easy estimated. It can be used for both types of algorithms those that don't use training set and those that use it. Besides clustering it can verify query algorithms. Each class has definition provided by USPTO expert, which is short text describing the technical area for the perspective inventions. For example class 131 TOBACCO:
Products containing tobacco or tobacco substitutes intended for personal use for smoking ...
and so on. This query can be executed on documents of PATENT CORPUS and result can be easily examined for accuracy using file names. This is no doubt natural language processing test. The query, that defines technical field and written by an expert, is used for identification of documents that may contain this technical field.
For estimation of the accuracy we can use so-called F-measure. If anyone knows why it is called F-measure please let me know and I'll publish it here. F-measure is a formula that contains two parameters: recall and precision. Recall is proportion of returned relevant documents out of all relevant documents in the corpus. For example, if our query returned total 125 documents, 75 of them relevant, but there are 100 relevant documents in the corpus, recall is R = 75/100 = 0.75. Obviously, we can make query returning all documents and recall will be always 1.0, so it is not enough to make judgment. The second parameter is precision. Precision is proportion of relevant documents out of retrieved, in this example it is P = 75/125 = 0.6. Having recall and precision, we can calculate F-measure as following:
F = [2 * R * P] / [P + R] = 0.66
Computing F-measure for document clustering is slightly different. We have to decide which class to assign to the cluster. Each constructed cluster may contain several actual classes that are known because we use testing data. We assign classes to clusters one by one, choosing best possible on each step. I can show it on example data below:
Presume the table above contains experimental data. Program assigned 39 documents to cluster 1, 43 documents to cluster 2 and 25 documents to cluster 3. Since we use testing data the class for each document is known and table shows actual classes in column elements. If we divide every element by the sum in columns we obtain recall for each element.
If we divide each element by sum in rows we obtain precision for each element.
Having recall and precision we can compute F-measures for each element
Now we find maximum element and assign correspondent class column to correspondent cluster row, which is cluster 2 and class 1. Then we remove second row and first column out of consideration and apply same logic to remained matrix. We can see that cluster 3 = class 2 and cluster 1 = class 3. After each cluster is assigned class we can compute average F-measure, which is 0.51.
Since recognized classes are reassigned labels, we can rearrange original table accordingly, by placing columns in the order as the rows associated with them.
Now we can clearly see numbers of correctly recognized documents on the main diagonal.
When testing documents' clustering software, we may want to know which performance and accuracy is good enough. It depends more on the number of clusters rather than on size of collection. From my experience I can provide some approximate expected figures: for 5 clusters accuracy should be near 95%, for 10 clusters 85%, for 20 clusters 70%, for 50 clusters about 50%. When number of clusters is more than 50, the whole concept of clustering documents does not make sense. The approach to semantic search should be completely different from clustering. Regarding performance: we may expect processing of about 100 documents per second. Reasonable size of the document is 2 to 20 pages. The larger documents are complex and not dedicated to one or two topics. They must be process by parts (chapters or paragraphs).