Search This Blog

Translate

Friday, July 12, 2013

How To Calculate Tf-Idf and Cosine Similarity using JAVA.

Get real time news update from your favorite websites.
Don't miss any news about your favorite topic.
Personalize your app.

Check out NTyles.


Get it on....

NTyles-App






NOTE: Lucene 4.x users please do refer
Calculate Cosine Similarity Using Lucene

For beginners doing a project in text mining aches them a lot by various term like :
  • TF-IDF
  • COSINE SIMILARITY
  • CLUSTERING
  • DOCUMENT VECTORS
In my earlier post I showed you guys what is Cosine Similarity. I will not talk about Cosine Similarity in this post but rather I will show a nice little code to calculate Cosine Similarity in java.

Many of you must be familiar with Tf-Idf(Term frequency-Inverse Document Frequency).
I will enlighten them in brief.

Term Frequency:
Suppose for a document "Tf-Idf Brief Introduction" there are overall 60000 words and a word Term-Frequency occurs 60 times.
Then , mathematically, its Term Frequency, TF = 60/60000 =0.001.

Inverse Document Frequency:
Suppose one bought Harry-Potter series, all series. Suppose there are 7 series and a word "AbraKaDabra" comes in 2 of the series.
Then, mathematically, its Inverse-Document Frequency , IDF = 1 + log(7/2) = .......(calculated it guys, don't be lazy, I am lazy not you guys.)

And Finally, TFIDF = TF * IDF;

By mathematically I assume you now know its meaning physically.

Document Vector:
There are various ways to calculate document vectors. I am just giving you an example. Suppose If I calculate all the term's TF-IDF of a document A and store them in an array(list, matrix ... in any ordered way, .. you guys are genius you know how to create a vector. ) then I get an Document Vector of TF-IDF scores of document A.

The class shown below calculates the Term Frequency(TF) and Inverse Document Frequency(IDF).

//TfIdf.java
package com.computergodzilla.tfidf;

import java.util.List;

/**
 * Class to calculate TfIdf of term.
 * @author Mubin Shrestha
 */
public class TfIdf {
    
    /**
     * Calculates the tf of term termToCheck
     * @param totalterms : Array of all the words under processing document
     * @param termToCheck : term of which tf is to be calculated.
     * @return tf(term frequency) of term termToCheck
     */
    public double tfCalculator(String[] totalterms, String termToCheck) {
        double count = 0;  //to count the overall occurrence of the term termToCheck
        for (String s : totalterms) {
            if (s.equalsIgnoreCase(termToCheck)) {
                count++;
            }
        }
        return count / totalterms.length;
    }

    /**
     * Calculates idf of term termToCheck
     * @param allTerms : all the terms of all the documents
     * @param termToCheck
     * @return idf(inverse document frequency) score
     */
    public double idfCalculator(List allTerms, String termToCheck) {
        double count = 0;
        for (String[] ss : allTerms) {
            for (String s : ss) {
                if (s.equalsIgnoreCase(termToCheck)) {
                    count++;
                    break;
                }
            }
        }
        return 1 + Math.log(allTerms.size() / count);
    }
}


The class shown below parsed the text documents and split them into tokens. This class will communicate with TfIdf.java class to calculated TfIdf. It also calls CosineSimilarity.java class to calculated the similarity between the passed documents.

//DocumentParser.java

package com.computergodzilla.tfidf;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

/**
 * Class to read documents
 *
 * @author Mubin Shrestha
 */
public class DocumentParser {

    //This variable will hold all terms of each document in an array.
    private List termsDocsArray = new ArrayList<>();
    private List allTerms = new ArrayList<>(); //to hold all terms
    private List tfidfDocsVector = new ArrayList<>();

    /**
     * Method to read files and store in array.
     * @param filePath : source file path
     * @throws FileNotFoundException
     * @throws IOException
     */
    public void parseFiles(String filePath) throws FileNotFoundException, IOException {
        File[] allfiles = new File(filePath).listFiles();
        BufferedReader in = null;
        for (File f : allfiles) {
            if (f.getName().endsWith(".txt")) {
                in = new BufferedReader(new FileReader(f));
                StringBuilder sb = new StringBuilder();
                String s = null;
                while ((s = in.readLine()) != null) {
                    sb.append(s);
                }
                String[] tokenizedTerms = sb.toString().replaceAll("[\\W&&[^\\s]]", "").split("\\W+");   //to get individual terms
                for (String term : tokenizedTerms) {
                    if (!allTerms.contains(term)) {  //avoid duplicate entry
                        allTerms.add(term);
                    }
                }
                termsDocsArray.add(tokenizedTerms);
            }
        }

    }

    /**
     * Method to create termVector according to its tfidf score.
     */
    public void tfIdfCalculator() {
        double tf; //term frequency
        double idf; //inverse document frequency
        double tfidf; //term requency inverse document frequency        
        for (String[] docTermsArray : termsDocsArray) {
            double[] tfidfvectors = new double[allTerms.size()];
            int count = 0;
            for (String terms : allTerms) {
                tf = new TfIdf().tfCalculator(docTermsArray, terms);
                idf = new TfIdf().idfCalculator(termsDocsArray, terms);
                tfidf = tf * idf;
                tfidfvectors[count] = tfidf;
                count++;
            }
            tfidfDocsVector.add(tfidfvectors);  //storing document vectors;            
        }
    }

    /**
     * Method to calculate cosine similarity between all the documents.
     */
    public void getCosineSimilarity() {
        for (int i = 0; i < tfidfDocsVector.size(); i++) {
            for (int j = 0; j < tfidfDocsVector.size(); j++) {
                System.out.println("between " + i + " and " + j + "  =  "
                                   + new CosineSimilarity().cosineSimilarity
                                       (
                                         tfidfDocsVector.get(i), 
                                         tfidfDocsVector.get(j)
                                       )
                                  );
            }
        }
    }
}


This is the class that calculates Cosine Similarity:

//CosineSimilarity.java
/*
 * To change this template, choose Tools | Templates
 * and open the template in the editor.
 */
package com.computergodzilla.tfidf;

/**
 * Cosine similarity calculator class
 * @author Mubin Shrestha
 */
public class CosineSimilarity {

    /**
     * Method to calculate cosine similarity between two documents.
     * @param docVector1 : document vector 1 (a)
     * @param docVector2 : document vector 2 (b)
     * @return 
     */
    public double cosineSimilarity(double[] docVector1, double[] docVector2) {
        double dotProduct = 0.0;
        double magnitude1 = 0.0;
        double magnitude2 = 0.0;
        double cosineSimilarity = 0.0;

        for (int i = 0; i < docVector1.length; i++) //docVector1 and docVector2 must be of same length
        {
            dotProduct += docVector1[i] * docVector2[i];  //a.b
            magnitude1 += Math.pow(docVector1[i], 2);  //(a^2)
            magnitude2 += Math.pow(docVector2[i], 2); //(b^2)
        }

        magnitude1 = Math.sqrt(magnitude1);//sqrt(a^2)
        magnitude2 = Math.sqrt(magnitude2);//sqrt(b^2)

        if (magnitude1 != 0.0 | magnitude2 != 0.0) {
            cosineSimilarity = dotProduct / (magnitude1 * magnitude2);
        } else {
            return 0.0;
        }
        return cosineSimilarity;
    }
}


Here's the main class to run the code:

//TfIdfMain.java
package com.computergodzilla.tfidf;

import java.io.FileNotFoundException;
import java.io.IOException;

/**
 *
 * @author Mubin Shrestha
 */
public class TfIdfMain {
    
    /**
     * Main method
     * @param args
     * @throws FileNotFoundException
     * @throws IOException 
     */
    public static void main(String args[]) throws FileNotFoundException, IOException
    {
        DocumentParser dp = new DocumentParser();
        dp.parseFiles("D:\\FolderToCalculateCosineSimilarityOf"); // give the location of source file
        dp.tfIdfCalculator(); //calculates tfidf
        dp.getCosineSimilarity(); //calculates cosine similarity   
    }
}



You can also download the whole source code from here: Download.

Overall what I did is, I first calculate the TfIdf matrix of all the documents and then document vectors of each documents. Then I used those document vectors to calculate cosine similarity.

You think clarification is not enough. Hit me..
Happy Text-Mining!!

Please check out my first Android app, NTyles:

86 comments:

  1. java.lang.NoClassDefFoundError: com/computergodzilla/tfidf/TfIdfMain
    Caused by: java.lang.ClassNotFoundException: com.computergodzilla.tfidf.TfIdfMain
    at java.net.URLClassLoader$1.run(URLClassLoader.java:221)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:209)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:324)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:269)
    at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:337)
    Exception in thread "main" Java Result: 1
    BUILD SUCCESSFUL (total time: 0 seconds)

    ReplyDelete
  2. java.lang.NoClassDefFoundError: com/computergodzilla/tfidf/TfIdfMain
    Caused by: java.lang.ClassNotFoundException: com.computergodzilla.tfidf.TfIdfMain
    at java.net.URLClassLoader$1.run(URLClassLoader.java:221)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:209)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:324)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:269)
    at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:337)
    Exception in thread "main" Java Result: 1

    ReplyDelete
  3. @prasanna wadekar
    Create a package named "com.computergodzilla.tfidf" and copy all the downloaded files inside this package and run the project. This should solve your problem.

    ReplyDelete
  4. What if i want to print the TfIdf value for a particular term?

    ReplyDelete
  5. @Abha
    You can simply do that by using:
    tf = new TfIdf().tfCalculator(docTermsArray, term); //give your term here
    idf = new TfIdf().idfCalculator(termsDocsArray, term);
    tfidf = tf * idf;
    System.out.println(tfidf); //this is your required tfidf value.

    ReplyDelete
  6. How can I specify the document file name when output "between " + i + "and " + j + " = ") in getCosineSimilarity

    ReplyDelete
  7. @Jubilee:
    First add a list to store all the filenames. For this add below line in DocumentParser.java :
    private List fileNameList = new ArrayList();
    Next add all the filenames to the list as shown below:

    if (f.getName().endsWith(".txt")) {
    fileNameList.add(f.getName()); ///add here
    in = new BufferedReader(new FileReader(f));
    StringBuilder sb = new StringBuilder();
    Then you can specify document file name as below:
    System.out.println("between " + fileNameList.get(i) + " and " + fileNameList.get(j) + " = "

    ReplyDelete
  8. @shresthaMubin Thank you - it works. I also noticed that you have specified that docVector1 and docVector2 must be in the same length. Just wondering where did you specify the length normalization in cosineSimilarity class since not all documents are in the same length to perform comparison.

    ReplyDelete
  9. Thank you for your reply! I also noticed that you have specified that docVector must be in the same length in cosineSimilarity. Just wonder where do you specify the length normalization in that class since not all documents are in the same length.

    ReplyDelete
  10. Thank you for your quick reply! Another question: would like to know if you have done length normalization when comparing two document vectors (in case they are not in the same length) in CosineSimilarity - thanks!

    ReplyDelete
  11. Hi shresthaMubin, thanks for the great tutorial. It's very easy to understand. I'd like to point out a possible optimization that you could do. You could actually precalculate idfCalculator and store it in an Hashtable before you start calculated TF. Both of the arguments used in that function don't change when you start calculating TFIDF. But it's probably easier to understand it if you write the code that way.

    ReplyDelete
  12. Also, in CosineSimilarity.java, for the line:

    if (magnitude1 != 0.0 | magnitude2 != 0.0) {

    Shouldn't it be this instead?

    if ((magnitude1 != 0.0) && (magnitude2 != 0.0)) {

    If one of the variables was zero, it will still end up trying to divide by zero in the original code which is what you seem to be avoiding.

    ReplyDelete
  13. @sw2de3fr4gt

    Yes thats a bug, will fix them soon and update the content. Thank you.

    ReplyDelete
  14. @Jubilee:
    The above code works for document with any length. The document vector is created for all the unique terms of all the documents.

    ReplyDelete
  15. run:
    Exception in thread "main" java.lang.NullPointerException
    at com.computergodzilla.tfidf.DocumentParser.parseFiles(DocumentParser.java:37)
    at com.computergodzilla.tfidf.TfIdfMain.main(TfIdfMain.java:26)
    Java Result: 1
    BUILD SUCCESSFUL (total time: 0 seconds)

    I am getting this error while executing...
    And the program shows error in the below line,

    for (String[] ss : allTerms)

    and the error is,
    incompatible types
    required: java.lang.String[]
    found: java.lang.Object

    Thank u

    ReplyDelete
  16. Exception in thread "main" java.lang.Error: Unresolved compilation problems:
    Type mismatch: cannot convert from element type Object to String[]
    Type mismatch: cannot convert from element type Object to String

    at DocumentParser.tfIdfCalculator(DocumentParser.java:64)
    at TfIdfMain.main(TfIdfMain.java:28)

    I get this error in TFIDF calculator method


    public void tfIdfCalculator() {
    double tf; //term frequency
    double idf; //inverse document frequency
    double tfidf; //term requency inverse document frequency
    for (String[] docTermsArray : termsDocsArray) {
    double[] tfidfvectors = new double[allTerms.size()];
    int count = 0;
    for (String terms : allTerms) {
    tf = new TfIdf().tfCalculator(docTermsArray, terms);
    idf = new TfIdf().idfCalculator(termsDocsArray, terms);
    tfidf = tf * idf;
    tfidfvectors[count] = tfidf;
    count++;
    }
    tfidfDocsVector.add(tfidfvectors); //storing document vectors;
    }
    }

    ReplyDelete
  17. Hello first, thank you for your effort in clarifying the program and I have a question
    How could calculate Cosine Similarity one from file path and other from another path
    What are the possible changes that occur on the program

    ReplyDelete
    Replies
    1. Just modify the function
      parseFiles(String filePath)
      to
      parseFiles(String filePath1, String filePath2)
      and replace
      File[] allfiles = new File(filePath).listFiles();
      with
      List<FIle> allFiles = new ArrayList<FIle>();
      for(File f : new File(filePath1).listFiles())
      {
      allFiles.add(f);
      }

      for(File f : new File(filePath2).listFiles())
      {
      allFiles.add(f);
      }

      Delete
  18. Hello first, thank you for your effort in clarifying the program and I have a question
    How could calculate Cosine Similarity one from file path and other from another path
    What are the possible changes that occur on the program

    ReplyDelete
  19. Hi everyone i need help for my assignment which requires me to create a programme to check the tfidf of each word that a user searches.
    1. Loading in all the text document information from all the files. A set of files from Open American National Corpus is used for testing in this assignment.

    2. Pre-process each text document to do the relevant word counts, storing the data in hashmaps(one hashmap for one text document) for fast retrieval during the analysis phase.

    3. Provide a menu for user to enter the search query terms, and then calculate the td-idf score for each text document. For example if user enters query term “Singapore attraction” then the document will have a td-idf score which is the sum of td-idf of Singapore + td-idf of attraction.

    4. Display the top 10 query search documents with the score information. You are required to make use of the Comparable interface to help you do sorting.

    ReplyDelete
  20. run:
    Exception in thread "main" java.lang.NullPointerException
    at com.computergodzilla.tfidf.DocumentParser.parseFiles(DocumentParser.java:37)
    at com.computergodzilla.tfidf.TfIdfMain.main(TfIdfMain.java:26)
    Java Result: 1
    BUILD SUCCESSFUL (total time: 0 seconds)

    I am getting this error while executing...



    i have created a package and placed the code above and executed it in netbeans.
    still it is showing any output.

    ReplyDelete
    Replies
    1. Did you give the location of the source files.

      Delete
    2. I made these changes in main method
      dp.parseFiles("D:\student.txt");
      student.txt is the tabulated source file that i've given.

      run:
      Exception in thread "main" java.lang.NullPointerException
      at javaapplication3.DocumentParser.parseFiles(DocumentParser.java:36)
      at javaapplication3.TfIdfMain.main(TfIdfMain.java:25)
      Java Result: 1
      BUILD SUCCESSFUL (total time: 0 seconds)

      I'm getting this error.

      Delete
    3. Use dp.parseFiles("D:\\student.txt"); or dp.parseFiles("D:/student.txt"); instead of dp.parseFiles("D:\student.txt");. I guess you know why. Also the above program calculates the cosine similarity between two or more than two files and you are using only one file. So it will not work. Also pass dp.parseFiles("");. folder location instead of file name.

      Delete
  21. error: incompatible types
    for (String[] ss : allTerms) {
    required: String[]
    found: Object
    1 error
    object cannot be converted to string
    error in tfidf.java please hel me

    ReplyDelete
    Replies
    1. This has been repeated problem with the users. I will update my code base soon to make it run with older version of jdk. Please upgrade your jdk.

      Delete
  22. i have updated jdk1.7 to jdk 1.8 as you have said but still giving bsame error . please can you help

    ReplyDelete
    Replies
    1. Download the source code from download link from post. I am sure that will blog. I will update the blog. Let me know, if it works for you.

      Delete
  23. thanks a lot it works
    i have one more query what if i have find tfidf for only a single text document how to do this ?
    hope you will help
    i am new to java so facing this much problem

    ReplyDelete
  24. This comment has been removed by the author.

    ReplyDelete
  25. i am having issue in code. when i add two files in folder then it shows similarity between them 0.0 but when i add more two only then it shows proper score. why it would ? how can i correct it??

    ReplyDelete
  26. plz tell why it is not showing similarity when i add two files in folder. it just shows 0.0 score. but if i add more than two files only then the score is correct.

    ReplyDelete
    Replies
    1. It was a bug in my code base. I had corrected it. The issue was not with the number of files present in the folder but rather the formula of idf was wrong. The idf value should have been 0.0 when you ran the program. The scenario occurs when you both the file contained the same word. The code will work fine now. I had also updated the idf formula.

      Delete
    2. Can you tell me how can i show only those files which cosine score is greater than 0.4??

      Delete
    3. Its very simple. Make allFiles variable public and you will have to add a if condition checking in the code base as below:
      /**
      * Method to calculate cosine similarity between all the documents.
      */
      public void getCosineSimilarity() {
      for (int i = 0; i < tfidfDocsVector.size(); i++) {
      for (int j = 0; j < tfidfDocsVector.size(); j++) {
      double cosineSimilarity = new CosineSimilarity().cosineSimilarity
      (
      tfidfDocsVector.get(i),
      tfidfDocsVector.get(j)
      );
      if(cosineSimilarity > 0.4)
      {
      System.out.println("between " + allFiles[i].getName() + " and " + allFiles[j].getName() + " = " + cosineSimilarity);
      }
      }
      }
      }

      Delete
  27. public double idfCalculator(List allTerms, String termToCheck) {
    double count = 0;
    for (String[] ss : allTerms) {

    its showing error in this 3rd line now.

    ReplyDelete
  28. This comment has been removed by the author.

    ReplyDelete
  29. Thanks it worked perfectly.

    ReplyDelete
  30. I need Some changes in formula because this formula needs docs in same length . Can we use tfidf formula where it wont affect the length of files on similarity score. one thing if we do use tf= 1 + log (tf) and idf = log(idf)... can we achieve this goal. i did it but getting NaN because tf.idf score is in minus. how can we resolve it. if we can resolve it can you write the code for it.

    ReplyDelete
  31. First clear up your mind that the formula does not need the same length documents, the source documents can be of any length. For calculating cosine similarity, the two vector under going dot product must be of same length. This does not mean that the document needs to be of same length. My code transforms all length document into the required document vector length. Please read my "What is cosine similairty" blog.

    ReplyDelete
  32. hmmm okz thanks for clearing it. Now my question is what would happen if we calculate Tf = 1+Math.log(count / totalterms.length ) and idf Math.log(allTerms.size() / count);. can we do this?? if not why??

    ReplyDelete
    Replies
    1. Please study wiki page http://en.wikipedia.org/wiki/Tf%E2%80%93idf for clarifying your concept of TF and IDF. The formula you mentioned above are wrong so you certainly can't do them.

      Delete
  33. Thanks for this code
    I want to ask you how we can calculate the cosineSimilarity using TFIDF between two ontologies instead of document as the elements of ontologies like class , properties instead of words in a document

    ReplyDelete
  34. shresthaMubin i want source code for the information retrievel system in java which will have following functionalities :

    1. User will give the query to the system
    2. system will show us the related ranked documents retrieved from the directory or corpus.

    kindly help me.. :(
    my email id is : firstwebdevelopers@gmail.com

    ReplyDelete
  35. why diffents inputs come the same output

    ReplyDelete
  36. why diffent inputs comes same output..how to give the input

    ReplyDelete
    Replies
    1. Give the location of the folder where you have all the files to be processed. I have commented the section where you should give the folder location.

      Delete
  37. can u plz tell me that where i can add file names ,m so confused

    ReplyDelete
  38. error: cannot find symbol
    DocumentParser dp=new DocumentParser() ;

    ReplyDelete
    Replies
    1. Exception in thread "main" java.lang.RuntimeException: Uncompilable source code - incompatible types: java.lang.Object cannot be converted to java.lang.String[]
      at com.computergodzilla.tfidf.DocumentParser.tfIdfCalculator(DocumentParser.java:67)
      at com.computergodzilla.tfidf.TfIdfMain.main(TfIdfMain.java:32)
      Java Result: 1 even i am using jdk 1.8 and i have two txt file of uique word in a folder but it does not worl

      Delete
    2. Exception in thread "main" java.lang.RuntimeException: Uncompilable source code - incompatible types: java.lang.Object cannot be converted to java.lang.String[]
      at com.computergodzilla.tfidf.DocumentParser.tfIdfCalculator(DocumentParser.java:67)
      at com.computergodzilla.tfidf.TfIdfMain.main(TfIdfMain.java:32)
      Java Result: 1 even i am using jdk 1.8 and i have two txt file of uique word in a folder but it does not worl

      Delete
    3. Exception in thread "main" java.lang.RuntimeException: Uncompilable source code - incompatible types: java.lang.Object cannot be converted to java.lang.String[]
      at com.computergodzilla.tfidf.DocumentParser.tfIdfCalculator(DocumentParser.java:67)
      at com.computergodzilla.tfidf.TfIdfMain.main(TfIdfMain.java:32)
      Java Result: 1

      Delete
    4. Did you work with the downloadable source code from the link. https://drive.google.com/file/d/0BzQONlWil3VGRVNmYm5KUEJsTWM/view?usp=sharing. If not please try it and let me know.

      Delete
  39. Replies
    1. Why do you need sample data. You can try with any text files.

      Delete
  40. can you please provide a code for finding idf value of more than one term jointly.

    ReplyDelete
    Replies
    1. There would certainly won't be anything such as calculating idf for "more than one word jointly." TFIDF scoring is for single term in a collection of documents. Please clarify your concepts regarding TFIDF. BTW I have provided TFIDF class for calculating TF and IDF above in the blog. You will have to calculate tfidf of each term individually.

      Delete
  41. can you plz provide a code for finding idf of more than one term jointly

    ReplyDelete
    Replies
    1. NOTE: Below reply is same as commented for rohini 454 above.

      There would certainly won't be anything such as calculating idf for "more than one word jointly." TFIDF scoring is for single term in a collection of documents. Please clarify your concepts regarding TFIDF. BTW I have provided TFIDF class for calculating TF and IDF above in the blog. You will have to calculate tfidf of each term individually.

      Delete
  42. I have copied all java programs in TfIdfMain.java program.i am getting following error.please give a solution for this error.
    error:class TfIdf is public,should be declared in a file named TfIdf.java.

    ReplyDelete
    Replies
    1. You don't have to copy all the java programs to the TFIDFMain.java. TFIDFmain.java is the executor class. You should look into tfidfcalculator method of documentparser.java. And follow up accordingly.

      Delete
  43. Can u plz send the vedio(execution of above program).i tried but i always getting an error:can't find the symbol DocumentParser..once plz show me that execution procedure

    ReplyDelete
  44. Please explain the execution procedure of above program..plz help..

    ReplyDelete
  45. I need above requirement urgently...so plz give a reply as early as possible.

    ReplyDelete
    Replies
    1. Hell abc123, create a new project in you favourite IDE. Create a new package call com.computergodzilla.tfidf. Now copy all above class files into that package. Change your folder source in Documentparser.java. And then run the program. If it didnt help. I am really busy right. I would put a details explanation this weekend. Just let me know if it helped. Thank you

      Delete
  46. Thank you so much..its working....but i got the outPut as follows:between 0 and 0=1.0
    between 0 and 1=0.0
    between 1 and 0=0.0
    between 1 and 1=1.0
    This is the output what i got...plz explain what represents the above values....

    ReplyDelete
    Replies
    1. Please read my blog on "What is cosine similarity?"
      computergodzilla.blogspot.com/2012/12/what-is-cosine-similarity.html
      on understanding what those values means.

      Delete
  47. Actually i need tfidf value for particular term which is present in text files....above you have given modifications for finding tdidf value for particular term i tried, but it showing the error as:gladiator cannot be resolved to a variable...here gladiator is a term which is present in text files..i want to findout tfidf value for gladiator term...

    ReplyDelete
    Replies
    1. Above code gives the cosine simiarity scores. Above all trying to give gladiator as the input is not accepted. Above code takes files as input. Not terms. And it obvious that you will get the error.

      Delete
  48. How can we finout the tfidf of particulat term...plz explain it...

    ReplyDelete
    Replies
    1. Replace tfidfcalculator() with below method:
      /**
      * Method to create termVector according to its tfidf score.
      * term : pass you term here.
      */
      public void tfIdfCalculator(String term) {
      double tf; //term frequency
      double idf; //inverse document frequency
      double tfidf; //term requency inverse document frequency
      for (String[] docTermsArray : termsDocsArray) {
      double[] tfidfvectors = new double[allTerms.size()];
      int count = 0;
      tf = new TfIdf().tfCalculator(docTermsArray, term);
      idf = new TfIdf().idfCalculator(termsDocsArray, term);
      tfidf = tf * idf;
      tfidfvectors[count] = tfidf;
      count++;
      tfidfDocsVector.add(tfidfvectors); //storing document vectors;
      }
      }

      Now, pass your term to above function and enjoy.

      Delete
  49. Hi shresthaMubin,
    you have a mistake in your downloadable files. In TfIdf.java in the function "idfCalculator" there is missing a "1+":

    return 1 + Math.log(allTerms.size() / count);

    Regards,
    Chris

    ReplyDelete
    Replies
    1. Yes Chris, Thank you. I will correct it soon. Thank you for your valuable comment.

      Delete
  50. Hi shresthaMubin,
    you have a mistake in your downloadable files. In TfIdf.java in the function "idfCalculator" there is missing a "1+":

    return 1 + Math.log(allTerms.size() / count);

    Regards,
    Chris

    ReplyDelete
  51. When will you update the code for K-means clustering with cosine similarity as a distance measure?? :) Waiting!!

    ReplyDelete
    Replies
    1. Hi Rizwan,

      I won't be adding the code for K-means Clustering. Since cosine measures are there, it is straightforward job to calculate K-means Clustering. You will find the lot of open source projects or even source codes in stackoverflow or in google about K-means Clustering.

      Delete
  52. Awesome code. A great big thank you ;-).
    In December 2014 someone asked you about modifying getCosineSimilarity to print the file names in "between + [i] + " and " + [j]. When I made allFiles in the parseFiles I got a lot of underlined code.

    I changed:
    File[] allfiles = new File(filePath).listFiles();
    to
    public File[] allfiles = new File(filePath).listFiles();
    but received "Illegal start of expression". Can you please help? Thank you.

    ReplyDelete
  53. I managed by declaring Files allfiles as a global variable under the private variables at the beginning.

    ReplyDelete
  54. Hello Mubin

    Would it be possible to modify the code so that it computes the similarity of in one pass? For example; say I have 3 documents of type txt and 10 documents of type html all in one folder and I want to find the cosine similarity of the first 3 with the rest, without comparing each document with another. So the iteration will compare the first document with the remaining 12, the second with 12 and the third with 12 and then stop. Any help would be greatly appreciated. Thanks

    ReplyDelete
  55. what to do at the error
    Exception in thread "main" java.lang.NullPointerException
    at com.computergodzilla.tfidf.DocumentParser.parseFiles(DocumentParser.java:36)
    at com.computergodzilla.tfidf.TfIdfMain.main(TfIdfMain.java:25)
    Java Result: 1
    BUILD SUCCESSFUL (total time: 0 seconds)

    ReplyDelete
  56. how do you perform clustering on the output ans what are the steps for that

    ReplyDelete
  57. Hey Shrestha Mubin,

    This is exactly what I wanted and it worked perfectly. Nice explanation and sample code. Thanks a lot!!!

    ReplyDelete