Python - Locating Duplicate Words in a Text File

675 views python
6

I was wondering if you could help me with a python programming issue? I'm currently trying to write a program that reads a text file and output "word 1 True" if the word had already occurred in that file before or "word 1 False" if this is the first time the word appeared.

Here's what I came up with:

fh = open(fname)
lst = list ()
for line in fh:
    words = line.split()
    for word in words:
        if word in words:
            print("word 1 True", word)
        else:
            print("word 1 False", word)

However, it only returns "word 1 True"

Please advise.

Thanks!

answered question

You need an additional set to lookup if the word was already contained and to add it to the set if not.

Every word from words is going to show up in words, so the test is just an expensive way to say if True:. If you're looking for duplicates, you need a count.

3 Answers

8

A simple (and fast) way to implement this would be with a python dictionary. These can be thought of like an array, but the index-key is a string rather than a number.

This gives some code fragments like:

found_words = {}    # empty dictionary
words1 = open("words1.txt","rt").read().split(' ')  # TODO - handle punctuation
for word in words1:
    if word in found_words:
        print(word + " already in file")
    else:
        found_words[word] = True    # could be set to anything

Now when processing your words, simply checking to see if the word already exists in the dictionary indicates that it was seen already.

posted this
5

This snipped code doesn't use the file, but it's easy to test and study. The main difference is that you must load the file and read per line as you did in your example

example_file = """
This is a text file example

Let's see how many time example is typed.

"""
result = {}
words = example_file.split()
for word in words:
    # if the word is not in the result dictionary, the default value is 0 + 1
    result[word] = result.get(word, 0) + 1
for word, occurence in result.items():
    print("word:%s; occurence:%s" % (word, occurence))

UPDATE:

As suggested by @khachik a better solution is using the Counter.

>>> # Find the ten most common words in Hamlet
>>> import re
>>> words = re.findall(r'\w+', open('hamlet.txt').read().lower())
>>> Counter(words).most_common(10)
[('the', 1143), ('and', 966), ('to', 762), ('of', 669), ('i', 631),
 ('you', 554),  ('a', 546), ('my', 514), ('hamlet', 471), ('in', 451)]

posted this
1

You might also want to track previous locations, something like this:

with open(fname) as fh:
    vocab = {}
    for i, line in enumerate(fh):
       words = line.split()
       for j, word in enumerate(words):
           if word in vocab:
               locations = vocab[word]
               print word "occurs at", locations
               locations.append((i, j))
           else:
               vocab[word] = [(i, j)]
               # print "First occurrence of", word

posted this

Have an answer?

JD

Please login first before posting an answer.