Question
Compare strings from a very large text file (over 100 GB) with a small text file (about 30 lines) and print all the strings contained in both files
I have two text files. One contains a very long list of strings (100 GB), the other contains about 30 strings. I need to find which lines in the second file are also in the first file and write them to another,third text file. Manually searching for each line is a pain, so I wanted to write a script to do it automatically. For this I choose Python because it is the only language that I know even a little.
Essentially I tried copying this answer since I'm too inexperienced to write my own code: Compare 2 files in Python and extract differences as a strings
smallfile = 'smalllist.txt'
bigfile = 'biglist.txt'
def file_2_list(file):
with open(file) as file:
lines = file.readlines()
lines = [line.rstrip() for line in lines]
return lines
def diff_lists(lst1, lst2):
differences = []
both = []
for element in lst1:
if element not in lst2:
differences.append(element)
else:
both.append(element)
return(differences, both)
listbig = file_2_list(bigfile)
listsmall = file_2_list(smallfile)
diff, both = diff_lists(listbig, listsmall)
print(both)
I wanted it to print me the lines that are in both lists. However it gave me a "memory error". But I'm already using a 64-bit version of Python, so the memory limit shouldn't be an issue? (I have 16 GB RAM)
So how can you avoid this “memory error”? Or maybe there is a better way to accomplish this task?