Conversation
Sourcery refactored master branch
desterhuizen
left a comment
There was a problem hiding this comment.
I do see the majority of the changes here are mainly cleaning up the code. I feel that a few small changes could speed up the operations even more.
| def remove_duplicates_from_file(infile_path, outfile_path="temp.000000000.bopscrk"): | ||
| lines_seen = set() # holds lines already seen | ||
| outfile = open(outfile_path, "w") | ||
| infile = open(infile_path, "r") | ||
| for line in infile: | ||
| if line not in lines_seen: # not a duplicate | ||
| outfile.write(line) | ||
| lines_seen.add(line) | ||
| outfile.close() | ||
| with open(outfile_path, "w") as outfile: | ||
| infile = open(infile_path, "r") | ||
| for line in infile: | ||
| if line not in lines_seen: # not a duplicate | ||
| outfile.write(line) | ||
| lines_seen.add(line) |
There was a problem hiding this comment.
I would recommend sorting the data read here from infile_path before executing the loop, this will remove the need for a set(), reducing the memory footprint of this function. The conditional check for a duplicate is a check if two strings are equal with only accessing those memory locations for comparison, where a set has to verify all values in the set that is does not exist (Many loops even if they are C level loops!).
I use sorted here but if the file is already sorted that can be removed.
def remove_duplicates_from_file(infile_path, outfile_path="temp.000000000.bopscrk"):
last_line = '' # holds lines already seen
with open(outfile_path, "w") as outfile:
infile = open(infile_path, "r")
for line in sorted(infile):
if line != last_line: # not a duplicate
outfile.write(line)
last_line = line | def remove_by_lengths(wordlist, min_length, max_length): | ||
| '''expect a list, return a new list with the values between min and max length provided''' | ||
| new_wordlist = [] | ||
| for word in wordlist: | ||
| #if (len(str(word)) < min_length) or (len(str(word)) > max_length): wordlist.remove(word) | ||
| if (len(str(word)) >= min_length) and (len(str(word)) <= max_length): new_wordlist.append(str(word)) | ||
| return new_wordlist No newline at end of file | ||
| return [ | ||
| str(word) | ||
| for word in wordlist | ||
| if (len(str(word)) >= min_length) and (len(str(word)) <= max_length) | ||
| ] No newline at end of file |
There was a problem hiding this comment.
I would make use of the filter function in python. it may make it slightly cleaner and can remove some additional calculations
def length_filter(word, min, max):
word_length = len(str(word))
return word_length >= min and word_length <= max
def remove_by_lengths(wordlist, min_length, max_length):
'''expect a list, return a new list with the values between min and max length provided'''
return filter(lambda word: length_filter(word, min_length, max_length) , wordlist)| @@ -69,18 +64,15 @@ def case_transforms(word): | |||
| def leet_transforms(word): | |||
| new_wordlist = [] | |||
There was a problem hiding this comment.
Changing this to a set, make the list unique. no need to remove the duplicates at the end, this applies to almost all the other generate and transform funcions.
No description provided.