Skip to content

Refactor and Cleanup#28

Open
LopeKinz wants to merge 2 commits intor3nt0n:masterfrom
LopeKinz:master
Open

Refactor and Cleanup#28
LopeKinz wants to merge 2 commits intor3nt0n:masterfrom
LopeKinz:master

Conversation

@LopeKinz
Copy link

No description provided.

Sourcery AI and others added 2 commits May 23, 2023 07:45
Copy link

@desterhuizen desterhuizen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do see the majority of the changes here are mainly cleaning up the code. I feel that a few small changes could speed up the operations even more.

Comment on lines 47 to +54
def remove_duplicates_from_file(infile_path, outfile_path="temp.000000000.bopscrk"):
lines_seen = set() # holds lines already seen
outfile = open(outfile_path, "w")
infile = open(infile_path, "r")
for line in infile:
if line not in lines_seen: # not a duplicate
outfile.write(line)
lines_seen.add(line)
outfile.close()
with open(outfile_path, "w") as outfile:
infile = open(infile_path, "r")
for line in infile:
if line not in lines_seen: # not a duplicate
outfile.write(line)
lines_seen.add(line)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend sorting the data read here from infile_path before executing the loop, this will remove the need for a set(), reducing the memory footprint of this function. The conditional check for a duplicate is a check if two strings are equal with only accessing those memory locations for comparison, where a set has to verify all values in the set that is does not exist (Many loops even if they are C level loops!).

I use sorted here but if the file is already sorted that can be removed.

def remove_duplicates_from_file(infile_path, outfile_path="temp.000000000.bopscrk"):
    last_line = '' # holds lines already seen
    with open(outfile_path, "w") as outfile:
        infile = open(infile_path, "r")
        for line in sorted(infile):
            if line != last_line:  # not a duplicate
                outfile.write(line)
                last_line = line  

Comment on lines 26 to 33
def remove_by_lengths(wordlist, min_length, max_length):
'''expect a list, return a new list with the values between min and max length provided'''
new_wordlist = []
for word in wordlist:
#if (len(str(word)) < min_length) or (len(str(word)) > max_length): wordlist.remove(word)
if (len(str(word)) >= min_length) and (len(str(word)) <= max_length): new_wordlist.append(str(word))
return new_wordlist No newline at end of file
return [
str(word)
for word in wordlist
if (len(str(word)) >= min_length) and (len(str(word)) <= max_length)
] No newline at end of file

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make use of the filter function in python. it may make it slightly cleaner and can remove some additional calculations

def length_filter(word, min, max):
   word_length = len(str(word))
   return word_length >= min and word_length <= max

def remove_by_lengths(wordlist, min_length, max_length):
    '''expect a list, return a new list with the values between min and max length provided'''
    return filter(lambda word: length_filter(word, min_length, max_length) , wordlist)

@@ -69,18 +64,15 @@ def case_transforms(word):
def leet_transforms(word):
new_wordlist = []

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing this to a set, make the list unique. no need to remove the duplicates at the end, this applies to almost all the other generate and transform funcions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants