Conversation
Updated: - Redefine the "entanglement" function - Rewrite the "refine" function - Change the name of the new untangle method from "permutations" to "ShUnTan"
schlegelp
left a comment
There was a problem hiding this comment.
Thanks for this PR - there is some good stuff there. Unfortunately, you are also making some changes with great impact on how the library works and what it can be used for that I'm not onboard with (see my comments). I'm happy to discuss options on how to proceed though.
| method=sort, **sort_kwargs) | ||
|
|
||
| fig = pylab.figure(figsize=(8, 8)) | ||
| def draw_tanglegram(linkage_1, linkage_2, labels1, labels2, color_by_diff=True, dend_kwargs={}): |
There was a problem hiding this comment.
Any reason why you dropped the entire docstring and a couple parameters?
There was a problem hiding this comment.
Also am I correct in that you want to change the workflow such that people produce the linkage themselves (i.e. no more DataFrames), untangle it and then pass it to the plotting function?
There was a problem hiding this comment.
Hi. Sorry for dropping some parameters that change the use of the library. I deleted them because I did not use them, but you are right, there should be other options for users.
Yes. The workflow I wanted is that the users produce the linkages first and then use the untangle methods and "draw" function to get the desired tanglegram layout.
| def untangle_random_search(link1, link2, labels1, labels2, R=100, L=1.5): | ||
| def untangle_random_search(link1, link2, labels1, labels2, R=1000, L=1.0): | ||
| """Untangle dendrogram using a simple random search. | ||
|
|
There was a problem hiding this comment.
I'm not really on board with you remove the empty lines in the docstrings.
| often one will want to use 0, 1, 1.5 or 2: | ||
| ``sum(abs(x-y)^L)``. | ||
|
|
||
| def entanglement(link1, link2): |
There was a problem hiding this comment.
Seems like L is still accepted (and other functions use it as parameter) but entanglement now ignores it and just does squared distance?
There was a problem hiding this comment.
This is my mistake. L should be chosen to be 0, 1, 1.5 or 2. Because I always use L = 2 so I make it unchanged. I am fixing it.
| exist_in_both = list(set(lindex1) & set(lindex2)) | ||
| ix = np.arange(max(len(lindex1), len(lindex2))) | ||
|
|
||
| if not exist_in_both: |
There was a problem hiding this comment.
If you are not using labels but just indices, then there is no point in checking if they exist in both. Or am I missing something?
There was a problem hiding this comment.
The "leaves_list" function returns the list of leaves' indices. So we only work with indices. In doing so we have to assume that the relationship between indices and labels is one-to-one and identical in both trees.
There was a problem hiding this comment.
That's fair but I want/need to cater for scenarios where that's not the case.
There was a problem hiding this comment.
Totally agree. We should use labels instead of indices.
| index=labelsB) | ||
| # Mapping the "number" (1 til tree size) in the left tree with the right tree | ||
| matching_leaf_vector = np.zeros(max(len(lindex1), len(lindex2))) | ||
| for i in lindex2: |
There was a problem hiding this comment.
I haven't tested it properly but this for loop can't possibly be faster than the previous array-based solution. Could you elaborate a bit on what the advantage of doing it this way is?
There was a problem hiding this comment.
The previous array-based solution is actually not correct. Here we want to match the "number" (1 til tree size) in the left tree with the right tree and then calculate the difference between these numbers in two trees, not to compute the difference between indices. Do you apply such matching with "dict" which I do not understand honestly?
The old calculation leads to different results compared to R language.
There was a problem hiding this comment.
I disagree re the existing solution being incorrect - it might not yield exactly the same results as in R but certainly does the same job. Using {label: index} dicts is necessary for scenarios where labels in both dendrograms don't match up 1:1.
There was a problem hiding this comment.
I agree it is necessary to use such dictionary. But I am not sure it matches the numbers (from 1 to tree size) in the left tree with the right tree. What I wanted can be illustrated in the following example:
Left tree: A D E C F
Right tree: C D A F E
Giving objects in the left tree numbers from 1 to tree size yields: 1, 2, 3, 4, 5
Matching these numbers with the right tree: 4, 2, 1, 5, 3
There was a problem hiding this comment.
That's pretty much what the existing function does with the dict.
There was a problem hiding this comment.
Then that's my bad not realizing it.
|
|
||
|
|
||
| def untangle(link1, link2, labels1, labels2, method='random', L=1.5, **kwargs): | ||
| def untangle(link1, link2, labels1, labels2, method='random', L=2.0, **kwargs): |
There was a problem hiding this comment.
Looks like labels are still accepted but essentially ignored in favour of just using the linkage. This implies that the labels in each linkage always match perfectly (i.e. index left 1 = index 1 right and so on). This may work for toy examples but will not be true for most real world examples. This change is unfortunately a deal breaker.
There was a problem hiding this comment.
It is assumed that the set of objects (labels) in two dendrograms have a one-to-one correspondence. Such case occasionally occurs in real life when we apply different hierarchical clustering algorithms on the same dataset.
| @@ -720,7 +620,6 @@ def shuffle_dendogram(link, copy=True): | |||
|
|
|||
| def leaf_order(link, labels=None, as_dict=True): | |||
There was a problem hiding this comment.
I still find the entanglement acting so weird. I might know the reason. The problem comes from the leaf_order function.
leafs_ix = sclust.hierarchy.leaves_list(link)returns a list of indices of objects as they appear in the dendrogram (these indices are corresponding with the indices in "labels")if as_dict: if not isinstance(labels, type(None)): return dict(zip(labels, leafs_ix))
matches the labels in "labels" with indices in sclust.hierarchy.leaves_list(link)- However, the orders of objects in "labels" and in the x-axis of the dendrogram are different, so the matching is wrong.
I can give you an example via email.
Updated: