NetworkX: Find longest path in DAG returning all ties for max
我在弄清楚如何更新networkx dag_find_longest_path()算法时遇到麻烦,而不是返回找到的第一个最大边缘,或者返回所有最大重量相关的边缘的列表。
我首先从pandas数据框中创建了一个DAG,其中包含一个边缘列表,如以下子集:
1 2 3 4 5 6 7 8 9 10 11 | edge1 edge2 weight 115252161:T 115252162:A 1.0 115252162:A 115252163:G 1.0 115252163:G 115252164:C 3.0 115252164:C 115252165:A 5.5 115252165:A 115252166:C 5.5 115252162:T 115252163:G 1.0 115252166:C 115252167:A 7.5 115252167:A 115252168:A 7.5 115252168:A 115252169:A 6.5 115252165:A 115252166:G 0.5 |
然后,我使用以下代码对图形进行拓扑排序,然后根据边缘的权重找到最长的路径:
1 2 3 4 5 6 | G = nx.from_pandas_edgelist(edge_df, source="edge1", target="edge2", edge_attr=['weight'], create_using=nx.OrderedDiGraph()) longest_path = pd.DataFrame(nx.dag_longest_path(G)) |
这很好用,除非有最大加权边缘的平局,它返回找到的第一个最大边缘,而我希望它只返回代表" Null"的" N"。
因此,当前的输出为:
1 2 3 4 5 6 | 115252161 T 115252162 A 115252163 G 115252164 C 115252165 A 115252166 C |
但是我真正需要的是:
1 2 3 4 5 6 | 115252161 T 115252162 N (or [A,T] ) 115252163 G 115252164 C 115252165 A 115252166 C |
查找最长路径的算法是:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | def dag_longest_path(G): dist = {} # stores [node, distance] pair for node in nx.topological_sort(G): # pairs of dist,node for all incoming edges pairs = [(dist[v][0] + 1, v) for v in G.pred[node]] if pairs: dist[node] = max(pairs) else: dist[node] = (0, node) node, (length, _) = max(dist.items(), key=lambda x: x[1]) path = [] while length > 0: path.append(node) length, node = dist[node] return list(reversed(path)) |
可复制的
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | import pandas as pd import networkx as nx import numpy as np edge_df = pd.read_csv( pd.compat.StringIO( """edge1 edge2 weight 115252161:T 115252162:A 1.0 115252162:A 115252163:G 1.0 115252163:G 115252164:C 3.0 115252164:C 115252165:A 5.5 115252165:A 115252166:C 5.5 115252162:T 115252163:G 1.0 115252166:C 115252167:A 7.5 115252167:A 115252168:A 7.5 115252168:A 115252169:A 6.5 115252165:A 115252166:G 0.5""" ), sep=r" +", ) G = nx.from_pandas_edgelist( edge_df, source="edge1", target="edge2", edge_attr=["weight"], create_using=nx.OrderedDiGraph(), ) longest_path = pd.DataFrame(nx.dag_longest_path(G)) |
我最终只是在defaultdict计数器对象中对行为建模。
1 | from collections import defaultdict, Counter |
我将边列表修改为(位置,核苷酸,重量)元组:
1 | test = [(112,"A",23.0), (113,"T", 27), (112,"T", 12.0), (113,"A", 27), (112,"A", 1.0)] |
然后使用defaultdict(counter)快速求和每个核苷酸在每个位置的出现:
1 2 3 4 | nucs = defaultdict(Counter) for key, nuc, weight in test: nucs[key][nuc] += weight |
然后遍历字典以提取所有等于最大值的核苷酸:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | for key, nuc in nucs.items(): seq_list = [] max_nuc = [] max_val = max(nuc.values()) for x, y in nuc.items(): if y == max_val: max_nuc.append(x) if len(max_nuc) != 1: max_nuc ="N" else: max_nuc = ''.join(max_nuc) seq_list.append(max_nuc) sequence = ''.join(seq_list) |
这将返回找到的最大值的核苷酸的最终序列,并在平局位置返回N:
1 | TNGCACAAATGCTGAAAGCTGTACCATANCTGTCTGGTCTTGGCTGAGGTTTCAATGAATGGAATCCCGTAACTCTTGGCCAGTTCGTGGGCTTGTTTTGTATCAACTGTCCTTGTTGGCAAATCACACTTGTTTCCCACTAGCACCAT |
但是,这个问题困扰着我,因此我最终使用networkx中的节点属性作为将每个节点标记为平局的手段。现在,当在最长路径中返回节点时,然后可以检查" tie"属性,如果已标记该节点名称,则将其替换为" N":
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 | def get_path(self, edge_df): G = nx.from_pandas_edgelist( edge_df, source="edge1", target="edge2", edge_attr=["weight"], create_using=nx.OrderedDiGraph() ) # flag all nodes as not having a tie nx.set_node_attributes(G,"tie", False) # check nodes for ties for node in G.nodes: # create list of all edges for each node edges = G.in_edges(node, data=True) # if there are multiple edges if len(edges) > 1: # find max weight max_weight = max([edge[2]['weight'] for edge in edges]) tie_check = [] for edge in edges: # pull out all edges that match max weight if edge[2]["weight"] == max_weight: tie_check.append(edge) # check if there are ties if len(tie_check) > 1: for x in tie_check: # flag node as being a tie G.node[x[0]]["tie"] = True # return longest path longest_path = nx.dag_longest_path(G) return longest_path |
根据您的示例判断,每个节点由位置ID(
基本上,将所有内容放在
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | edge_df_pos = pd.DataFrame( { "edge1": edge_df.edge1.str.partition(":")[0], "edge2": edge_df.edge2.str.partition(":")[0], "weight": edge_df.weight, } ) vert_labels = dict() for col in ("edge1","edge2"): verts, lbls = edge_df[col].str.partition(":")[[0, 2]].values.T for vert, lbl in zip(verts, lbls): vert_labels.setdefault(vert, set()).add(lbl) G_pos = nx.from_pandas_edgelist( edge_df_pos, source="edge1", target="edge2", edge_attr=["weight"], create_using=nx.OrderedDiGraph(), ) longest_path_pos = nx.dag_longest_path(G_pos) longest_path_df = pd.DataFrame([[node, vert_labels[node]] for node in longest_path_pos]) |
结果
1 2 3 4 5 6 7 8 9 10 | # 0 1 # 0 115252161 {T} # 1 115252162 {A, T} # 2 115252163 {G} # 3 115252164 {C} # 4 115252165 {A} # 5 115252166 {G, C} # 6 115252167 {A} # 7 115252168 {A} # 8 115252169 {A} |
如果我的解释不正确,我怀疑是否存在基于拓扑排序的算法的简单扩展。问题是图可以接受多种拓扑排序。如果按照示例中
1 2 3 4 5 6 7 8 9 10 11 | {'115252161:T': (0, '115252161:T'), '115252162:A': (1, '115252161:T'), '115252162:T': (0, '115252162:T'), '115252163:G': (2, '115252162:A'), '115252164:C': (3, '115252163:G'), '115252165:A': (4, '115252164:C'), '115252166:C': (5, '115252165:A'), '115252166:G': (5, '115252165:A'), '115252167:A': (6, '115252166:C'), '115252168:A': (7, '115252167:A'), '115252169:A': (8, '115252168:A')} |
请注意,
函数内部的这一行似乎放弃了您想要的路径;因为
1 | node, (length, _) = max(dist.items(), key=lambda x: x[1]) |
我会保留最大值,然后根据它搜索所有项目。然后重用代码查找所需的路径。一个例子是这样的:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | def dag_longest_path(G): dist = {} # stores [node, distance] pair for node in nx.topological_sort(G): # pairs of dist,node for all incoming edges pairs = [(dist[v][0] + 1, v) for v in G.pred[node]] if pairs: dist[node] = max(pairs) else: dist[node] = (0, node) # store max value inside val variable node, (length, val) = max(dist.items(), key=lambda x: x[1]) # find all dictionary items that have the maximum value nodes = [(item[0], item[1][0]) for item in dist.items() if item[1][1] == val] paths = [] # iterate over the different nodes and append the paths to a list for node, length in nodes: path = [] while length > 0: path.append(node) length, node = dist[node] paths.append(list(reversed(path))) return paths |
PS。我尚未测试此代码以了解其是否正常运行。