Neo4j Cypher query to NetworkX

The basic question is, how do we read an entire graph from a Neo4j store into a NetworkX graph? And another question is, how do we extract subgraphs from Cypher and recreate them in NetworkX, to potentially save memory?

Using a naive query to read all relationships

This is based on cypher-ipython module. This uses a simple query like the following to obtain all the data:

MATCH (n) OPTIONAL MATCH (n)-[r]->() RETURN n, r

This can be read into a graph using the following code. Note that the rows may duplicate both relationships and nodes, but this is taken care of by the use of neo4j IDs.

def rs2graph(rs):
    graph = networkx.MultiDiGraph()

    for record in rs:
        node = record['n']
        if node:
            print("adding node")
            nx_properties = {}
            nx_properties.update(node.properties)
            nx_properties['labels'] = node.labels
            graph.add_node(node.id, **nx_properties)

        relationship = record['r']
        if relationship is not None:   # essential because relationships use hash val
            print("adding edge")
            graph.add_edge(
                relationship.start, relationship.end, key=relationship.type,
                **relationship.properties
            )

    return graph

There's something about this query that is rather inelegant, that is that the result set is essentially 'denormalized'.

Using aggregation functions

Luckily there's another more SQL-ish way to do it, which is to COLLECT the relationships of each node into an array. This then returns lists which represent a distinct node and the complete set of relationships for that node, similar to something like the ARRAY_AGG() and GROUP BY combination in PostgreSQL. This seems much cleaner to me.

# this version expects a collection of rels in the variable 'rels'
# But, this version doesn't handle dangling references
def rs2graph_v2(rs):
    graph = networkx.MultiDiGraph()

    for record in rs:
        node = record['n2']
        if not node:
            raise Exception('every row should have a node')

        print("adding node")
        nx_properties = {}
        nx_properties.update(node.properties)
        nx_properties['labels'] = list(node.labels)
        graph.add_node(node.id, **nx_properties)

        relationship_list = record['rels']

        for relationship in relationship_list:
            print("adding edge")
            graph.add_edge(
                relationship.start, relationship.end, key=relationship.type,
                **relationship.properties
            )

    return graph

Trying to extend to handle subgraphs

When we have relationship types that define subtrees, which are labelled something like :PRECEDES in this case, we can attempt to materialize this sub-graph selected from a given root in memory. In the query below, the Token node with content nonesuch is taken as the root.

This version can be used with a Cypher query like the following:

MATCH (a:Token {content: "nonesuch"})-[:PRECEDES*]->(t:Token)
WITH COLLECT(a) + COLLECT(DISTINCT t) AS nodes_
UNWIND nodes_ AS n
OPTIONAL MATCH p = (n)-[r]-()
WITH n AS n2, COLLECT(DISTINCT RELATIONSHIPS(p)) AS nestedrel
RETURN n2, REDUCE(output = [], rel in nestedrel | output + rel) AS rels

And the Python code to read the result of this query is as such:

# This version has to materialize the entire node set up front in order
# to check for dangling references.  This may induce memory problems in large
# result sets
def rs2graph_v3(rs):
    graph = networkx.MultiDiGraph()

    materialized_result_set = list(rs)
    node_id_set = set([
        record['n2'].id for record in materialized_result_set
    ])

    for record in materialized_result_set:
        node = record['n2']
        if not node:
            raise Exception('every row should have a node')

        print("adding node")
        nx_properties = {}
        nx_properties.update(node.properties)
        nx_properties['labels'] = list(node.labels)
        graph.add_node(node.id, **nx_properties)

        relationship_list = record['rels']

        for relationship in relationship_list:
            print("adding edge")

            # Bear in mind that when we ask for all relationships on a node,
            # we may find a node that PRECEDES the current node -- i.e. a node
            # whose relationship starts outside the current subgraph returned
            # by this query.
            if relationship.start in node_id_set:
                graph.add_edge(
                    relationship.start, relationship.end, key=relationship.type,
                    **relationship.properties
                )
            else:
                print("ignoring dangling relationship [no need to worry]")

    return graph