Graphical Interpretation of data using ArangoDB
A number of industries and laboratories still rely upon relational database management systems for handling their data. But usually, raw data that they encounter is not structured. It is highly complicated, fast-changing and massive for conventional technologies to handle efficiently.
In the past, I have advocated working on huge amount of data using only relational database management. For the purposes of actually knowing what goes on under the hood, I think that handling big data is essential, and the lessons learned from building things from scratch are real game-changers when it comes to the tackling real world data. It is the NoSQL database systems that allow simpler scalability and improved performance in terms of maintaining big unstructured data.
In this article, I am describing my work during the summers this year, in which, I dealt with huge and highly connected data, stored in .json
format, and discovered relationships between its nodes. To do this, I built a generic API that interpreted this data as graphs, that could only concern with data-points and relationships between them than the values itself.
I used ArangoDB that worked perfectly fine for this job. It is an open-source NoSQL database that not only works with documents but can also handle graphs natively. I also tested its performance for working with different number of clients at the same time about which I’ll discuss later in detail.
So, overall, the article is divided into following sub-sections:
- Getting Started with ArangoDB: Brief introduction of ArangoDB, ArangoQL and the installation process.
- Building the Graph API: Steps taken to build the API using Java and ArangoDB.
- Using ArangoQL for exploring and visualizing dataset: Examples of some ArangoQL queries for given dataset and using web interface for visualizing graph database.
- Analyzing performance of these ArangoQL queries: Building RESTful API, introduction of Apache JMeter and steps taken for performance testing.
ArangoDB
As they say, it is a multi-threaded “native multi-model database”, that allows us to store the data as key/value pairs, graphs or documents, and access any or all the data using a single declarative query language. It is called multi-model database as it allows ad hoc queries that we can run on data stored in different models. We can also choose single node or cluster execution. It worked quite efficiently for graph algorithms processed across data spread throughout the cluster.
You can read the whole documentation from here. I’ll jump directly into some basic concepts and nomenclatures in ArangoDB that will be important for this article.
A database here is a set of collections. These collections are equal to the tables in relational databases. They store records which are referred as documents. A simple document, by default, has its own immutable handle as _id
which consists of the collection’s name and the document key, a primary key as _key
which is specified by the user when created. It also consists of document revision as _rev
that is maintained by ArangoDB.
ArangoDB allows us to perform various operations in graphs like traversal or finding shortest path etc. For graphical model, the database consists of two collections. The vertices of a graph are stored in document collection and edges in the edge collection. Vertices can be any object like users or groups and edges are the relationships between those objects. While vertices in graphs have similar properties as a simple document in a collection, edges consist of directions as _from
and _to
that store document handles in the form of strings as well as _label
to name the interconnections.
ArangoDB query language (ArangoQL/AQL)
Similar to SQL, it supports reading and modifying collection data. It is a pure data-manipulation language, client independence, allows complex query patterns, and is easy to understand as it uses keywords from the English language. We can execute document queries, graph queries, joins and plus it has ACID support with multi-collection transactions. The key feature I liked is that it can combine different data models in single query that makes it easier to explore various connections between data-points.
ArangoDB Installation:
To install ArangoDB on your system:
- First, visit this page and select your operating system. I am using Linux distribution — Ubuntu 16.04 LTS.
- Download the Server (
.deb
) file. - Install it manually OR you can open terminal and type the following commands.
$ echo ‘deb https://www.arangodb.com/repositories/arangodb32/xUbuntu_16.04/ /’ | sudo tee /etc/apt/sources.list.d/arangodb.list
$ sudo apt-get install apt-transport-https
$ sudo apt-get update
$ sudo apt-get install arangodb3=3.2.1
To check is it is installed successfully, see if the following command works:
$ arangosh
The default user is _root
that has access to all the databases in the server. To create new databases or perform operations we can use arangosh
only. To start
/stop
ArangoDB server, we can simply use the following commands:
$ /etc/init.d/arangod start
$ /etc/init.d/arangod stop
ArangoDB comes mainly with two storage engines: mmfiles and RocksDB. In simple words, MMFILES are generally well suited for use-cases that fit into main memory, while RocksDB allows larger than memory work-sets. It is an embeddable persistent key-value store for fast storage. The API I am building here is generic and use huge miscellaneous data, hence to increase performance I use RocksDB for the storage. Following command is used to start ArangoDB with RocksDB:
$ arangod — server.storage-engine=rocksdb /tmp/rocksdb
ArangoDB comes with a built-in web interface for administration purposes, which can be accessed via:
http://127.0.0.1:8529
ArangoDB — Graph API
As I discussed earlier, my first task here was to develop a generic API that interprets a large data into a graphical model and can perform general operations with it.
I went through the following steps:
- Built a new database and defined a couple of collections, one for nodes and another for links in a Java API (Eclipse IDE). I worked on Maven Project with dependencies: arangodb-java-driver, junit, slf4j-api, slf4j-nop, velocypack-module-jdk8.
- Defined a generic class for
Node
where_id
,_key
and_rev
are the default document handles and their values are assigned during declaration only. For example,{
“_key”: “1006”,
“_id”: “vertices/1006”,
“_rev”: “_Vi2UzK2 — -”,
“label”: “Verizon”,
“vertex”: {<some json data>}
} - Defined a generic class for
Link
. It additionally contains_to
and_from
document handles. For example,{
“_key”: “1009”,
“_id”: “edges/1009”,
“_from”: “vertices/1000”,
“_to”: “vertices/1006”,
“_rev”: “_Vi2UzK6 — -”,
“label”: “level_2”
} - When the graph was initialized, I started with parsing the data in given in
.json
file and defined the individual json object as node and hierarchy as the link for nested objects recursively. These nodes were stored invertex
collection and links were stored inedge
collection. - It could now perform following operations on the above graph:
- Creating a new node/link using
_id
/_key
/label
. - Retrieving a node/link using
_id
/_key
/label
. - Apply different operations to read/write in the graph using AQL.
Examples of AQL queries:
FOR x in <collection> return x
:Return all documents.FOR x in <collection> FILTER x.label==’Data’ return x
:Returns documents with “Data” as label.FOR u IN <collection> SORT u.label DESC RETURN u
: Sort documents in give collection in descending order based on label.FOR u IN <collection> LIMIT 5 RETURN u
:Returns only 5 documents.FOR i IN 1..100 INSERT { value: i } IN numbers
: Insert documents 1 to 100 in ‘numbers’ collection.FOR u IN <collection> UPDATE u WITH { label: “has” } IN <collection>
:Updates document labels to “has”.FOR v, e, p IN 1..3 ANY ‘<starting_vertex_id>’ GRAPH ‘<graph_name>’ RETURN v
: Returns all vertices starting from ‘starting_vertex_id’ up to 3 nearest neighbors.
Some complex graph queries:
FOR x in <collection> FILTER x.label==’Data’ LET vin=TO_STRING(x._id) FOR v, e, p IN 1..1 ANY vin GRAPH ‘<graph_name>’ RETURN v
:Returns all neighbors of document vertex with label “Data” in the graph.FOR node IN <collection> SORT RAND() LIMIT 1 let rand1 = TO_STRING(node._id)
. Returns shortest path between two random vertices.
FOR node2 IN <collection> SORT RAND() LIMIT 1 let rand2 = TO_STRING(node2._id)
FOR v IN ANY SHORTEST_PATH rand1 TO rand2 GRAPH ‘<graph_name>’ return vFOR x in <collection> FILTER x.label==’Data’ LET vin=TO_STRING(x._id)
:Returns data of the vertices containing “city” as label in the graph.
FOR v, e, p IN 2..2 ANY vin GRAPH ‘<graph_name>’ FILTER v.label==’city’ LET vcity= v.vertex RETURN vcity
Graph Visualization
Following is a simple procedure for visualizing data using ArangoDB Web Interface after running the above API:
- Start the web interface on browser.
- Enter Login — “root” (no password) and choose the database that is used.
- Home page shows the list of collections in current database, you can click and check the documents in those collection by clicking their icons:
- Click on the Graph tab in the navigation bar, it contains list of all graphs in the database.
- Click on the graph for visualization. You can configure the graph settings by clicking on the rightmost icon.
Performance testing
This section contains execution of testing the performance of AQL queries for a given graph. For this purpose, I used Spring-Boot, an open-source framework for application creation, for providing the web service to allow communication using RESTful API. In brief, RESTful API uses HTTP server to perform functions like get, put, post and delete. To test the performance I used Apache JMeter to know how efficiently the AQL queries work and how many concurrent users the server can handle.
This API is also built on Maven with additional dependencies: spring-boot-starter-web, spring-boot-starter-test, and spring-boot-maven-plugin. You can find the RESTful API tutorial here.
The API contains several Request mapping functions with different RESTful methods to run at localhost
server with port 8080
. For example,
- To get all documents in the vertices/edges collection.
URL — http://127.0.0.1:8080/node/ || http://127.0.0.1:8080/link/
Method — GET
- To add new vertex.
URL — http://127.0.0.1:8080/node/<new_label> || http://127.0.0.1:8080/link/<new_label>
Method — POST
Body — <some JSON object> || <Start_vertex> <End_vertex>
- To get specific vertex/link
URL — http://127.0.0.1:8080/node/<id/key/label> || http://127.0.0.1:8080/link/<id/key/label>
Method — GET
- Execute a query:
URL — http://127.0.0.1:8080/query
Method — POST
Body — <some query>
Run the API. If any error occurs while connecting the Tomcat server port, you can either change the port number or kill the process already running on it by executing the following command:
$ sudo pkill -9 -f tomcat
Apache JMeter
Apache JMeter is open-source software that is popular for performance testing. This tool is designed to load test functional behavior and measure performance. It can be used to extract information from the API response and use it in the subsequent requests in the test.
To install Apache JMeter on Linux, simply type the below commands on terminal:
$ sudo apt-get update
$ sudo apt-get install openjdk-7-jre-headless
$ wget -c http://ftp.ps.pl/pub/apache//jmeter/binaries/apache-jmeter-3.2.tgz
$ tar -xf apache-jmeter-3.2.tgz
To run:
$ ./apache-jmeter-3.2/bin/jmeter
When the JMeter window pops up, follow these steps to set-up the environment for performance testing of ArangoDB:
1. Right-click on the Test Plan
and go to Add
> Config Element
> HTTP Header Manager
and change the Name if you want. Add this in Header Stored — Name
-> “Content-type”, Value
-> “application/json”.
2. Now Right-click on the Test Plan
and go to Add
> Threads (Users)
> Thread group
and change the Name of the group.
Set the number of threads (users), say to 10.
Set the Ramp-Up period (in seconds), say to 10.
Set the Loop count, say to 10.
3. Right click on the Thread Group
and Select Add
> Config Element
> HTTP Request Defaults
. This will set default values that your HTTP Request controllers use. In the Web Server:
- Set Protocol to http
.
- Set Server Name or IP to localhost
.
- Port Number to 8080
.
4. Again, Right click on the Thread Group
and go to Add
> Sampler
> HTTP Request
. Under the HTTP Request block, drop-down option in Method are the functions specified in RESTful API like GET
/POST
/PUT
/DELETE
. Set the path as defined in the API as URL mapping value. If the API takes some content as input, write down the text in the Body Data. Rename the HTTP Request if you want. Repeat the same step for all Requests. Here are some examples:
5. When done, Right click on the Thread Group
and go to Add
> Listener
> View Results in Tree
. Specify the File name you want to write the performance report.
6. At last. In the Menu bar click Run
> Start
. It will execute all the HTTP request and store the results in .csv
format.
7. I analyzed the performance of each operation in the above thread group using Aggregate Graphs Listener by altering the number of threads/clients running simultaneously:
At last,
While still with some very good competitors like Neo4j for graphs and MongoDB for NoSQL, overall, ArangoDB is powerful and flexible because of its multi-model feature, fast-enough when deals with complex datasets, and ready to be used for production environment.