deepstream.io is currently under reconstruction for the V4 release!

For V3 and enterprise documentation please go to https://deepstreamhub.com

To continue click here

v4.0.0 - Server

Features:

  • New binary protocol support (under the hood)
  • Bulk actions support (under the hood)
  • V2 storage API
type StorageReadCallback = (error: string | null, version: number, result: any) => void
type StorageWriteCallback = (error: string | null) => void

interface StoragePlugin extends DeepstreamPlugin {
  apiVersion?: number
  set (recordName: string, version: number, data: any, callback: StorageWriteCallback, metaData?: any): void
  get (recordName: string, callback: StorageReadCallback, metaData?: any): void
  delete (recordName: string, callback: StorageWriteCallback, metaData?: any): void
}

Improvements

  • Lazy data parsing
  • Upgraded development tools
  • New deepstream.io website

Backwards compatability

  • All V3 SDKs no longer compatible due to binary protocol

TLDR:

Unsupported SDKs

I wanted to leave this part till the end, but it’s the biggest loss with upgrading to V4 and will be an instant blocker for some users.

We are sad to say that we haven’t yet migrated the V3 non browser and node SDKs to V4. The reason is because the underlying protocol has changed and the way SDKs were written in V3 and pretty much constructed and parsed messages all over the code base. This design has unfortunately meant that while we could write a binary to text parser in the Java SDK it would just make it maintenance hell.

Our Swift SDK has been ambitious from the start, using J2OBJC in order to convert the java code to Objective C with thick polyfills for java methods. While this approach has worked its really hard to maintain and build.

Our goal going forward is to write a single Kotlin SDK that can run on both iOS and Java. We would also have it run a much more minimal set of functionality, allowing the SDK to only consume strings rather than objects. This would allow us to integrate easily with many of the different flavours of JSON libraries out there.

This website

There has been alot of feedback on the differences between our deepstreamHub and deepstream documentation and offerings, where some users were not certain where the line was drawn between open source and enterprise. We also have over a hundred pages of documentation in a world where some of yesterday’s hot trends (For example knockout, angularJS) have been replaced by others (React, Vue). And even within the one library approaches have been deprecated, replaced or advised against (React mixins, stateful components and hooks). While we love keeping up to date with all the latests chatter in devops and developer land, it’s pretty much impossible to do so while also focusing on integrating important features into deepstreams core. As such I’m happy to say we have migrated all of our OS documentation and website back to opensource using the amazing Gatsby framework. Every page can now be edited by the community, and adding pages is as easy as writing a markdown document, adding some images and letting the build take care of the rest. If you would like to do anything fancy your more than welcome to add a React component!

Binary Protocol

The driver behind pretty much all of the V4 refactor was our move from our old text based protocol to binary. Before you ask, while we might add actual binary data support in deepstream we still currently use it to parse JSON payloads. But it makes building SDKs and new features so much easier. Seriously. LIKE SO MUCH EASIER.

Okay so first things first, the structure of text vs binary messages:

V3 -Text:

TOPIC | ACTION | meta1 | meta2 | ...metaN | payload +

This string had the initial TOPIC and ACTION read by the parser to find out where to route it, and the rest of the data was figured out within the code module that dealt with it. This gave some benefits like only parsing a full message once its actually required, but also meant that the message parsing code was distibuted and adding for example a meta field would require lots of refactoring. Tests also had to create text based messages even when testing internal code paths. Payload serialization also didn't use JSON, but instead used a custom form of serialization to minimize bandwidth: U for undefined, T for true, F for false, O for object, S prefix for string and a N prefix for number.

So the message object in V3 SDKs and server were like:

{
    "topic": "R",
    "action": "S",
    "data": ["A", "recordName"]
}

V4 - Binary:

 /*
 *  0                   1                   2                   3
 *  0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
 * +-+-------------+-+-------------+-------------------------------+
 * |F|  Message    |A|  Message    |             Meta              |
 * |I|   Topic     |C|  Action     |            Length             |
 * |N|    (7)      |K|   (7)       |             (24)              |
 * +-+-------------+-+-------------+-------------------------------+
 * | Meta Cont.    |              Payload Length (24)              |
 * +---------------+-----------------------------------------------+
 * :                     Meta Data (Meta Length * 8)               :
 * + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - +
 * |                  Payload Data (Payload Length * 8)            :
 * +---------------------------------------------------------------+
 *
 * The first 6 bytes of the message are the header, and the rest of 
 * the message is the payload.
 *
 * CONT (1 bit): The continuation bit. If this is set, the following
 * payload of the following message must be appended to this one. 
 * If this is not set, parsing may finish after the payload is read.
 *
 * RSV{0..3} (1 bit): Reserved for extension.
 *
 * Meta Length (24 bits, unsigned big-endian): The total length of 
 *                Meta Data in bytes.
 *                If Meta Data can be no longer than 16 MB.
 *
 * Payload Length (24 bits, unsigned big-endian): The total length of 
 *                Payload in bytes.
 *                If Payload is longer than 16 MB, it must be split into 
 *                chunks of less than 2^24 bytes with identical topic and
 *                action, setting the CONT bit in all but the final chunk.
 */

The binary protocol is utf8 based, with some bit shifting for things like ACKs for easier parsing. The only time deepstream actually creates or sees this object is in the parser itself, meaning as far as the code is concerned the actual protocol can change at any time.

The objects used within V4 SDKs and server look like this:

{
    "topic": 3,
    "action": 2,
    "isAck": true,
    "name": "recordName"
}

This makes writing code alot easier. At the time of writing out full message API that can be consumed by any SDK is as follows:

export interface Message {
    topic: TOPIC
    action: ALL_ACTIONS
    name?: string

    isError?: boolean
    isAck?: boolean

    isBulk?: boolean
    bulkId?: number
    bulkAction?: ALL_ACTIONS

    data?: string | Buffer
    parsedData?: RecordData | RPCResult | EventData | AuthData
    payloadEncoding?: PAYLOAD_ENCODING

    parseError?: false

    raw?: string | Buffer

    originalTopic?: TOPIC
    originalAction?: ALL_ACTIONS
    subscription?: string
    names?: Array<string>
    isWriteAck?: boolean
    correlationId?: string
    path?: string
    version?: number
    reason?: string
    url?: string
    protocolVersion?: string
}

Using this approach has made adding new features and maintaining current ones significantly easier. And the given the combination of TOPICs and ACTIONs we can pretty much ensure we'll be able to extend it without running out of space any time soon.

Cons

It wouldn't be fair to say that this overhaul has no downsides. There have been some sacrifices that we had to make along the way.

1) If you count messages in the billions, those extra control bytes add up. Data bandwidth is quite expensive on cloud systems so lack of compression isn't just a latency issue anymore.

2) Our meta data is a JSON object. It's predefined meaning we can have a much more optimial parser than those built in, and we minimize space by using abbreviations for the metadata names. However it still takes longer to parse and more bandwidth to transfer. There are optimizations planned to allow all this to happen further down in C++ land to reduce the weight of this occuring on the main node thread, but it's a small step back in optimal performance.

Why yet another proprietry protocol?

Because deepstream offers some very specific features, and has alot more on the way. For example we currently have a unique concept such as listening. We are also looking to release a monitoring topic in the 4.1 release, better OS clustering integration in 4.2 and an admin API in 4.3. Tying into another stack means we unfortunately can't move as quickly as we want with these features.

Typescript

We converted the majority of the codebase to typescript, for the benefit of future code maintenance as well making it easier for people to contribute.

This also means we now had declerations for all the possible plugin interfaces which should make it much easier for people to write their own once they fork the V4 connector template.

Current custom external connectors are:

  • Authentication
  • Permissioning
  • Storage and Cache
  • Logger
  • Connection Endpoints
  • Generic Plugins

Performance Improvements

Things have changed quite a bit in the nodeJS world. Node 10 came out with the inclusion of a new garabage collector, async/await has changed the coding landscape and V8 has been optimised for all the ES6 improvements. However there’s unforuntately a bit of a dark side to all of this. In order to improve performance for the ES6 features most developers now use, the actual performance of ES5 has taken a hit. While there were talks about potentially switching to a totally different language instead a total rewrite would have been absolutely impossible. So instead we targeted what I like to call optimistic optimizations, which mean in the worse case scenario it won’t make any difference at all, but if your lucky you could get boosts of multiple factors.

So what falls under these optimizations?

In this current release there are three parts:

Lazy data parsing

So the downside behind using JSON as a data payload is that its not exactly fast. Without knowing your schema upfront and given that each record, event or request/response can literally contain anything there’s little we can do currently to improve that. So what we do instead is just ignore the whole parsing aspect altogether on the server unless needed. What this means is as far as deepstream is concerned, as long as you don’t need to access the data you’ll never actually parse it. There are three places where the data payload is actually required.

  1. Permissions, only if you access the data value.

  2. Record patches. A record patch (setting a value with a path) has to apply the patch onto the current value requiring both the previous and value to be parsed (bandwidth vs CPU usage tradeoff).

  3. Storage adaptors. This is unfortunately unavoidable currently as some storage adaptors don’t accept buffers or strings directly. This means even though we pass the data all the way to the storage SDK optimially we have to parse it just for the SDK to serialize it again =(. On that topic as well node hasn’t made it too easy with most libraries using the Buffer wrapper while ignoring the more optimial (and not so nice to use) Array Buffer. We are looking at extending our storage API’s going forward to allow deepstream to pick between a buffer and object argument to allow optimal paths when possible.

Seperation of data storage concerns

This one has been a bit of an interesting decision from day one. We initially in V1 had data stored in records with the following nesting:

{
    _v: 1,
    _d: { "status": "DONE" }
}

That just made searching an absolute pain, so what we done is transformed the data to instead store it as follows:

{
    __ds: {
        _v: 1
    },
    "status": "DONE"
}

Reason it’s an object instead is incase we ever decided to add more meta data going forward. The issue with this however is we needed to load the entire record into memory and transform it whenever want to do anything. When you start thinking in bulk (hundreds or thousands of subscriptions) the objects, CPU cycles and immediate gc this uses is just, well, useless.

So how did we decide on optimizing this? By no longer doing any of the transform logic in the core server. This means rather than deepstream calling into storage using this:

public set (
    name: string, 
    data: { __ds: { _v: number }, ...recordData }, 
    (error: string) => void
)

We do this:

public set (
    name: string, 
    version: number,
    data: RecordData, 
    (error: string) => {}
)

It looks like a tiny change and for all our current adaptors it’s fully backwards compatible. However the goal is for us to start using things like custom redis commands to store these entries seperately in the cache:

Name Example value Description
recordName_version 5 The record version
recordName_data { “name”: “Purist” } The data untouched by deepstream

This allows us to then do awesome things going forward like:

  • Validating the version number doesn’t conflict on the cache rather than in the server, critical when clustering
  • Only requesting the version number of records instead of the entire data-set when using offline-storage or doing a head/has
  • Potentially storing deepstream data in a meta collection for clear seperation

Bulk APIs

This was probably one of the biggest under the hood improvements, and although it can still be seriously optimized going forward it has already shown a huge performance boost.

So whats the difference?

In V3 if you subscribed to a few thousand records the only optimization that would occur is that it would be send as an individual websocket frame. So something like this (excuse the repetitiveness):

Recieves:

Topic Action Name
RECORD SUBSCRIBE record1
RECORD SUBSCRIBE record2
RECORD SUBSCRIBE record3

And would have recieved the following responses:

Sends:

Topic Action Name
RECORD SUBSCRIBE_ACK record1
RECORD SUBSCRIBE_ACK record2
RECORD SUBSCRIBE_ACK record3

Where now instead what would happen is:

Recieves:

Topic Action Name
RECORD SUBSCRIBE_BULK [record1, record2, record3]

Sends:

Topic Action CorrelationId
RECORD SUBSCRIBEBULKACK 12345

This gives deepstream a massive boost in performance as it doesn’t have to care about individual records. However in terms of permissions it still calls into the permission handler to run them on a per name basis to ensure the same level of granualirity.

Changing development tools

In order to be consistent with all our other repos we have focused on minimizing the amount of variations between toolsets. As such we now have a consistent toolset of mocha, sinon and typescript for our V4 development environments. All adaptors also now use docker to run their tests, as it really simplifies testing and development for all the seperate variations.