1
0
mirror of https://github.com/halaxa/json-machine.git synced 2025-01-18 05:28:14 +01:00
json-machine/README.md

177 lines
7.7 KiB
Markdown
Raw Normal View History

2018-11-29 14:20:06 +01:00
![](img/logo.png)
2018-11-29 11:41:26 +01:00
# JSON Machine
2018-11-29 19:54:39 +01:00
JSON Machine is a fast, efficient and easy-to-use JSON stream parser based on coroutines
2018-11-29 11:41:26 +01:00
developed for unpredictably long JSON streams or documents. Main features are:
- Constant memory footprint for unpredictably large JSON documents.
2018-12-10 11:41:53 +01:00
- Ease of use. Just iterate JSON of any size with `foreach`. No events and callbacks.
- Efficient iteration on any subtree of the document, specified by [Json Pointer](#json-pointer)
2018-11-29 11:41:26 +01:00
- Speed. Performace critical code contains no unnecessary function calls, no regular expressions
and uses native `json_decode` to decode JSON document chunks.
2018-12-01 20:09:38 +01:00
## TL;DR;
2018-12-02 20:29:47 +01:00
JSON Machine is drop-in replacement for non efficient iteration of big JSONs:
2018-12-01 20:09:38 +01:00
2018-12-02 20:29:47 +01:00
```diff
2018-12-01 20:09:38 +01:00
<?php
2018-12-02 20:27:16 +01:00
- $users = json_decode(file_get_contents('500MB-users.json'));
+ $users = \JsonMachine\JsonMachine::fromFile('500MB-users.json');
2018-12-01 20:09:38 +01:00
foreach ($users as $id => $user) {
2018-12-02 20:30:39 +01:00
// just process $user as usual
2018-12-01 20:09:38 +01:00
}
```
2018-12-01 20:11:06 +01:00
Random access like `$users[42]` **is not possible**. However you can scan the collection in `foreach` and find the item.
2018-11-29 19:54:39 +01:00
## Parsing JSON documents
### Simple document
Let's say that `big.json` contains this really big JSON document:
```json
2018-11-10 20:09:07 +01:00
// big.json
{
"apple": {
"color": "red"
},
"pear": {
"color": "yellow"
}
}
```
It can be parsed this way:
```php
<?php
2018-11-29 11:41:26 +01:00
$jsonStream = \JsonMachine\JsonMachine::fromFile('big.json');
2018-11-10 20:25:38 +01:00
foreach ($jsonStream as $name => $data) {
// 1st iteration: $name === "apple" and $data === ["color" => "red"]
// 2nd iteration: $name === "pear" and $data === ["color" => "yellow"]
}
```
Parsing an array instead of a dictionary follows the same logic.
The key in a foreach will be a numeric index of an item.
2018-11-29 19:54:39 +01:00
### Parsing a subtree
2018-11-10 20:25:38 +01:00
If you want to iterate only `fruits-key` subtree in this `fruits.json`:
```json
2018-11-10 20:09:07 +01:00
// fruits.json
{
"fruits-key": {
"apple": {
"color": "red"
},
"pear": {
"color": "yellow"
}
}
}
```
2018-11-10 20:25:38 +01:00
do it like this:
```php
<?php
2018-11-29 11:41:26 +01:00
$jsonStream = \JsonMachine\JsonMachine::fromFile("fruits.json", "/fruits-key" /* <- Json Pointer */);
2018-11-10 20:25:38 +01:00
foreach ($jsonStream as $name => $data) {
2018-11-10 20:09:07 +01:00
// The same as above, which means:
// 1st iteration: $name === "apple" and $data === ["color" => "red"]
// 2nd iteration: $name === "pear" and $data === ["color" => "yellow"]
}
```
2018-11-10 20:09:07 +01:00
2018-11-29 11:41:26 +01:00
> Note:
>
2018-11-12 13:34:35 +01:00
> Value of `fruits-key` is not loaded into memory at once, but only one item in
> `fruits-key` at a time. It is always one item in memory at a time at the level/subtree
2018-11-29 19:54:39 +01:00
> you are currently iterating. Thus the memory consumption is constant.
2018-12-01 20:09:38 +01:00
<a name="json-pointer"></a>
#### Few words about Json Pointer
It's a way of addressing one item in JSON document. See the [Json Pointer RFC 6901](https://tools.ietf.org/html/rfc6901).
It's very handy, because sometimes the JSON structure goes deeper, and you want to iterate a subtree,
not the main level. So you just specify the pointer to the JSON array or object you want to iterate and off you go.
When the parser hits the collection you specified, iteration begins. It is always a second parameter in all
`JsonMachine::from*` functions. If you specify pointer to scalar value (which logically cannot be iterated)
or non existent position in the document, an exception is thrown.
Some examples:
| Json Pointer value | Will iterate through |
|--------------------|---------------------------------------------------------------------------------------------------|
2018-12-10 11:41:53 +01:00
| (empty string) | `["this", "array"]` or `{"a": "this", "b": "dictionary"}` will be iterated (main level - default) |
2018-12-01 20:09:38 +01:00
| `/result/items` | `{"result":{"items":["this","array","will","be","iterated"]}}` |
| `/0/items` | `[{"items":["this","array","will","be","iterated"]}]` (supports array indexes) |
2018-12-02 20:27:16 +01:00
| `/` (gotcha! - a slash followed by an empty string) | `{"":["this","array","will","be","iterated"]}` |
2018-12-01 20:09:38 +01:00
2018-11-29 19:54:39 +01:00
## Parsing API responses
2018-11-29 19:54:39 +01:00
API response or any other JSON stream is parsed exactly the same way as file is. The only difference
is, you use `JsonMachine::fromStream($streamResource)` for it, where `$streamResource` is the stream
resource with the JSON document. The rest is the same as with parsing files.
2018-11-29 19:54:39 +01:00
### GuzzleHttp
2018-11-12 13:05:38 +01:00
Guzzle uses its own streams, but they can be converted back to PHP streams by calling
2018-11-29 19:54:39 +01:00
`\GuzzleHttp\Psr7\StreamWrapper::getResource()`. Pass the result of this function to
`JsonMachine::fromStream` function and you're set up. See working
[GuzzleHttp example](src/examples/guzzleHttp.php).
## Efficiency of parsing streams/files
JSON Machine reads the stream or file 1 JSON item at a time and generates corresponding 1 PHP array at a time.
This is the most efficient way, because if you had say 10,000 users in JSON file and wanted to parse it using
`json_decode(file_get_contents('big.json'))`, you'd have the whole string in memory as well as all the 10,000
PHP structures. Following table demonstrates a concept of the difference:
| | String items in memory at a time | Decoded PHP items in memory at a time | Total |
|----------------------------|---------------------------------:|--------------------------------------:|------:|
| `json_decode` | 10000 | 10000 | 20000 |
| `JsonMachine::fromStream` | 1 | 1 | 2 |
This means, that `JsonMachine::fromStream` is constantly efficient for any size of processed JSON. 100 GB no problem.
## Efficiency of parsing in-memory JSON strings
There is also a method `JsonMachine::fromString()`. You may wonder, why is it there. Why just not use
`json_decode`? True, when parsing short strings, JSON Machine may be overhead. But if you are
forced to parse a big string and the stream is not available, JSON Machine may be better than `json_decode`.
The reason is that unlike `json_decode` it still traverses the JSON string one item at a time and doesn't
load the whole resulting PHP structure into memory at once.
Let's continue with the example with 10,000 users. This time they are all in string in memory.
When decoding that string with `json_decode`, 10,000 arrays (objects) is created in memory and then the result
is returned. JSON Machine on the other hand creates single array for found item in the string and yields it back
to you. When you process this item and iterate to the next one, another single array is created. This is the same
behaviour as with streams/files. Following table puts the concept into perspective:
| | String items in memory at a time | Decoded PHP items in memory at a time | Total |
|----------------------------|---------------------------------:|--------------------------------------:|------:|
| `json_decode` | 10000 | 10000 | 20000 |
| `JsonMachine::fromString` | 10000 | 1 | 10001 |
This is still about twice as efficient than `json_decode`.
## Error handling
2018-11-12 12:37:48 +01:00
When any part of the JSON stream is malformed, `SyntaxError` exception is thrown. Better solution is on the way.
## Running tests
```bash
composer install
vendor/bin/phpunit
```
To run tests on all supported PHP platforms install docker to your machine and run `tests/docker-run-all-platforms.sh`
2018-11-29 14:20:06 +01:00
2018-12-01 20:09:38 +01:00
## Installation
```bash
composer require halaxa/json-machine
```
or clone or download this repository (not recommended).
2018-11-29 14:20:06 +01:00
## License
2018-12-01 20:09:38 +01:00
Apache 2.0
2018-11-29 19:54:39 +01:00
Cogwheel element: Icons made by [TutsPlus](https://www.flaticon.com/authors/tutsplus)
2018-11-29 14:20:06 +01:00
from [www.flaticon.com](https://www.flaticon.com/)
is licensed by [CC 3.0 BY](http://creativecommons.org/licenses/by/3.0/)