Jeff Li

Be another Jeff

Avro Cookbook : Part III

Recipe 6: Serialize data as JSON data

In Avro Cookbook : part I, if you open the file /tmp/log created by recipe 3, you would find that it is definitely not a human readable text format. Avro provides the encoder/decoder mechanism which helps to serial the data to text format as JSON data. Actually, if I want to serial the POJOs to JSON data, I would rather use Google Gson. Anyway, this is a post about Avro, right ?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
@Test
public void testSerializeToJson() throws Exception {
   ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
   Schema schema = ReflectData.get().getSchema(LogEntry3.class);

   Encoder encoder = new EncoderFactory().jsonEncoder(schema, outputStream);
   DatumWriter<LogEntry3> writer = new ReflectDatumWriter<>(schema);

   LogEntry3 entry1 = new LogEntry3("Jeff", "readme.md", "192.168.4.1");
   LogEntry3 entry2 = new LogEntry3("John", "readme.txt", "192.168.4.2");

   writer.write(entry1, encoder);
   writer.write(entry2, encoder);
   encoder.flush();

   System.out.println(new String(outputStream.toByteArray()));
}

Refer to the Avro Cookbook : Part II for the what class LogEntry3 looks like. Here is the output:

1
2
{"name":"Jeff","resource":"readme.md","ip":"192.168.4.1"}
{"name":"John","resource":"readme.txt","ip":"192.168.4.2"}

Recipe 7: Deserialize JSON data

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
@Test
public void testDeserializeFromJson() throws Exception {
   String input = "{\"name\":\"Jeff\",\"resource\":\"readme.md\",\"ip\":\"192.168.4.1\"}" +
           "{\"name\":\"John\",\"resource\":\"readme.txt\",\"ip\":\"192.168.4.2\"}";

   ByteArrayInputStream inputStream = new ByteArrayInputStream(input.getBytes());
   Schema schema = ReflectData.get().getSchema(LogEntry3.class);

   JsonDecoder decoder = new DecoderFactory().jsonDecoder(schema, inputStream);
   DatumReader<GenericRecord> reader = new GenericDatumReader<>(schema);
   GenericRecord entry;
   while (true) {
      try {
         entry = reader.read(null, decoder);
         System.out.println(entry);
      } catch (EOFException exception) {
         break;
      }
   }
}

If you look at the code carefully, you will find several interesting things.

First, there is no explicit separator between every JSON record. That means the Avro JSON decoder can decode the JSON data in the form of stream. This could be very helpful when you have to deserialize a whole bunch of JSON records without any explicit separator between records.

Second, when parsing the JSON data, the generic reader DatumReader<GenericRecord> instead of specific reader DatumReader<LogEntry3> is used. I tried to use the specific reader but it was not able to work with the error:

org.apache.avro.AvroTypeException: Expected start-union. Got VALUE_STRING
	at org.apache.avro.io.JsonDecoder.error(JsonDecoder.java:697)
	at org.apache.avro.io.JsonDecoder.readIndex(JsonDecoder.java:441)
	at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:229)
	at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
	at org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:206)
	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:155)
	at org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:193)
	at org.apache.avro.reflect.ReflectDatumReader.readField(ReflectDatumReader.java:230)
	at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:183)
	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:151)
	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:142)

It turned out that the Avro JSON decoder can only parse JSON data like this

1
2
{"name":"Jeff","resource":{"string":"readme.md"},"ip":{"string":"192.168.4.1"}}
{"name":"John","resource":{"string":"readme.txt"},"ip":{"string":"192.168.4.2"}}

but will not work on data like this

1
2
{"name": "Jeff", "resource": "readme.md", "ip": "192.168.4.1"}
{"name": "John", "resource": "readme.txt", "ip": "192.168.4.2"}

After googling the issue, I found this: Issue writing union in avro. So my suggestion is that don’t use Avro for json serialization and deserialization if you have other choices.

Recipe 8: Serialize array in JSON

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
@Test
public void testSerializeArray() throws IOException {
   Schema schema = Schema.createArray(ReflectData.get().getSchema(LogEntry3.class));
   GenericData.Array<LogEntry3> logs = new GenericData.Array<>(2, schema);
   logs.add(new LogEntry3("Jeff", "readme.md", "192.168.5.1"));
   logs.add(new LogEntry3("John", "readme.txt", "192.168.5.2"));

   ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
   Encoder encoder = new EncoderFactory().jsonEncoder(schema, outputStream);
   DatumWriter<GenericData.Array<LogEntry3>> writer = new ReflectDatumWriter<>(schema);
   writer.write(logs, encoder);
   encoder.flush();

   System.out.println(new String(outputStream.toByteArray()));
}

In the code above, not code generation is required, the LogEntry3 is just a normal Java class. The serialized data would be

1
[{"name":"Jeff","resource":"readme.md","ip":"192.168.5.1"},{"name":"John","resource":"readme.txt","ip":"192.168.5.2"}]

However, if the type of element of the array is generated from schema defintion(See Recipe 2), the output would be different:

1
[{"name":"Jeff","resource":{"string":"readme.md"},"ip":{"string":"192.168.5.1"}},{"name":"John","resource":{"string":"readme.txt"},"ip":{"string":"192.168.5.2"}}]

Still, I will not choose Avro to serialize data to JSON if I can use gson. However, this recipe is intended to demonstrate how to serialize a array without code generation from predefined schema.

Recipe 9: Deserialize JSON array data

1
2
3
4
5
6
7
8
9
10
@Test
public void testDeserializeArray() throws Exception {
   Schema schema = Schema.createArray(ReflectData.get().getSchema(LogEntry3.class));
   String input = "[{\"name\":\"Jeff\",\"resource\":\"readme.md\",\"ip\":\"192.168.5.1\"},{\"name\":\"John\",\"resource\":\"readme.txt\",\"ip\":\"192.168.5.2\"}]";
   Decoder decoder = new DecoderFactory().jsonDecoder(schema, input);
   DatumReader<GenericData.Array<GenericData.Record>> reader = new GenericDatumReader<>(schema);
   GenericData.Array<GenericData.Record> logs = reader.read(null, decoder);
   GenericData.Record entry = logs.get(0);
   System.out.println(entry);
}

Note, we can use only generic reader, have not figured out how to use a specific reader yet. Maybe it would be ok for the binary encoder to use specific reader.

Recipe 10: Deserialize data stream

The word stream means that the size of the source data is unknown. For example. if you want to serialize data like this

[
   {
      "name":"Jeff",
      "resource":"readme.md",
      "ip":"192.168.5.1"
   },
   {
      "name":"John",
      "resource":"readme.txt",
      "ip":"192.168.5.2"
   }
][
   {
      "name":"Joe",
      "resource":"readme.md",
      "ip":"192.168.5.3"
   },
   {
      "name":"James",
      "resource":"readme.txt",
      "ip":"192.168.5.4"
   }
]

The data have several unusual characteristics due to which gson can not be simply applied. * The whole data is not a valid JSON. Instead it is JSON records set. Besides, the size of data could be very large. * There is no explicit separator between records. If the records are separated by character such as \n, then one record can be read and parsed easily.

The format of the data is so poor which should be avoided in practice. However, sometimes, we ourselves are just the data consumers. We can parse the data with Avro like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
@Test
public void testDeserializeStream() throws Exception {
   Schema schema = Schema.createArray(ReflectData.get().getSchema(LogEntry3.class));
   String input = "[{"name":"Jeff","resource":"readme.md","ip":"192.168.5.1"},{"name":"John","resource":"readme.txt","ip":"192.168.5.2"}][{"name":"Joe","resource":"readme.md","ip":"192.168.5.3"},{"name":"James","resource":"readme.txt","ip":"192.168.5.4"}]
";
   Decoder decoder = new DecoderFactory().jsonDecoder(schema, input);
   DatumReader<GenericData.Array<LogEntry3>> reader = new GenericDatumReader<>(schema);
   GenericData.Array<LogEntry3> parsedRecords;

   int count = 0;
   while (true) {
      try {
         // We can iterate the parsedRecords to get every individual record
         parsedRecords = reader.read(null, decoder);
         count += parsedRecords.size();
      } catch (EOFException e) {
         break;
      }
   }

   Assert.assertEquals(count, 4);
}

Summary

This post is the last part of the three part series on Avro. To be honest, I myself use Avro rarely and I am not a Avro expert, thus the code examples may not be the most appropriate. They are just intended to help the starters to play with Avro quickly. Anyone who uses Avro frequently should spend more time on looking deeper into the Avro framework. It would be great if they are helpful to you!

Comments