Jeff Li

Be another Jeff

Avro Cookbook : Part II

Recipe 5: Serialize data without Code Generation

In formal recipes, before using Avro to serialize/deserialize data, schema files have to be defined to be leveraged by Avro code generation facility to generate the Java classes. This is also recommended when using Avro in Java. However, it is not required. Actually, you can parse the schema on the fly without code generation.

Parse Schema from String

The schema looks like this:

1
2
3
4
5
6
7
8
9
10
{
 "namespace": "me.jeffli.avrosamples.model",
 "type": "record",
 "name": "LogEntry2",
 "fields": [
     {"name": "name", "type": "string"},
     {"name": "resource",  "type": ["string", "null"]},
     {"name": "ip", "type": ["string", "null"]}
 ]
}

Define the schema as a Java String:

String schemaDesc = "{\n" +
           " \"namespace\": \"me.jeffli.avrosamples.model\",\n" +
           " \"type\": \"record\",\n" +
           " \"name\": \"LogEntry2\",\n" +
           " \"fields\": [\n" +
           "     {\"name\": \"name\", \"type\": \"string\"},\n" +
           "     {\"name\": \"resource\",  \"type\": [\"string\", \"null\"]},\n" +
           "     {\"name\": \"ip\", \"type\": [\"string\", \"null\"]}\n" +
           " ]\n" +
           "}";

Then the code to serialize the data would be this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
@Test
public void testSerializeOnTheFly() throws IOException {
   Schema schema = new Schema.Parser().parse(schemaDesc);
   GenericRecord entry1 = new GenericData.Record(schema);
   entry1.put("name", "Jeffrey");
   entry1.put("resource", "README");
   entry1.put("ip", "192.168.2.1");

   GenericRecord entry2 = new GenericData.Record(schema);
   entry2.put("name", "Johnson");
   entry2.put("resource", "readme.markdown");
   entry2.put("ip", "192.168.2.2");

   DatumWriter<GenericRecord> datumWriter = new GenericDatumWriter<>(schema);
   DataFileWriter<GenericRecord> dataFileWriter = new DataFileWriter<>(datumWriter);
   File file = new File("/tmp/log2");
   dataFileWriter.create(schema, file);

   dataFileWriter.append(entry1);
   dataFileWriter.append(entry2);
   dataFileWriter.close();
}

From the example, we can see that we don’t need to define any external schema file and no external Java classed are generated.

Parse Schema from Disk File

In the above example, we parse the schema from String. Since the schema is defined with JSON language, it is cumbersome to define the schema as a Java String. Fortunately Avro Schema.Paser also provides other API to parse the schema from disk file or existing Java class :

1
Schema schema = new Schema.Parser().parse(new File("src/test/resources/LogEntry2.avsc"));

Parse Schema from Existing Java Class

Per the JSON Schema definition, a equivalent Java class would look like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
public class LogEntry3 {
   private String name;
   private String resource;
   private String ip;

   public LogEntry3(String name, String resource, String ip) {
      this.name = name;
      this.resource = resource;
      this.ip = ip;
   }

   public String getName() {
      return name;
   }

   public void setName(String name) {
      this.name = name;
   }

   public String getResource() {
      return resource;
   }

   public void setResource(String resource) {
      this.resource = resource;
   }

   public String getIp() {
      return ip;
   }

   public void setIp(String ip) {
      this.ip = ip;
   }
}

Then the Schema can be fetched easily:

1
Schema schema = ReflectData.get().getSchema(LogEntry3.class);

What is more, you can use ReflectDatumWriter to append specific type objects to the target:

1
2
3
4
5
6
7
8
9
10
11
12
13
@Test
public void testSerializeData() throws IOException {
   Schema schema = ReflectData.get().getSchema(LogEntry3.class);
   File file = new File("/tmp/log3");
   LogEntry3 entry1 = new LogEntry3("Jeff", "readme.txt", "192.168.3.1");
   LogEntry3 entry2 = new LogEntry3("John", "readme.md", "192.168.3.2");

   ReflectDatumWriter<LogEntry3> reflectDatumWriter = new ReflectDatumWriter<>(schema);
   DataFileWriter<LogEntry3> writer = new DataFileWriter<>(reflectDatumWriter).create(schema, file);
   writer.append(entry1);
   writer.append(entry2);
   writer.close();
}

## Recipe 6: Deserialize data without Code Generation Deserializing data without code generation is pretty easy. The only difference with Recipe 4 is how it get the schema. Thus the ways to fetch schemas in Recipe 5 are also applicable in this recipe. Here is only the example to load schema from disk file. I am pretty sure that you can finish the rest code. It should be noted that if the schema is parsed on the fly without any code generation, then when deserializing the data, you can only use the generic datum reader even if you attempt with ReflectDatumReader.

1
2
3
4
5
6
7
8
9
10
11
12
@Test(dependsOnMethods = "testSerializeOnTheFly")
public void testDeserializeOnTheFly() throws IOException {
   Schema schema = new Schema.Parser().parse(new File("src/test/resources/LogEntry2.avsc"));
   DatumReader<GenericRecord> datumReader = new GenericDatumReader<>(schema);
   File file = new File("/tmp/log2");
   DataFileReader<GenericRecord> dataFileReader = new DataFileReader<>(file, datumReader);
   GenericRecord entry = null;
   while (dataFileReader.hasNext()) {
      entry = dataFileReader.next(entry);
      System.out.println(entry);
   }
}

Comments