Schema Visualization

In this example we show how can schema be turned into HTML interactive visualization, which helps to examine the schema, especially when dealing with large and heterogeneous data.

Tip

This example is also available as a Jupyter notebook, feel free to run it yourself: schema_visualization.ipynb

We include packages we want to use.

using JsonGrinder, JSON
import JsonGrinder: generate_html

Now we load all samples

data_file = "../../../data/recipes.json"
samples_str = open(data_file) do fid
	read(fid, String)
end;

We parse them to structures

samples = convert(Vector{Dict}, JSON.parse(samples_str));

Print example of the JSON

JSON.print(samples[1],2)
{
  "id": 10259,
  "ingredients": [
    "romaine lettuce",
    "black olives",
    "grape tomatoes",
    "garlic",
    "pepper",
    "purple onion",
    "seasoning",
    "garbanzo beans",
    "feta cheese crumbles"
  ],
  "cuisine": "greek"
}

We create schema from all samples

sch = JsonGrinder.schema(samples)
[Dict]  # updated = 39774
  ├─────────── id: [Scalar - Int64], 10000 unique values  # updated = 39774
  ├────── cuisine: [Scalar - String], 20 unique values  # updated = 39774
  ╰── ingredients: [List]  # updated = 39774
                     ╰── [Scalar - String], 6714 unique values  # updated = 428275

Now we can generate the html visualization into a file, keeping only 100 unique values per item

generate_html("recipes_max_vals=100.html", sch, max_vals=100)
11484

Or we can generate html, keeping all values from schema.

generate_html("recipes.html", sch, max_vals=nothing)
490993

If we omit the first argument, we will get the html as a string

generated_html = generate_html(sch, max_vals = 100);

Now we can look at the visualization.

Feel free to click the triangles, individual nodes of the tree are collapsed by default, but can be expanded or collapsed when clicked. This way you can easily examine individual parts of the schema. For lists we show histograms of lengths, for leaves we show histogram of values etc.

generated_html
" Json schema dump
    [Dict] (updated=39774)
  • cuisine -
      [Scalar - String], 20 unique values, (updated=39774, filled=100.00%, min=brazilian: 467, max=italian: 7838)
    • italian: 7838
    • mexican: 6438
    • southern_us: 4320
    • indian: 3003
    • chinese: 2673
    • french: 2646
    • cajun_creole: 1546
    • thai: 1539
    • japanese: 1423
    • greek: 1175
    • spanish: 989
    • korean: 830
    • vietnamese: 825
    • moroccan: 821
    • british: 804
    • filipino: 755
    • irish: 667
    • jamaican: 526
    • russian: 489
    • brazilian: 467
  • id -
      [Scalar - Int64], 10000 unique values, (updated=39774, filled=100.00%, min=41850: 1, max=23593: 1)
    • 23593: 1
    • 37819: 1
    • 29454: 1
    • 11950: 1
    • 45120: 1
    • 12778: 1
    • 10548: 1
    • 1956: 1
    • 12427: 1
    • 42582: 1
    • 41222: 1
    • 27167: 1
    • 36280: 1
    • 47428: 1
    • 39471: 1
    • 38919: 1
    • 16797: 1
    • 29579: 1
    • 28900: 1
    • 37780: 1
    • 7353: 1
    • 48361: 1
    • 18139: 1
    • 46927: 1
    • 17925: 1
    • 36992: 1
    • 33594: 1
    • 12255: 1
    • 11280: 1
    • 28025: 1
    • 46806: 1
    • 14804: 1
    • 21896: 1
    • 29702: 1
    • 37160: 1
    • 33390: 1
    • 1823: 1
    • 17853: 1
    • 33003: 1
    • 16984: 1
    • 11373: 1
    • 33489: 1
    • 5435: 1
    • 6706: 1
    • 3646: 1
    • 49672: 1
    • 13701: 1
    • 47586: 1
    • 28505: 1
    • 12322: 1
    • 3163: 1
    • 46481: 1
    • 31634: 1
    • 16341: 1
    • 23629: 1
    • 35305: 1
    • 37033: 1
    • 22241: 1
    • 35395: 1
    • 37258: 1
    • 47577: 1
    • 47032: 1
    • 5640: 1
    • 47726: 1
    • 46400: 1
    • 31831: 1
    • 35526: 1
    • 47339: 1
    • 27819: 1
    • 366: 1
    • 37890: 1
    • 27617: 1
    • 9329: 1
    • 40964: 1
    • 19588: 1
    • 33968: 1
    • 18219: 1
    • 79: 1
    • 19391: 1
    • 9374: 1
    • 49378: 1
    • 44623: 1
    • 41931: 1
    • 12795: 1
    • 33642: 1
    • 23214: 1
    • 40228: 1
    • 45920: 1
    • 28442: 1
    • 5001: 1
    • 19448: 1
    • 6515: 1
    • 15982: 1
    • 11775: 1
    • 24042: 1
    • 32017: 1
    • 40725: 1
    • 21512: 1
    • 25196: 1
    • 7560: 1
    • and other 9900 values
  • ingredients -
      [List] (updated=39774, filled=100.00%, mean=10.77, min=1, max=65, 10th percentile=6.0, median=10.0, 90th percentile=17.0)
    • with following frequencies
      • 1: 22
      • 2: 193
      • 3: 549
      • 4: 1128
      • 5: 1891
      • 6: 2662
      • 7: 3329
      • 8: 3556
      • 9: 3753
      • 10: 3677
      • 11: 3512
      • 12: 3146
      • 13: 2698
      • 14: 2253
      • 15: 1809
      • 16: 1439
      • 17: 1160
      • 18: 879
      • 19: 610
      • 20: 504
      • 21: 313
      • 22: 218
      • 23: 141
      • 24: 91
      • 25: 72
      • 26: 46
      • 27: 20
      • 28: 27
      • 29: 21
      • 30: 15
      • 31: 11
      • 32: 4
      • 33: 4
      • 34: 3
      • 35: 3
      • 36: 4
      • 38: 2
      • 40: 3
      • 43: 1
      • 49: 2
      • 52: 1
      • 59: 1
      • 65: 1
    • and data
        [Scalar - String], 6714 unique values, (updated=428275, min=pecan meal: 1, max=salt: 18049)
      • salt: 18049
      • onions: 7972
      • olive oil: 7972
      • water: 7457
      • garlic: 7380
      • sugar: 6434
      • garlic cloves: 6237
      • butter: 4848
      • ground black pepper: 4785
      • all-purpose flour: 4632
      • pepper: 4438
      • vegetable oil: 4385
      • eggs: 3388
      • soy sauce: 3296
      • kosher salt: 3113
      • green onions: 3078
      • tomatoes: 3058
      • large eggs: 2948
      • carrots: 2814
      • unsalted butter: 2782
      • extra-virgin olive oil: 2747
      • ground cumin: 2747
      • black pepper: 2627
      • milk: 2263
      • chili powder: 2036
      • oil: 1970
      • red bell pepper: 1939
      • purple onion: 1896
      • scallions: 1891
      • grated parmesan cheese: 1886
      • sesame oil: 1773
      • corn starch: 1757
      • ginger: 1755
      • baking powder: 1738
      • jalapeno chilies: 1730
      • dried oregano: 1707
      • chopped cilantro fresh: 1698
      • fresh lemon juice: 1679
      • diced tomatoes: 1624
      • fresh parsley: 1604
      • minced garlic: 1583
      • chicken broth: 1554
      • sour cream: 1539
      • cayenne pepper: 1523
      • fresh ginger: 1503
      • brown sugar: 1503
      • cooking spray: 1490
      • shallots: 1477
      • garlic powder: 1442
      • lime: 1439
      • lemon juice: 1395
      • fresh lime juice: 1368
      • flour: 1348
      • honey: 1299
      • vanilla extract: 1298
      • paprika: 1287
      • chopped onion: 1251
      • fish sauce: 1247
      • ground cinnamon: 1231
      • avocado: 1229
      • canola oil: 1223
      • dry white wine: 1218
      • lemon: 1218
      • rice vinegar: 1204
      • yellow onion: 1184
      • green bell pepper: 1180
      • cilantro leaves: 1160
      • tomato paste: 1158
      • heavy cream: 1146
      • cilantro: 1142
      • fresh basil: 1137
      • boneless skinless chicken breasts: 1111
      • flat leaf parsley: 1094
      • white sugar: 1093
      • lime juice: 1072
      • chicken stock: 1039
      • bay leaves: 1036
      • potatoes: 1018
      • chicken: 982
      • corn tortillas: 965
      • salsa: 963
      • cumin: 953
      • ground turmeric: 949
      • freshly ground pepper: 949
      • baking soda: 942
      • sea salt: 940
      • cumin seed: 935
      • garam masala: 925
      • shrimp: 912
      • black beans: 896
      • zucchini: 892
      • ground beef: 878
      • dried thyme: 873
      • large garlic cloves: 873
      • tomato sauce: 865
      • flour tortillas: 865
      • buttermilk: 863
      • plum tomatoes: 858
      • coconut milk: 854
      • granulated sugar: 849
      • and other 6614 values
"

If you like, you may use the Electron to open it in browser. using the following code (this works if you run it from REPL, but not from jupyter notebook or in CI)

using ElectronDisplay
using ElectronDisplay: newdisplay
display(newdisplay(), MIME{Symbol("text/html")}(), generated_html)