Schema Visualization
In this example we show how can schema be turned into HTML interactive visualization, which helps to examine the schema, especially when dealing with large and heterogeneous data.
This example is also available as a Jupyter notebook, feel free to run it yourself: schema_visualization.ipynb
We include packages we want to use.
using JsonGrinder, JSON
import JsonGrinder: generate_html
Now we load all samples
data_file = "../../../data/recipes.json"
samples_str = open(data_file) do fid
read(fid, String)
end;
We parse them to structures
samples = convert(Vector{Dict}, JSON.parse(samples_str));
Print example of the JSON
JSON.print(samples[1],2)
{
"id": 10259,
"ingredients": [
"romaine lettuce",
"black olives",
"grape tomatoes",
"garlic",
"pepper",
"purple onion",
"seasoning",
"garbanzo beans",
"feta cheese crumbles"
],
"cuisine": "greek"
}
We create schema from all samples
sch = JsonGrinder.schema(samples)
[Dict] # updated = 39774
├─────────── id: [Scalar - Int64], 10000 unique values # updated = 39774
├────── cuisine: [Scalar - String], 20 unique values # updated = 39774
╰── ingredients: [List] # updated = 39774
╰── [Scalar - String], 6714 unique values # updated = 428275
Now we can generate the html visualization into a file, keeping only 100 unique values per item
generate_html("recipes_max_vals=100.html", sch, max_vals=100)
11484
Or we can generate html, keeping all values from schema.
generate_html("recipes.html", sch, max_vals=nothing)
490993
If we omit the first argument, we will get the html as a string
generated_html = generate_html(sch, max_vals = 100);
Now we can look at the visualization.
Feel free to click the triangles, individual nodes of the tree are collapsed by default, but can be expanded or collapsed when clicked. This way you can easily examine individual parts of the schema. For lists we show histograms of lengths, for leaves we show histogram of values etc.
generated_html
"
- [Dict] (updated=39774)
- cuisine -
- [Scalar - String], 20 unique values,
(updated=39774, filled=100.00%, min=brazilian: 467, max=italian: 7838)
- italian: 7838
- mexican: 6438
- southern_us: 4320
- indian: 3003
- chinese: 2673
- french: 2646
- cajun_creole: 1546
- thai: 1539
- japanese: 1423
- greek: 1175
- spanish: 989
- korean: 830
- vietnamese: 825
- moroccan: 821
- british: 804
- filipino: 755
- irish: 667
- jamaican: 526
- russian: 489
- brazilian: 467
- id -
- [Scalar - Int64], 10000 unique values,
(updated=39774, filled=100.00%, min=41850: 1, max=23593: 1)
- 23593: 1
- 37819: 1
- 29454: 1
- 11950: 1
- 45120: 1
- 12778: 1
- 10548: 1
- 1956: 1
- 12427: 1
- 42582: 1
- 41222: 1
- 27167: 1
- 36280: 1
- 47428: 1
- 39471: 1
- 38919: 1
- 16797: 1
- 29579: 1
- 28900: 1
- 37780: 1
- 7353: 1
- 48361: 1
- 18139: 1
- 46927: 1
- 17925: 1
- 36992: 1
- 33594: 1
- 12255: 1
- 11280: 1
- 28025: 1
- 46806: 1
- 14804: 1
- 21896: 1
- 29702: 1
- 37160: 1
- 33390: 1
- 1823: 1
- 17853: 1
- 33003: 1
- 16984: 1
- 11373: 1
- 33489: 1
- 5435: 1
- 6706: 1
- 3646: 1
- 49672: 1
- 13701: 1
- 47586: 1
- 28505: 1
- 12322: 1
- 3163: 1
- 46481: 1
- 31634: 1
- 16341: 1
- 23629: 1
- 35305: 1
- 37033: 1
- 22241: 1
- 35395: 1
- 37258: 1
- 47577: 1
- 47032: 1
- 5640: 1
- 47726: 1
- 46400: 1
- 31831: 1
- 35526: 1
- 47339: 1
- 27819: 1
- 366: 1
- 37890: 1
- 27617: 1
- 9329: 1
- 40964: 1
- 19588: 1
- 33968: 1
- 18219: 1
- 79: 1
- 19391: 1
- 9374: 1
- 49378: 1
- 44623: 1
- 41931: 1
- 12795: 1
- 33642: 1
- 23214: 1
- 40228: 1
- 45920: 1
- 28442: 1
- 5001: 1
- 19448: 1
- 6515: 1
- 15982: 1
- 11775: 1
- 24042: 1
- 32017: 1
- 40725: 1
- 21512: 1
- 25196: 1
- 7560: 1
- and other 9900 values
- ingredients -
- [List] (updated=39774, filled=100.00%, mean=10.77,
min=1, max=65, 10th percentile=6.0, median=10.0, 90th percentile=17.0)
- with following frequencies
- 1: 22
- 2: 193
- 3: 549
- 4: 1128
- 5: 1891
- 6: 2662
- 7: 3329
- 8: 3556
- 9: 3753
- 10: 3677
- 11: 3512
- 12: 3146
- 13: 2698
- 14: 2253
- 15: 1809
- 16: 1439
- 17: 1160
- 18: 879
- 19: 610
- 20: 504
- 21: 313
- 22: 218
- 23: 141
- 24: 91
- 25: 72
- 26: 46
- 27: 20
- 28: 27
- 29: 21
- 30: 15
- 31: 11
- 32: 4
- 33: 4
- 34: 3
- 35: 3
- 36: 4
- 38: 2
- 40: 3
- 43: 1
- 49: 2
- 52: 1
- 59: 1
- 65: 1
- and data
- [Scalar - String], 6714 unique values,
(updated=428275, min=pecan meal: 1, max=salt: 18049)
- salt: 18049
- onions: 7972
- olive oil: 7972
- water: 7457
- garlic: 7380
- sugar: 6434
- garlic cloves: 6237
- butter: 4848
- ground black pepper: 4785
- all-purpose flour: 4632
- pepper: 4438
- vegetable oil: 4385
- eggs: 3388
- soy sauce: 3296
- kosher salt: 3113
- green onions: 3078
- tomatoes: 3058
- large eggs: 2948
- carrots: 2814
- unsalted butter: 2782
- extra-virgin olive oil: 2747
- ground cumin: 2747
- black pepper: 2627
- milk: 2263
- chili powder: 2036
- oil: 1970
- red bell pepper: 1939
- purple onion: 1896
- scallions: 1891
- grated parmesan cheese: 1886
- sesame oil: 1773
- corn starch: 1757
- ginger: 1755
- baking powder: 1738
- jalapeno chilies: 1730
- dried oregano: 1707
- chopped cilantro fresh: 1698
- fresh lemon juice: 1679
- diced tomatoes: 1624
- fresh parsley: 1604
- minced garlic: 1583
- chicken broth: 1554
- sour cream: 1539
- cayenne pepper: 1523
- fresh ginger: 1503
- brown sugar: 1503
- cooking spray: 1490
- shallots: 1477
- garlic powder: 1442
- lime: 1439
- lemon juice: 1395
- fresh lime juice: 1368
- flour: 1348
- honey: 1299
- vanilla extract: 1298
- paprika: 1287
- chopped onion: 1251
- fish sauce: 1247
- ground cinnamon: 1231
- avocado: 1229
- canola oil: 1223
- dry white wine: 1218
- lemon: 1218
- rice vinegar: 1204
- yellow onion: 1184
- green bell pepper: 1180
- cilantro leaves: 1160
- tomato paste: 1158
- heavy cream: 1146
- cilantro: 1142
- fresh basil: 1137
- boneless skinless chicken breasts: 1111
- flat leaf parsley: 1094
- white sugar: 1093
- lime juice: 1072
- chicken stock: 1039
- bay leaves: 1036
- potatoes: 1018
- chicken: 982
- corn tortillas: 965
- salsa: 963
- cumin: 953
- ground turmeric: 949
- freshly ground pepper: 949
- baking soda: 942
- sea salt: 940
- cumin seed: 935
- garam masala: 925
- shrimp: 912
- black beans: 896
- zucchini: 892
- ground beef: 878
- dried thyme: 873
- large garlic cloves: 873
- tomato sauce: 865
- flour tortillas: 865
- buttermilk: 863
- plum tomatoes: 858
- coconut milk: 854
- granulated sugar: 849
- and other 6614 values
- with following frequencies
If you like, you may use the Electron to open it in browser. using the following code (this works if you run it from REPL, but not from jupyter notebook or in CI)
using ElectronDisplay
using ElectronDisplay: newdisplay
display(newdisplay(), MIME{Symbol("text/html")}(), generated_html)