plot/design.txt at master · vdobler/plot · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
Design of ggplot2 for Go
========================

Data Input
----------

Data input is either Slice-Of-Measurements (SOM) of the form

    type Measurement struct {
        Age    int
        Weight float64
        Height float64
        Born   time.Time
	Gender string
    }
    func (m Measurement) BMI() float64 {
        return m.Weight/(m.Height*m.Height)
    }

    var data []Measurement

or of the Collection-Of-Slices (COS) form

    type Measurement struct {
        Age    []int
        Weight []float64
        Height []float64
        Born   []time.Time
        Gender []string
    }
    func (m Measurement) BMI(i int) float64 {
        return m.Weight[i]/(m.Height[i]*m.Height[i])
    }

    var data Measurement

Both forms are converted to a Data Frame which is basically a generic
representation in the COS form. It resembles a R data frame.


Plot Creation
-------------

Plot Creation is the following process:

  1.  Split data according to facetting specification.
      Happens only on facetted plots, each facett is basically
      treated as its own plot and represented as a panel.
      Result: A n x m data frames in Domain Units, original field names


  2.  Prepare Data
      ▔▔▔▔▔▔▔▔▔▔▔
  2a. Some or all fields are mapped to aesthetics.
      Mapped fields are renamed to the aesthetic.
      Unmapped fields are removed from the data frame.

      Result: Data frame in Domain Units, unmapped fields have
      been removed, other have the mapped aes name and are
      scale transformed

  2b. Scales are added for the "known" aesthetics if not
      jet present.
      The mapped fields are transformed according to the
      scale transformation (identitiy, log, sqrt, 1/x, ...)
      Pre-Train Scales. Usefull if upcumming stat wants to knows what
      the full x-range will be.


  3.  Statistical transformation
      ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
  3a. It is checked that the statistical transform can be
      applied to the given data frame

  3b. The data frame is partitioned on any additional discrete
      fields.
      The stat transform is applied to each partition.
      The partitions are joined to produce a result.

  3.  A statistical transform is applied, typically some kind of
      summary statistics like binning, boxplot or smoothing.
      Result: A completely new data frame with new field names.
      (Or the original input if no stat is requested.)


  4.  Wiring stat to geom
      ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
  4a. Some (typically new) fields might be mapped to (new) aestetics.
      Scale transformations are performed during this mapping.

  4b. Rename fields to match expected input from Geom in next step.
      Result: A data frame with field names suitable to be rendered
      as a specific Geom.


  5.  Geom constructio TODO: no longer accurate
      ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
  5a. Apply geom specific position adjustments.

  5b. Train scales.

  5c. Reparametrise Geoms to a set of fundamental geoms.
      Result: One or several fundamantal geoms with their data frames
      in Domain Units with field names suitable for primitive geoms.

  6.  Prepare scales
      ▔▔▔▔▔▔▔▔▔▔▔▔▔
      Set up the remaining fields in each scale so that the scales
      can be used.

  7.  Render fundamental geoms
      ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
      Uses these scales to render the fundamental geoms produced in
      step 5c into Grobs (Graphical Objects).
      Result: Structure of Grobs, coordinates/values in Range Units.

  8   Render rest of plot
      ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
  8a. Render Guides (Axes and Legends)

  8b. Render title, facett boxes.

  8c. Determine space needed for titel, label, guides, etc.
      From this calculate panel size and positions.

  9.  Draw layer grobs
      ▔▔▔▔▔▔▔▔▔▔▔▔▔▔
  9a. Apply Coordinate Transformations: Interprete <x,y> pairs in
      a Grob eg. as <y,x> (flip coordinates), as <r,ϑ> (polar coordiantes)
      or <x,-y> (reversed y) and so on.
      Munching (interpolate lines by lots of small segments.) for
      non-cartesian coordinates.

  9b. Render into panel viewport.


Statistical Transforms
----------------------
Basic idea is dead simple. Take one data frame and construct an other data
frame which summarizes/describes the original one.

Only complication: What to do if the input data frame contains more
slots/fields than the stat can operate on?  Example: StatBin needs
"x" and can use optional "weight". What to do if extra field "Gender"
is present in input data frame?
The following may happen:
  -  Completely ignore the additional fields, pretend they are just
     not there.
  -  Fail. Do not process such data.
  -  Group (facett) input data on the additional field(s) if the
     additional fields are discrete.
Ignore and fail are simple; grouping n the StatBin with extra Gender example:
Split input data frame into two frames Gender=='f' and Gender=='m'
apply Simple StatBin to both and combine both results afterwards to something
like
       x  count density Gender
       5     0    0.0     m
      10     6    0.2     m
      15    11    0.4     m
      20     9    0.3     m
       5     2    0.1     f
      10     5    0.2     f
      15    15    0.5     f
      20     0    0.0     f

The fields x and Gender where present in the input to StatBin, the fields
count and density (also ncount and ndensity) are new fields.
Gender is passed through pretty much unchanged while x is aggregated,
here the center of the bins, i.e. none of the resulting x might occur
in the input data frame.

Statistical transforms cannot be chained. (Well not automatically:
If data frame generation and applying a stat is exported the user could
chain several stats manually.)


Aesthetics Mapping
------------------

Mapping of Aesthetics may happen before and after statistical transform.
Example StatBin: x must be mapped before (as StatBin needs x as input)
while mapping the generated count to y can happen only after computing
the stat.

What happens if one of the known real aesthetics is mapped:
  -  The field in the data frame is renamed to the aesthetics
     (e.g. BMI is replaced by y)
  -  A scale is added for this aesthetics if not present in the plot.
     Depending on the type of the data frame field this might be continous
     or discrete.
  -  The scale is pre-trained: TODO: really here? not own phase?


Complications, differences from ggplot2 for R
---------------------------------------------
All of the following do _not_ produce points for the calculated bins:
  ggplot(diamonds, aes(x=x)) + stat_bin() + geom_point()
  ggplot(diamonds, aes(x=x)) + stat_bin(aes(y=..count..)) + geom_point()
  ggplot(diamonds, aes(x=x)) + stat_bin(aes(y=..count..)) + geom_point(count=count)
  ggplot(diamonds, aes(x=x)) + stat_bin(aes(y=..count..)) + geom_point(y=..count..)
  ggplot(diamonds, aes(x=x)) + stat_bin(aes(y=..count..)) + geom_point(y=count)
  ggplot(diamonds, aes(x=x)) + stat_bin(aes(y=..count..)) + geom_point(y=count)
  ... and so on.
The ggplot2 way of doing this is by use of
  ggplot(diamonds, aes(x=x)) + stat_bin(aes(y=..count..), geom="point")
which is pretty straight forward for an interactive tool but does not
fit well into a prepared plot with fixed layers, stats and geoms.

It seem impossible in ggplot for R to plot geom_crossbar to a calculated
stat_boxplot as the calculated middle cannot be wired to the required
y value. The standard way in ggplot2 for R is to add either a stat or
a geom but not both (maybe change the geom in the stat).  This won't
work properly here.

[[
   Maybe stuff like plot.AddHistogram or plot.AddBoxplot which
   adds suitable stat and geom in one step might be nice
   convenience functions, maybe in a subpackage, once...
]]


Position Adjustments
--------------------
All four adjustments (dodge, jitter, stack and fill) work some geoms
only in ggplot2 for R. Boxplots can use identity and dodge. Stacking (and
filling) works properly only for geom_bar.  The ggplot for R code is a bit
shaky here: It works hardwired on ymin and ymax so that stacking crossbars
looses the the crossbar at the y value as this one is not moved...
It kinda works for ggplot2 for R as the normal way is to add either a
stat _or_ a geom, but not both.

Conclusion: Original ggplot2 for R solution is not practical here.
Solution: Make position adjustment a property of the individial
geoms. Some geoms might not allow and position adjustments at all
or just a subset.


Discrete Position Scales
------------------------
Discrete position scales provide a major obstacle in geom construction:
Assume a discrete string variable stored like this
  "foo"  3.0
  "bar"  4.0
  "waz"  7.0
and try to construct a boxplot (or bars) with this string variable
mapped to x-scale. How to set up, compute, transmit and draw the
with of the boxes, especially with a dodging position adjustment?
Idea:
  - Nothing is really changed during geom construction or rendering.
  - Geom construction produces continuous fields in the data frames,
    e.g. a dodged boxplot might produce two boxes for "foo" positioned
    at 2.7 and 3.3 each of width 0.2
  - When these continuous fields are drawn on a discrete scale any
    value x will be decomposed into x == xi + dx with xi an integer
    and dx in (-0.5,+0.5).  Mapping x (e.g. == 3.2) to the continuous screen
    coordinates is a two step process:
      a) look up xi (== 3) in the discrete scale levels: 3==foo --> 1
      b) add dx, thus producing 1.2
    and map to natural units.


Geoms
-----
There are aa handfull of fundamental geoms, these correspond directly
to grobs.  Other geoms might be simple reparametrizations of a fundamental
grob: E.g. GeomTile, described by <x,y,width,height,fill,...> is just
a different parametrization of a GeomRect of <xmin,ymin,xmax,ymax,fill...>.
Other geoms can be represented by a set of other geoms, e.g. GeomBoxplot
consists of GeomRects, GeomLines and GeomPoints.

Constructing geoms performs the following parts:
  - Adjust positions (dodge, stack, ...).
  - Compute "bounding boxes", e.g. a bar at x=1 might reach from.
    0.25 to 1.75.
  - Train scales with these bounding boxes.
  - Emit fundamental geoms.


Grobs, Graphical Objects
------------------------

Grobs are the primitive building blocks for graphics
  -  Points.   <x,y, style, size, color, alpha>
  -  Text.     <x,y, text, color, size, v-/h-align, font, alpha>
  -  Lines.    <x0,y0,x1,y1,...,xn,yn, style, color, size/width>
  -  Polygon   <x0,y0,x1,y1,...,xn,yn, fill, alpha>
  -  Rectangle <xmin,ymin, xmax,ymax, style, color, width, fill, alpha>
  -  Container <xmin,ymin, xmax,ymax>
All grob coordinates are in the interval [0,1]. Container group grobs
and provide a convenient way to reparam different grobs of e.g. one geom.

TODO: Rectangle Needed? Achievable through Grob-Reparametrisation of
Polygon but may be convenient...


Boxplot
-------

Maybe the most complex stuff. If this works everythings should be fine.

Prerequisite:
  - One discrete field is mapped to x.
  - One field is mapped to y.
  - Other fields may be mapped to color, fill, linetype

Statistical transform:
  - Partition input data by color, fill and linetype.
  - For each partition:
      * Group on the discrete x values
      * For each group compute min, low, med, high, max
        (and the outliers) based on the set of ys.
      * Return DF with (x, min,q1,median,q3,max)
  - Combine partitions to single DF which might look like
    (x, min,q1,median,q3,max, color, linetype)

Geom construction:
  - Wire generated stat fields to geom input fields, e.g.
    rename q1-->low, median-->mid, q3-->high
  - For same-x entries in the DF: Apply position adjustments.
    The only sane here is dodging, but nevertheless, why not
    stack them....
  - While building the geom: Train the scales, especially
    the y scale.
  - Produce a set of "Basic-Geoms" which biject to grobs and
    do not need a reparametrisation: Produce rectangle,
    vertical lines, horizontal median and outlier dots.
  - Outlier dots could be read from the original DF?


Grob construction:
  - All scales should be trained now, prepare projection
    functions.
  - Use projection functions to turn values in the different
    geoms to real grobs in viewport coordinates.


Example of Plot Construction
----------------------------

Step            Example A                          Example B                        Example C                        Example D
°°°°            °°°°°°°°°                          °°°°°°°°°                        °°°°°°°°°                        °°°°°°°°°

                x=Age, y=BMI, fill=Gender          x=Age, fill=Gender               x=Age, fill=Gender               x=Age, y=BMI, fill=Gender, linestyle=Smoker
                Stat Identity                      Stat Bin                         Stat Bin                         Stat Boxplot
                Geom Point                         Geom Bar y=count                 Geom Bar y=density               Geom Boxplot
                Position Identity                  Position Stack                   Position Fill                    Position Dodge
                LogScale on x


Prepare Data    rename field                       rename field                     rename field                     rename field
°°°°°°°°°°°°      Age->x, BMI->y, Gender->fill       Age->x, Gender->fill             Age->x, Gender->fill             Age->x, BMI->y, Gender->fill, Smoker->linestyle
                keep only the field                keep only the field              keep only the field              keep only the field
                  x, y, fill                         x, fill                          x, fill                          x, y, fill, linestyle
                set x <- log(x)


Prepare Scale   (add Scale x), pretrain            add Scale x, pretrain            add Scale x, pretrain            add Scale x, pretrain
°°°°°°°°°°°°°   add Scale y, pretrain                                                                                add Scale y, pretrain
                add Scale fill, pretrain           add Scale fill, pretrain         add Scale fill, pretrain         add Scale fill, pretrain
                                                                                                                     add Scale linestyle, pretrain


Stat Trans.     noop                               partition on fill                partition on fill                partition on fill
°°°°°°°°°°°                                        per fill:                        per fill:                        per fill
                                                      binify (x,count,density)         binify (x,count,density)         partition on linestyle
                                                   combine to:                      combine to:                         per linestyle
                                                      (x,count,density,fill)           (x,count,density,fill)              boxify (x,min,q1,med,q3,max)
	      				       					 				        combine to:
                                                                                                                           (x,min,q1,med,q3,max, linestyle)
                                                                                                                     combine
														        (x,min,q1,med,q3,max, linestyle, fill)


Rewire          noop                               count->y, drop density           density->y, drop count           q1->low, q3->high
°°°°°°                                             add Scale y                      add Scale y                      (add Scale y)


Geom Constr     forach row in data:                foreach rows                     foreach rows                     foreach rows
°°°°°°°°°°°       create one point (x,y,fill)        create bars (x,0,y,fill)         create bars (x,0,y,fill)         create box (x,width,min,...,max,ls,fill)
                                                     if same x exists:                if same x exists:                if same x exists:
                                                        raise to (x,a,a+y,fill)          raise to (x,a,a+y,fill)          move to (x+a, width/b,min,...,max,ls,fill)
                train x,y,fill                     train x, y, fill                 foreach unique x:                train x,y,fill,ls
                                                                                      rescale y-vals
                                                                                    reset y-scale and retrain


The oxboys example on grouping
------------------------------
Original DF with fields:
  - Subject
  - Height
  - Age
  - Occasion


Discrete Scales mapped to shape and linetype
--------------------------------------------
ggplot2 limits number of shapes and linetypes to a handful (otherwise the plot
becomes to complicated and unreadable). This behaviour is okay for an
interactive tool but might be frustating in code.
But even if we do not clip the number of shapes or linestyles there is just
a limited number: What to do if we run out of shapes? Wrap around? Most
probably not as this is ambiguous. Add color if available?


Themes
------
Styling of plots is done by specifiying various properties of individual
plot elements.  The plot elements are grouped hierarchically and settings
are inherited from above (a bit like CSS).