model the world; view your data; control their chaos
We’re heading into the time of year where institutional data integration miseries make a mockery of academic productivity as we scrabble to assemble the outcome of a variety of assessment processes into something that might resemble the basis for a judgment.
I share a module with a colleague (at Strathclyde we use the word “class”, but that might become confusing, given what follows). I do a lot more online assessment than he currently does, so it suits me to key all my student data by username. My colleague keys all his assessment data by registration number. Our institution’s Virtual Learning Environment keys students differently again, for exercises involving anonymous marking. All of these keys are just strings. How do we achieve coherence? Laboriously.
My part of the module is chopped up into topics. Each topic has associated classroom-delivered paper tests and some online materials.
The information about how students have performed in these various
components is managed rather heterogeneously. There might be one file for each paper test. Meanwhile, students each have their own directory recording their interaction with online materials, with a subdirectory for each topic, containing files which relate to their performance in individual assessment items. Some of these files have formats for which I am to blame; other file formats I have thrust upon me. I need to be able to find out who did how well in what by when. I need logic.
And I’m asking myself what I usually ask myself when I need logic: ‘How much of the logic I need can I get from types?’. I’m fond of decidable typechecking, and of various kinds of type-directed program construction (which I much prefer to program-directed type construction). Can we have types for data which help us to audit, integrate and transform them in semantically sensible ways? That’s the kind of problem that we dependent type theorists ought to be able to get our teeth into. But these everyday spreadsheet-this, database-that, log-file-the-other data are really quite unlike the indexed inductive tree-like datatypes which we are used lovingly to be crafting. “Beautiful Abstract Syntax Trees Are Readily Definable” was one of the names we thought about calling Epigram, until we checked the acronym. Dependent type theory is not just sitting on a canned solution to these real world data problems, ready to deploy. Quite a lot of headscratching will be necessary.
‘What’s a good type for a spreadsheet?’ is a reasonable question. ‘What’s a good dependent type for a spreadsheet?’ is a better question. ‘Upon what might a dependent type for a spreadsheet depend, and how much would that really have to do with spreadsheets per se?’ is a question which might lead to an idea. When you have diverse files and online sources all contributing information to some larger resource, we need to establish a broader conceptual framework if we are to work accurately with the individual components. The spreadsheets, database records, forms, etc, are all views or lenses into a larger system which may never exist as a single bucket of bits, but which me might seek to model.
So what I’m looking for is a dependently typed language of metadata. We should be able write a model of the information which ought to exist, and we should be able to write views which describe a particular presentation of some of the data. A machine should then be able to check that a model makes sense, that a view conforms to the model, and that the data is consistent with the view. Given a bunch of views, we should be able to compute whether they cover the model: which data are missing and which are multiply represented. The computational machinery to check, propagate or demand the actual data can then be constructed.
I had a thought about this last summer. Picking some syntax out of thin air, I began to write things like
class Student class Module for Module -: class Test for Student, Module -: prop Participant
What’s going on? I’ve made four declarations: three “classes”, and one relation. A “class” is a conceptual variety of individuals. Classes can be (relatively) global, such as students or modules. Classes can be localized to a context, so that each module has its own class of tests.
The “for” construct localizes the declarations which are subordinated by indentation after the layout herald “-:”. It’s tidier to say that each module has a bunch of tests than that tests exist globally but each test maps to a module. Moreover, it means that tests in different modules need not share a keyspace.
A class is a finite enumeration whose elements are not known at declaration time. A prop is a finite enumeration whose elements are not known at declaration time, but it is known that there’s at most one element. There’s at most one way in which a student can be a participant in a module.
So far, I haven’t said anything about what these wretched individuals might look like. So,
for Student -: email ! String username ! String regNo ! String surname : String forenames : String
I’ve declared a bunch of things which ought to exist in the context of an individual student. The ones with “!” are intended to be keys for students. That’s to say any sensible view of student data should include at least one student key, but it doesn’t really matter which. Of course, with a little more dependently typed goodness, we could enforce formatting properties of email addresses, usernames and registration numbers…some other time. The point is that by introducing abstract notions of individual, outside of the way that those individuals can be keyed, we provide a hook for data integration.
I’m kind of saying what stuff should be stored in a “master record” for each student, but I don’t expect to know all the fields when I introduce the concept of a student.
Another thing that’s fun about bothering to introduce abstract classes of individual is that contextualization can be much more implicit. We do not need to name individuals to talk about stuff that’s pertinent to a typical individual, which means we can write higher-order things in a first order way and handle more of the plumbing by type-based lookup.
class Department for Module -: department : Department moduleId ! String class Test -: item ! String max : [0..] weight : [0..] for Student, Module, Participant, Test -: prop Present -: score : [0..max]
Here, I show how to associate a department with a module, after the fact. I also introduce tests for each module, each with a maximum score: the use of “-:” in the “class Test” declaration just elides an immediately subsequent “for Test -:”.
Correspondingly, for each student participating in a module (and those students might not be from the same department as the module), and for each test, it makes sense to wonder if the student showed up to the test and where in the range of possible marks they scored.
I should be able to write something like
to mean, given a contextualizing department, just the modules for that department.
What’s a view of this information? It might be something like
one Module [moduleId] | for Test [item | ---- | max ] ----------------------------+-------------------------------- for Student, Participant | val : Percentage [surname|foreNames|regNo] | if Present -: | [score] | val = weight * score / max | else -: | ["A"] | val = 0
I’m sure we can negotiate over the two-dimensionality of the syntax (as long as we prioritise reading over writing), but that’s the picture. Scoping goes downward and rightward. The brackets show you what you actually see, which must exist in the given scope. The keyword “one” indicates that we are working within just one module, keyed by the given code. Meanwhile “for” requires us to tabulate all the individuals for whom an environment can be constructed to match the given context.
Meanwhile, the cells in the middle of the table will enable the computation of new local information, “val”. Presence or absence is signified by a score or the constant “A” (which is checkably distinct from all allowable scores), and the definition of “val” is given accordingly.
Note that I have not indicated whether this view is a query or a form. In fact, I have made sure it is valid in both roles. I’d like to be able to instruct the computer to initialize a spreadsheet with as much of this information as is available from other sources. I usually have to do that with cut and paste! After the fashion of pivot tables, I should be able to specify aggregations over rows and columns which are appropriate (e.g., the average score for each test, the total score for each student).
Lots of the ingredients for these tools are in existence already, and it’s clear that my knowledge of them is sadly lacking, having stumbled into the direction of data from the direction of dependent type theory. I seek to educate myself, and help in that regard is always appreciated. Of course, informally, I’m taught about some of the problems by the poor technology with which I face the mundane realities of my existence, and I understand that I can change me more easily than I can change the world. I don’t expect institutional buy-in (I’ll have a go, right enough), but I don’t need it. The point of the modelling language is to build myself a bubble with a permeable membrane: the things from outside the bubble can have sense made of them (by giving a view which describes the role of external data); the things constructed inside the bubble make sense intrinsically (because they were constructed in a model-directed way). Fewer strings! More things!
Edit: I should have included a link to
my slides on this topic, for a talk delivered at Microsoft Research and at York.