Defining the Data Models

Having created the schema, let’s now move on to implementing the processors. We will write two processors, the runs processor for processing the raw run summaries, and the stats processor for querying the metrics. Before that, we will first write the internal models for our samples, libraries, and the statistics containers.

We will put all of them inside the nl.lumc.sasc.sentinel.processors.maple package, since everything will be specific for the Maple pipeline support.

Note

Since we will be using the internal models for this part, it is useful to browse the ScalaDoc along the way. Link to the most recent ScalaDoc is available in the sidebar.

To start off, we first consider the types of object we need to define:

  • For the run itself, we’ll define a MapleRunRecord that subclasses nl.lumc.sasc.sentinel.models.BaseRunRecord.
  • For the samples, we’ll define MapleSampleRecord that subclasses nl.lumc.sasc.sentinel.models.BaseSampleRecord.
  • Likewise, for the library, we’ll define MapleReadGroupRecord subclassing nl.lumc.sasc.sentinel.models.BaseReadGroupRecord.
  • And finally, for the statistics, we’ll define MapleStats for the single data points and MapleStatsAggr for aggregated data points.

The definitions of these objects are outlined below. Note that while we are defining these objects once per file, you have the freedom to create them in one large file. The important thing is they have the correct package name (nl.lumc.sasc.sentinel.maple in this case).

MapleRunRecord

Let’s start with the first one: MapleRunRecord. Open a MapleRunRecord.scala file in the appropriate directory and add the following contents (you can use your own package name, if you prefer):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
 package nl.lumc.sasc.sentinel.exts.maple

 import java.util.Date

 import org.bson.types.ObjectId

 import nl.lumc.sasc.sentinel.models._
 import nl.lumc.sasc.sentinel.utils.utcTimeNow

 /** Container for a Maple run. */
 case class MapleRunRecord(
   runId: ObjectId,
   uploaderId: String,
   pipeline: String,
   sampleIds: Seq[ObjectId],
   readGroupIds: Seq[ObjectId],
   runName: Option[String] = None,
   deletionTimeUtc: Option[Date] = None,
   creationTimeUtc: Date = utcTimeNow) extends BaseRunRecord

From the definition above, you can already notice a few properties :

  1. Our run record stores most of its IDs as ObjectId, which is the default ID type for MongoDB databases. The uploader ID is kept as a String for later use.
  2. We also store the date when the record is created in creationTimeUtc. We use the utctTimeNow function from the utils package to get the current UTC time.
  3. There is also a deletionTimeUtc attribute that stores when the record is deleted. The default is set to None, since when an object is created it is not yet deleted.

MapleSampleRecord

Now let’s move on to the sample record definition. In the same file, add the following MapleSampleRecord definition:

1
2
3
4
5
6
7
 /** Container for a single Maple sample. */
 case class MapleSampleRecord(
   stats: MapleSampleStats,
   uploaderId: String,
   runId: ObjectId,
   sampleName: Option[String] = None,
   runName: Option[String] = None) extends BaseSampleRecord

In contrast to MapleRunRecord, our sample record can be quite short since it needs to store less information. The actual metrics itself will be stored in a yet-defined MapleSampleStats object, under the stats attribute. The name stats itself is free-form, you are free to choose the attribute name for your metrics object. You can even define multiple attributes storing different statistics. This is useful for storing different types of metrics on the same level, for example storing alignment metrics and variant calling metrics for a given sample.

Notice also that there is no deletionTimeUtc attribute. This is because when sample records are removed from the database, Sentinel removes it completely and does not keep a record of which samples are removed. This is mainly because Sentinel never shows the sample document in the HTTP interface, so it is free to add and remove samples. The run record, on the other hand, are shown to users, and sometimes it is useful to keep track of ones that have been deleted.

Finally, notice that now we store the sample name under sampleName in addition to the run name.

MapleReadGroupRecord

Next up, is the read group record:

1
2
3
4
5
6
7
8
9
 /** Container for a single Maple read group. */
 case class MapleReadGroupRecord(
   stats: MapleReadGroupStats,
   uploaderId: String,
   runId: ObjectId,
   isPaired: Boolean = true,
   readGroupName: Option[String] = None,
   sampleName: Option[String] = None,
   runName: Option[String] = None) extends BaseReadGroupRecord

This is almost similar to MapleSampleRecord, except:

  1. There is an attribute called isPaired, which as you can guess, denotes whether the library comes from paired-end sequencing or not. Since Maple handles paired-end files, we can set this definition by default to true.
  2. There is an additional name attribute: readGroupName, for storing the read group name.

Statistics container

Finally, we come to the definition of our actual metrics container. Since we store the metrics on two levels, sample and read group, we need to define the metrics container for each of these levels. This is what they look like:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
 /** Container for a single Maple sample statistics. */
 case class MapleSampleStats(
   nSnps: Long,
   labels: Option[DataPointLabels] = None) extends LabeledStats

 /** Container for a single Maple read group statistics. */
 case class MapleReadGroupStats(
   nReadsInput: Long,
   nReadsAligned: Long,
   labels: Option[DataPointLabels] = None) extend LabeledStats

For each level, we define a case class that extends LabeledStats. This trait enforces the use of the labels attribute to tag a particular metrics data point with labels. For any given data point, it must at least be labeled with the database ID of the run record (runId). Optionally, it may also be labeled with the run name, read group name and/or sample name. All this is contained within the DataPointLabels instance stored in the labels attributed.

The objects defined above stores single data points of our metrics. They are instantiated for each sample or read group that is present in the uploaded JSON summary file. We enforce the use of a case class here based on several reasons:

  1. To minimize potential runtime errors, since the case class ensures our stored metrics are all typed. The type information is also used to ensure user-defined metrics works well with the Sentinel core methods.
  2. Case classes play nicely with Swagger’s automatic API spec generation. Supplying these as type parameters in our controllers later on results in Swagger generating the JSON object definitions.

In addition to the two case classes defined above, we may also want to define the following case classes for storing aggregated data points instead of single data points:

1
2
3
4
5
6
7
 /** Container for aggregated Maple sample statistics. */
 case class MapleSampleStatsAggr(nSnps: DataPointAggr)

 /** Container for aggregated Maple read group statistics. */
 case class MapleReadGroupStatsAggr(
   nReadsInput: DataPointAggr,
   nReadsAligned: DataPointAggr)

You’ll notice that these are almost similar to the previous case classes, except:

  1. All the attribute types are DataPointAggr.
  2. There are no labels anymore.

The DataPointAggr is another case class that contains aggregated statistics like avg, max, or median. It is likely that we will use macros to generate these in future Sentinel versions, since they are very similar to the case classes that define the single data points.

That concludes our first part of the processors tutorial! Now we can move on the the actual implementation of the processors. Before you go on, however, we would like to note that the processors make use of Scalaz’s disjunction type (popularly known as \/), its EitherT type, and the standard library Future type. If these do not sound familiar, we strongly recommend that you go over our short guides on them first: Composable Error Handling and Asynchronous Processing. Otherwise, feel free to go to the processors tutorial: Creating the Processors directly.