GitHub Issue 875: Assay Multi-File Transform Import Skips First Data Row#7456
GitHub Issue 875: Assay Multi-File Transform Import Skips First Data Row#7456cnathe merged 4 commits intorelease25.11-SNAPSHOTfrom
Conversation
…regardless of if column headers written - note: not yet fixed to make sure selenium test fails as expected on TC first
…regardless of if column headers written
…of the number of Luminex and Standard assay runs that were imported with > 1 data input file
| assayMetrics.put("assayRunsWithMultipleInputFiles", new SqlSelector(schema, """ | ||
| SELECT COUNT(*) FROM ( | ||
| SELECT sourceapplicationid, COUNT(*) AS count FROM exp.data | ||
| WHERE name NOT LIKE '%.log' AND name NOT LIKE '%.Rout' AND name NOT LIKE '%.pdf' AND sourceapplicationid IN ( |
There was a problem hiding this comment.
Wouldn't the .Rout extension only apply to transform scripts written in R? Maybe I'm wrong but I thought that whatever files are left after the transform script has completed get added as a data output.
It looks like there is data type information encoded into the exp.data lsid. I haven't looked into it but wondering whether filtering on that might work.
There was a problem hiding this comment.
yeah, if there is a better way to filter down to just the "data" files that would be great. When I was looking over the set of files (exp.data rows) linked to the runid, I just wanted to make sure we aren't counting the logging info files and other generated files.
There was a problem hiding this comment.
I looked at it a little bit and there is some evidence that we use the AbstractAssayProvider.RELATED_FILE_DATA_TYPE for these additional files that get produced from a transform script. I tried this metric variation:
SELECT COUNT(*) FROM (
SELECT sourceapplicationid, COUNT(*) AS count FROM exp.data
WHERE lsid NOT LIKE '%:RelatedFile.%' AND sourceapplicationid IN (
SELECT rowid FROM exp.protocolapplication
WHERE lsid LIKE '%:SimpleProtocol.CoreStep' AND (protocollsid LIKE '%:LuminexAssayProtocol.%' OR protocollsid LIKE '%:GeneralAssayProtocol.%')
)
GROUP BY sourceapplicationid
) x WHERE count > 1
And it produced the same result as your query. You are free to play around with it if you were interested but overall I don't think it is superior to your query. If anything yours could be a more conservative estimate and might pull in some false positives but I think that is more desirable than the other direction.
There was a problem hiding this comment.
I like that better. I'm switching to using that NOT LIKE RelatedFile instead. Thanks
Rationale
#875 Luminex Multi-File Transform Import Skips First Data Row
When a Luminex assay run is imported that includes multiple files, we merge / concatenate the data from those files together into a single runData tsv file when writing the data out to the assay transform script. This PR fixes an issue where the first row of the 2nd/3rd/etc files was getting skipped.
Related Pull Requests
Changes