crash testing, now 92000 documents continuously tested
Last August we had a collection of approximately 76000 documents. This July we have over 92000 documents. The corpus is mostly gathered from our bugzilla and other projects bugzillas via our get-bugzilla-attachments-by-mimetype script. For testing purposes we continuously import them all, and then export certain formats to multiple destination formats. For example odts are re-exported to odt, doc and docx and the results of what documents crashed (which includes asserts) are uploaded to the crashtest site
I like to imagine that these are typically the type of mean and bitter documents that try to eat innocent office software alive.
The get-bugzilla-attachments-by-mimetype script only downloads new attachments from its target bugzillas which are not already downloaded. The last run to refresh the corpus took over two days to complete. Refreshing isn't fast or cheap so it's fairly infrequent.
The regular crashtesting run to import and reexport the corpus is comparatively frequent, takes approximately 24 hours and typically gets run every two or three days on the latest 64bit Linux master in headless mode.