ODOT reveals workaround for using FILESTREAM & Collector with large datasets

In this article, experts from the Ohio Department of Transportation provide a detailed workflow example, with step by step instruction, for using FILESTREAM with Collector for ArcGIS to gather, manage and store extremely large datasets in the best ways possible for performance, management and scalability, in line with current industry best practices.

Introduction

In the 2008 release of SQL Server, Microsoft introduced the option to enable a column to use FILESTREAM on a field for BLOB data with a data type of VARBINARY (MAX). This opened the door to options for storing larger data such as images, videos, documents, etc., in the systems directory folders (NTFS) instead of embedded inside the database.

From an architectural standpoint, using FILESTREAM has a significant Return on Investment (ROI) as a result of storage cost savings. Databases are stored on better quality drives that cost more to scale. Additionally, with the backup routines for maintenance and Disaster Response (DR) needs, the cost to account for rapid growth due to large BLOB data inside the database is substantial. Leveraging FILESTREAM allows for the database size to remain significantly smaller and offers more versatility by scaling a file server environment that uses lower quality and less expensive drives to store the large attached data.

Currently, ESRI does not support the FILESTREAM option out of the box; instead, these types of BLOB data are embedded within the tables inside the database. While this is beneficial for the transactional aspects of data management, it takes its toll on scalability aspects.

Additionally, the aspect of performance once records get into the 10’s or 100’s of thousands raises concerns. While the FILESTREAM option hasn’t been tested, there also seems to be a lack of documentation on the performance of databases with the attachments stored inside the tables.

Problem set

As a large government organization, the Ohio Department of Transportation has a need to gather data on a large number of different assets. The collection of this data may come from internal employees, external local government partners or contractors. A single enterprise solution capable of standardizing the data collection processes while leveraging existing architectures was needed.

The requirements list was extensive, but key pieces included:

Being able to track changes to the data at the individual user level using Active Directory integration,
Retain historical details,
Perform inspections/assessments against an asset, and
Attach images or documents.

All of this was to be accomplished in the most efficient and cost effective manner, with a scalable architecture to accommodate the growth of existing and new datasets.

As a result of an emergency situation in which Collector for ArcGIS had already been leveraged, ODOT was able to establish some baseline metrics and determined that Collector may be a viable solution. In that case, a total of roughly 5,000 assets making up a single data set, with 1-5 images per asset, were collected. This was done entirely within the ArcGIS Online environment, but once the data was pulled down into the local SDE, it was 15GB in size. For estimation purposes this was generalized out to assess impacts of larger asset datasets, and it was determined that an asset with 100,000 items could scale to roughly 300GB [(100,000/5,000) * 15]. It was easy to quickly recognize that, propagated across the large number of assets that were under consideration, storage could easily climb into the terabytes range.

Solution

The solution was to leverage Collector for ArcGIS as part of the ESRI suite of products, and make use of the Eend-User License Agreement (EULA) already in place to facilitate the front end through ArcGIS Online, and the Collector application while managing, storing and serving the data on ODOT’s back end architecture.

The workflow example outlined below details the concepts behind using Collector for ArcGIS to gather datasets and then manage and store them in the best ways possible for performance, management and scalability, in line with current industry best practices.

Our initial workflow was designed to leverage Collector for ArcGIS in order to gather a new dataset at any point location and store up to 5 asset images for that location.

Current ESRI solution

After creating the data set tables in the ArcSDE Geodatabase on SQL Server 2012, one of the administrative steps is to "Create Attachments", easily done in ArcCatalog by right-clicking the feature class.

ESRI takes control and builds the necessary related table to store the images, PDF's, etc., along with the relationship class that associates these BLOB data to the individual features. During this process a field is created that has the name “DATA”. This field is where the attached BLOB files are physically stored. Each record in this related table now has a single attached data item that is embedded directly within the table. A few other admin steps, such as enabling “Editor Tracking” or “Enable Archiving”, are performed here as well, and the data is ready to be used with Collector.

The issue with ESRI's current solution

The issue with this is that during the “Create Attachments” process there is no option to enable the “DATA” field as a FILESTREAM field. This restricts users or database administrators to this single method of storing these data types in the tables, instead of providing the option to let the database manage the data regardless of which option is used; but this is precisely the moment when this needs to occur, because a field cannot be designated to use FILESTREAM after the table is created — the allocation to use FILESTREAM must be done as part of the table creation process.

While utilizing the field as a FILESTREAM-enabled field may slightly slow down transactions, we felt it was an acceptable trade-off and would add a minimal amount of time to a process that already takes some user waiting and patience. The benefit includes the ability to store the BLOB data on separate disks that can be configured to better optimize their purpose and help with the transactional aspects. Having the BLOB data on separate disks makes it easier to scale and manage growing datasets while also helping to maintain database performance for users. Additionally, handling the backup and routine maintenance aspects is greatly improved because the database is significantly smaller in size.

After posts to the ESRI Idea’s website and escalated tickets submitted through the ESRI helpdesk, there was no helpful solution presented to ODOT for leveraging the FILESTREAM option. With no answers on if or when an out-of-the-box solution would come, as well as no details for workaround options, it seemed that an interesting path was ahead for managing such a large and rapidly growing database and datasets.

Workaround

Despite no existing procedures for leveraging FILESTREAM, it still presented the best solution to meet the needs of ODOT, so a workaround was designed; below are the procedural steps that allow for the creation of a new data set in ArcGIS that can take advantage of the FILESTREAM option.

CAUTION: This workflow is a functional solution but may or may not meet the needs of all organizations. This workflow is structured based on the creation of a brand new data set and is not to be applied to an existing data set. Applying this to an existing data set may lead to the loss of data and/or corruption of the data or database. For ODOT’s purposes the options for “Enabling Editor Tracking”, “Enabling Archiving” and using a non-versioned workflow were necessary. Specifically, this has not been vetted against a versioned workflow and may not provide a working solution to versioned workflows.

DISCLAIMER: The Ohio Department of Transportation and authors of this paper assume no responsibility and cannot be held liable for any corruption of datasets, databases or subsequent breaking of applications or other negative impacts due to attempts to leverage this solution. It is the responsibility of the individual to follow IT best practices with regards to backing up data and working solutions through the development, testing and production environments.

1. Verify FILESTREAM is enabled for your SQL Server instance by going into the SQL Server Configuration Manager and right-clicking on the server instance name. Go down to the pop-out menu and click Properties. In the new pop-up window for Properties, click on the FILESTREAM tab and ensure that FILESTREAM is configured.

2. In SQL Management Studio, right-click on the database on which you want to enable FILESTREAM and go down to Properties. At the bottom of the Properties pop-up window there is a box for FILESTREAM; click “Add” to create the FILESTREAM file group.

3. On the left side, click the “Files” page then click the “Add” button on the bottom of the window. This will add a new file. In the window, click on the “File Type” for the newly created file name and in the drop-down menu select FILESTREAM Data. For the same file, at the right side, click the “Path” and designate the NTFS location where the files will be stored.

4. Create a new feature class (example: FILESTREAM_test).

5. Setup the new feature class for use in Collector.

6. Add Global ID’s, Enable Editor Tracking, Enable Archiving, Create Attachments.

7. Confirm the Attachments' related table and relationships classes are built.

8. Get a screenshot of the indexes that exist on the Attachment related table because you will need to rebuild them later.

9. Go into SQL Management Studio — or similar database tool — and copy the “Create” statement that was generated when ArcGIS built the attachment table (FILESTREAM_test_ATTACH).

1.  USE [Collector]
2.  GO
3.
4.  /****** Object: Table [dbo].[FILESTREAM_TEST__ATTACH] Script Date: 9/1/2015 10:35:16 AM ******/
5.  SET ANSI_NULLS ON
6.  GO
7.
8.  SET QUOTED_IDENTIFIER ON
9.  GO
10.
11. SET ANSI_PADDING ON
12. GO
13.
14. CREATE TABLE [dbo].[FILESTREAM_TEST__ATTACH](
15.   [GDB_ARCHIVE_OID] [int] NOT NULL,
16.   [REL_GLOBALID] [uniqueidentifier] NOT NULL,
17.   [CONTENT_TYPE] [nvarchar](150) NOT NULL,
18.   [ATT_NAME] [nvarchar](250) NOT NULL,
19.   [DATA_SIZE] [int] NOT NULL,
20.   [DATA] [varbinary](max) NULL,
21.   [GLOBALID] [uniqueidentifier] NOT NULL,
22.   [GDB_FROM_DATE] [datetime2](7) NOT NULL,
23.   [GDB_TO_DATE] [datetime2](7) NOT NULL,
24.   [ATTACHMENTID] [int] NOT NULL
25. ) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
26.
27. GO
28
29. SET ANSI_PADDING OFF
30. GO
31. 
32. ALTER TABLE [dbo].[FILESTREAM_TEST__ATTACH] ADD DEFAULT ('{00000000-0000-0000-0000-000000000000}') FOR [GLOBALID]
33. GO
34.
35. ALTER TABLE [dbo].[FILESTREAM_TEST__ATTACH] ADD CONSTRAINT [GDB_FROM_DATE129_def] DEFAULT (getutcdate()) FOR [GDB_FROM_DATE]
36. GO
37.
38. ALTER TABLE [dbo].[FILESTREAM_TEST__ATTACH] ADD CONSTRAINT [GDB_TO_DATE129_def] DEFAULT (CONVERT([datetime2],'9999-12-31 23:59:59.000000')) FOR [GDB_TO_DATE]
39. GO

10. Delete line 25, in this example, except for the bracket.

ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]

11. Modify line 21, in this example, to add in ROWGUIDCOL as unique per the details found at Microsoft’s site.

[GLOBALID] [uniqueidentifier] ROWGUIDCOL NOT NULL UNIQUE

12. Modify line 20, in this example, to enable FILESTREAM for the column.

[DATA] [varbinary](max) FILESTREAM NULL

13. Delete the attachment table: FILESTREAM_test__ATTACH using SQL Management Studio. This tricks ArcGIS by keeping the relationship class built and correct.

14. Execute the modified create statement built in steps 9-12 to rebuild the FILESTREAM_test__ATTACH table that now has the necessary FILESTREAM allocations in it.

1.  USE [Collector]
2.  GO
3.
4.  /****** Object: Table [dbo].[FILESTREAM_TEST__ATTACH] Script Date: 9/1/2015 10:35:16 AM ******/
5.  SET ANSI_NULLS ON
6.  GO
7.
8.  SET QUOTED_IDENTIFIER ON
9.  GO
10.
11. SET ANSI_PADDING ON
12. GO
13.
14. CREATE TABLE [dbo].[FILESTREAM_TEST__ATTACH](
15.   [GDB_ARCHIVE_OID] [int] NOT NULL,
16.   [REL_GLOBALID] [uniqueidentifier] NOT NULL,
17.   [CONTENT_TYPE] [nvarchar](150) NOT NULL,
18.   [ATT_NAME] [nvarchar](250) NOT NULL,
19.   [DATA_SIZE] [int] NOT NULL,
20.   [DATA] [varbinary](max) FILESTREAM NULL,
21.   [GLOBALID] [uniqueidentifier] ROWGUIDCOL NOT NULL UNIQUE,
22.   [GDB_FROM_DATE] [datetime2](7) NOT NULL,
23.   [GDB_TO_DATE] [datetime2](7) NOT NULL,
24.   [ATTACHMENTID] [int] NOT NULL
25. )
26.
27. GO
28
29. SET ANSI_PADDING OFF
30. GO
31. 
32. ALTER TABLE [dbo].[FILESTREAM_TEST__ATTACH] ADD DEFAULT ('{00000000-0000-0000-0000-000000000000}') FOR [GLOBALID]
33. GO
34.
35. ALTER TABLE [dbo].[FILESTREAM_TEST__ATTACH] ADD CONSTRAINT [GDB_FROM_DATE129_def] DEFAULT (getutcdate()) FOR [GDB_FROM_DATE]
36. GO
37.
38. ALTER TABLE [dbo].[FILESTREAM_TEST__ATTACH] ADD CONSTRAINT [GDB_TO_DATE129_def] DEFAULT (CONVERT([datetime2],'9999-12-31 23:59:59.000000')) FOR [GDB_TO_DATE]
39. GO

15. Using ArcGIS, recreate the indexes captured in step 5. NOTE: Notice the existing index as a result of the work done in SQL Management Studio: This is fine — DO NOT delete it.

16. Recreate the UUID index.

17. Recreate the GDB index.

18. Review and confirm all 3 indexes exist and are correct.

19. The data set is ready to be used in a service on an ArcGIS server and built into a map to be used with Collector.

Observations

All the core functionalities continue to work as designed by ESRI for the datasets within all the ArcGIS software products. The archiving functionality is working correctly and was vetted by deleting attachments from within Collector. Not only are the record sets within the database properly applying the date time stamps (GDB_TO_DATE) and no longer showing the image, it is also properly retaining the images on the file server location via FILESTREAM to meet archiving needs.

There is some initial review that will need to be done on each dataset in order to understand the mapping of the attached files back to the individual records within the database because the generated attachment file names are system-generated, offering no immediate logical correlation.

If there is no desire to retain images on the file server in accordance with the ESRI archiving processes, then some simple scripting work can be done to identify records that contain a valid “GDB_TO_DATE” value (example: 20150831-09:15:15) in the FILESTREAM_test__ATTACH table, allowing you to delete the appropriate records. This deletion propagates to the file server and properly removes the attached file as well. A per se, invalid value in the “GDB_TO_DATE” field begins with “9999” for the year and identifies that it’s a record that is still in use, representing the most current state of the data and related attachments. Scripting is simply tied to this “GDB_TO_DATE” field to retain all records with “9999” beginnings and delete records up to a specific date and time. Pull system time at that moment or anything older than 1 month would be some example options based on the retention policies and needs of the organization.

Conclusion

Initial testing of Collector using Active Directory Federated Services along with ArcGIS Online and enabling FILESTREAM works, and thus far, meets the requirements and user needs. Based on work done thus far, scaling and performance seem to be optimal for the architecture and needs.

Future potential exists to test this workflow as part of a versioned data set workflow. This is still being evaluated by ODOT and may become a future need, at which point this can be evaluated as a working solution, or a new possible workaround could be defined, should one exist. Additionally, no testing has been done for handling the migration of existing data into this solution. This will be evaluated in the coming months as a new data set is created based on this solution and an Export, Transform, Load (ETL) process will facilitate the migration of existing data into this solution. There are no anticipated issues with this solution but it must be noted that ODOT has not vetted this yet.

Enterprise solutions and architectures are exploding, and as they do, many new challenges and opportunities arise. While ESRI has done great work thus far to provide true, fully-integrated enterprise solutions and architecture options, the continued advancements made by other technologies creates a continuous cycle of triaging opportunities for testing, development and implementation into the ESRI product suites. Hopefully, based on the solution provided here and customer needs as illustrated on the Ideas site and the forums, this database side advancement can be integrated with out-of-the-box functionalities.

For more information, contact Scott Fierro, GISP, CADD/GIS administrator, Ohio Department of Transportation, scott.fierro@gmail.com or Dustin Brown, SQL DBA, Ohio Department of Transportation, dustin.brown2@dot.ohio.gov.