Web Harvester for Sitecore: A Technology-Agnostic Solution for Sitecore Content Migrations

Content Migration has been one of the key problems for organizations to solve when migrating to Sitecore or to Sitecore Headless Architectures for adapting to the most sought-after Composable DXP strategy. This introduces the need for a technology-agnostic content migration tool that allows seamless migration of content from any platform into Sitecore irrespective of where the content resides.

‘Web Harvester for Sitecore’ is a Chrome Extension that enables content administrators/developers to harvest and import any website’s content into Sitecore. This Chrome Extension eliminates the need for building unique platform-specific solutions for Sitecore content migrations by providing a tech-agnostic solution that improves productivity, efficiency & accuracy.

PREREQUISITES:

  • This tool depends on Sitecore ItemService RESTful API for creating/updating items in Sitecore. Ensure that either ServicesOnPolicy or ServicesLocalOnlyPolicy(if Sitecore is hosted locally) is configured for Sitecore.Services.SecurityPolicy in the Sitecore.Services.Client.config of your Sitecore instance
  • For importing media into Sitecore,

INSTALLATION:

CONFIGURING IMPORTS:

  • Specify the URLs to be imported separated by semi-colons. Ensure that all the mentioned URLs follow the same Page DOM structure.
  • Specify the Sitecore URL and authenticate to enable fetching Sitecore Content Tree from the target Sitecore instance and to allow importing content into Sitecore
  • Specify the target content/media locations and template to create/update the item
  • Select the DOM Elements whose content needs to be imported into Sitecore using the DOM picker icons and map them with appropriate template fields
    • For Image & Link fields, you might want to update the DOM path to add the attribute (Eg: @src, @href) from which the content needs to be pulled once the DOM is selected.
    • For Multilist fields, you will need to double-click on the DOM picker icon to select multiple DOMs. Once the selection is complete, the DOM picker icon needs to be clicked again. Selected DOM contents will be delimited by ‘|’ and will be pushed to Sitecore. (Ensure to map  DOMs from the page/URL which has the max. no of tags/options for the Multilist field to ensure no tag/option is missed for any of the imported URLs)
  • Configure appropriate Replace Options for DOM content individually if necessary by selecting the replace icon and specifying find/replace texts
    • Fields like Multilist, Droplink, Treelink, etc. require Datasource Sitecore Item GUID instead of texts. Replace Options should include the text to find in the content and appropriate Datasource GUIDs to replace with.
    • The default XML template followed by Image & Link fields will be auto-populated in the Replace Options once the field is selected and can be updated if necessary.
  • Any sub-items or dependent items can be added using the ‘Add another Mapping Section’ link
  • Click on the Import button to extract content from the specified DOM elements and import them into Sitecore for all of the specified URLs.

NOTE:

  • Ensure to publish all the content locations along with children once the import is complete.
  • This Chrome Extension uses a few asynchronous jobs for validating and initiating the import, hence the import initiation may be delayed depending on the URLs count. It is recommended to keep the URLs count <250 per import operation to avoid such delays. 
  • The Extension uses only the non-processed HTML markup(i.e. without any javascript execution) for capturing the DOM Elements to ensure performance. Hence any lazy-loaded images should use the attribute specified in the non-processed HTML markup(Eg: @data-src) in the DOM path and any lazy-loaded URLs must be imported separately.
  • For importing images, ensure that the IIS User has download permissions for wwwroot folder. When using Docker, the following command can be included in the Dockerfile of CM,
    RUN icacls ‘C:\inetpub\wwwroot’ /grant ‘IIS_IUSRS:(F)’ /t

Source Code for this module is available in Github. This module is built for the Sitecore community and doesn’t require any license. This module doesn’t collect any of your information. Please check out and let me know if you have any feedback/issues/feature requests here.

Thank you for using this module!

2 Replies to “Web Harvester for Sitecore: A Technology-Agnostic Solution for Sitecore Content Migrations”

  1. Why that approach, if Data Exchange Framework gives a wider variety of options, in pipelines?
    If needed low level, one could use SPE which is an industry-proven solution for Sitecore .

    What advantages does this approach have over DEF or SPE?

    1. Hi Martin, DEF is a very powerful, extensible ETL tool. This Chrome Extension is mainly built for one common scenario of migrating content directly from the browser to Sitecore, as we do not have a tech agnostic way to read content with DEF. Some of the key differentiators,

      • Easy to setup & configure mappings
      • Works for any website irrespective of the technology
      • Allows mapping portions of a page flexibly with XPaths
      • Process content with Regex Replace options
      • Save mapping and reuse later

      To import with DEF, we could use OOB value accessors, writers, readers, connectors, etc., or create custom ones that can be complex. For other cases, DEF/SPE would be more appropriate.

      Thanks,
      Subbu

Leave a Reply

Your email address will not be published. Required fields are marked *