Content Migration has been one of the key problems for organizations to solve when migrating to Sitecore or to Sitecore Headless Architectures for adapting to the most sought-after Composable DXP strategy. This introduces the need for a technology-agnostic content migration tool that allows seamless migration of content from any platform into Sitecore irrespective of where the content resides.
‘Web Harvester for Sitecore’ is a Chrome Extension that enables content administrators/developers to harvest and import any website’s content into Sitecore. This Chrome Extension eliminates the need for building unique platform-specific solutions for Sitecore content migrations by providing a tech-agnostic solution that improves productivity, efficiency & accuracy.
- This tool depends on Sitecore ItemService RESTful API for creating/updating items in Sitecore. Ensure that either ServicesOnPolicy or ServicesLocalOnlyPolicy(if Sitecore is hosted locally) is configured for Sitecore.Services.SecurityPolicy in the Sitecore.Services.Client.config of your Sitecore instance
- For importing media into Sitecore,
- Ensure Sitecore Powershell Extensions is installed in your Sitecore instance
- Install the below Sitecore Powershell Module in your Sitecore instance, https://github.com/SubbuRamanathan/web-harvester-for-sitecore/releases/download/v1.0/WebHarvesterForSitecore-v1.0.zip
- Enable ‘restfulv2’ in Sitecore Powershell Config as described here
- Install the Chrome Extension from the below link, https://chrome.google.com/webstore/detail/capmoeppijlopehoccajgabbmpmmacnc?hl=en
- Right-click on the intended website page and select ‘Inspect Element’ or press ‘F12’ to launch Developer Tools. Navigate to ‘Web Harvester’ panel in the Developer Tools.
- Specify the URLs to be imported separated by semi-colons. Ensure that all the mentioned URLs follow the same Page DOM structure.
- Specify the Sitecore URL and authenticate to enable fetching Sitecore Content Tree from the target Sitecore instance and to allow importing content into Sitecore
- Specify the target content/media locations and template to create/update the item
- Select the DOM Elements whose content needs to be imported into Sitecore using the DOM picker icons and map them with appropriate template fields
- For Image & Link fields, you might want to update the DOM path to add the attribute (Eg: @src, @href) from which the content needs to be pulled once the DOM is selected.
- For Multilist fields, you will need to double-click on the DOM picker icon to select multiple DOMs. Once the selection is complete, the DOM picker icon needs to be clicked again. Selected DOM contents will be delimited by ‘|’ and will be pushed to Sitecore. (Ensure to map DOMs from the page/URL which has the max. no of tags/options for the Multilist field to ensure no tag/option is missed for any of the imported URLs)
- Configure appropriate Replace Options for DOM content individually if necessary by selecting the replace icon and specifying find/replace texts
- Fields like Multilist, Droplink, Treelink, etc. require Datasource Sitecore Item GUID instead of texts. Replace Options should include the text to find in the content and appropriate Datasource GUIDs to replace with.
- The default XML template followed by Image & Link fields will be auto-populated in the Replace Options once the field is selected and can be updated if necessary.
- Any sub-items or dependent items can be added using the ‘Add another Mapping Section’ link
- Click on the Import button to extract content from the specified DOM elements and import them into Sitecore for all of the specified URLs.
- Ensure to publish all the content locations along with children once the import is complete.
- This Chrome Extension uses a few asynchronous jobs for validating and initiating the import, hence the import initiation may be delayed depending on the URLs count. It is recommended to keep the URLs count <250 per import operation to avoid such delays.
- For importing images, ensure that the IIS User has download permissions for wwwroot folder. When using Docker, the following command can be included in the Dockerfile of CM,
RUN icacls ‘C:\inetpub\wwwroot’ /grant ‘IIS_IUSRS:(F)’ /t
Source Code for this module is available in Github. This module is built for the Sitecore community and doesn’t require any license. This module doesn’t collect any of your information. Please check out and let me know if you have any feedback/issues/feature requests here.
Thank you for using this module!