Alfresco content store cleanup

General information

In alfresco information about a document is stored in three places:
  • File system
  • Dat Base
  • Lucene Indexes

Content Life Cycle

Where the document information is stored

 I Created a content in alfresco with some binary file uploaded. Each content will have a unique ID which can be found by viewing the details of the uploaded content from the web client. Format of the node ref: workspace://SpacesStore/a0ec9fcf-3775-4e2c-b3c0-d326bd8acf2b)
The information about the file is stored in the following places:
  1. In file system
    In my file system that content is locate at “alf_data\contentstore\2013\7\26\14\29\5ad9140e-1534-48c3-83d7-3050a6b956e2.bin”
    Note: alf_data location can be count in "alfresco-global.properties" file
  2. In database
    Main information about the content (meta data) is stored in alf_node table.
    SELECT an.id, an.store_id, als.protocol, als.identifier, an.uuid FROM alf_node as an, alf_store as als where als.id = an.store_id and an.uuid='a0ec9fcf-3775-4e2c-b3c0-d326bd8acf2b';
  3. Lucene index
    Alfresco stores indexed data in workspace store in my system path is “alf_data/lucene-indexes/workspace/SpacesStore/...”
    Lucene index location can be count in "alfresco-global.properties" file

 What happens when a content is deleted

Now, lets delete that content from alfresco
  1.  To delete Content go to Alfresco web client and click on the delete icon: 








Now the information about the file is stored in the following places: 
  1. In file system
    In my file system that content is locate at “alf_data\contentstore\2013\7\25\14\29\5ad9140e-1534-48c3-83d7-3050a6b956e2.bin”
  2. In database
     SELECT an.id, an.store_id, als.protocol, als.identifier, an.uuid FROM alf_node as an, alf_store as als where als.id = an.store_id and an.uuid='a0ec9fcf-3775-4e2c-b3c0-d326bd8acf2b';
     (you can find node in archive store in protocol column)
  3. Lucene index
    node is in archive store

Once the content is deleted, it is moved in to archived store



Delete the content from archived store



Even now the deleted content is not completely removed from the system. Even though this content is not available through Web client it is still stored in the file system and DB.

  1. In file system
    In my file system that content is locate at “alf_data\contentstore\2013\7\26\14\29\5ad9140e-1534-48c3-83d7-3050a6b956e2.bin”
    Note: alf_data location can be count in "alfresco-global.properties" file
  2. In database
    SELECT an.id, an.store_id, als.protocol, als.identifier, an.uuid, an.node_deleted 
    FROM alf_node as an, alf_store as als 
    where als.id = an.store_id and an.uuid='a0ec9fcf-3775-4e2c-b3c0-d326bd8acf2b';
    In node_deleted you can see as a ‘1’

    To find all deleted node that are not shifted to deleted store
    SELECT * FROM alf_content_url where orphan_time is not null;
  3. Lucene index
    We don’t have entry as we depleted content from archived store.

Now content is orphaned

Once all references to a content binary have been removed from the metadata, the content is said to be orphaned. Orphaned content can be deleted or purged from the content store while the system is running. Identifying and either sequestering or deleting the orphaned content is the job of the contentStoreCleaner. 
In the default configuration, the contentStoreCleanerTrigger fires the contentStoreCleaner bean. Default configuration xml path is “\alfresco\WEB-INF\classes\alfresco\content-services-context.xml”
<bean id="contentStoreCleaner" class="org.alfresco.repo.content.cleanup.ContentStoreCleaner" >
...
<property name="protectDays" >
 <value>14</value>
</property>
<property name="stores" >
 <list>
    <ref bean="fileContentStore" />
 </list>
</property>
<property name="listeners" >
 <list>
    <ref bean="deletedContentBackupListener" />
 </list>
</property>
</bean>
protectDays
Use this property to dictate the minimum time that content binaries should be kept in the contentStore. In the above example, if a file is created and immediately deleted, it will not be cleaned from the contentStore for at least 14 days. 
 Note: In alfresco-global.properties we can add system.content.orphanProtectDays=NUMBER OF DAYS property to reset protectDays
store
This is a list of ContentStore beans to scour for orphaned content.
listeners
When orphaned content is located, these listeners are notified. In this example, the deletedContentBackupListener copies the orphaned content to a separate deletedContentStore.
Note: that this configuration will not actually remove the files from the file system but rather moves them to the designated deletedContentStore, usually contentstore.deleted. The files can be removed from the deletedContentStore via script or cron job once an appropriate backup has been performed.
Note:  deletedContentStore (usually contentstore.deleted) is located under alf_data location which can be fount in "alfresco-global.properties" file