There are lots of programming situations where it is necessary to break up a large amount of work into smaller, more manageable pieces or chunks. The technique is known as “Chunking” and is frequently used when a huge list needs to be iterated in batches of say, 100 items at a time.
Reasons For Chunking
Some scenarios where you might need to do chunking include:
- Perform operations in batches to limit memory usage and prevent peaking
- Chunking allows better management of failures: When a failure occurs, the process can resume from the last failure point instead of starting all over
- Run multiple threads (each processing a different chunk) to take advantage of multiple processors and/or cores
- Some third party APIs specifically require that you employ chunking to grab data. I find this one really interesting… so I’ll expand on it with an example:
Most methods/endpoints in the Twitter API are rate limited. The rate limits become even more noticeable if you’re working with very large Twitter accounts. For example, as of this writing, my Twitter account has more than 289,000 followers. When I write apps to retrieve a list of all my Twitter followers and query the results according to certain logic, I always need to carefully apply chunking techniques to prevent exceeding Twitter’s API rate limits. Apps that frequently exceed Twitter’s API rate limits get banned every time.
Chunking In C#
This helper method written in C# will split a generic List into chunks.
/// <summary> /// Break a List<T> into multiple chunks. The <paramref name="list="/> is cleared out and the items are moved into the returned chunks /// </summary> /// <typeparam name="T"></typeparam> /// <param name="list">The list to be chunked</param> /// <param name="chunkSize">The size of each chunk</param> /// <returns>A list of chunks</returns> public static List<List<long>> BreakIntoChunks(List<long> list, int chunkSize) { if (chunkSize <= 0) { throw new ArgumentException("Chunk size must be greater than 0"); } List<List<long>> retVal = new List<List<long>>(); while (list.Count > 0) { int count = list.Count > chunkSize ? chunkSize : list.Count; retVal.Add(list.GetRange(0, count)); list.RemoveRange(0, count); } return retVal; }
Chunking In C#: Another Method
If you prefer to copy items from the big list to the small chunks while keeping the original items in the big list, use this version instead.
NOTE: This version could be said to be less efficient since it keeps duplicate items in memory.
/// <summary> /// Splits a List<T> into multiple chunks /// </summary> /// <typeparam name="T"></typeparam> /// <param name="list">The list to be chunked</param> /// <param name="chunkSize">The size of each chunk</param> /// <returns>A list of chunks</returns> public static List<List<T>> SplitIntoChunks<T>(List<T> list, int chunkSize) { if (chunkSize <= 0) { throw new ArgumentException("Chunk size must be greater than 0"); } List<List<T>> retVal = new List<List<T>>(); int index = 0; while (index < list.Count) { int count = list.Count - index > chunkSize ? chunkSize : list.Count - index; retVal.Add(list.GetRange(index, count)); index += chunkSize; } return retVal; }
Calling The Helper Method
Here’s an example of how you would call the BreakIntoChunks helper method described earlier…
Let’s say I have a C# list containing the IDs (long data type) of all my Twitter followers. Say this list is called FollowerIDs.
Now I want to get another list (called FollowerObjects) containing the full user object (not just the ID) of each of the items in the FollowerIDs list (let’s assume the object data type is IUser).
The Twitter endpoint that provides this information allows for a maximum of 100 user IDs at a time (as of this writing of course).
I will proceed like this:
List<List<long>> ListOfChunks = BreakIntoChunks(FollowerIDs, 100); List<IUser> FollowerObjects = new List<IUser>(); foreach (List<long> ChunkOfHundred in ListOfChunks) { // Do something with ChunkOfHundred. // That "something" will involve calling the appropriate Twitter API to get user objects based on user IDs. // It will also add pauses as necessary to prevent exceeding Twitter's API rate limits. // Add the results of the call/process to the FollowerObjects list. }
Leave a Reply