Friday, January 31, 2014

Making of "Year of the Horse" Greeting video

Today is the first day of the Chinese New Year 2014. It is the year of the horse. If you haven't done so already, go check out the greeting video I created for this.


The story of how this video came to be goes back over 12 years...


In the Beginning
I guess the idea to make this greeting video really came about at the start of 2001. Back in those days, the internet was still very much still in its infancy (despite being in the middle of the dotcom bubble of that era). Most people were only really starting to discover "cyberspace", taking halting steps around via their dialup modems (with their slow speeds, small bandwidths, and critical weakness - aka a phone off the hook). When it came to communication, email was still very much in vogue, as social networks didn't exist yet and "cloud computing" could probably only be used to describe a select few businessmen lugging and using their heavy + short-battery-life'd laptops while on a plane.

So, it is fitting that this whole story begins with a single email - or rather, the attachment that it contained - which kicked off this effort. Specifically, it was a little executable (back in the day, the world really hadn't woken up yet to the dangers of receiving unknown executable files in your inbox and just running those) or rather, an executable flash file with an interactive greeting for the Year of the Snake (2001). Sure, the graphics and general execution of the thing were really tacky. And perhaps, since it was so crappy, I decided that I could do better for the following year. Thus, the project began.

Techniques and Technology - 2002 Edition
As alluded to in my post on recovering the raw frames used in this video from my Windows 98 computer, back then, I was familiar with three main technologies which I used to create the first version of this greeting: Microsoft Paint, Microsoft Agent / Agent Character Editor, and Visual Basic 6.

With this combination of software, I clobbered together an interactive greeting card/app where the main feature was a little skit featuring an animated and talking (complete with computer-generated audio, speech bubbles, and automatic lipsyncing) horse character. There were also a few other little bits and pieces that you could access from within the GUI, but the main attraction was always the talking horse:
The original incarnation - Horsey along with speech bubble and the control app in the background.

The workflow for creating this animation went something like this:

First I'd create the character - specifically, an image of the character in its "Rest Pose" (using MsPaint). Along with this, I'd need to create an image containing the colour palette for the set of colours that could be used for the character and its props. This was limited to 220-something IIRC (20 were reserved for system colours, and 1 was used as the alpha colour since .bmp's don't have an alpha channel).

From this rest pose,  I'd start "animating" the character (again in MsPaint) in a straight-ahead fashion (vs pose to pose - which I didn't know about until many years later). This was done by saving a new file for the first frame of the new action, and then using a combination of box select and hand-painting (particularly over the connection point between the rest of the body and the moved parts).

For some frames, some extra elements would be needed (e.g. the galloping/walking, or drinking actions). In those cases, those extra elements were created in a separate file (where they could be tweaked in isolation with greater precision - e.g. by making it easier to gradually shear the pixels to create rotation effects), and then pasted back into the appropriate frames later.

To save effort drawing duplicate frames, much of the actual "timing" work occurred when I virtually "shot" (or rather, composed the frames together) using the MsAgent Character Editor (ACE). This was usually the first and only place that I got to see the frames in action to check on how the animation was going. As a result, the animation process often involved step-by-step forward animation in MsPaint, followed by importing the frames into ACE to check on the timing and whether it was flowing ok. From this initial preview, some frames would need to be duplicated to get the timing to read better, while others may need to be dropped or redrawn to get the desired results.

Finally, once I had some actions put together, and had set up voice/mouth-shapes/speech bubble theming settings, I would get to work in Visual Basic to create the application to host the little skit. Using a basic piece of example code I found somewhere (probably as part of the MsAgent documentation - it was interesting coming across this again the other day when cleaning out an old bookshelf) as the basic frame, I'd clobber together a new character-specific version by replacing all the generic/inappropriate parts. Examples of atrocities performed in this code include repeating the GUID for the character once per invocation/use of a MsAgent API call, and use of absolute paths (more on this in a bit). 

With a bit of back-and-forth between the Visual Basic script and ACE to add new segments and/or refine existing ones, I arrived at a version of the skit that was (IMO at the time of course ;) ready to present to the world.

Distribution Woes
The next step was to try to distribute it to people - mainly family/relatives scattered around the globe.

Perhaps it's necessary to firstly have a brief refresher history lesson in the state of the internet in those days. It was the early 2000's. Internet adoption was still just starting to pick up a bit, with most people still on sluggish dial up connections. Social media and cloud services did not exist yet, and the spectre of the dot-com bust was starting to loom in the horizon... And email still ruled the roost, as a novelty and as the core workhorse of sharing stuff among friends and family.    (For more nostalgic details, check out this post).

Back then, the world at large was still yet to wake up to the dangers of stuff sent as email attachments - particular of the "executable" kind. In fact, if you wanted to send interactive/animated content, it was probably your best bet.

2014 Revival - Overview
Not counting the time it took to research and perform the retrieval of the raw frames and files from my old Windows 98 box, the actual production process of this video took a total of about 8 hours of work.


First Step - Grunt Work
The first step of the revival process turned out to be just a lot of monkey-work to get the frames into a usable/useful modern format.

To be specific, the frames were saved as Windows Bitmaps (.bmp's - the native format of MsPaint at the time) and since MsAgent only accepted those for its input files. Since this format doesn't have an alpha channel, MsAgent used a hack where you specify a "chromakey" colour that acts as the alpha colour in bitmaps.

However, we now have a lot of modern formats which work a lot better, and support proper alpha. In other words, the PNG (Portable Network Graphics) format - my current favourite. Back in the Windows 98 days though, this was one of the newest formats on the block, and could barely be read by any tools at the time (save for IE, with a really awkward and ugly icon used to represent these files).

So, to bring things up to speed, I needed to take each of the raw frames (in .bmp format), remove the green backgrounds, and then resave the frames as PNG's with proper transparency baked into the files (which were dumped in a separate directory, to avoid confusion). This was done in Paint.NET (since I'm still not sure whether regular MsPaint in Windows 8 is able to correctly handle alpha channels yet).

The "Threshold" setting on the Fill tool there turned out to be the key to making this process quick and easy. By trial and error, I found that this value needs to be < 50:  30-40 is ok for most frames, but 20 or so is needed for the grass (or else the grass would be included in as part of the background.


Second Step - Manual Sequencing of Frames 
Using the agent character file and the old VB script/code as a base, I proceeded to sequence up all the images using the Blender Sequencer. Checking the agent character file settings, I found that it had been animated at 10 FPS. After a quick test, I realised that I'd need do the animation with a frame/window size that was the same as the one used for the original animation, as defined in the character file (i.e. 281 x 256), since the walking/galloping sequences used offset effects, which required the frame clipping effects to work correctly.

My video editing workspace - I basically assembled the animation in the Blender sequencer. The layout of the strips is as follows: 
   - Track 0: Sound FX - Galloping/Walking sounds
   - Track 1: Main Speech/Audio track
   - Track 2: Additional Audio
   ----------------------------------------
   - Track 4: Main Animation
   - Track 5: Mouth shapes overlays
   - Track 6: Additional mouth shape overlays 

This turned out to be a good 2-3 hours of work, as all the lip-syncing had to be redone by hand in-context. To do this in an economical way and within the confines of the mouth shapes that I had prepared for the character in the past, I ended up approaching this using the "muppet jaw" technique (i.e. place your hand/fist under your jaw, read the text out a few times at full speed - that part is important - trying to feel when the jaw moves, and how much it is open everytime its position changes). Done right, this produces a good enough effect to be convince viewers that the character is speaking the words being heard. Hell, it works for the muppets, and they're big and popular cultural icons!

Here we see a zoomed in view of the timeline. There are several notable points here: 
   1) I basically worked from start to finish, in linear order adding frames one by one until everything was ready, since I didn't really have any idea how long things really were, making forward planning to block out sections in advance and then jumping around not a feasible workflow. 
   2) In the first part, I started out making things hold longer by simply duplicating the strips. However, soon, I ended up switching to simply stretching things out.   
   3) In the speaking sections, I'd start by loading up the audio in one of the lower tracks (so that those would be closer to the frames/markers where they could be lined up more easily). Then, I'd take the default pose and stretch it out across that time range in the main animation track; mouth shapes were added in the track above this on an as-needed basis. I organically settled on this workflow, as I figured that doing it this way would help make it easier to see where exactly the dialog sections were. Besides, doing it this way is much less work - since you don't have to worry when the mouth is closed (i.e. only when the mouth is open do you need to specify anything extra/special).
   4) This overlay on basis technique was so successful, that I ended up repeating it in the next section (seen at far right - i.e. the "drinking" scene) where using the lower strips really helped to keep track of what exactly was going on, since that featured a repeating set of frames. Without doing this, I'm sure I wouldn've gone mad!
  5) Markers were used to delinear different shots. I needed some way of doing this to make it easier to get around the timeline. A side effect of this decision though was that it mean that it wasn't such a great idea to use the markers for annotating smaller things within the animation process, which is another reason why I ended up with the held-strip method in 3 and 4. 

Although cross-referencing the frames this way was tough, it was really quite a lot of fun once you get into the swing of animating this way. Working at such a low framerate is in many ways quite liberating, as you stop worrying about smoothness and/or key poses, and more about just getting the right relative timing and the core intentions of the scene conveyed across, using a strictly limited set of pre-defined poses. I need to explore this technique further, and given my agent character backlog, there could be quite a few followup revival projects to come should the need/want arise.

Framing Video
Once the basic animation was laid down, I needed to export out that sequence, and then import it into a "framing" video to make it a more acceptable resolution. Also, there was no background colour, so it defaulted to either black or white - neither of which reads that well. But, since the video was effectively square, and had visible clipping on either side (from the entrances and exits), I needed a way to make this less obvious/distracting.

The lanterns overlay image

Enter the lanterns, which I managed to quickly model in Blender (all within less than 30 minutes or so, from conception to completion), especially with the simple low-poly style I used. Originally, the plan was to just have one lantern on each side, but I soon found that I needed something below them (i.e. those stringy dangly things, which would be too much work to model) to help balance things out a bit better. Seeing as it was a new year video, I decided on using firecrackers - which are much simpler to model.


Putting it All Together - Breakdown of the Framing Video
To give a better picture of how exactly this worked, here is a little breakdown of how the various assets were assembled together to create the final product.

A breakdown of how various assets were assembled into the final "framing video" using the Blender Sequencer.  Elements 1 and 3 were static images above and below the video they were framing (i.e. element 2). Element 2 was the animated sequence rendered out during the previous step - this is undersized horse animation clip. Element 4 was one of 3 images overlaid at certain points in the clip. There is also another element/overlay not shown here in the breakdown (i.e. the copyright watermark), which went above the lanterns layer, but below the blessings.

The sequencer layout for the framing video. Use the breakdown as a guide for understanding how this was all put together.  As for the sound clip, I ended up obtaining that by importing a video-export of the animation sequence (which couldn't be used in video form, as that lacked an alpha channel, and had inferior image quality), since there wasn't any other way to get an audio track out of Blender 2.69 (which was needed, as the lipsync accuracy depended on that).


Script Revisions
After completing this and watching it, it became obvious that the old script wasn't quite working that well anymore. Although I'd already tried tweaking it in places to get better flow + timing out of the performance as I imported it in (based on having watched the original on its original host, combined with experience I've gained over the years), there were still some clunky parts.

In particular, there was one particular gag that, looking at it now, turned out to be pretty crude/nasty culturally, especially during the CNY period. So, that ended up being cut and replaced with the "turning into money" gag. Of course, this meant that I now had to animated some new material, as those frames didn't exist at all yet. For this, I ended up using Paint.Net, which gave me the option of being able to have alpha on the brushes, and various other advanced tools - all of which I put to good use.


Inaudible Audio - Replacing the TTS with Live Recordings
The original character was literally able to "speak" to the audience, thanks to the MsAgent technology's use of the Lernout and Hauspie TTS engine. At the time, this was a real blessing, as it made it possible for me to create dialog to go along with my little skits - no mean feat considering that back then, computers didn't come with built-in microphones, and even if I did have a microphone, I lacked software to record with (if you don't count the Windows Sound Recorder, which was kindof crap, as with most other things like that). How things change...

To try to recreate the original experience, I ended up trying to find ways of installing that antiquated/nearly extinct TTS engine on some modern hardware where I could feed the audio into some of my modern software, which let's me record things with much greater ease. The first major hurdle was trying to actually locate a functioning copy of the thing (since MS have officially dropped the whole MsAgent technology stack 5-6 years ago already, and L&H have actually gone bankrupt following a chain of being bounced around by high-trading bean counters). Once that was in place, I ended up using the "Test Voice" feature in the ACE to get it to read out lines of the script (since I couldn't really manage to hack up an executable to interface with the MsAgent stuff, having not really used any pure windows dev tools for the better part of a decade now, and especially not on my current machine), which I then piped into Audacity to be recorded as if it were coming from the microphone. Admittedly, the quality wasn't quite as good as I remembered it, with a few minor but irritating "harmonics" and "squealing" and "other noise" in the background usually AFTER the text had ended, along with a few other weird clicks and so forth in between. However, in isolation, it "seemed" to work ok enough...

After a few test screenings though, it became obvious that the original computer-generated TTS voice output was unintelligible - particularly for most of the target audience. For instance, the line "I feel great" started sounding like "I speak greek" when I listened to it again after dinner. Another one was a weird mix of Chinese and English (i.e. "may all your wishes come true" became "lei ho your wishes come true" - translation, "hello, your wishes come true"). I dunno if it was something in the food that night, or whether it was just a clearer head after some food, but it was clear at that point that this wasn't going to work!

So, in the end, I ended up rerecording some live audio - something that I should probably have just done in the first place, saving myself a whole lot of trouble. This of course required that the lipsync to be changed again for the third time! In some parts it was just a matter of retiming the bits (i.e. some got longer, others shorter) while in some parts, the sound actually required larger or smaller mouth shapes to feel right. Also, I ended up adding some four-letter blessings to be associated with some of the audio so that it would be even more meaningful for the audience.


Old Vs New - Main Differences
By and large, the final product still features most of the original frames. About 3-5 were omitted, with about 5-8 new ones created. In particular, look out for:
  1) The bucket changing to a lump of gold. This replaces the old ending for that segment, which in retrospect was quite crude and inauspicious!
  2) The moments when the grass actually touches the teeth and starts to disappear. In the original, the grass didn't go anywhere near the mouth; it was simply too hard for me to try to animate at the time.

Apart from that, perhaps the part(s) which changed the most were the messages that were associated with each segment, with the text changed to become much more meaningful from a CNY perspective. For example, instead of deadpanning "I am hungry" followed by "Burp! Excuse me!", this was replaced with "Let's Celebrate / Fung Yee Juk Sik".


FFMPEG Woes
It would not be a complete production without fighting with Blender's FFMPEG render output at least once. Originally I rendered using a slightly modified version of the settings I found successful for my previous project. However, the video quality was bad (i.e. highly noisy and degraded).

So, I tried rendering again, this time upping the bitrates (as per the later part of those notes). At this point, it successfully rendered out all the final framed frames, only to spew out a cryptic error at the end of the process and bail out, without having produced any output!   "VBV buffer too small!" or something to that effect. Even after adjusting the bitrate settings, as I did last time kept spewing out that message! Despite several attempts to Google that error message, nothing came up...

In the end, it turned out that the culprit was the "Buffer" / "Rate Control" setting below the Min/Max Bitrate fields. Talk about obscure internal names with no obviously matching UI widget visible to tweak!

Final FFMPEG Settings:
- Output Container = MPEG
- Encoding Format = MPEG-4
- Bitrate = 20,000
- Min Rate = 0
- Max Rate = 30,000
- Buffer = 2000 (up from 1792 or so)

(NOTE: for reference, the dimensions were 576 x 324  at 10 FPS  for this video. This non-standard size + speed combo was probably what necessitated the higher rate buffer size).


Summary
After some 12 years in the making, I'm pleased to finally be able to complete this project and fulfill a dread that's been a long time coming.

恭喜發財!  心想事成!
(Gong Hay Fat Choy! Sum Seung See Sing!)

No comments:

Post a Comment