Importing Excel Files with Images

July 9, 2020

Every now and then your PO (Product Owner), business analyst or project leader will ask you to import an Excel file. You will decline, he will insist and, in the end, you will give in.

When the Excel file contains images, importing is not as straight forward as we might think.

Excel File

Let's assume that we have a simple Excel file containing a list of images with their names.

excel

The first naive approach is of course to export the file as csv (comma separated values) and hope that the images are properly serialized. This doesn't work.

We have to look at alternatives. As a seasoned Java engineer, you probably heard of Apache POI. We will tackle the problem with this library.

Apache POI

According the Apache POI's website, the mission of the project is to create and maintain Java APIs for manipulating various file formats based upon the Office Open XML standards (OOXML) as well as Excel 97-2008 format (OLE2). In short, you can read and write MS Excel, Word and PowerPoint files using Java.

Office OpenXML format is the new XML-based file format found in Microsoft Office from 2007 on. This includes XLSX.

For each MS Office application, there exists a component module that attempts to provide a common high level Java API to both OLE2 and OOXML document formats. For Excel, the two formats are: HSSF (Horrible SpreadSheet Format) for OLE2 and XSSF (XML SpreadSheet Format) for OOXML.

HSSF is the POI Project's pure Java implementation of the Excel 97(-2007) (.xls) file format. XSSF is the POI Project's pure Java implementation of the Excel 2007 OOXML (.xlsx) file format. HSSF and XSSF provide ways to read, create, modify, and write Excel spreadsheets.

From now on, we will concentrate on reading the Excel XLSX file shown above with the XSSF format.

Parsing Excel Files

Now we will go through the code for reading an Excel file, extracting the data, and processing images.

Reading a File

InputStream in = Main.class.getResourceAsStream("/images.xlsx");

Java has many ways to read a file. For this exercise, we will just read a file located in the class path.

XSSF Workbook

try (XSSFWorkbook workbook = new XSSFWorkbook(in)) {
    XSSFSheet sheet = workbook.getSheetAt(0);
    ...
}

This constructs a workbook object by buffering the whole stream in memory. The sheet represents the first sheet of the workbook.

Extracting Images

XSSFDrawing patriarch = sheet.createDrawingPatriarch();
List<XSSFShape> shapes = patriarch.getShapes();
Map<Integer, byte[]> imageByLocations = shapes.stream()
        .filter(Picture.class::isInstance)
        .map(s -> (Picture) s)
        .map(this::toMapEntry)
        .collect(toMap(Pair::getKey, Pair::getValue));

This is the tricky part. Images are not saved in the sheet's cells but in a SpreadsheetML drawing called XSSFDrawing in POI's lingo. You can think of SpreadsheetML drawing as a canvas on which to position shapes. From the patriarch (the SpreadsheetML drawing) you can fetch all the shapes. Shapes can be pictures, but also connectors, graphic frames and more. As we are only interested in pictures, we filter the shapes .filter(Picture.class::isInstance). The next step is to find out on which row the picture is located this::toMapEntry. Finally, we build a map containing the row number with the image bytes.

Finding Image Location

Pair<Integer, byte[]> toMapEntry(Picture picture) {
    byte[] data = picture.getPictureData().getData();
    ClientAnchor anchor = picture.getClientAnchor();
    return Pair.of(anchor.getRow1(), data);
}

As the shapes are not placed in cells, we need to find where they are anchored. The top left corner of the shape (anchor.getRow1()) gives us the row number. We build an intermediate object org.apache.commons.lang3.tuple.Pair which allows us to build the map.

Iterating over the Rows

List<Image> images = StreamSupport.stream(sheet.spliterator(), false)
        .filter(this::isNotHeader)
        .map(row -> new Image(row.getCell(0).toString(), imageByLocations.get(row.getRowNum())))
        .collect(toList());

We iterate over each row to extract the cell content. In our case, only the file name is stored in the first cell row.getCell(0).toString() (index 0). With the row number, we can get the image from the map we built in the previous step.

Saving the Pictures

images.forEach(image -> {
        String pathname = String.format("%s/%s.jpg", System.getProperty("user.home"), image.fileName);
        FileUtils.writeByteArrayToFile(new File(pathname), image.bytes);
    });

The final step is to write the images to the file system. For the sake of the example, we will write the image in the home folder.

Conclusion

There are many reasons not to import Excel files. If the files are man-made, you usually end up rewriting your Excel parser with each version. If they are generated, you should rather consider a programmatic interface. Some messaging infrastructure would be my preferred choice, but a REST endpoint could do the trick as well.

For some reasons, the business people like Excel; Probably because it is a familiar application. For software engineers, it is usually the worst case scenario. Luckily for us, Apache POI offers a convenient way to import Excel files.

Throughout the post, we assumed that the images are in a readable format. Unfortunately, they might not, as MS Office uses EMF format as the default image format; There is no simple way to transform an image from EMF to something more universal such as PNG or JPG. So be sure the Excel file you plan to import contains valid image formats.

Appendix

Here is the complete source code of this blog post.

public class Main {
    public static void main(String[] args) throws IOException {
        InputStream in = Main.class.getResourceAsStream("/images.xlsx");
        new Main().importExcel(in);
    }

    void importExcel(InputStream in) throws IOException {
        try (XSSFWorkbook workbook = new XSSFWorkbook(in)) {
            XSSFSheet sheet = workbook.getSheetAt(0);
            XSSFDrawing patriarch = sheet.createDrawingPatriarch();
            List<XSSFShape> shapes = patriarch.getShapes();
            Map<Integer, byte[]> imageByLocations = shapes.stream()
                    .filter(Picture.class::isInstance)
                    .map(s -> (Picture) s)
                    .map(this::toMapEntry)
                    .collect(toMap(Pair::getKey, Pair::getValue));

            List<Image> images = StreamSupport.stream(sheet.spliterator(), false)
                    .filter(this::isNotHeader)
                    .map(row -> new Image(row.getCell(0).toString(), imageByLocations.get(row.getRowNum())))
                    .collect(toList());

            images.forEach(image -> {
                try {
                    String pathname = String.format("%s/%s.jpg", System.getProperty("user.home"), image.fileName);
                    FileUtils.writeByteArrayToFile(new File(pathname), image.bytes);
                } catch (IOException e) {
                    throw new RuntimeException(e);
                }
            });
        }
    }

    boolean isNotHeader(Row row) {
        return !row.getCell(0).toString().toLowerCase().contains("name");
    }

    Pair<Integer, byte[]> toMapEntry(Picture picture) {
        byte[] data = picture.getPictureData().getData();
        ClientAnchor anchor = picture.getClientAnchor();
        return Pair.of(anchor.getRow1(), data);
    }

    static class Image {
        String fileName;
        byte[] bytes;

        Image(String fileName, byte[] bytes) {
            this.fileName = fileName;
            this.bytes = bytes;
        }
    }
}
About the author: Frank Bommeli

Frank is a passionate skier and loves travelling. A convinced clean code and clean architect practitioner.

Comments
Join us