I would like to extract text from a given PDF file with Apache PDFBox.
I wrote this code:
PDFTextStripper pdfStripper = null;
PDDocument pdDoc = null;
COSDocument cosDoc = null;
File file = new File(filepath);
PDFParser parser = new PDFParser(new FileInputStream(file));
parser.parse();
cosDoc = parser.getDocument();
pdfStripper = new PDFTextStripper();
pdDoc = new PDDocument(cosDoc);
pdfStripper.setStartPage(1);
pdfStripper.setEndPage(5);
String parsedText = pdfStripper.getText(pdDoc);
System.out.println(parsedText);
However, I got the following error:
Exception in thread "main" java.lang.NullPointerException
at org.apache.fontbox.afm.AFMParser.main(AFMParser.java:304)
I added pdfbox-1.8.5.jar and fontbox-1.8.5.jar to the class path.
Edit
I added System.out.println("program starts"); to the beginning of the program.
I ran it, then I got the same error as mentioned above and program starts did not appear in the console.
Thus, I think I have a problem with the class path or something.
Thank you.
解决方案
I executed your code and it worked properly. Maybe your problem is related to FilePath that you have given to file. I put my pdf in C drive and hard coded the file path.here is my code:
// PDFBox 2.0.8 require org.apache.pdfbox.io.RandomAccessRead
// import org.apache.pdfbox.io.RandomAccessFile;
public class PDFReader{
public static void main(String args[]) {
PDFTextStripper pdfStripper = null;
PDDocument pdDoc = null;
COSDocument cosDoc = null;
File file = new File("C:/my.pdf");
try {
// PDFBox 2.0.8 require org.apache.pdfbox.io.RandomAccessRead
// RandomAccessFile randomAccessFile = new RandomAccessFile(file, "r");
// PDFParser parser = new PDFParser(randomAccessFile);
PDFParser parser = new PDFParser(new FileInputStream(file));
parser.parse();
cosDoc = parser.getDocument();
pdfStripper = new PDFTextStripper();
pdDoc = new PDDocument(cosDoc);
pdfStripper.setStartPage(1);
pdfStripper.setEndPage(5);
String parsedText = pdfStripper.getText(pdDoc);
System.out.println(parsedText);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}