[1/3] utils/md5_file: don't iterate line-by-line

Message ID 20180813172054.17767-1-ross.burton@intel.com
State New
Headers show
Series
  • [1/3] utils/md5_file: don't iterate line-by-line
Related show

Commit Message

Burton, Ross Aug. 13, 2018, 5:20 p.m.
Opening a file in binary mode and iterating it seems like the simple solution
but will still break on newlines, which for binary files isn't really useful as
the size of the chunks could be huge or tiny.

Instead, let's be a bit more clever: we'll be MD5ing lots of files, but we don't
want to fill up memory: use mmap() to open the file and read the file in 8k
blocks.

Signed-off-by: Ross Burton <ross.burton@intel.com>

---
 bitbake/lib/bb/utils.py | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

-- 
2.11.0

-- 
_______________________________________________
Openembedded-core mailing list
Openembedded-core@lists.openembedded.org
http://lists.openembedded.org/mailman/listinfo/openembedded-core

Comments

akuster808 Aug. 13, 2018, 6:03 p.m. | #1
On 08/13/2018 10:20 AM, Ross Burton wrote:
> Opening a file in binary mode and iterating it seems like the simple solution

> but will still break on newlines, which for binary files isn't really useful as

> the size of the chunks could be huge or tiny.

>

> Instead, let's be a bit more clever: we'll be MD5ing lots of files, but we don't

> want to fill up memory: use mmap() to open the file and read the file in 8k

> blocks.

>

> Signed-off-by: Ross Burton <ross.burton@intel.com>


shouldn't this go to the bitbake mailing list ?
> ---

>  bitbake/lib/bb/utils.py | 13 +++++++++----

>  1 file changed, 9 insertions(+), 4 deletions(-)

>

> diff --git a/bitbake/lib/bb/utils.py b/bitbake/lib/bb/utils.py

> index 9903183213b..b20cdabcf01 100644

> --- a/bitbake/lib/bb/utils.py

> +++ b/bitbake/lib/bb/utils.py

> @@ -524,12 +524,17 @@ def md5_file(filename):

>      """

>      Return the hex string representation of the MD5 checksum of filename.

>      """

> -    import hashlib

> -    m = hashlib.md5()

> +    import hashlib, mmap

>  

>      with open(filename, "rb") as f:

> -        for line in f:

> -            m.update(line)

> +        m = hashlib.md5()

> +        try:

> +            with mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) as mm:

> +                for chunk in iter(lambda: mm.read(8192), b''):

> +                    m.update(chunk)

> +        except ValueError:

> +            # You can't mmap() an empty file so silence this exception

> +            pass

>      return m.hexdigest()

>  

>  def sha256_file(filename):


-- 
_______________________________________________
Openembedded-core mailing list
Openembedded-core@lists.openembedded.org
http://lists.openembedded.org/mailman/listinfo/openembedded-core
Burton, Ross Aug. 13, 2018, 6:04 p.m. | #2
Yeah, just sent it there,  sorry

On 13 August 2018 at 19:03, akuster808 <akuster808@gmail.com> wrote:
>

>

> On 08/13/2018 10:20 AM, Ross Burton wrote:

>> Opening a file in binary mode and iterating it seems like the simple solution

>> but will still break on newlines, which for binary files isn't really useful as

>> the size of the chunks could be huge or tiny.

>>

>> Instead, let's be a bit more clever: we'll be MD5ing lots of files, but we don't

>> want to fill up memory: use mmap() to open the file and read the file in 8k

>> blocks.

>>

>> Signed-off-by: Ross Burton <ross.burton@intel.com>

>

> shouldn't this go to the bitbake mailing list ?

>> ---

>>  bitbake/lib/bb/utils.py | 13 +++++++++----

>>  1 file changed, 9 insertions(+), 4 deletions(-)

>>

>> diff --git a/bitbake/lib/bb/utils.py b/bitbake/lib/bb/utils.py

>> index 9903183213b..b20cdabcf01 100644

>> --- a/bitbake/lib/bb/utils.py

>> +++ b/bitbake/lib/bb/utils.py

>> @@ -524,12 +524,17 @@ def md5_file(filename):

>>      """

>>      Return the hex string representation of the MD5 checksum of filename.

>>      """

>> -    import hashlib

>> -    m = hashlib.md5()

>> +    import hashlib, mmap

>>

>>      with open(filename, "rb") as f:

>> -        for line in f:

>> -            m.update(line)

>> +        m = hashlib.md5()

>> +        try:

>> +            with mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) as mm:

>> +                for chunk in iter(lambda: mm.read(8192), b''):

>> +                    m.update(chunk)

>> +        except ValueError:

>> +            # You can't mmap() an empty file so silence this exception

>> +            pass

>>      return m.hexdigest()

>>

>>  def sha256_file(filename):

>

-- 
_______________________________________________
Openembedded-core mailing list
Openembedded-core@lists.openembedded.org
http://lists.openembedded.org/mailman/listinfo/openembedded-core

Patch

diff --git a/bitbake/lib/bb/utils.py b/bitbake/lib/bb/utils.py
index 9903183213b..b20cdabcf01 100644
--- a/bitbake/lib/bb/utils.py
+++ b/bitbake/lib/bb/utils.py
@@ -524,12 +524,17 @@  def md5_file(filename):
     """
     Return the hex string representation of the MD5 checksum of filename.
     """
-    import hashlib
-    m = hashlib.md5()
+    import hashlib, mmap
 
     with open(filename, "rb") as f:
-        for line in f:
-            m.update(line)
+        m = hashlib.md5()
+        try:
+            with mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) as mm:
+                for chunk in iter(lambda: mm.read(8192), b''):
+                    m.update(chunk)
+        except ValueError:
+            # You can't mmap() an empty file so silence this exception
+            pass
     return m.hexdigest()
 
 def sha256_file(filename):